Cordelia Schmid (2023):
Making AI More Intelligent – Smart Image Recognition for Autonomous Robots
The German computer scientist Cordelia Schmid is a pioneer in computer-aided image processing. Schmid developed revolutionary new procedures that enable computers to understand image content. Thanks to her algorithms, artificial intelligence (AI) can locate a motif or an object in a database of images within a fraction of a second. The prize winner is currently doing research on systems that can semantically interpret videos and even predict future actions. Her goals include the development of robots that will be able to respond to verbal commands and be employed, for instance, as intelligent assistants in hospitals and in care for the aged.
Making AI More Intelligent: Smart Image Recognition for Autonomous Robots
Text: Claus-Peter Sesín
Translation: Michael Wilson, Jacqui Allen, Markus Dressel
Artificial Intelligence (AI) has developed in an outright explosive manner over the course of the last ten years. In November 2022, the US company OpenAI introduced ChatGPT, a chatbot capable of understanding questions and providing elaborate answers in an eerily human manner. Ever since, AI has been a major topic in the media with the tenor of reports ranging from exaggerated hopes to warnings of misuse. What is certain is that AI will decisively shape both the way we live and future economic developments.
Cordelia Schmid is one of the most important pioneers in AI research. In her dissertation, published in 1996, she developed fundamentally new procedures in the field of image recognition—an important subfield of AI—which led to enormous breakthroughs in computer vision. In the following years, she also succeeded in developing powerful new computer vision algorithms, which later became the established standard. Today, one focus of Cordelia Schmid’s research is on “multimodal transformers”. These are computer systems which, on the basis of video and audio information, can analyse and understand the content of videos and predict forthcoming actions in these videos. This field of technology is an important intermediate step toward future robotic assistants for hospitals and care homes. The prize winner’s long-term goal is to develop smart household robots that can respond to spoken commands, for example, to fetch cucumbers from the refrigerator and then even wash and cut them.
Schmid received her master’s degree in computer science in 1992 from the Karlsruhe Institute of Technology. In 1996, she was awarded her PhD at the National Polytechnic Institute of Grenoble (Grenoble INP). After working as a postdoc at the British Oxford Robotics Institute, she moved to the French National Institute for Research in Digital Science and Technology (Inria) in 1997, where she received her advanced academic qualification (habilitation) in 2001. She has been a research director of this institute since 2004. Schmid was also a co-editor of the International Journal of Computer Vision from 2004 to 2012, and from 2013 to 2018 its editor in chief. Furthermore, Schmid has worked part time for Google Research since 2018.
“Responsibly developed, AI has the potential to revolutionise our society—just as steam power and electricity once did. AI can help solve some of the most urgent problems in the world, from sustainability to health.”
AI pioneers took their cue from the human brain as early as 1956
AI research began in the USA back in the 1950s. In 1956, twenty top researchers from the fields of computer science, mathematics, and information theory met at Dartmouth College in the state of New Hampshire. Their goal: understanding how a computer could be programmed to understand human language. The group quickly agreed that such a computer would in some sense have to imitate the human brain.
Two rival factions, however, quickly formed. One group around the “AI pope” Marvin Minsky stored numerous if-then rules in a computer. These
so-called expert systems were able to independently draw conclusions. For example, given a rule A “When it rains, the street is slippery” and a rule B “When the street is slippery, there are more car accidents”, such a system could conclude that “When it rains, there are more car accidents.” Minsky believed that an expert system—if only it could be programmed with a sufficient number of rules or symbols—may eventually even develop something like consciousness.
The second group around the cognitive scientist Frank Rosenblatt attempted to copy the functioning of the brain in a more direct way—by using artificial neurons. A neuron in the brain has access points (dendrites) and an output point (the synapse). The information gathered by the dendrites is passed on via the synapse, which then reaches the dendrites of other neurons. This is how information is passed on within the brain. The strength of the connections is variable and is determined by the stimuli and material learned.
In 1957, Rosenblatt and his team were the first to construct a network of artificial neurons. It had inputs and outputs, similar to the neurons of the brain. The strength with which information was transmitted between these artificial neurons was determined by so-called weights. Whereas stimulating signals corresponded to weights with positive numbers, inhibiting signals were associated with negative numbers. Rosenblatt showed that his network—named “Perceptron”—was already capable of executing simple logical operations, using formal operators such as AND, OR, or NOT.
With his Perceptron, Rosenblatt created the prototype of an artificial neural network (ANN) and paved the way for modern AI systems. In fact, almost all of today’s AI applications are based on ANNs. In modern ANNs the neurons are arranged in numerous layers. There is an input and an output layer, with further layers for the fine-grained processing of information between them. The more layers, the more powerful the ANN’s performance. So-called deep networks have many intermediate layers.
However, in the early days of AI, a controversy arose amongst AI pioneers. Marvin Minsky and colleagues proved mathematically that it was either “impossible” or would take an “infinitely long” time to program such a powerful ANN. Subsequently, research funds were cancelled, and Minsky’s expert systems dominated the AI scene of the time—with rather modest results.
Artificial Intelligence was a hot topic long before ChatGPT entered the stage: In 1950, British mathematician Alan Turing (1912-1954, left) introduced the Turing Test. It defines when a machine can be considered “intelligent”. In 1957, American cognitive scientist Frank Rosenblatt (1928-1971, second from left) developed the prototype of an artificial neural network. This “Perceptron” (third from left) was already capable of identifying simple objects and could, for example, differentiate between the letters “C” and “D”. Today, almost all AI systems employ Rosenblatt’s principle. In 1972, the expert system “Mcyin” (fourth from left) was able to make medical diagnoses based on stored rules. It was not until much later that Google presented its first self-driving car in 2014 (fifth from left). It would take even longer—until 2016—for the AI system “AlphaGo” to beat the reigning world champion Lee Sedol in the Japanese chess game Go (sixth from left).
Artificial neural networks learn from training
ANNs had their first breakthrough in the 1980s. Rather than programming them, AI researchers solved the problem posed by Minsky by feeding ANNs training data. This made it possible, for example, to train an ANN to distinguish apples from pears. In order to do so, the input layer is connected to a camera. The goal is for the output layer to provide fitting descriptions of the content—such as “apple” or “pear”—as text on a monitor, for example. Initially, the weights in the ANN layers are set to random values. For this reason, the ANN initially makes many errors, such as reporting “apple” although a pear was displayed. To correct these mistakes, a human trainer* tells the network whether it had been right or wrong in each individual case. The ANN thus “learns” to set its weights in each of its layers closer and closer to the desired results. In the process, the mistakes are reported internally by the output layer to the input layer. After the ANN has been sufficiently trained, it is able to classify the two types of fruit as apple or pear with a high rate of success, even for new images that it has never seen before.
*In modern ANNs, annotated training data is employed during training in which the correct classification—initially masked—has already been recorded, making it possible to automate the training.
AI research began in the USA back in the 1950s. In 1956, twenty top researchers from the fields of computer science, mathematics, and information theory met at Dartmouth College in the state of New Hampshire. Their goal: understanding how a computer could be programmed to understand human language.
Cordelia Schmid’s algorithms set the stage for fast image searches on the Internet
When Cordelia Schmid wrote her groundbreaking dissertation in 1996, image classification was still in its infancy: “Back in these days, the systems were only capable of recognising simple geometric shapes such as circles, triangles, or squares, and this only against a uniform background.” Schmid substantially improved image recognition by focusing on distinctive features of an image, so-called “local image descriptors.” These image descriptors represent the spatial dimensions of the object displayed. Image descriptors enable computers to identify an object even when it is partially hidden or displayed from a different perspective. In other words, a system employing image descriptors will recognise the Eiffel Tower even if it is photographed from below from a short distance, or slanted from the side, or from a long distance when a tree blocks the view of parts of the tower. With this innovation, Schmid laid the foundation for today’s search engines to search millions of images on the internet within seconds.
After the turn of the millennium, automatic image recognition made great advances and engendered many novel approaches. During this period, Cordelia Schmid developed benchmark tests that made it possible to determine the most effective of these numerous new methods. Among the test criteria was—in addition to a high success rate in finding the desired images—that the processing speed be as fast as possible.
In 2006, Schmid developed another standard procedure for image recognition: “spatial pyramid matching”. This approach divides images into smaller and smaller sections, which makes the process of grasping spatial structures more flexible. “We were now able to clearly distinguish between categories such as ‘bedroom’ and ‘living room,’ and image content such as a beach scene was recognised at first sight,” says Schmid.
AI systems were also making advancements in other fields. Their successes in games, which previously had been considered the domain of humans, attracted special attention. As early as 1996, an IBM system named “Deep Blue” had defeated the then reigning chess world champion Garry Kasparov. Deep Blue was however not an AI, but a powerful standard computer in which the rules of chess and thousands of matches between grand masters had been programmed. Deep Blue achieved its victory over Kasparov with “brute force computing power”—it analysed 126 million positions per second.
Other games are less easily formalised, such as the Japanese board game Go. While there are also black and white stones in Go, there are as many as 361 points of intersection—chess has only 64 squares. The standard method employed by chess computers, namely to run through all the possible moves one after another, fails with Go because of its greater complexity. Only an ANN was able to handle it. It was not until 2016 that the ANN “AlphaGo,” an ANN developed by the Google company DeepMind, managed to defeat the reigning Go world champion. AlphaGo was trained on 160,000 games between Go masters. In 2018, its successor AlphaGo Zero achieved superhuman capabilities. It defeated its predecessor AlphaGo in one hundred out of one hundred matches. The special feature of AlphaGo Zero: It had taught itself the rules and strategies by using deep learning. The supervised learning of early ANNs was now succeeded by unsupervised learning, which in many cases has been even more productive.
The way machine learning works can be demonstrated by the example of AI translation programs. Literal translations from one language to another are often clumsy or misleading. In order to translate accurately, programs have to consider both the particularities of each language and the semantic context. For instance, the word “nut” in connection with a screw or bolt means something different than “nut” in the context of food. Computer scientists thus had the idea to train translation ANNs using professionally created translations as the learning material.
AI versus human: Deep Blue’s victory over world chess champion Garry Kasparov sparked international attention (left). Go is far more complex than chess. It was not until 2016 that the AI system “AlphaGo” managed to defeat the world champion Lee Sedol (right).
Machines learn how to learn—even without human supervision
This is how automated or “self-supervised” learning works: The ANN is given sample pairs of original texts and professional translations. In a first step, roughly ten to twenty percent of individual words and sentence fragments in the translation are masked for the ANN. The ANN then compares the original text and the fragmentary translation and guesses which words or sentence fragments are the best semantic fit for the gaps. In a second step, the masking is removed, so that the ANN sees the complete professional translation. In this way, the system can learn from mistakes and expand its knowledge step by step. The ANN acquires this linguistic knowledge not by using logical rules, but in a purely statistical manner—by determining which words occur in which order most frequently in a given semantic context.
The AI model “VideoBert” simultaneously analyses images and text in videos. The system is trained by omitting parts from the audio channel—such as “steak”—which VideoBert then has to guess. The same happens in the image channel. After training, the model can predict both forthcoming actions and speech in the video.
“VideoBert” uses online cooking videos for unsupervised learning
Utilising the same principle, Cordelia Schmid is currently conducting research on vision-language models, for example on a system named “VideoBert”. This system can independently analyse video instructions on the internet, such as cooking videos. One task of this AI system is to train itself to predict subsequent actions in the videos. VideoBert works multimodally, which means that it simultaneously examines the image sequences and the corresponding spoken text (e.g., “place the steak in the pan,” see figure). In VideoBert, Schmid also utilises the masking procedure: Words or video sequences are omitted that VideoBert then has to guess. Self-supervised learning has the advantage that there are thousands of cooking videos available online that can be used free of charge.
After the self-training, Schmid showed VideoBert cooking videos that it had never seen before. When it saw, for example, a bowl with flour and cacao, it was able to predict that later a chocolate cake would be baked with these ingredients and could generate appropriate images of the expected final product. “Coming versions of VideoBert will even be capable of preparing written recipes after viewing new cooking videos,” Schmid adds. Based on this multimodal vision-language understanding, the prize winner is also planning to develop intelligent robot assistants for hospitals and care homes in the future.
A vision-enabled challenger of ChatGPT
One of the projects that Cordelia Schmid intends to pursue with the prize money is to design a kind of vision-enabled competitor to the chatbot ChatGPT, which is a deep ANN consisting of a particularly high number of layers. Having trained with countless data sets from the internet, ChatGPT is able to process natural language and can provide suitable answers. The result is usually eloquent and detailed. In February, the German politician Tiemo Wölken gave a speech to the European Parliament that had been written completely by ChatGPT to demonstrate the capability of the system to the public.
Schmid also considers the performance of ChatGPT “impressive,” but criticises the fact that the “model is not self-explanatory and is strongly dependent on data. It has a limited context window and cannot learn from experience. Above all, there is no physical interaction with the real world.” Schmid wants to develop a “truly intelligent” rival bot capable of processing visual information and 3D environmental data, which it would continuously incorporate into its knowledge base. It should be equipped with an additional memory for this knowledge so that learned material does not remain hidden in its network of inner neuronal weights, which would be next to impossible to disentangle. Until now, conventional ANNs have been a kind of black box; no one knows precisely how they reach their decisions. In contrast, Schmid’s new bot should be able to give reasons for its decisions by referring to the knowledge in its additional memory. “Our goal is to make it possible to explain the output,” Schmid explains. “Later, the bot should utilise its 3D visual knowledge in order to navigate autonomously in an unfamiliar environment.”
That ChatGPT does not always work reliably was demonstrated by the German science journalist Jürgen Scriba, who had the idea to have ChatGPT prepare a biography of a made-up nutritional scientist named Dr. Anton Wirsing. ChatGPT put together an extensive vita and in response to questions even began to make up a story that Dr. Wirsing had emigrated to the USA and studied at Harvard—despite the fact that Dr. Wirsing had never existed.
Language-based generative AI such as ChatGPT thus opens the door for fake news. Thanks to the well-formed sentences and seemingly precise facts, these fakes appear particularly convincing and trustworthy. Because of the great danger of misuse, leading AI experts demanded in March 2023 that there be a six-month moratorium in the development of AI.
Generative AI can produce arbitrary images on command, such as fictional footage of King Charles’ coronation ceremony (left). When AI-generated images of Donald Trump or the Pope went viral on the internet (right), many users were unaware of their fictional nature.
Growing dangers of fake news from generative AI
Further deep fake dangers stem from images and videos created or manipulated by AI. In 2014, the US computer scientist Yoshua Bengio and his colleagues succeeded in reprogramming an ANN that had been trained to provide text descriptions of images so that it could also run backwards. Put simply, the user enters a description of the desired image and the ANN creates a suitable virtual image, based on the wealth of data used in its training. By employing this image generator, artists can create truly fantastic works of art. This new style of art refers to itself as “deep art”. Recently, an AI generated photo even won a prestigious photo competition. Yet, the new technology also enables conspiracy theorists to create fake photos of prominent people in compromising situations that look deceptively real. There are already AI-generated fake photos of Donald Trump in prison and of the Pope in a rapper outfit circulating online (see images).
Two rivalling ANNs are usually employed in AI image generators. The first one (the generator) produces the images or videos on the basis of text commands. The second works similarly to a human trainer and checks whether the outcome appears sufficiently genuine. The goal is reached when the control ANN (the discriminator) cannot find any more differences to the stored originals. The arrangement where such ANNs oppose one another is referred to as a generative adversarial network (GAN).
Misleading rumours of an “AI world domination”
Due to the tremendous capabilities of recent AI, some cautionary voices claim that the systems could ultimately “achieve world domination.” For instance, Elon Musk, the American entrepreneur known for his provocative assertions, believes that AI will soon have the ability to autonomously create successors that are even more intelligent. American robot scientist Hans Moravec and inventor Ray Kurzweil had such dreams back in the 1980s. Yet, nothing of the sort has occurred so far.
The fact remains that AI is in principle nothing other than software running on computers—and these computers solely process the commands of a human programmer. Despite their apparent “intelligence,” which is essentially based on statistical learning, AI systems neither have consciousness nor are they capable of autonomous intentional action. There is no evidence for this to change in the foreseeable future.
Admittedly, however, AI is likely to have an impact on the future of the workforce. The World Economic Forum expects that by 2027, every eighth job could be lost as a consequence of AI. A study conducted by the Leibniz Institute for Economic Research (RWI) draws a different conclusion, suggesting that in the future, AI could actually result in even more employment. Cordelia Schmid is also optimistic: “Responsibly developed, AI has the potential to revolutionise our society—just as steam power and electricity once did. AI can help solve some of the most urgent problems in the world, from sustainability to health. I personally am enthusiastic about the opportunities for research that it creates.”
The Prize Winner 2023
Cordelia Schmid was born in Mainz in 1967. Her father was a physicist, and her mother a teacher of English and French, and later a housewife. “As a child I wanted to be a pilot. In school, I discovered my passion for mathematics. And my father’s profession helped me get an understanding of research early on.”
After graduating from school, Schmid studied computer science at the Karlsruhe Institute of Technology. She received her master’s degree in 1992 with a thesis on robot vision. “That inspired me to do research in the field of object recognition. At the time, computers were very bad at it; it was even difficult for them to recognise a simple cube.”
In 1996, the Prize winner was awarded her PhD at the National Polytechnic Institute of Grenoble (Grenoble INP). In her dissertation, she developed fundamentally new procedures of image recognition—procedures that enabled enormous breakthroughs in computer vision and subsequently became the established standard. “It was the first study that used grey values to identify objects in images.”
After working as a postdoc at the British Oxford Robotics Institute, she moved to the French National Institute of Research in Informatics and Robotics (Inria) in Grenoble in 1997, where she received her advanced academic qualification (habilitation) in 2001. She has been a research director of this institute since 2004. Schmid also served as the co-editor of the International Journal of Computer Vision from 2004 to 2013, and from 2013 to 2018 as its editor in chief. Furthermore, Schmid has worked part time for Google Research since 2018.
The old notion that maths is “nothing for girls” is an unwarranted prejudice, according to the world renown computer scientist. Schmid was influenced by women such as the nuclear physicist Marie Curie, whose biography fascinates her. “Yet, many of my role models and mentors have been or are men.” She therefore advises “girls and women to not only stay on the lookout for female, but also for male role models and mentors in order to make a career in this supposedly male-dominated field.”
Cordelia Schmid has been awarded other scientific prizes in the past. She wishes to use the funds from the Körber-Stiftung to develop a kind of vision-enabled competitor to the chatbot ChatGPT. This new system would “also interact visually with the real world and have a separate memory for storing knowledge.”
In her free time, the prize winner enjoys reading crime stories, novels, and psychology books, including books on management. She likes to ski and loves hiking and climbing.
Dr Thomas Paulsen About the Körber Prize
“The Körber Prize is not just another source of funding. It supports scientific projects without the pressures of commercialisation and research bureaucracy.”
Dr. Thomas Paulsen
Member of the Executive Board, Körber-Stiftung
Dr Paulsen, what makes the Körber Prize special?
The Körber Prize is the only major science award with a decidedly European focus. Europe needs excellent science in order to keep up with research environments in the USA and Asia. That’s why the Körber Prize honours researchers from Europe’s scientific community who have demonstrated remarkable achievements and are expected to make further breakthroughs in the future. It is perhaps no coincidence that so far eight Körber Prize laureates have also been awarded the Nobel Prize. The significance of the Körber Prize is also underscored by its substantial prize money of one million euros, making it one of the most highly endowed science prizes in the world.
In times of third-party funding and industrial research, isn’t that just a drop in the bucket?
Not at all! The Körber Prize is not just another source of funding, but a distinction for outstanding scientific accomplishments. More importantly, it supports scientific projects without the pressures of commercialisation and research bureaucracy. Science is an open and often unpredictable process. It is for this reason that the Körber Prize allows a great freedom in the allocation of the prize money, enabling the winners to focus on the science instead of getting bogged down writing project reports.
Freedom of research sounds good. But what’s in it for society?
We want to overcome the tension between scientific freedom and social benefit. The Körber Prize therefore rewards research that holds the prospect of creating actual social value. This is expressed in the words of our founder Kurt A. Körber, who initiated the prize to contribute “to sustaining the living conditions on our planet.” A perfect example is this year’s winner Cordelia Schmid. As a result of her work on artificial intelligence, it may be possible to tackle many social problems in the future, such as the overloading of the healthcare and nursing systems or the lack of workers in many areas.
At the same time, AI poses challenges to society. Which will prevail from your point of view: the benefits or the risks?
They are equally important. AI can simplify tasks and make many things more efficient, but it can also lead to the loss of jobs, to discrimination, and to the spread of fake news. Therefore, we have to be aware when dealing with AI and perhaps acquire new skills in order to better understand it. Yet, we should remain open to this new development. Here I agree with Cordelia Schmid: Responsibly developed, AI has an unimaginable potential, which should definitely be utilised. That’s why the Körber-Stiftung actively promotes engagement with AI—with this year’s Körber Prize, but also in many other projects.
Making AI More Intelligent – Smart Image Recognition for Autonomous Robots
Award Ceremony 2023
Photos of the presentation of the Körber European Science Prize 2023 to Cordelia Schmid in the Hamburg City Hall on 08 September 2023
These photos are free to use in the context of news coverage with the credit Körber-Stiftung/Claudia Höhne.