How Computers Learn to Understand our World

Artificial Intelligence is everywhere
Artificial Intelligence is everywhere in our life. Have you noticed that your camera automatically detects faces when you try to take a photo? Your favorite online music streaming service or video platform analyzes what you play and when you hit pause in order to make recommendations based on what it thinks you like. Google image search uses methods to automatically label videos and photos so you can find them even though no one has ever put tags on them. I am dictating this text right now to my computer and it knows how to transcribe what I’m saying (more or less). These are just a few examples for common artificial intelligence (AI) tools seamlessly embedded in our everyday life.
What is AI?
But what exactly is artificial intelligence? Actually, artificial intelligence is not exactly a scientific term. In computer science, Machine Learning is more common. In the most general sense, Machine Learning describes a model or algorithm that can solve a non-trivial task because it has seen training examples before. More formally, it works like linear regression: fitting a straight line through bunch of data points. As you might remember from high school, every linear function has two parameters: slope and intercept with the y-axis. Let’s call that function f(x) = mx + c. When you want to fit a straight line to some x-y pairs (“the data”), you want to tune those two parameters m and c such that you describe the data as accurate as possible.

Linear Regression. Given data points, find the straight line that fits the data as accurate as possible by adjusting slope and y-intercept.
Now, you need to be a little more precise as to what “accurate” means. Usually, you define something like a cost function. The cost function could be the discrepancy between prediction of the model with certain parameters and true value. In the example of linear regression, a common cost function is the difference between f(x) (with parameters m and c) and y (the true y-value for a given x): |f(x)-y| for all examples from your data set. This cost is what you want to minimize.
Data is everything
As you can see, data plays the key role. The more data you have the better the model or algorithm will perform. That’s why Google does such a great job at its image search: they are able to process an incredible amount of data. Ideally, the data set is not just some random collection of images or another sort of records. For many applications data needs to be labeled, in the case of images for example with a description of the content of the image, with a keyword or with a class label. The goal is to define a model that can be seen as a function, just like in the linear regression example above. The function takes an input (image) and yields an output (keyword). However, only two parameters as in the example above are by far not sufficient. The functions we’re talking about in deep learning are very complex and can easily have a billion parameters. Because there is such an insane number of parameters to tune, a lot of training data is needed.
Complex tasks require complex models
One example for a typical machine learning problem is image classification. Here, you want to map an image to, say, one of thousand classes. As mentioned, a quite sophisticated function is needed to get the job done. Why? Because an image itself is quite a complex thing. It consists of many pixels that don’t carry any meaning per se. Only when looked at in the context of their neighborhood (image patches), some semantic information starts to emerge. Natural images have a huge variance: just picture what an image of a car could look like. What color? What shape? Are there any reflections? What is the background? All this information is present in the pixels but completely irrelevant for the classification task. Thus, a model needs to be robust against all these disturbing factors. Inspiration comes from neuroscience. State-of-the-art image recognition systems work with something called artificial neural networks.

Example of a neural network. The input neurons take for example the pixels of an input image. These values are then forward passed through the network, activating some neurons. Eventually, one or more output neurons are activated, depending on the task.
A neural network in the computer
Let’s start small. What is a neuron? In our model, a neuron is an object with many inputs and one output. If the sum of the inputs exceeds a threshold, the neuron will output something or otherwise remain silent. Many of these neurons form a layer and many layers build up a network. Simple tasks like handwritten digit recognition can be solved by such a neural network with a stunning accuracy of over 99% on unseen data. If there are multiple layers between input and output, the structure is called “deep”, thus coining the term “deep learning”.
Example of a neural network. The input neurons take for example the pixels of an input image. These values are then forward passed through the network, activating some neurons. Eventually, one or more output neurons are activated, depending on the task.
Neural networks are so good that they even outperform humans in specific tasks such as traffic sign recognition. In an experiment, test persons were shown real-world images with traffic signs on them. They were asked to classify the sign within a second or so (quite realistic, considering that a driver can only see the sign for a short time when driving past it). Neural networks were able to classify more signs correctly in a shorter amount of time.
The deeper, the better
Why is it so important to have deep structures? How are deep neural networks so powerful that they can do photo classification, image segmentation, speech transcription and many more magically sounding things? Remember, we want our model (or function) to be super complex. The more parameters there are, the more complex a function can be. However, an increasing number of parameters also means a growing number of data examples necessary for training.

Visualization of features learned by a deep network. The higher up in the network hierarchy,the more abstract/complex are features that neurons will learn. Visualizations are projections to the input space. (Image: Rob Fergus)
Visualization of features learned by a deep network. The higher up in the network hierarchy, the more abstract/complex are features that neurons will learn. Visualizations are projections to the input space.
A hierarchy of layers is able to extract features of growing abstraction, starting with the most basic one (e.g. edges). The next layer learns what compounds of edges mean. The following layer will detect more abstract structures and higher layers can detect complex features such as faces, wheels or cats. It has become a sport among scientists to design such networks and use them to solve complicated tasks which had been tackled with conventional mathematical methods before. The big problem is, what are the rules for such networks? What is the ideal design for a problem? And what is the best algorithm to “train” the network, i.e. to adjust the parameters in order to achieve best performance. All these questions and many more are part of current research. We are witnessing the evolution of a whole new science, a mix of neuroscience and computer science. Researchers will have to cope with technical questions as mentioned above and discuss ethical standards for Machine Learning approaches. It may be the most exciting era ever.
About the author
Philip Haeusser did his Master of Science in Physics at the University of California, Santa Cruz and is now pursuing a PhD in Computer Science at the TU Munich.
Kontakt:
Philip Häusser M.Sc.
Technische Universität München
Department of Informatics
Tel: +49-89-289-17788
haeusser@cs.tum.edu