Deep Learning: Learns increasingly complex features, Source: Andrew Ng

The limiting factor of traditional machine learning and computer vision technology before the recent rise of deep learning algorithms was the handcrafted feature extraction by a human telling the algorithm what to look for and how to classify an image. Hand engineering an algorithm to detect an object on an image is a long and effortful process requiring an expert to extract relevant features and mostly leading to unsatisfying results. Thousands of computer vision experts working on technologies for many years could not achieve what a three years old child learns by looking at millions of images.

So the basic concept behind deep learning for image recognition is to let the Deep Learning algorithm extract the features itself that are needed to classify an image based on a large set of training images. 

This is typically achieved by a hierarchical approach in layers, detecting simpler features and patterns like e.g. (1) light and dark pixels and (2) shapes and edges and then combining them to larger structures like (3) eyes, noses or mouths and finally a (4) human face.

Convolutional Neural Networks

A very popular type of neural network that has proven to be very effective performing image recognition tasks is Convolutional Neural Network (CNN). We don’t want to dig too deep here and will just stick with a rough overview to illustrate how CNNs are successful in reducing complexity to perform this task. Convolution in this context can be viewed as the process of filtering an image for specific patterns. Convolutional Networks combine the following types of layers performing different tasks. 

  • Convolutional layers: looking for patterns in the data
  • Rectified Linear Units (ReLUs): combining patterns to larger structures
  • Pooling layer: reducing complexity
  • Fully connected layer: connecting the findings with labeled data for classification

Source: Course Notes Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition by Andrew Kapathy

Training a machine like a child

Inspired by the idea of providing computers with the same kind of learning experiences  a child would have in his or her early developmental years, meaning a very large quantity of images, a team of researchers led by Fei Fei Lei (Stanford University) undertook a huge crowdsourcing effort to create a database containing millions of annotated images, the ImageNet database. While back in 2007 Fei Fei Li got the advice from colleagues to do something useful for her tenure, the massive effort taken by this project laid the foundation for the breakthrough of neural networks in computer vision by providing them with a very large set of labeled data.1

The image recognition revolution

In 2012 the first model based on a CNN, the AlexNet was submitted to the annual ImageNet Large Scale Visual Recognition Challenge by a team from the University of Toronto. This CNN brought the error rate down to 15% from 26% of the best conventional machine vision solution.2

The winning solution of the 2015 challenge, submitted by a team from Microsoft, was the first CNN to beat a human, bringing the error down to 3.5% compared to the human benchmark of 5.1%. Driven by these breakthroughs CNNs are finding their way into all fields of computer vision, including many medical and life science applications such as radiology, pathology and genomics. And while computer vision solutions based on handcrafted algorithms were struggling to achieve acceptable results, deep learning based solutions e.g. for cancer detection suddenly appear to be a considerable solution to assist doctors in diagnosis. 

How do CNNs “see” images?

Photo: Janko Ferlic, Unsplash

Photo: Janko Ferlic, Unsplash

For a long time researchers did not really have a clear idea what exactly happens inside a neural network. The goal of the Deep Visualization project is to provide a better understanding of this by visualizing how CNNs are seeing images. Researchers created images synthetically to maximally activate individual neurons in a Deep Neural Network (DNN). The images show what each neuron “wants to see”, and thus what each neuron has learned to look for.  3

Source: Jason Yosinski

So are we finally there? 

Can we train machines to see and process visual information like humans? And will they be able to replace us e.g. in driving our cars or curing our diseases? Not yet according to Olga Russakovsky, one of the ImageNet challenges organizers. In an article in New Scientist she points out that the algorithms only have to identify images as belonging to one of a thousand categories, which is tiny compared to what humans are capable of. To show true intelligence, machines would have to draw inferences about the wider context of an image, and what might happen one second after a picture was taken.4

In addition to the limitations in the type of tasks they can perform and in adding context to the  information, neural networks are also easily fooled. Google LeNet, the winning submission of the ImageNet challenge 2014, struggled to recognize images that contained filters (like e.g. on Instagram) or that depicted objects in an abstract form like a 3D rendering, painting or sketch.5  A Two-Minute-Paper Session on “Breaking Deep Learning Systems With Adversarial Examples” illustrates how easily neural networks can be fooled by adding noise to the images.6  Another study shows how neural networks make high confidence predictions for images that are unrecognizable for humans.7

The recent news that Google/Alphabet is changing its plans to develop a self driving car to partnering with car manufacturers and equipping cars with sensors8 could be seen as symptom of the fact that technical development is slower than expected and that technology still has a long way to go before we will be able to develop truly intelligent machines and replace humans in complex tasks like driving a car or diagnosing cancer. But today’s neural networks performing image recognition are capable of providing us with valuable assistance in such tasks by recognizing obstacles on the road or patterns in tissue that indicate cancer prevalence.