Generally speaking, humans are good visualizing and understanding images. We cannot see text without reading the words, and we have no trouble identifying separate objects and understanding the relationships between them.
That is, we can identify cats from dogs, water bottles from mugs, and red lights from green. Computers on the other hand, struggle with this problem. You or I have no trouble going online and watching videos of cats. We might not even think twice about it, but when a computer did the same thing in 2012, it took 16,000 processors and resulted in publications and keynote speakers. [src: https://www.wired.com/2012/06/google-x-neural-network/]
Identifying cats is one thing (especially considering the algorithm had no concept of what a cat was beforehand), but it’s not the same as driving a car. An autonomous vehicle won’t have the same compute power that the cat algorithm had, it won’t have the same amount of time, and the penalty for an incorrect classification could be much worse.
So what are some of the ways that a self-driving car can detect and identify objects? Autonomous vehicles can have any number of sensors, ranging from heat and humidity sensors, GPS, tactile, radar, and of course, cameras.
For this post, we will talk about how cameras on autonomous vehicles identify objects.
The most rudimentary technique for image recognition is something called edge detection. The fundamental idea is that pixels related to an individual object will be relatively similar, but pixels related to different objects will be relatively different. So if we calculate the difference pixel-to-pixel then any time we see a large or dramatic difference pixel to pixel, then we can draw an edge there. Once we have drawn all the edges in our image, we should have a fairly good idea of what scene were looking at. For example, here are the input and output images from an Edge Detection algorithm known as the Canny algorithm. [src: https://en.wikipedia.org/wiki/Edge_detection#Canny]
One use case for this edge detection might be to find the outline of a car and to track its movement. A real world example of this is the autonomous driving app iOnRoad, which used a camera to guide the host vehicle by measuring the distance between the taillights of the cars ahead. That is, if the camera can use edge detection to identify the outline of what is a body panel and what is a taillight, then the algorithm can track the distance between the two taillights, and thus the distance to the car ahead.
While edge detection is useful for pulling some data out of pixels, it has two large downsides. First of all, it performs terribly in low light settings, or settings with little contrast (think: clouds and snow). Secondly, the information it provides requires a lot of overhead to turn into actionable advice, that is, the administrator of the algorithm needs to understand a lot of different things about the image before the data becomes useful, and the algorithm has no logical concept of what it is looking at.
More advanced image recognition algorithms contain concepts of objects. An algorithm looking for street signs will have some concept of what a street sign is: it’s general shape, size, colors, and what to do with this information. These object aware algorithms can roughly be categorized as Image Classification algorithms
The algorithm can use this information to better identify objects by matching outlines and patterns to a known set of images. This also means that when an image is identified, the algorithm knows relatively more about the scene around it: a stopped car means we should stop, a merge sign painted on our lane means we need to move over.
The actual process of image classification is much more varied and complex than edge detection. However a fairly common implementation is to use Neural Networks. We won’t cover Neural Networks here, but effectively they are layers of filters the computer will apply to data in order to determine what it is looking at. For example, the computer might ask Is there a large red area in the image? and then: Is the red area shaped like an octagon, or some angled view of an octagon? and then: Does the red octagon say STOP?, in which case, the algorithm concludes that it is probably looking at a stop sign. Each of those questions would be modelled as a layer, or node, in the neural network, and these networks can span thousands or millions of nodes. More interestingly, these and get better the more they are used, sort of like our brains.
It is no surprise then that as computers and algorithms evolve, we are trying to make them more and more like us, because after all, we had no trouble identifying cats or road signs, so it makes sense to model computer vision after our own.