Traffic sign detection

Overview

We use a Machine Learning approach to detect traffic signs. This means that the project is split into two big parts: Training the ML model, and then using it in the ROS module.

We tried several different Machine Learning algorithms. Initially, we started out with neural networks, which didn't work sufficiently for a number of reasons. Still, we describe that approach at the end of this document, because we think that it might be promising to continue working on this in the future.

In the end, we decided to use SVMs with HOG features. In the following, we will describe how to train the model, and how it works in ROS.

In professional literature, detecting traffic signs is usually done in these two steps:

Localization: Finding the location of possible signs in a large image
Classification: Given a cropped image where only a sign (or maybe nothing relevant) is visible, classify whether it's a sign or not, and which sign it is

We roughly follow these steps, but the first step is just a very rough localization that we call finding candidates.

Project goal

We want to take the camera image from ROS and then find traffic signs. It's also important that we're able to show where in the image the traffic sign was found. For visualization purposes, we draw a bounding box for this.

For now, we only want to find this red triangle sign. There are two additional important constraints:

False positives should be avoided (i.e. we want a very high precision, while not hurting recall too much)
The total running time per analysed frame has to be smaller than 1 second, when running on the car

Training the model

Datasets

For training, several datasets can be used. We tried several established datasets and preprocessed them for our needs. GTSRB (for classification) and GTSDB (for localization) are two famous datasets with German traffic signs that can be used if one needs a lot of training data. These datasets contain traffic sign images that were taken from cars on real streets.

To get a Machine Learning model that generalizes better to our specific use case, we also created and labelled several datasets on our own. These datasets contain images that were directly taken from the car and show our customly built traffic signs in the lab.

Generally, we got better results using our own datasets. However, this heavily depends on the Machine Learning algorithm that's being used. We ended up using SVMs with HOG features. This approach requires significantly less training data than neural networks do. For neural networks, it's better to either use GTSRB / GTSDB or to generate even more training data than we did.

When more training data is necessary, one can manipulate the individual images slightly, e.g. do a small skew or blur. This way, much more training data can be created on the fly.

For getting access to the datasets and the code for training, please send an email to jscoder or florian.hartmann in the ZEDAT mail system.

SVM

Our final Machine Learning model uses SVMs with HOG feature. This has one huge advantage: Very little training data is needed. We can create a good model using ~20 training images and a really great one with ~30. Of course, this only works when some constraints for the images are met:

They need to have a good quality
They need to be sufficiently distinct from each other, i.e. having two images that look nearly the same won't help much
They need to show all/most of the cases how the sign might look when predicting. This means it's important to show signs in different sizes, distances, angles and lightning conditions

For an implementation of SVMs, we make use of dlib, which works in Python, as well as in C++. There's also some helpers for using dlib in ROS.

To make things easier, we decided to do the prototyping and training in Python. The ROS part is of course done completely in C++. For examples on how to train the SVM, one can look at our Jupyter notebooks.

Resulting models

Deciding on which SVM to select, is a matter of finding a good trade-off between accuracy and performance. Using the 27 images of our favored dataset training yields an extremely accurate model, but also quite a slow one.

The alternative is only using the first 20 images. This still gives a good model. The difference is that this one has less accuracy, but much better performance.

Of course, which model to take is a question of what performance one wants to have and how exactly the model is being used. In the ROS package, we use the more accurate model, but do a lot of preprocessing to make sure that we query the model as few times as possible.

We recorded videos of these two SVMs in action:

own27: SVM trained with 27 images
own20: SVM trained with 20 images

When viewing both of the videos next to each other, it becomes clear that own27 is doing a much better job. We're also using these names (own27.svm / own20.svm) as file names for the SVMs. A closer analysis of the results is given later in the section Evaluating the results.

Using the model in ROS

Required packages

Our program needs the cv_bridge and image_transport packages. The cv_bridge package is used for converting image between ROS image messages and OpenCV images. The image_transport package is used for subscribing and publishing images.

For Machine Learning functionality, we use dlib. We added instructions for adding dlib compilation to the build process. When working on a new car make sure that you download dlib and position it in the correct folder.

Structure of the workspace

There is only one package named traffic_sign_detection in the workspace (e.g. traffic_sign_ws for debugging or catkin_ws_user for general use). And inside the package there is only one node also named detection doing the image receiving, preprocessing, detecting and publishing.

Structure of the code

Receiving images

The node detection subscribes to the topic app/camera/rgb/image_raw. So it will receive image frames taken from the camera. By using package cv_bridge, it will convert the image to OpenCV format.

For general testing, we also wrote another node pub.cpp for directly publishing images. It will read an image from the given path and publish to the topic image_reading. By using the command rosrun traffic_sign_detection pub $PATH_OF_THE_IMAGE this node will start to publish images. Just change the subscribing topic image_reading in node detection, it will start to subscrib

Finding candidates

One central problem of the SVM is that it's very slow. Basically, the SVM part is moving a sliding window through the image and applies the SVM with HOG features to every one of these subimages. The vast majority of subimages don't contain any signs, so this is very inefficient.

To make it more efficient, we want to do candidate finding: This means we do some preliminary filtering so that we have to query the SVM fewer times. By doing this, performance improves a lot.

Finding candidates should work like this:

It needs to find all signs, because otherwise we're hurting accuracy
It's ok if it also find some other stuff, but it shouldn't be in the hundreds of candidates

In other words: All signs have to be candidates, but it's alright if only 20% of all candidates are also signs.

To find candidates, we use OpenCV:

Find red pixels, using the HSV color format
Find areas with many red pixels, using OpenCV contours, and
Filter for some heuristics:
- Only use candidates that are approximatly square, because the sign bounding box is always nearly square
- Only use candidates that have a certain size, i.e. if it's too large or too small then it's most probably not a sign

Then we give the candidates to the SVM until a configured time limit is passed.

Loading the SVM and evaluating candidates

We only load the SVM file once when the node is starting up. Afterwards, as many candidates as time allows are given to the SVM. For this we actually need very little code because we just use dlib's SVM implementation. Afterwards, we just go through the results and draw them into the image.

Publishing result

Because of two possible results, one is that we find a traffic sign in the current image frame. Another is that there is no traffic sign in the current image. For first situation, we first use opencv function cv::rectangle to draw the bounding box on the traffic sign on the image. And then by already announced image publisher imagePub calling publish(cv_ptr->ImageMsg), we can publish the image with bounding box. For second situation, which means there is no traffic sign in the current image, we don't need to draw any bounding boxes, but directly publish the original image.

Using an acceptable amount of CPU resources

An important consideration of the project is that we shouldn't completely block the CPU, but should also let other packages work. To be able to do this, we make sure that our launched node is nice.

We configured our launch file to automatically enable niceness for our node. This means that if other important nodes want to do work on the CPU, we are nice and let them have the resources. But if not many other nodes are running on the car, then we take the CPU resources, because they are available and it wouldn't make sense to just leave them without work.

Launch file

We created a special launch file that takes care of starting the node with the nice property. We add sentence launch-prefix="nice" to the launch file so that the process will have lower CPU usage.

By using the command roslaunch traffic_sign_detection detection.launch the traffic_sign_detection package will start to run with the nice property.

Evaluating the results (accuracy)

SVM

The accuracy of the SVM that we ended up using can be seen in this video:

https://drive.google.com/open?id=0B_Kof8cSs3Y7NGxBdDR4dzBseG8

It's able to find most of the traffic signs and finally we managed to improve performance enough to run this exact SVM on the car.

This video should showcase that the SVM works extremely well in most cases. Therefore we will also focus on the not so good cases for the remainder of this section.

Flickering

Very small changes to the image can sometimes lead to a different prediction, as shown in this example:

Of course to the human eye, this difference in prediction looks very weird. It still shouldn't be a huge problem for using the module, because one can just reuse the same prediction for several frames.

Very small signs

Dealing with very small signs is hard because small subimages don't work for the SVM HOG approach. In this image, we only find one sign. This is because it's slightly closer to the camera than the other sign.

One frame later, we can detect both signs because they're now close enough to the camera:

We spent some time measuring the distances, and usually the maximum distance at which we're able to detect cars is a bit larger than 1 meter. After that, the signs just become too small. Again, this shouldn't be a huge problem for using the module, because the signs are always recognized as soon as they get close enough.

Additionally, the >1 meter distance seems to be quite sufficient to find small signs:

In theory, there is also a maximum image size. Because the SVM HOG approach uses sliding windows, we need to configure some sensible maximum image size. In practice, we found that this rarely messes up detection.

Finding multiple signs

To give an example where our model works well, it's generally not a problem to detect multiple signs, e.g. for this image:

The number of signs that we can detect is only limited by how much time we want to spent on detection. If we don't want to exceed 1 second then it's not possible to detect a large number of signs (like above 5). In the lab we don't have that many signs, so we don't really hit that limit here. If the signs are very close to the camera (i.e. more expensive to analyse) then finding two signs can also be expensive, but this heavily depends on the exact image being used.

Weird angle

Our model also works well in the case of bad angles. This is because we spent extra time on making this work by adapting training for this case.

Nearly no false positives

One very important part of our work was to make sure that we don't get many false positives. In practice, this part works very well. In the example videos linked above, you can see that we don't have one false positives in a 90 second video, even if we choose the weaker SVM. In terms of precision this means we have a perfect precision of 1.

Candidate finding

In practice, the SVM is only as good as the candidate finding allows it to be. So it's also important to evaluate how well finding candidates works.

General example

In the following images, we highlight each found candidate with a small black rectangle. In a bold blue rectangle, you can see all the candidates where a traffic sign was found. Notice that the candidate rectangle is larger than the traffic sign rectangle. This is because we want a closer bounding box for detection but a larger more general one for finding candidates.

Trying to confuse it

Because the candidate finding works using colors, one might think that it's very easy to confuse it. But as seen here, it doesn't think the orange ball should be a candidate:

Sometimes, the candidate finding detects parts of an image that humans would consider brown, but in HSV it's in the red color range, for example shoes:

Of course, this isn't bad, as the SVM correctly recognizes that a shoe is not a traffic sign.

For most of the lab, the candidate finding ignores the background. Only in this one direction, it finds stuff in the background:

Using other signs

We only implemented traffic sign detection for one kind of sign. Here you can see that we don't recognize other traffic signs as our kind of traffic sign:

Number of candidates

Depending on the background, there might be more candidates. Again, this only happens when facing this one direction. If a sign would've been in this image, it would be hard to detect, because we don't have enough CPU time to consider all candidates.

But keep in mind that we tried hard to create this example image. We had to move the camera angle up, just to make sure that we show more of the background.

Lightning differences

Something important to keep in mind are lightning differences. By using HSV, we can deal with this pretty well, but of course we are not completely immune to them.

The following test images were recorded late in the evening. The first image was taken when the light was on and the blinds were open.

This second image was taken with the blinds closed, which made the room darker. You can see this by looking at the white piece of paper on the cupboard on the left top corner. In this image it's a lot clearer to see, because no light is shining on it.

In both cases, our solution finds the traffic sign.

As mentioned earlier, when really wanting to confuse the model, it's possible to do this by heavily changing the lightning conditions. For this image, we turned the light off and closed the blinds. The room is now extremely dark, and we can't find the sign anymore. The candidate finding fails here.

Conclusion

Neither candidate finding nor the SVM classification are completely perfect. But for our personal testing, they both worked very well. When driving the car around, we always found each sign at some point. Maybe not directly because it wasn't close enough or the angle was bad, but once the car moved a bit each sign was always found. The flickering looks weird but shouldn't be a big problem in production.

Evaluating the results (performance)

The goal was to stay under 1 second. Generally, we managed to reach this goal. In the code for the package, one can configure after how many ms new candidates should not be configured anymore.

Currently, we configured it in a way so that the prediction stays under 1s for the majority of cases, but in special cases a slightly longer time is allowed. Of course, this can be changed so that we always stay easily under 1s.

If only one sign is visible and no complicated background, then it takes around 300-500ms. If two signs are visible then it goes up to a value closer to 1s. It's not possible to detect many more signs in one second. But these values are just ballpark figures, it always depends on the size of the signs and the background. The smaller the sign, the less runtime we need.

If no sign is visible and no complicated background, then it only takes about 30ms to come to that conclusion. As described earlier, CPU usage is also not a problem.

Ideas for future improvements

In terms of accuracy the SVM model is already working very well. The main pain point is the speed. To improve this one could:

Train the SVM on different kinds of signs
Make candidate finding smarter, e.g. using Bayesian filters or just by adding more heuristics for good candidates. Also make it work under different lightning conditions
Try different libraries for SVMs
Completely get rid of SVMs and use a faster algorithm
Try to make the neural network approach work

Other approaches

Neural networks

We spent a lot of time trying to make deep learning work for this use case. There were some good results, but in the end we decided to drop this approach in favor of SVMs with HOG features.

Neural networks are very good for traffic sign classification (i.e. after already having localized a possible sign). Using a datset like GTSRB, we managed to get above human performance. However, localizing the sign in the image is much more difficult. With a dataset like GTSDB, we can get about 97% accuracy for real world images, but it generalizes much much worse to our lab images. Because we didn't have much data for lab images, we weren't able to train for our specific use case.

Instead of trying to localize signs, we tried a sliding window approach with candidate finding, very similiar to what we did for SVMs. This didn't work well enough for neural networks because the SVM got better result (since it needed much less training data). But it was still worth experimenting with this, because we ended up using many of those ideas for the SVM approach.

Another big problem was that we had a hard time running deep learning libraries on the car. We spent a lot of time trying to make tensorflow or caffe run on the car, but for our case it never worked because of ProtoBuf compile errors.

Traffic sign detection

Overview

Project goal

Training the model

Datasets

SVM

Resulting models

Using the model in ROS

Required packages

Structure of the workspace

Structure of the code

Receiving images

Finding candidates

Loading the SVM and evaluating candidates

Publishing result

Using an acceptable amount of CPU resources

Launch file

Evaluating the results (accuracy)

SVM

Flickering

Very small signs

Finding multiple signs

Weird angle

Nearly no false positives

Candidate finding

General example

Trying to confuse it

Using other signs

Number of candidates

Lightning differences

Conclusion

Evaluating the results (performance)

Ideas for future improvements

Other approaches

Neural networks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!