Skip to content

Latest commit

 

History

History
55 lines (33 loc) · 13.9 KB

writeup.md

File metadata and controls

55 lines (33 loc) · 13.9 KB

Project: Follow Me


Objectives

  1. Use Udacity's quadrotor simulator to collect scene data from a busy city environment, consisting of landscape, buildings, non-hero persons, and a hero person.
  2. Build a fully convolutional network using Keras for use in semantic segmentation.
  3. Train the network on collected scene data to classify environment objects, non-hero persons, and the hero person. Adjust various network parameters as necessary.
  4. Test the network on provided test images and deploy network in a "Follow Me" setting, where quadrotor must follow a hero person walking through a busy city environment.

Writeup

Here I will discuss various aspects of the project and my approach as the project progressed. Link to rubric here.

Fully Convolutional Networks - A Conceptual Discussion

Fully convolutional networks (FCNs) preserve the spatial representation of an image, while still being able to perform image tasks such as classification. In regular convolutional networks, you often have one (or more) fully connected layers at the end, which connects each pixel in each feature map to a hidden node. This destroys the spatial relationship that each pixel has with the pixels around it. To avoid this, fully convolutional networks employ 1x1 convolutions at the end of its "encoder", in lieu of fully connected layers. This still allows for learning through combining output different feature maps together (coupled with an activation function such as Relu), yet it preserves the spatial relationship of the feature maps. Thus, it is similar to a fully connected layer, but only in that it combines features along the depth of the output map (the channel dimension).

The "encoder" part of the FCN has the exact same function as a regular convolutional network. In the encoder, you employ several layers of convolutions to extract features from the input data and output feature maps that have decreasing spatial resolution and increasing depth. It ends with 1x1 convolutions, as discussed above. The "decoder" part of an FCN takes the feature maps outputted by the encoder and uses them to rebuild a representation of the image through upscaling and further convolutions. There are several ways to accomplish this, such as with transposed convolutions or bilinear upsampling (followed by more convolutions).

This encoder/decoder architecture is particularly useful for scene segmentation, where one desires to generate a pixel-wise classification of an image. Because the encoder extracts features and preserves their spatial context within an image, the decoder can take the output of the encoder, upscale it back to the original image size, and perform pixel-wise classification on the upscaled feature map. This requires training the network using masked versions of the original image, so that the network "learns" the different features that correspond to each class. Bounding boxes are also useful for scene understanding, but they may not be adequate in certain situations, especially where a box cannot capture the shape of an object (e.g. a curvy road in an image). Fully convolutional networks, with properly labeled masked data images, can accomplish this. However, in the encoding step, some resolution is lost, due to images being compressed into a smaller spatial area (with each successive layer), so the final masked output will have lower resolution compared to the original image. One can alleviate this issue by creating skip connections, whereby layers of encoder output are concatenated or added to the decoder inputs of the same spatial size. This helps increase the resolution accuracy of the output layers.

This model would work well for following any kind of object, assuming there is the appropriate data available to train the model on. As this simulator features a human being as the object of interest, the network trains to identify human beings. If one wanted to follow another type of object (e.g. car, dog, cat, etc.), then one would need to procure images of that object, in addition to the properly masked versions of these images.

Network Architecture

To carry out semantic segmentation in this project, I created a fully convolutional network using the Keras framework with TensorFlow as the backend. The encoder portion of the network initially consisted of layers of separable convolutions (with relu activations and batch normalization). These convolution layers are then followed by one or more 1x1 convolutions (also with relu activations and batch normalization). The decoder portion consisted of layers of bilinear upsampling coupled with a skip connection and a separable convolution (with relu and batch normalization). Bilinear upsampling is a computationally efficient way of upscaling the feature maps to a higher resolution. Skip connections are feature maps taken from the encoder portion; these feature maps are concatenated with the existing output feature maps from bilinear upsampling. The subsequent separable convolution takes these upscaled and skip-connected feature maps and "learns" more spatial features from them.

At first, I started with just a single layer of encoding and decoding. With this shallow architecture, it was difficult to get meaningful results. I had to keep the encoder portion "thin", with just a few output layers. I also found that the decoder portion needed to have a large number of output layers in order to perform classification correctly. In spite of these architecture optimizations, I was only able to achieve a 0.3341 final IOU score. Please see the spreadsheet file named parameter_exploration.ods, also located in this repository, for a more detailed progression of my network architecture throughout the various training runs of this project.

As I explored the results of various training runs on this first data set, I noticed one particular trend - the network would often classify objects based merely on color. So if a building or landscape feature was red, it would identify it as the hero. At this point, I realized that I should not be employing separable convolutions on the input layer, which only has three channels of data. This meant that the network would only be able to utilize linear combinations of merely three filters to generate all the output feature maps for the next layer. This would not be sufficient for more complex classification tasks (and columns M through P seem to confirm my suspicions that this was so). From that point on, I always had a regular convolution layer for the first layer, and then separable convolutions in subsequent layers. After making that change and adding a second encoder and decoder layer to the network, I observed a sizable increase in accuracy (see column K in parameter_exploration.ods).

At this point, I started adjusting different filter depths in the encoder and decoder to see if I could optimize the two-layer architecture. I realized that with this number of layers, I still needed both skip connections (see columns Q and R) to maintain a high score. Eventually, I augmented the training data with my second data set, but there didn't seem to be much improvement until I added a third encoder/decoder layer, thus increasing the network architecture to three layers. This saw another increase in accuracy (see column V). After this, I continued experimenting with different filter depths and also trained a few four-layer networks. With regards to the number of 1x1 convolution layers at the end of the encoder, I found that having two layers worked well, but did not test thoroughly whether this was an optimal number of 1x1 convolution layers. In my last training run, I was able to achieve a final score of 0.4567 with a three-layer network (column AC). Please see below for a diagram of the final network that I used. Diagram is also located at ./docs/misc/network_diagram.jpg.

img-01

Parameter Tuning

The main hyperparameters to tune for this network was the number of training epochs, initial learning rate, and batch size. Other tunable parameters related to network architecture, such as the number of layers or depth of output feature maps, are discussed above.

The best number of training epochs seemed be around 30 epochs or so. Usually after 30 epochs, the training and validation loss would start to oscillate about some value, increasing for some epochs, decreasing in others. For some networks, the training loss would continue to decrease very slightly while the validation loss would remain about the same. Because each network seemed to converge at different rates, for the later networks, I would train for 20 epochs, observe its training and validation loss pattern, and then continue training for a few epochs at a time, repeating as necessary until the loss stopped decreasing (see cell 186 in model_training.html).

Batch size was determined by what the hardware could handle. As shown in parameter_exploration.ods, most of the earlier networks were trained using a batch size of 128. However, later networks, which were larger and featured more output maps, could only train with a batch size of 64.

In earlier networks, using an initial learning rate of 0.01 seemed to work well. Later on, I became concerned that the learning rate was too high and that it was preventing the network from settling on an optimal solution, so I lowered the initial learning rate to 0.001. This lowering of the initial learning rate didn't seem to have a huge impact on performance (see columns S through W), but the training convergence rate remained about the same, so I kept the initial learning rate at 0.001. I also discovered later that the Adam optimizer does learning rate decay for you during training. Nevertheless, for the very final network, I stepped down the initial learning rate for the final five epochs of training (from 0.001 to 0.0001), to ensure that the loss continue to decrease to the optimal solution.

Data Collection

Based on the performance of my network in the Segmentation Lab, I realized that I wanted to collect data myself. For this first data set, I collected a batch of data with the drone following the hero person as she walked throughout the entire city. A second batch consisted of the drone patrolling throughout the city, with non-hero persons walking throughout (the hero was not present in this batch). The third batch had the drone fly around in a set pattern, with the hero also walking around in the area, with a large number of non-hero persons in the vicinity; this was repeated in a number of different areas. In this first data set, I collected a total of 3456 images.

After training on this first data set, I noticed that some areas could perhaps be improved by collecting even more data. In particular, I saw that the network would classify the sides of buildings or any upright object (e.g. rainwater leaders) as non-hero persons. Thinking back, I think this may have been partly due to deficiencies in my network architecture, but I also realized that perhaps there wasn't enough data with various buildings in view. So I decided to collect a second data set to augment the first data set. In this set, I first collected a large number of images with the various buildings in view. Also, in another batch, I had the drone go in a back-and-forth pattern, with various non-hero spawn points in the distance. In some of these runs, the hero would also be walking back and forth in the distance; in others, she would not be present. In total, this second data set had 2334 images.

Upon training on the aggregate data set consisting of the first and second data sets, I observed that the test scores for far away recognition were quite low still, for non-hero and hero persons alike. So I collected an additional 759 images with the hero and non-hero person far away from the drone. This was the third and final data set I collected and trained on before submitting this project. In total, there were 6549 images in my aggregate training data set.

For validation, I collected 1061 images in the same fashion as the first training data set, using the same three settings as described above. Because the second and third data sets were simply more targeted runs on the same types of images, I decided not to collect more validation data during those runs.

Future Enhancements

As shown at the very end of model_training.html (see cells 10-12 in model_training.html), the network does quite well in detecting the hero person in the first case, where the quad is following behind the target. The average IOU for the hero person here is 0.94. In the second case, where the quad is on patrol and the target is not visible, the network does well in identifying non-hero targets, although there are still a number of false positives in this case. In the third case, we see that the network has some trouble detecting the hero from a distance. There are about an equal number of true positives and false negatives, and the average IOU for the hero here is 0.26. If the network could be helped to detect the hero more accurately from a distance, this would lead to a much higher final score.

By looking at the prediction results, it is apparent that the network has difficulty distinguishing between the hero and non-hero persons. To help the network make this distinction, the first step would be to gather more data, particularly image data where the hero is at a distance. Because the quad's camera is constantly rotating, it can be difficult to obtain images with the hero in view at a distance. Secondly, improvements to network architecture should be considered. Perhaps more encoder layers should be regular convolutions instead of separable convolutions, to give the network more parameters to work with. Also, a second separable convolution could be added to each decoder layer, to also give the network more "learning" capability. I did not experiment with transposed convolutions in lieu of bilinear upsampling, but that is also a possible modification.