Project: Follow Me

For this project a Fully Convolutional Network was designed. A dozen of model trainings were performed and decisions were made based on the resulting data. A few first runs were made using 4-core CPU resources only. As network complexity and number of epochs were growing, it quickly became clear that it's almost impossible to gain satisfactory results within a reasonable time. Therefore AWS service with GPU resources was used for the further model development.

I started with a simple network architecture composed of a downsampling layer, a 1x1 convolution, an upsampling layer and a final convolution. Encoder comprises a separable convolution layer which extracts features from an input image (reduced parameters number simplifies the network and reduces overfitting). We can't use fully connected layers for the project because we need to preserve spatial information (every pixel in an image have to be classified). That's why we use 1x1 convolution instead. Decoder upsamples the input feature map using bilinear interpolation. Skip connection via concatenation after upsampling allows to use data from the input high resolution image. Additional separable convolution applied after concatenation. Nonlinearity is introduced by ReLu. After every convolution a batch normalization applied to provide better convergence and normalization. Final convolution provides classification probabilities for every pixel.

Random search method was used for choosing starting hyperparameters. For the very first runs only 2 to 10 epochs where used and a relatively small part of the data provided. For learning rate > 0.05, batch <= 16 and filter number <= 16 loss and validation loss values were unstable and showed bad convergence or even divergence. With high number of batches (up to 256) and encoder/decoder filters (up to 512) the network performed just a little bit better. Increasing the separable convolution kernel size up to 10x10 gave slightly better results but not enough to consider them satisfactory. Additionally I analyzed popular FCN designs for semantic segmentation and came to a conclusion that hyperparameters should be limited to learning rate = [0.3, 0.001], batch = [32, 64] and filter number = [32, 1024] for further investigation.

On the next step of the FCN development I doubled the number of encoder and decoder layers. The resulting network (2 downsampling layers, 1x1 convolution, 2 upsampling layers, final convolution) showed better results but they still were unacceptable.

The final FCN architecture comprises of 3x(separable convolutional layer -> separable convolutional downsampling layer), 1x1 convolution layer, 3x(upsampling layer with 2 convolutions), final convolution:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 160, 160, 3)       0         
_________________________________________________________________
separable_conv2d_keras_1 (Se (None, 160, 160, 32)      155       
_________________________________________________________________
batch_normalization_1 (Batch (None, 160, 160, 32)      128       
_________________________________________________________________
separable_conv2d_keras_2 (Se (None, 80, 80, 32)        1344      
_________________________________________________________________
batch_normalization_2 (Batch (None, 80, 80, 32)        128       
_________________________________________________________________
separable_conv2d_keras_3 (Se (None, 80, 80, 64)        2400      
_________________________________________________________________
batch_normalization_3 (Batch (None, 80, 80, 64)        256       
_________________________________________________________________
separable_conv2d_keras_4 (Se (None, 40, 40, 64)        4736      
_________________________________________________________________
batch_normalization_4 (Batch (None, 40, 40, 64)        256       
_________________________________________________________________
separable_conv2d_keras_5 (Se (None, 40, 40, 128)       8896      
_________________________________________________________________
batch_normalization_5 (Batch (None, 40, 40, 128)       512       
_________________________________________________________________
separable_conv2d_keras_6 (Se (None, 20, 20, 128)       17664     
_________________________________________________________________
batch_normalization_6 (Batch (None, 20, 20, 128)       512       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 20, 20, 1024)      132096    
_________________________________________________________________
batch_normalization_7 (Batch (None, 20, 20, 1024)      4096      
_________________________________________________________________
bilinear_up_sampling2d_1 (Bi (None, 40, 40, 1024)      0         
_________________________________________________________________
concatenate_1 (Concatenate)  (None, 40, 40, 1088)      0         
_________________________________________________________________
separable_conv2d_keras_7 (Se (None, 40, 40, 128)       149184    
_________________________________________________________________
batch_normalization_8 (Batch (None, 40, 40, 128)       512       
_________________________________________________________________
separable_conv2d_keras_8 (Se (None, 40, 40, 128)       17664     
_________________________________________________________________
batch_normalization_9 (Batch (None, 40, 40, 128)       512       
_________________________________________________________________
bilinear_up_sampling2d_2 (Bi (None, 80, 80, 128)       0         
_________________________________________________________________
concatenate_2 (Concatenate)  (None, 80, 80, 160)       0         
_________________________________________________________________
separable_conv2d_keras_9 (Se (None, 80, 80, 64)        11744     
_________________________________________________________________
batch_normalization_10 (Batc (None, 80, 80, 64)        256       
_________________________________________________________________
separable_conv2d_keras_10 (S (None, 80, 80, 64)        4736      
_________________________________________________________________
batch_normalization_11 (Batc (None, 80, 80, 64)        256       
_________________________________________________________________
bilinear_up_sampling2d_3 (Bi (None, 160, 160, 64)      0         
_________________________________________________________________
concatenate_3 (Concatenate)  (None, 160, 160, 67)      0         
_________________________________________________________________
separable_conv2d_keras_11 (S (None, 160, 160, 32)      2779      
_________________________________________________________________
batch_normalization_12 (Batc (None, 160, 160, 32)      128       
_________________________________________________________________
separable_conv2d_keras_12 (S (None, 160, 160, 32)      1344      
_________________________________________________________________
batch_normalization_13 (Batc (None, 160, 160, 32)      128       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 160, 160, 3)       99        
=================================================================
Total params: 362,521
Trainable params: 358,681
Non-trainable params: 3,840

This network better suits the idea of edges -> shapes -> objects feature extraction. However, a lot of layers and filters lead to a lot of calculations. The network gives acceptable results even with the default data set (final_score = 0.447 when num_epochs = 100). Fixing the number of epochs to 20 and using the grid search method I ended up with the following hyperparameters:

kernel = 3
learning_rate = 0.005
batch_size = 64

For this small number of epochs with the default data set I got final_score = 0.424. The usage of additional 512 samples collected within the simulator's DL Training mode has leaded to a noticeable improvement. For the same hyperparameters but with the expanded data set I got final_score = 0.4597

The model is limited to identify the hero and the specific environment. For following another object a specific data set has to be collected and hyperparameters have to be tuned for training a new model. Another network architecture might better suit a different data set.

If I continued to work on this project, I'd collect much more data because it is very promising technique. The next step would be fine tuning of the learning rate. Epochs should be increased until loss and validation loss values start to diverge (to prevent overfitting). It also worth to try Bayesian optimization or other methods for automatic hyperparameter tuning. It might be better to concatenate decoder layers to encoded_layer_1, encoded_layer_3 and encoded_layer_5 instead of inputs, encoded_layer_2 and encoded_layer_4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Project: Follow Me

Files

README.md

Latest commit

History

README.md

File metadata and controls

Project: Follow Me