9I experimented with the recognition of single MNIST digits and continued with the sequence of MNIST digits in the previous post. I decided to add localization of digits to the task too. The goal of the whole project is to create live camera app able to recognize and localize a sequence of digits.

The first part

The code is available on GitHub.

Digit sequence recognition

Update 31.7.2017

Sequence of digits recognition and localization

I modified the original testing set by placing digit sequences to a random position on the 128×256 canvas. The labels are the digits to recognize and the x,y coordinates together with the width and height defining the bounding box of the sequence now. The modified dataset has the same number of examples (165000 test, 15000 validation, 30000 test) as the original, but the size is much greater. The original training dataset has a size of 2 GB, the modified has  42 GB. The modified dataset doesn’t fit into the memory due to the size, so I have to load it from HDD. This increased time of training from few tens of minutes to approximately 8 hours.

Example of modified dataset (tiny red rectangle is center of bounding box)

Example of modified dataset (tiny red rectangle is center of bounding box)

Example of modified dataset

Example of modified dataset

Example of modified dataset

Example of modified dataset

Sequence model

I apply original sequence model to the modified dataset to compare the difficulty of the task to recognize the sequence in the bigger input. I don’t localize the sequence now. The Sequence model achieved 1.0 accuracy on the testing set, 0.8 on the validation set and 0.82 on the testing set. The model achieves an accuracy of only 0.17 on the training set, 0.17 on the validation set, and 0.15 on the testing set. (Reminder: A sequence is considered correctly recognized if all digits are correctly recognized in it.)  This shows that recognition of sequence on the bigger input is a much harder task. The suspicious thing is low accuracy despite the low loss. I will report the “per digit” recognition accuracy of the future models too.

Sequence model train accuracy

Sequence model train accuracy

Sequence model validation accuracy

Sequence model validation accuracy

Sequence model test accuracy

Sequence model test accuracy

Sequence model loss

Sequence model loss

Update 1.8.2017

Localization model

I create the baseline model for sequence classification and localization. The model consists of three convolutional layers, followed by two fully connected layers. Outputs are divided into classification and localization parts. Classification part is represented by convolutional layer used in previous models (five times unrolled GRU followed by fully connected layer). The localization output is two fully connected layers. The total loss is a sum of cross entropy (classification) and means squared error (localization).

Localization model

Localization model

The model learns to localize the sequence with “not so great” precision. The classification fails almost completely.

Metric/Dataset Training Validation Testing
Mean localization error 312.7 313.4 312.6
Sequence accuracy 0.0 0.0 0.0
Character accuracy 0.13 0.13 0.13
Classification loss

Classification loss

Localization loss

Localization loss

Total loss (classification loss + localization loss)

Total loss (classification loss + localization loss)

One of two correctly classified examples of testing dataset (Green - true bounding box, Red - predicted bounding box)

One of two correctly classified examples of testing dataset (Green – true bounding box, Red – predicted bounding box)

Incorrectly classified examples of testing dataset (Green - true bounding box, Red - predicted bounding box)

Incorrectly classified example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localization model with square error

I create a new model by removing fully connected layers between the convolutional part and output parts from the previous model. I add two more convolutional layers too. I use square error as loss of localization instead of MSE.

Localization model with square error

Localization model with square error

The model learns to localize sequence much better. However, the classification fails again.

Metric/Dataset Training Validation Testing
Mean localization error 0.32 0.32 0.34
Sequence accuracy 0.0 0.0 0.0
Character accuracy 0.10 0.11 0.10
Classification loss

Classification loss

Localization loss

Localization loss

Total loss (classification loss + localization loss)

Total loss (classification loss + localization loss)

Localized example of testing dataset (Green - true bounding box, Red - predicted bounding box)

Localized example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localized example of testing dataset (Green - true bounding box, Red - predicted bounding box)

Localized example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localization model for classification only

I suspect, that it is not possible for the network to learn the classification. I want to test this hypothesis by removing the localization loss from learning. Loss consists of cross entropy only.

The network achieves better results than the “Sequence model” used yesterday. This denies my hypothesis. It seems that model is able to learn classification only if the loss of localization is not present. I will test the multiplying of individual losses by weights.

Metric/Dataset Training Validation Testing
Sequence accuracy 0.91 0.75 0.74
Character accuracy 0.98 0.94 0.94
Correctly classified testing example

Correctly classified testing example

Update 2.8.2017

Weighted loss

I use the model from the previous day (which was able to do localization or classification and not the both at the same). I increased the number of fully connected layers of localization and I weight losses. The loss formula is:

loss = 1000 * “classification loss (cross entropy)” + “localization error (meaned squared error)”

Model architecture

Model’s architecture

The model finally learns to classify the sequence and localize it at the same time. My explanation is, that the loss of localization is much bigger than classification error (approximately 9000 vs 2.8 ). This causes the weights to adjust to the localization task at the beginning of the training, as it lowers the loss the most. This prevents the model to learn classification task later because it will increase the loss of localization more than it will decrease the loss of classification. The weight allows optimizing both losses at the same time.

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Testing example with one error (Green - true bounding box, Red - predicted bounding box)

Testing example with one error (Green – true bounding box, Red – predicted bounding box)

Metric/Dataset Training Validation Testing
Mean localization error 2.06 2.24 2.32
Sequence accuracy 0.86 0.75 0.74
Character accuracy 0.97 0.94 0.94
Accuracy per character on testing dataset

Accuracy per character on testing dataset

Accuracy per character on validation dataset

Accuracy per character on validation dataset

Accuracy per character on testing dataset

Accuracy per character on testing dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on validation dataset

Accuracy per sequence on validation dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on testing dataset

Mean of position error on training dataset

Mean of position error on training dataset

Mean of position error on validation dataset

Mean of position error on validation dataset

Mean of position error on testing dataset

Mean of position error on testing dataset

Classification loss

Classification loss

Localization loss

Localization loss

Total loss

Total loss

Update 3.8.2017

Sequences of variable length

I make new dataset consisting of the sequences of variable length. The minimal size of the sequence is one and maximally is five. The missing digits are labeled by the special character in the dataset.

Example of dataset (tiny red rectangle is center of bounding box)

Example of dataset (tiny red rectangle is center of bounding box)

Weighted loss

I applied the model from the previous day. The model doesn’t have any problem of learning to classify and localize on this new dataset. Tens in the labels in the following images are the special characters signifying missing digit.

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Testing example with one error (Green - true bounding box, Red - predicted bounding box)

Testing example with one error (Green – true bounding box, Red – predicted bounding box)

Metric/Dataset Training Validation Testing
Mean localization error 2.62 2.90 2.93
Sequence accuracy 0.86 0.79 0.80
Character accuracy 0.97 0.95 0.95
Classification loss

Classification loss

Localization loss

Localization loss

Total loss

Total loss

Update 27.8.2017
SVHN dataset recognition

I move to more realistic task. It is the recognition and localization of digits in real life images. I use SVHN dataset for this purpose. SVHN dataset consists of images of house numbers taken from the Street View. There are 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data. The images are annotated by numbers and by bounding box of each digit.

Sequence of digits recognition and localization

Example of SVHN dataset

The images are of different resolutions, so I reshape all images to 256×128 px. I also calculate the bounding box of the whole number by taking maxima of bounding boxes. I don’t use validation set this time. I consider it as a mistake from today’s point of view.

SVHN example with the bounding box

SVHN example with the bounding box

Baseline model

I use standard model without any bells or whistles at first. It consists of four convolutional layers followed by the recurrent head for digit recognition and fully connected head for sequence localization (5 layers with relu between). The loss function is the reweighted one as described above. I use no dropout or batch normalization. The results are much worse than in the previous task.

Metric/Dataset Training Testing
Mean localization error 82.0 351.6
Sequence accuracy 0.50 0.20
Character accuracy 0.97 0.87
Correct test example

Correct test example (10 in label means no digit, red rectangle is predicted bounding box)

Incorrect test example

Incorrect test example (10 in label means no digit, red rectangle is predicted bounding box)

Convolution from the paper

Next model is inspired by paper Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks. They use five convolutional layers with size 5×5. The localization layer has only single layer. This model performs better for whole sequence and localization.

Metric/Dataset Training Testing
Mean localization error 84.4 251.3
Sequence accuracy 0.70 0.38
Character accuracy 0.93 0.82

Convolution from the paper with dropout and fully connected recognition head

The previous model overfits. So I use dropout with keep probability of 0.8 in front of localization head and recognition head. I stop using the recurrent head with GRU. I change it to six fully connected layers, each for one digit.

Model's architecture

Model’s architecture

The model improves the sequence accuracy. It also shows, that the recurrent head can be changed to several fully connected layers and the accuracy doesn’t drop.

Metric/Dataset Training Testing
Mean localization error 91.7 264.7
Sequence accuracy 0.83 0.40
Character accuracy 0.96 0.82

Update 28.8.2017

Unsuccessful transfer learning experiment

I try to use the previous model and apply transfer learning to it. The plan is to learn digit recognition head at first, lock convolutional layers and train localization head. I also normalized the input, which is my mistake. I should do it before. The result of recognition is not significantly better (altho it is best). I don’t try to train the localization head because of it.

Metric/Dataset Training Testing
Sequence accuracy 0.69 0.43
Character accuracy 0.92 0.84
Mean of SVHN train dataset

Mean of SVHN train dataset

Unsuccessful transfer learning with L2 normalization

I use the approach from the previous model and add L2 normalization of convolutional layers. The loss is:

loss = “classification loss (cross entropy)” + 0.001 * “L2 regularization”

It doesn’t bring any improvement. So I don’t train localization head.

Metric/Dataset Training Testing
Sequence accuracy 0.89 0.36
Character accuracy 0.92 0.82

Mixing training and extra parts of dataset without pooling

I mix training and extra dataset and I also replaced 2×2 max pooling by 2×2 convolution with 2×2 stride. I hope that removing of pooling will improve localization performance. I report only performance on testing set because training is too huge to evaluate (my mistake that I don’t use validation set). I train it for 50k steps and manually decreased learning rate. No breakthrough again.

Metric/Dataset Testing
Mean localization error 423.7
Sequence accuracy 0.42
Character accuracy 0.84

No digit class reweighting

I really wonder what is wrong this time. My last try is to reweight the class of “no digit” because there is a majority of it in the dataset.

Digit classes in dataset

Digit classes in dataset (10 is “no digit”)

I multiply loss produced by “no digit” by 0.125. I train it for 60k steps and manually decreased learning rate. It brings no improvement.

Metric/Dataset Testing
Mean localization error 349.6
Sequence accuracy 0.40
Character accuracy 0.82

2 responses

Leave a Reply

Your email address will not be published. Required fields are marked *