9I experimented with the recognition of single MNIST digits and continued with the sequence of MNIST digits in the previous post. I decided to add localization of digits to the task too. The goal of the whole project is to create live camera app able to recognize and localize a sequence of digits.
The first part
The code is available on GitHub.
Update 31.7.2017
Sequence of digits recognition and localization
I modified the original testing set by placing digit sequences to a random position on the 128×256 canvas. The labels are the digits to recognize and the x,y coordinates together with the width and height defining the bounding box of the sequence now. The modified dataset has the same number of examples (165000 test, 15000 validation, 30000 test) as the original, but the size is much greater. The original training dataset has a size of 2 GB, the modified has 42 GB. The modified dataset doesn’t fit into the memory due to the size, so I have to load it from HDD. This increased time of training from few tens of minutes to approximately 8 hours.

Example of modified dataset (tiny red rectangle is center of bounding box)

Example of modified dataset

Example of modified dataset
Sequence model
I apply original sequence model to the modified dataset to compare the difficulty of the task to recognize the sequence in the bigger input. I don’t localize the sequence now. The Sequence model achieved 1.0 accuracy on the testing set, 0.8 on the validation set and 0.82 on the testing set. The model achieves an accuracy of only 0.17 on the training set, 0.17 on the validation set, and 0.15 on the testing set. (Reminder: A sequence is considered correctly recognized if all digits are correctly recognized in it.) This shows that recognition of sequence on the bigger input is a much harder task. The suspicious thing is low accuracy despite the low loss. I will report the “per digit” recognition accuracy of the future models too.

Sequence model train accuracy

Sequence model validation accuracy

Sequence model test accuracy

Sequence model loss
Update 1.8.2017
Localization model
I create the baseline model for sequence classification and localization. The model consists of three convolutional layers, followed by two fully connected layers. Outputs are divided into classification and localization parts. Classification part is represented by convolutional layer used in previous models (five times unrolled GRU followed by fully connected layer). The localization output is two fully connected layers. The total loss is a sum of cross entropy (classification) and means squared error (localization).

Localization model
The model learns to localize the sequence with “not so great” precision. The classification fails almost completely.
Metric/Dataset | Training | Validation | Testing |
Mean localization error | 312.7 | 313.4 | 312.6 |
Sequence accuracy | 0.0 | 0.0 | 0.0 |
Character accuracy | 0.13 | 0.13 | 0.13 |

Classification loss

Localization loss

Total loss (classification loss + localization loss)

One of two correctly classified examples of testing dataset (Green – true bounding box, Red – predicted bounding box)

Incorrectly classified example of testing dataset (Green – true bounding box, Red – predicted bounding box)
Localization model with square error
I create a new model by removing fully connected layers between the convolutional part and output parts from the previous model. I add two more convolutional layers too. I use square error as loss of localization instead of MSE.

Localization model with square error
The model learns to localize sequence much better. However, the classification fails again.
Metric/Dataset | Training | Validation | Testing |
Mean localization error | 0.32 | 0.32 | 0.34 |
Sequence accuracy | 0.0 | 0.0 | 0.0 |
Character accuracy | 0.10 | 0.11 | 0.10 |

Classification loss

Localization loss

Total loss (classification loss + localization loss)

Localized example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localized example of testing dataset (Green – true bounding box, Red – predicted bounding box)
Localization model for classification only
I suspect, that it is not possible for the network to learn the classification. I want to test this hypothesis by removing the localization loss from learning. Loss consists of cross entropy only.
The network achieves better results than the “Sequence model” used yesterday. This denies my hypothesis. It seems that model is able to learn classification only if the loss of localization is not present. I will test the multiplying of individual losses by weights.
Metric/Dataset | Training | Validation | Testing |
Sequence accuracy | 0.91 | 0.75 | 0.74 |
Character accuracy | 0.98 | 0.94 | 0.94 |

Correctly classified testing example
Update 2.8.2017
Weighted loss
I use the model from the previous day (which was able to do localization or classification and not the both at the same). I increased the number of fully connected layers of localization and I weight losses. The loss formula is:
loss = 1000 * “classification loss (cross entropy)” + “localization error (meaned squared error)”

Model’s architecture
The model finally learns to classify the sequence and localize it at the same time. My explanation is, that the loss of localization is much bigger than classification error (approximately 9000 vs 2.8 ). This causes the weights to adjust to the localization task at the beginning of the training, as it lowers the loss the most. This prevents the model to learn classification task later because it will increase the loss of localization more than it will decrease the loss of classification. The weight allows optimizing both losses at the same time.

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Testing example with one error (Green – true bounding box, Red – predicted bounding box)
Metric/Dataset | Training | Validation | Testing |
Mean localization error | 2.06 | 2.24 | 2.32 |
Sequence accuracy | 0.86 | 0.75 | 0.74 |
Character accuracy | 0.97 | 0.94 | 0.94 |

Accuracy per character on testing dataset

Accuracy per character on validation dataset

Accuracy per character on testing dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on validation dataset

Accuracy per sequence on testing dataset

Mean of position error on training dataset

Mean of position error on validation dataset

Mean of position error on testing dataset

Classification loss

Localization loss

Total loss
Update 3.8.2017
Sequences of variable length
I make new dataset consisting of the sequences of variable length. The minimal size of the sequence is one and maximally is five. The missing digits are labeled by the special character in the dataset.

Example of dataset (tiny red rectangle is center of bounding box)
Weighted loss
I applied the model from the previous day. The model doesn’t have any problem of learning to classify and localize on this new dataset. Tens in the labels in the following images are the special characters signifying missing digit.

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Testing example with one error (Green – true bounding box, Red – predicted bounding box)
Metric/Dataset | Training | Validation | Testing |
Mean localization error | 2.62 | 2.90 | 2.93 |
Sequence accuracy | 0.86 | 0.79 | 0.80 |
Character accuracy | 0.97 | 0.95 | 0.95 |

Classification loss

Localization loss

Total loss
Update 27.8.2017
SVHN dataset recognition
I move to more realistic task. It is the recognition and localization of digits in real life images. I use SVHN dataset for this purpose. SVHN dataset consists of images of house numbers taken from the Street View. There are 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data. The images are annotated by numbers and by bounding box of each digit.

Example of SVHN dataset
The images are of different resolutions, so I reshape all images to 256×128 px. I also calculate the bounding box of the whole number by taking maxima of bounding boxes. I don’t use validation set this time. I consider it as a mistake from today’s point of view.

SVHN example with the bounding box
Baseline model
I use standard model without any bells or whistles at first. It consists of four convolutional layers followed by the recurrent head for digit recognition and fully connected head for sequence localization (5 layers with relu between). The loss function is the reweighted one as described above. I use no dropout or batch normalization. The results are much worse than in the previous task.
Metric/Dataset | Training | Testing |
Mean localization error | 82.0 | 351.6 |
Sequence accuracy | 0.50 | 0.20 |
Character accuracy | 0.97 | 0.87 |

Correct test example (10 in label means no digit, red rectangle is predicted bounding box)

Incorrect test example (10 in label means no digit, red rectangle is predicted bounding box)
Convolution from the paper
Next model is inspired by paper Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks. They use five convolutional layers with size 5×5. The localization layer has only single layer. This model performs better for whole sequence and localization.
Metric/Dataset | Training | Testing |
Mean localization error | 84.4 | 251.3 |
Sequence accuracy | 0.70 | 0.38 |
Character accuracy | 0.93 | 0.82 |
Convolution from the paper with dropout and fully connected recognition head
The previous model overfits. So I use dropout with keep probability of 0.8 in front of localization head and recognition head. I stop using the recurrent head with GRU. I change it to six fully connected layers, each for one digit.

Model’s architecture
The model improves the sequence accuracy. It also shows, that the recurrent head can be changed to several fully connected layers and the accuracy doesn’t drop.
Metric/Dataset | Training | Testing |
Mean localization error | 91.7 | 264.7 |
Sequence accuracy | 0.83 | 0.40 |
Character accuracy | 0.96 | 0.82 |
Update 28.8.2017
Unsuccessful transfer learning experiment
I try to use the previous model and apply transfer learning to it. The plan is to learn digit recognition head at first, lock convolutional layers and train localization head. I also normalized the input, which is my mistake. I should do it before. The result of recognition is not significantly better (altho it is best). I don’t try to train the localization head because of it.
Metric/Dataset | Training | Testing |
Sequence accuracy | 0.69 | 0.43 |
Character accuracy | 0.92 | 0.84 |

Mean of SVHN train dataset
Unsuccessful transfer learning with L2 normalization
I use the approach from the previous model and add L2 normalization of convolutional layers. The loss is:
loss = “classification loss (cross entropy)” + 0.001 * “L2 regularization”
It doesn’t bring any improvement. So I don’t train localization head.
Metric/Dataset | Training | Testing |
Sequence accuracy | 0.89 | 0.36 |
Character accuracy | 0.92 | 0.82 |
Mixing training and extra parts of dataset without pooling
I mix training and extra dataset and I also replaced 2×2 max pooling by 2×2 convolution with 2×2 stride. I hope that removing of pooling will improve localization performance. I report only performance on testing set because training is too huge to evaluate (my mistake that I don’t use validation set). I train it for 50k steps and manually decreased learning rate. No breakthrough again.
Metric/Dataset | Testing |
Mean localization error | 423.7 |
Sequence accuracy | 0.42 |
Character accuracy | 0.84 |
No digit class reweighting
I really wonder what is wrong this time. My last try is to reweight the class of “no digit” because there is a majority of it in the dataset.

Digit classes in dataset (10 is “no digit”)
I multiply loss produced by “no digit” by 0.125. I train it for 60k steps and manually decreased learning rate. It brings no improvement.
Metric/Dataset | Testing |
Mean localization error | 349.6 |
Sequence accuracy | 0.40 |
Character accuracy | 0.82 |
How to test this ?
I need to input one single input image and need to get it’s output..