I experimented with the recognition of single MNIST digits and continued with the sequence of MNIST digits in the previous post. I decided to add localization of digits to the task too. The goal of the whole project is to create live camera app able to recognize and localize a sequence of digits.

The first part

The code is available on GitHub.

Digit sequence recognition

Update 31.7.2017

Sequence of digits recognition and localization

I modified the original testing set by placing digit sequences to a random position on the 128×256 canvas. The labels are the digits to recognize and the x,y coordinates together with the width and height defining the bounding box of the sequence now. The modified dataset has the same number of examples (165000 test, 15000 validation, 30000 test) as the original, but the size is much greater. The original training dataset has a size of 2 GB, the modified has  42 GB. The modified dataset doesn’t fit into the memory due to the size, so I have to load it from HDD. This increased time of training from few tens of minutes to approximately 8 hours.

Example of modified dataset (tiny red rectangle is center of bounding box)

Example of modified dataset (tiny red rectangle is center of bounding box)

Example of modified dataset

Example of modified dataset

Example of modified dataset

Example of modified dataset

Sequence model

I apply original sequence model to the modified dataset to compare the difficulty of the task to recognize the sequence in the bigger input. I don’t localize the sequence now. The Sequence model achieved 1.0 accuracy on the testing set, 0.8 on the validation set and 0.82 on the testing set. The model achieves an accuracy of only 0.17 on the training set, 0.17 on the validation set, and 0.15 on the testing set. (Reminder: A sequence is considered correctly recognized if all digits are correctly recognized in it.)  This shows that recognition of sequence on the bigger input is a much harder task. The suspicious thing is low accuracy despite the low loss. I will report the “per digit” recognition accuracy of the future models too.

Sequence model train accuracy

Sequence model train accuracy

Sequence model validation accuracy

Sequence model validation accuracy

Sequence model test accuracy

Sequence model test accuracy

Sequence model loss

Sequence model loss

Update 1.8.2017

Localization model

I create the baseline model for sequence classification and localization. The model consists of three convolutional layers, followed by two fully connected layers. Outputs are divided into classification and localization parts. Classification part is represented by convolutional layer used in previous models (five times unrolled GRU followed by fully connected layer). The localization output is two fully connected layers. The total loss is a sum of cross entropy (classification) and means squared error (localization).

Localization model

Localization model

The model learns to localize the sequence with “not so great” precision. The classification fails almost completely.

Metric/Dataset Training Validation Testing
Mean localization error 312.7 313.4 312.6
Sequence accuracy 0.0 0.0 0.0
Character accuracy 0.13 0.13 0.13
Classification loss

Classification loss

Localization loss

Localization loss

Total loss (classification loss + localization loss)

Total loss (classification loss + localization loss)

One of two correctly classified examples of testing dataset (Green - true bounding box, Red - predicted bounding box)

One of two correctly classified examples of testing dataset (Green – true bounding box, Red – predicted bounding box)

Incorrectly classified examples of testing dataset (Green - true bounding box, Red - predicted bounding box)

Incorrectly classified example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localization model with square error

I create a new model by removing fully connected layers between the convolutional part and output parts from the previous model. I add two more convolutional layers too. I use square error as loss of localization instead of MSE.

Localization model with square error

Localization model with square error

The model learns to localize sequence much better. However, the classification fails again.

Metric/Dataset Training Validation Testing
Mean localization error 0.32 0.32 0.34
Sequence accuracy 0.0 0.0 0.0
Character accuracy 0.10 0.11 0.10
Classification loss

Classification loss

Localization loss

Localization loss

Total loss (classification loss + localization loss)

Total loss (classification loss + localization loss)

Localized example of testing dataset (Green - true bounding box, Red - predicted bounding box)

Localized example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localized example of testing dataset (Green - true bounding box, Red - predicted bounding box)

Localized example of testing dataset (Green – true bounding box, Red – predicted bounding box)

Localization model for classification only

I suspect, that it is not possible for the network to learn the classification. I want to test this hypothesis by removing the localization loss from learning. Loss consists of cross entropy only.

The network achieves better results than the “Sequence model” used yesterday. This denies my hypothesis. It seems that model is able to learn classification only if the loss of localization is not present. I will test the multiplying of individual losses by weights.

Metric/Dataset Training Validation Testing
Sequence accuracy 0.91 0.75 0.74
Character accuracy 0.98 0.94 0.94
Correctly classified testing example

Correctly classified testing example

Update 2.8.2017

Weighted loss

I use the model from the previous day (which was able to do localization or classification and not the both at the same). I increased the number of fully connected layers of localization and I weight losses. The loss formula is:

loss = 1000 * “classification loss (cross entropy)” + “localization error (meaned squared error)”

Model architecture

Model’s architecture

The model finally learns to classify the sequence and localize it at the same time. My explanation is, that the loss of localization is much bigger than classification error (approximately 9000 vs 2.8 ). This causes the weights to adjust to the localization task at the beginning of the training, as it lowers the loss the most. This prevents the model to learn classification task later because it will increase the loss of localization more than it will decrease the loss of classification. The weight allows optimizing both losses at the same time.

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Testing example with one error (Green - true bounding box, Red - predicted bounding box)

Testing example with one error (Green – true bounding box, Red – predicted bounding box)

Metric/Dataset Training Validation Testing
Mean localization error 2.06 2.24 2.32
Sequence accuracy 0.86 0.75 0.74
Character accuracy 0.97 0.94 0.94
Accuracy per character on testing dataset

Accuracy per character on testing dataset

Accuracy per character on validation dataset

Accuracy per character on validation dataset

Accuracy per character on testing dataset

Accuracy per character on testing dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on validation dataset

Accuracy per sequence on validation dataset

Accuracy per sequence on testing dataset

Accuracy per sequence on testing dataset

Mean of position error on training dataset

Mean of position error on training dataset

Mean of position error on validation dataset

Mean of position error on validation dataset

Mean of position error on testing dataset

Mean of position error on testing dataset

Classification loss

Classification loss

Localization loss

Localization loss

Total loss

Total loss

Update 3.8.2017

Sequences of variable length

I make new dataset consisting of the sequences of variable length. The minimal size of the sequence is one and maximally is five. The missing digits are labeled by the special character in the dataset.

Example of dataset (tiny red rectangle is center of bounding box)

Example of dataset (tiny red rectangle is center of bounding box)

Weighted loss

I applied the model from the previous day. The model doesn’t have any problem of learning to classify and localize on this new dataset. Tens in the labels in the following images are the special characters signifying missing digit.

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Correctly classified testing example (Green - true bounding box, Red - predicted bounding box)

Correctly classified testing example (Green – true bounding box, Red – predicted bounding box)

Testing example with one error (Green - true bounding box, Red - predicted bounding box)

Testing example with one error (Green – true bounding box, Red – predicted bounding box)

Metric/Dataset Training Validation Testing
Mean localization error 2.62 2.90 2.93
Sequence accuracy 0.86 0.79 0.80
Character accuracy 0.97 0.95 0.95
Classification loss

Classification loss

Localization loss

Localization loss

Total loss

Total loss

One response

Leave a Reply

Your email address will not be published. Required fields are marked *