I experimented with the recognition of single MNIST digits and continued with the sequence of MNIST digits in the previous post. I decided to add localization of digits to the task too. The goal of the whole project is to create live camera app able to recognize and localize a sequence of digits.
Sequence of digits recognition and localization
I modified the original testing set by placing digit sequences to a random position on the 128×256 canvas. The labels are the digits to recognize and the x,y coordinates together with the width and height defining the bounding box of the sequence now. The modified dataset has the same number of examples (165000 test, 15000 validation, 30000 test) as the original, but the size is much greater. The original training dataset has a size of 2 GB, the modified has 42 GB. The modified dataset doesn’t fit into the memory due to the size, so I have to load it from HDD. This increased time of training from few tens of minutes to approximately 8 hours.
I apply original sequence model to the modified dataset to compare the difficulty of the task to recognize the sequence in the bigger input. I don’t localize the sequence now. The Sequence model achieved 1.0 accuracy on the testing set, 0.8 on the validation set and 0.82 on the testing set. The model achieves an accuracy of only 0.17 on the training set, 0.17 on the validation set, and 0.15 on the testing set. (Reminder: A sequence is considered correctly recognized if all digits are correctly recognized in it.) This shows that recognition of sequence on the bigger input is a much harder task. The suspicious thing is low accuracy despite the low loss. I will report the “per digit” recognition accuracy of the future models too.
I create the baseline model for sequence classification and localization. The model consists of three convolutional layers, followed by two fully connected layers. Outputs are divided into classification and localization parts. Classification part is represented by convolutional layer used in previous models (five times unrolled GRU followed by fully connected layer). The localization output is two fully connected layers. The total loss is a sum of cross entropy (classification) and means squared error (localization).
The model learns to localize the sequence with “not so great” precision. The classification fails almost completely.
|Mean localization error||312.7||313.4||312.6|
Localization model with square error
I create a new model by removing fully connected layers between the convolutional part and output parts from the previous model. I add two more convolutional layers too. I use square error as loss of localization instead of MSE.
The model learns to localize sequence much better. However, the classification fails again.
|Mean localization error||0.32||0.32||0.34|
Localization model for classification only
I suspect, that it is not possible for the network to learn the classification. I want to test this hypothesis by removing the localization loss from learning. Loss consists of cross entropy only.
The network achieves better results than the “Sequence model” used yesterday. This denies my hypothesis. It seems that model is able to learn classification only if the loss of localization is not present. I will test the multiplying of individual losses by weights.
I use the model from the previous day (which was able to do localization or classification and not the both at the same). I increased the number of fully connected layers of localization and I weight losses. The loss formula is:
loss = 1000 * “classification loss (cross entropy)” + “localization error (meaned squared error)”
The model finally learns to classify the sequence and localize it at the same time. My explanation is, that the loss of localization is much bigger than classification error (approximately 9000 vs 2.8 ). This causes the weights to adjust to the localization task at the beginning of the training, as it lowers the loss the most. This prevents the model to learn classification task later because it will increase the loss of localization more than it will decrease the loss of classification. The weight allows optimizing both losses at the same time.
|Mean localization error||2.06||2.24||2.32|
Sequences of variable length
I make new dataset consisting of the sequences of variable length. The minimal size of the sequence is one and maximally is five. The missing digits are labeled by the special character in the dataset.
I applied the model from the previous day. The model doesn’t have any problem of learning to classify and localize on this new dataset. Tens in the labels in the following images are the special characters signifying missing digit.
|Mean localization error||2.62||2.90||2.93|