Water meter recognition

Recognizing digits from a meter is a classic Computer vision problem and can be tackled in a wide variety of ways. Deep learning could be used to do basic object recognition (e.g.YoloV3), haar cascades could be used to do template matching by sliding a window across the image and so on. While valid solutions, I wanted to build something low powered that is able to run on a raspberry pi so I went with my own segmentation and then trained a LightGbm model on the shapes.

Extracting the digits from the image

A cropped image of the meter looks like this:

It's pretty clear and the camera (in this case an old phone) is going to be taking new snapshots from the same location so there won't be much deviation from the above image.
The first step is to normalize the image, making sure that the values are spread out to the full 0-255 range to be more invariant to changes in brightness.
The next step is simple binary thresholding. This gives us the following image:

The digits are looking pretty clear versus the background but some touch the outer border of the meter, which is annoying. That border needs to be removed. That can be done by counting the number of black pixels on each row from top to bottom (for the top border) and bottom to top (for the bottom border). Stop removing as soon as the number of black pixels is more than 70%. The same can be done for the left and right borders, but this time stop when the black pixels exceeds 50% of the total pixels in the column. These percentage values took some trial and error to see what worked best. After trimming the borders the image looks like this:

Great! Now the digits are nicely seperated from each other. Each digit can now be detected by a depth first search: scan over the image and at each white pixel encountered do a "flood fill" to get the whole shape. Then remove shapes that aren't large enough to be a digit (e.g smaller than 1/3rd of the height). That approach works great when digits are all nice and clearly visible, but not so much when a digit is transitioning from one to another, like in this image:

The transitioning digit would be 2 separate shapes, both too small and thus ignored.
As the digits are always in the same place I tried to spend some time to detect the offset of the first image and then do a region based cut out to get the digits, but that didn't work very well. Instead I opted to fill the gaps between digits with the help of morphologic operations: dilation Á erosion. Before the operations the clean binary mask looks like:

Now after dilating twice:

and doing erosion to restore the original position:

you can see that the seperated parts are now joined together. Sadly this doesn't look like much of any digit anymore, so let's just use these shapes bounding boxes on the original binary mask:

Nice! Now we have a nicely segmentation per digit, even the transitioning digits. It's a pretty close call though, you can see that the dilation in the latter 2 digits is only a single pixel from the other digit and it frequently happened that those 2 ended up being a single shape. To solve that I checked if there were any shapes that had a width > 1.8x its height and split that shape in 2 so it were 2 shapes again. That fixed that issue.

Recognizing each shape

Now that we can succesfully extract each shape, we need to build some kind of model that can recognize this shape and say what digit it is. The easiest way to do so is multiclass classification from a feature vector, in this case with the LightGbm implementation of ML.NET. This algorithm gave me the best classification results after testing a training set with AutoML. There are some hurdles to overcome to actually build a training set though. Shapes can be of arbitrary size whereas the feature vector needs to have a fixed length. Due to noise shapes can also have an offset and aren't exactly centered all the time.
The first step was to manually label all samples taken previously (do this while listening to a podcast to not go completely insane):

Next was to save all shapes corresponding to the labelled digit and save those in their own folders, e.g all samples for the digit '3':

Now that we have samples for each digit we can build a training set, as follows:
For each digit:

Create a blank image of 20x20. The full feature vector set will thus have a length of 400
Take a random sample from the digit's folder and place it in the blank image somewhere at random. This ensures that offsetted digits are represented in the training set
Add 10 random white pixels to make the model more resilient to noise

Doing this 10000 times results in a training set of 100000 samples. These can be saved to a .tsv file (along with the digit as label) or used directly to train the LightGbm model.

... 50 minutes of training later ...

We now have a model that can recognize shapes, great! I've skipped the cross validation part here because AutoML already did that for me (on a smaller set) and indicated an accuracy of 99.79%. The actual prediction part now is pretty simple and reuses the steps described above to build a training sample:

Get the shapes from the input image as described above
Process each of them from left to right by placing the shape's pixels in the middle of a blank 20x20 image
Transform that resulting image into a feature vector and feed it to the prediction engine that has loaded the trained model
Take the label with the highest score, that's going to (hopefully) be the digit for that particular shape
Concatenate the digits together and voila, a full recognition of the meter.

Conclusion

The recognition is pretty accurate, almost all fed images are predicted perfectly. The only ones which had an issue are the ones with the last digit transitioning, but that can be solved if these transitions are also added into the training set. Some of those are even correctly recognized even if they were no part of the training set! I think the random offset is doing its work here, as the score will occasionally be higher for a partially matched number than the other numbers. One final remark though: ML.NET is currently not available for ARM, so it won't run on a raspberry pi for now. It is planned for the future though. I've wrapped the prediction into a tiny ASP.NET core REST service where the image can be uploaded to to get a prediction instead, that way it can be easily done through a simple 'curl --upload-file' statement.