We began with the idea of adding hand gesture recognition to our little LEGO robot, Charlie, while working on OpenCV's International Spatial AI competition. We converted MediaPipe's hand and finger tracking model for use on the OAK-D device and applied Kazuhito Takahashi's gesture recognizer to perform this task. This combination worked so well that we thought it might be relatively easy for us to add support for ASL alphabet recognition as well. Little did we know that this decision led us down a deep rabbit hole. I am happy to report that everything turned out really well in the end. We hope that this blog post will help others to build on top of our efforts to make better ASL recognizers for the OAK-D. For those interested in the technical details, the complete dataset and source code is available at https://github.com/cortictechnology/hand_asl_recognition.
The American Sign Language Alphabet Hand Signals. Source: APSEA
Before we started to do any real work on ASL recognition, we looked around for related work that other people have already done. We found this great GitHub repo by David Lee that performed ASL recognition by fine-tuning the Yolov5 object detector on his own ASL dataset. We would follow his approach for ASL recognition, but that would mean we needed to train our own Yolov5 detector for the OAK-D and get it to run alongside MediaPipe's hand tracker which already employs a palm detector. Running these two detectors that way would require a lot of computational power and will significantly lower our framerate.
The lightbulb moment hit us when we realized that we can easily find a bounding rectangle for the entire hand by using the detected locations of the landmarks. Then, we can just use an image classifier on the cropped hand image to do ASL recognition. This approach will save a significant amount of computational resources and will allow us to run both the hand tracker and ASL recognizer in the same DepthAI pipeline. We quickly set out to train our image classifier to implement this "brilliant" idea.
We selected MobileNetV2 as our model of choice because it's lightweight and has fewer parameters compared to MobileNetV3. We used David's ASL dataset to perform transfer learning on this model where we first trained a newly added softmax layer on top of the MobileNetV2 model for the classification task, then we unfroze the entire model to fine-tuned the pre-trained weights. Because the letters "J" and "Z" require motion, we removed them from our classifier. After a few training runs, we found that the classification accuracy hovered around 68-71% which is definitely not usable.
We tried the same dataset on a more complex EfficientNetB0 and found that the accuracy is still around 70%. This tells us that number of images in this dataset is not enough to train this classifier. We needed more data.
For the next couple of hours, we scrambled to capture more images for each alphabet. We managed to collect 480 new images and added them to David's dataset. We applied the usual data augmentation techniques such as random rotations, horizontal flip, horizontal and vertical shifts, and so on to these new images.
After using this new dataset to train both MobileNetV2 and EfficientNetB0, we found that the accuracy for both is very similar, around 90%. We believe that EfficientNetB0 will perform better (in terms of accuracy) if there is more training data available. However, for our current dataset, we decided to go back to MobileNetV2 because its inference speed is much faster than that of EfficientNetB0.
We integrated this new ASL classifier into Charlie and ran various tests on it. We found that using MediaPipe's hand tracker allowed us to detect hands much further away compared to David's Yolov5 model.
Also, according to David's readme on Github, the Yolov5 model returned no predictions for the letters "G" and "H" and incorrect predictions for "D", "E", "P", and "R". Our model seems to work just fine for these letters. We originally anticipated problems with the letters "A", "M", "N", "S", and "T" as they look very very similar to one another. However, we were pleasantly surprised by the fact that the letters "A", "M", "S", and "T" can all be classified correctly. The only letter that presents significant classification problems is "N".
We are reasonably certain that the problem with the letter "N" can be eliminated by collecting more data. As for the letter "J" and "Z", we should be able to use Kazuhito Takahashi's gesture recognizer to do the job.
In summary, the two days that we spent on data collection, training, testing, and integrating this ASL classifier into our system is time well spent. We are open-sourcing our dataset, the ASL classification model and sample code we created at https://github.com/cortictechnology/hand_asl_recognition. We sincerely hope that someone will be able to improve it even further.
The hand tracker from MediaPipe is actually able to identify left and right hands. We took advantage of this property and added a simple way to detect which side (palmar or dorsal) of the hand the camera sees.