Google Open-Sources Real-Time Gesture Recognition Tech

Google relied on computer vision and machine learning to research a better way to perceive hand shapes and motions in real-time, for use in gesture control systems, sign language recognition and augmented reality. The result is the ability to infer up to 21 3D points of a hand (or hands) on a mobile phone from a single frame. Google, which demonstrated the technique at the 2019 Conference on Computer Vision and Pattern Recognition, also put the source code and a complete use case scenario on GitHub.

According to VentureBeat, Google also implemented its new technique “in MediaPipe, a cross-platform framework for building multimodal applied machine learning pipelines to process perceptual data of different modalities (such as video and audio).”

“The ability to perceive the shape and motion of hands can be a vital component in improving the user experience across a variety of technological domains and platforms,” wrote research engineers Valentin Bazarevsky and Fan Zhang in a Google AI Blog post. “We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.”

The new technique is made up of three AI models that work together: Blaze Palm, that looks at the hand’s palm, analyzing a frame and returning a hand bounding box; “a hand landmark model that looks at the cropped image region defined by the palm detector and returns 3D hand points; and a gesture recognizer that classifies the previously-computed point configuration into a set of gestures.”

Among the challenges, BlazePalm, “has to contend with a lack of features while spotting occluded and self-occluded hands.” Google researchers “trained a palm detector instead of a hand detector” to overcome this problem, “since estimating bounding boxes of objects like fists tends to be easier than detecting hands and fingers.”

After the palm is detected, “the hand landmark model takes over, performing localization of 21 3D hand-knuckle coordinates inside the detected hand regions,” a task that took “30,000 real-world images manually annotated with coordinates, plus high-quality synthetic hand model rendered over various backgrounds and mapped to the corresponding coordinates.” Finally, the gesture recognition system determines “the state of each finger from joint angles and maps the set of finger states to predefined gestures.”

According to Bazarevsky and Zhang, the system can recognize counting gestures from multiple cultures (e.g. American, European and Chinese) and various hand signs including a closed fist, ‘OK’, ‘rock’, and ‘Spiderman’.”

Bazarevsky, Zhang and their team “plan to extend the technology with more robust and stable tracking, and to enlarge the number of gestures it can reliably detect and support dynamic gestures unfolding in time.”

For more information on MediaPipe, visit the GitHub post.