Facebook Using Self-Supervised Models to Build AI Systems

Facebook debuted Learning from Videos, a project designed to learn audio, images and text from publicly available Facebook videos to improve its core AI systems. By culling data from hundreds of languages and countries, said Facebook, the project will also help to enable “entirely new experiences.” Learning from Videos, which began in 2020, has also helped to improve recommendations in Instagram Reels. Facebook, Google and others are focused on self-supervised techniques rather than labeled datasets to improve AI.

VentureBeat reports that, “Facebook says it’s using Generalized Data Transformations (GDT), a self-supervised system that learns the relationships between sounds and images, to suggest Instagram Reel clips relevant to recently watched videos while filtering out near-duplicates.”

GDT, which involves a “series of models trained across dozens of GPUs on a dataset of millions of Reels and videos from Instagram,” can learn to match an image of an audience clapping with the sound of applause, for example. Likewise, it “can surface recommendations based on videos that sound alike or look alike, respectively, by leveraging audio as a signal.”

Learning from Videos includes “Facebook’s work on wav2vec 2.0, an improved machine learning framework for self-supervised speech recognition.” According to Facebook, “when applied to millions of hours of unlabeled videos and 100 hours of labeled data, wave2vec 2.0 reduced the relative word error rate by 20 percent compared with supervised-only baselines.”

Facebook is now scaling wav2vec 2.0 “with millions of additional hours of speech from 25 languages to reduce labeling, bolster the performance of low-and medium-resource models, and improve other speech and audio tasks.”

The company reported that it is also using “the Audio Visual Textual (AVT) model that aggregates and compares sound and visual information from videos as well as titles, captions, and descriptions,” adding speech recognition as an input, and plans to “apply the model to millions of videos before it begins testing it across its platform.” Also coming from the Learning from Videos project is TimeSformer (short for Time-Space Transformer), “a Facebook-developed framework for video understanding that’s based purely on the Transformer architecture.”

The company stated that, TimeSformer “attains the best reported numbers on a range of action recognition benchmarks … [and] also takes roughly one-third the time to train than comparable models … and requires less than one-tenth the amount of compute for inference and can learn from video clips up to 102 seconds in length, much longer than most video-analyzing AI models.” According to Facebook AI researcher Lorenzo Torresani, “TimeSformer can be trained in 14 hours with 32 GPUs.”

Facebook has asserted that, “systems like TimeSformer, GDT, wav2vec 2.0, and AVT will advance research to teach machines to understand long-form actions in videos, an important step for AI applications geared toward human understanding … [and] form the foundation of applications that can comprehend what’s happening in videos on a more granular level.”

Facebook AI director Geoffrey Zweig added that, “we are just starting to scratch the surface of self-supervised learning.” “There’s lots to do to build upon the models that we use, and we want to do so with speed and at scale for broad applicability,” he added.