Google Scientists Generate Realistic Videos at Scale with AI

Google research scientists report that they have produced realistic frames from open source video data sets at scale. Neural networks are able to generate complete videos from only a start and end frame, but it’s the complexity, information density and randomness of video that have made it too challenging to create such realistic clips at scale. The scientists wrote that, to their knowledge, “this is the first promising application of video-generation models to videos of this complexity.” The systems are based on a neural architecture known as Transformers, as described in a Google Brain paper, and are autoregressive, “meaning they generate videos pixel by pixel.”

VentureBeat reports that the scientists, in a paper titled “Scaling Autoregressive Video Models” on preprint server, described that the group’s “[AI] models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition data set of … videos exhibiting phenomena such as camera movement, complex object interactions, and diverse human movement.”

Deep neural networks, including Transformers, “contain neurons (functions) that transmit ‘signals’ from input data and slowly adjust the synaptic strength — weights — of each connection.” But Transformers also “have attention, such that every output element is connected to every input element and the weightings between them are calculated dynamically,” a feature that “enables the video-generating systems to efficiently model clips as 3D volumes — rather than sequences of still frames — and drives direct interactions between representations of the videos’ pixels across dimensions.”

The Google researchers “combined the Transformer-derived architecture with approaches that generate images as sequences of smaller, sub-scaled image slices,” which are sub-sampled lower-resolution videos. Once generated, “the padding in the video is replaced with the generated output and the process is repeated for the next slice.”

In its experiments, the Google researchers “modeled slices of four frames by first feeding their AI systems video from the BAIR Robot Pushing robot data set, which consists of roughly 40,000 training videos and 256 test videos showing a robotic arm pushing and grasping objects in a box … [and then] applied the models to down-sampled videos from the Kinetics-600 data set, a large-scale action-recognition corpus containing about 400,000 YouTube videos across 600 action classes.”

The team reported that videos generated for “limited subsets” such as cooking videos were “highly encouraging” and included “complex object interactions like steam and fire.”

“This marks a departure from the often very narrow domains discussed in the video generation literature to date, such as artificially generated videos of moving digits or shapes, or videos depicting natural, yet highly constrained environments, such as robot arms interacting with a small number of different objects with a fixed camera angle and background,” they wrote.