VideoPoet: Google Launches a Multimodal AI Video Generator

Google has unveiled a new large language model designed to advance video generation. VideoPoet is capable of text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. “The leading video generation models are almost exclusively diffusion-based,” Google says, citing Imagen Video as an example. Google finds this counter intuitive, since “LLMs are widely recognized as the de facto standard due to their exceptional learning capabilities across various modalities.” VideoPoet eschews the diffusion approach of relying on separately trained tasks in favor of integrating many video generation capabilities in a single LLM.

Google did this “by heavily ‘pre-training’ the VideoPoet LLM on 270 million videos and more than 1 billion text-and-image pairs from ‘the public Internet and other sources,’ and specifically, turning that data into text embeddings, visual tokens, and audio tokens, on which the AI model was ‘conditioned,’” VentureBeat writes, calling the results “pretty jaw-dropping, even in comparison to some of the state-of-the-art consumer-facing video generation models such as Runway and Pika.”

Google Research explains in a pre-review research paper that VideoPoet’s “input images can be animated to produce motion, and (optionally cropped or masked) video can be edited for inpainting or outpainting.” To add stylization, the model “takes in a video representing the depth and optical flow, which represent the motion, and paints contents on top to produce the text-guided style.”

The researchers feel their LLM approach may be able to address the problems AI has in generating realistic longer videos. Current video generation run times top out at about 3 minutes, with those running longer tending to be riddled with artifacts.

The diffusion-based methods of generating video “that are often considered the current top performers” usually start with a pre-trained image model, like Stable Diffusion, to produce “high-fidelity images for individual frames, and then fine-tune the model to improve temporal consistency across video frames,” the paper notes.

Not insignificantly, VideoPoet was found “capable of zero-shot video generation,” the researchers observe, noting the visual LLM “begins to show an ability to handle new tasks that were not included in its training.” For example, “the ability to perform new editing tasks by sequentially chaining training tasks together.”

The VideoPoet site features numerous output samples. Though still quite short, the multimodal clips showcase detailed images of things like a robot cat eating spaghetti and a walking tree. The gallery includes a one-minute short called “Rookie the Racoon” that points to future animation where humans fine-tune but AI does the heavy lifting.

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.