Google’s AutoFlip for Automated AI-Enabled Video Reframing

Google has unveiled AutoFlip, an open source, AI-enabled tool that offers smarter, automated video reframing. A lot of video is captured in landscape aspect ratios such as 16:9 and 4:3, not optimized for different (read: vertical) display devices. The traditional method has been to statically crop the material that doesn’t fit in the destination device, but that usually offers an unsatisfactory result. AutoFlip, however, relies on AI object detection and tracking to intelligently understand the video content.

VentureBeat reports that Google Research senior software engineers Nathan Frey and Zheng Sun said they were “excited to release this tool directly to developers and filmmakers, reducing the barriers to their design creativity and reach through the automation of video editing.”

“The ability to adapt any video format to various aspect ratios is becoming increasingly important as the diversity of devices for video content consumption continues to rapidly increase,” they added.

AutoFlip “detects changes in the composition that signify scene changes in order to isolate scenes for processing.” It detects the changes by computing the color histogram of each frame and comparing it to prior frames. Each shot relies on “video analysis to identify salient content before reframing the scene, chiefly by selecting an optimized camera mode and path.” AutoFlip then “buffers the video until the scene is complete before making reframing decisions in order to optimize the reframing for the entire scene.”

AI-based object detection is used to “find interesting content in the frame, like people, animals, text overlays, logos, and motion … [and] face and object detection models are integrated with AutoFlip through MediaPipe, a framework that enables the development of pipelines for processing multimodal data, which uses Google’s TensorFlow Lite machine learning framework on processors.”

The latter’s structure, said Google, allows AutoFlip to be extensible “so developers can add detection algorithms for different use cases and video content.” AutoFlip chooses among stationary, panning or tracking techniques for reframing “depending on the way objects behave during the scene.”

Depending on the reframing strategy chosen, “AutoFlip determines a cropping window for each frame while preserving the content of interest.” A configuration graph creates settings for reframing so that “if it becomes impossible to cover all the required region, the system will automatically switch to a less aggressive strategy by applying a letterbox effect,” drawing on the solid color background (if one is available) to fill in the frame.

In the future, said Frey and Sun, the focus will be on “improving AutoFlip’s ability to detect ‘objects relevant to the intent of the video,’ such as speaker detection for interviews or animated face detection on cartoons, and ensuring input video with overlays on the edges of the screen (such as text or logos) aren’t cropped from the view.”