Facebook’s VideoStory Relies on AI to Automate Storytelling

Facebook’s video clips get over 8 billion views a day on average, but people with bad Internet connections or disabilities often don’t have access to them. That led Facebook to create VideoStory, which the company described in a research paper as “A Dataset for Telling the Stories of Social Media Videos.” The paper, to be delivered at the Conference on Empirical Methods in Natural Language Processing, noted that, “automatically telling the stories using multi-sentence descriptions of videos would allow bridging this gap.”

VentureBeat reports that, “to compile the dataset of 20,000 videos and 123,000 descriptive sentences, the team set out to find videos with ‘high engagement’ on social media,” meaning those with lots of comments and shares. Then information from the video was transformed into “detailed captions describing the sequence of events.”

The team created “annotated paragraphs describing objects, situations, and important details,” and linked sentences with time stamps for the videos, which ran from 20 to 180 seconds. The completed clips had “about five sentences on average, each aligned to roughly 18 seconds of footage on average.”

Next, the team trained an AI system to use VideoStory to “caption videos automatically,” based on 17,098 videos for training, and 999 and 1,011 for, respectively, validation and testing. A so-called recurrent neural network, used typically in natural language processing, described each segment of a video. To “ensure the overall system took into account correlations between past and future events, they incorporated context from each previous segment description with a second machine learning model.”

In testing, the team found that generated captions “weren’t consistently right … but the results demonstrated that the model, trained on the VideoStory dataset, benefited from the addition of contextual information.”

“High-quality video descriptions are more than bags of single-sentence captions; they should tell a coherent story,” they wrote. “[Our] evaluations show that our dataset is complimentary from prior work due to more diverse topics and the selection of engaging videos which tell a story. Our VideoStory dataset can serve as a good benchmark to build models for story understanding and multi-sentence video description.”