Researchers Create AI Technique to Generate Video Captions

Researchers at Microsoft Research Asia and the Harbin Institute of Technology have come up with a new technique to use artificial intelligence to generate live video captions. In the past, technologists have used encoder-decoder models, but didn’t model the interaction between videos and comments, resulting in mainly irrelevant comments. The new technique — based on a model that iteratively learns to capture the representations of audio, video and comments — outperforms current methods, according to the research team.

VentureBeat reports that the system, which is based on Google’s Transformers architecture, “matches the most relevant comments with videos from a candidate set so that it jointly learns cross-modal representations.”

Transformers, as with all neural networks, contain “functions (neurons) arranged in layers that transmit signals from data and slowly adjust the connections’ strength (weights) … [and] uniquely … have attention, which means that every output element is connected to every input element, and the weightings between them are calculated dynamically.”

The automatic live commenting system is composed of “an encoder layer that converts different modalities of a video and a candidate comment into vectors (i.e., mathematical representations); a matching layer that learns the representation for each modality; and a prediction layer that outputs a score measuring the matching degree between a video clip and a comment.”

With a video and time-stamp, “the model aims to select a comment from a candidate set that is most relevant to the video clip near the time-stamp based on the surrounding comments, the visual part, and the audio part.” For the visuals, the system “samples video frames near the time-stamp.”

The system was evaluated with a “video-comment data set containing 2,361 videos and 895,929 comments, collected from the Chinese video streaming platform Bilibili.” Researchers “constructed a candidate comments set in which each video clip contained 100 comments comprising the ground-truth comments, top 20 popular comments, and random selected comments … [and] the model outperformed several baselines in terms of several measures, including relevance and correctness.”

The researchers noted that, “the multimodal pre-training will be a promising direction to explore, where tasks like image captioning and video captioning will benefit from pre-trained models.” They added that, “for future research, we will further investigate the multimodal interactions among vision, audio, and text in … real-world applications.”

The research is available in a preprint paper published on and the system’s code is available on GitHub.