Disney, Rutgers Scientists Use AI to Generate Storyboards

Disney Research and Rutgers University scientists just created an end-to-end model using artificial intelligence to produce a storyboard and video featuring text from movie screenplays. This kind of text-to-animation model is not new, but this research advances the state-of-the-art by producing animations without annotated data or pre-training. The researchers wrote that the system is “capable of handling complex sentences” and is intended to make creatives’ work “more efficient and less tedious.”

VentureBeat reports that, in the paper, which was published on the preprint server Arxiv.org, the researchers noted that the applications of “automatically generating animation from natural language text” include screenwriting, instructional videos and public safety, “enabling faster iteration, prototyping and proof of concept for content creators.” Most text-to-animation tools aren’t able to handle complex sentences because “neither the input sentences nor the output animations have a fixed structure.”

To solve the problem, the researchers constructed a modular neural network that included “a novel script-parsing module that automatically isolates relevant text from scene descriptions in screenplays; a natural language processing module that simplifies complex sentences using a set of linguistic rules and extracts information from the simplified sentences into predefined action representations; and an animation generation model that translates said representations into animation sequences.”

The system first “determines if a given snippet contains a particular syntactic structure and subsequently splits and assembles it into simpler sentences, recursively processing it until no further simplification is possible,” then applies a “coordination step and a lexical simplifier, which “matches actions in the simplified sentences with 52 animations (expanded to 92 by a dictionary of synonyms) in a predefined library.”

The content is fed to the pipeline, dubbed Cardinal, which turns the actions into previsualizations in the Unreal video game engine. The system has been trained with “scene descriptions from 996 screenplays drawn from over 1,000 scripts scraped from freely available sources including IMSDb, SimplyScripts, and ScriptORama5.” The system is made up of “525,708 descriptions containing 1,402,864 sentences, 920,817 (over 40 percent) of which had at least one action verb.”

Twenty-two participants evaluated 20 animations on a five-point scale, and 68 percent said the system created “reasonable” animation from the screenplays in question. They did, however, conclude that the system has flaws: “its list of actions and objects isn’t exhaustive, and occasionally, the lexical simplification fails to map verbs (like ‘watches’) to similar animations (‘look’) or creates only a few simplified sentences for a verb that has many subjects in the original sentence.”

The research team is continuing its work and intend to eventually “leverage discourse information by considering the sequence of actions which are described in the text.”