April 22, 2019
Facebook is bringing back FMV (full motion video) games, which use pre-recorded video files to display action. With the work of Facebook AI Research scientists, the new FMV games are much improved, with a system that can extract controllable characters from real-world videos and then control their motion, thus generating new image sequences. Facebook AI Research scientists, in collaboration with Tel Aviv University, also unveiled a system that, unsupervised, converts audio of one singer to the voice of another.
VentureBeat reports that the team that worked on FMV games used “two neural networks, or layers of mathematical functions modeled after biological neurons: Pose2Pose, a framework that maps a current pose and a single-instance control signal to the next post, and Pose2Frame, which plops the current pose and new pose (along with a given background) on an output frame.”
A joystick or keyboard’s “low-dimensional” signal can control the “reanimation.” Researchers used three videos — a tennis player outdoors, a person swinging a sword indoors and a person walking — to train the system.
“Each network addresses a computational problem not previously fully met, together paving the way for the generation of video games with realistic graphics,” said the researchers in a paper published on Arvix.org. “In addition, controllable characters extracted from YouTube-like videos can find their place in the virtual worlds and augmented realities.” Startup Promethean AI and Nvidia are also working on AI tools to help game design.
Also reported in VB, the research on “Unsupervised Singing Voice Conversion” (also available on Arvix.org) described the creation of a system to convert one singer’s voice to another. The team stated that “their model was able to learn to convert between singers from just five-to-30 minutes of their singing voices” due to “an innovative training scheme and data augmentation technique.”
“[Our approach] could lead, for example, to the ability to free oneself from some of the limitations of one’s own voice,” wrote the researchers. “The proposed network is not conditioned on the text or on the notes [and doesn’t] require parallel training data between the various singers, nor [does it] employ a transcript of the audio to either text … or to musical notes … While existing pitch correction methods … correct local pitch shifts, our work offers flexibility along the other voice characteristics.”
The method is based on WaveNet, “a Google-developed autoencoder (a type of AI used to learn representations for sets of data unsupervised) that generates models from the waveforms of audio recordings” and also includes back-translation, “which involves converting one data sample to a target sample (in this case, one singer’s voice to another) before translating it back and tweaking its next attempt if it doesn’t match the original.”
The research team also used “virtual identities” to create synthetic samples closer to the source singer’s voice, and a “confusion network” to make sure the results were “singer-agnostic.”
The training was based on five singers with 10 songs chosen at random from Stanford’s Digital Archive of Mobile Performances (DAMP) and 12 singers with four songs each from the National University of Singapore’s Sung and Spoken Corpus (NUS-48E).