Facebook Reveals New AI-Powered Text-to-Speech System

Facebook introduced an AI text-to-speech system (TTS) that produces a second of audio in 500 milliseconds. According to Facebook, the system, which is used with a new approach to data collection, powered the creation of a British accent-inflected voice in six months, versus over a year required for other voices. The TTS is now used for Facebook’s Portal smart display brand. The system can be hosted in real time via ordinary processors and is also available as a service for other apps, including Facebook’s VR.

VentureBeat reports that, “most modern AI TTS systems require graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google’s tensor processing units (TPUs) to run, train, or both.” Such specialized hardware, such as Google AI system’s 32 TPUs in parallel, is expensive. Facebook’s TTS system, however, doesn’t use such hardware; the company reported that “its system attained a 160 times speedup compared with a baseline, making it fit for computationally constrained devices.”

The system has four parts: “a linguistic front-end, a prosody model, an acoustic model, and a neural vocoder.” The first “converts text into a sequence of linguistic features, such as sentence type and phonemes,” the prosody model “draws on the linguistic features, style, speaker, and language embeddings … to predict sentences’ speech-level rhythms and their frame-level fundamental frequencies.”

The acoustic model “leverages a conditional architecture to make predictions based on spectral inputs, or specific frequency-based features … [which] enables it to focus on information packed into neighboring frames.”

Further, it trains a “lighter and smaller vocoder,” which consists of “a submodel that upsamples … the input feature encodings from frame rate (187 predictions per second) to sample rate (24,000 predictions per second).”

A second submodel “similar to DeepMind’s WaveRNN speech synthesis algorithm generates audio a sample at a time at a rate of 24,000 samples per second.” The vocoder requires that samples be synthesized in sequential order, which “makes real-time voice synthesis a major challenge.”

VB adds that, “all models consist of neurons, which are layered, connected functions … [and] signals from input data travel from layer to layer and slowly ‘tune’ the output by adjusting the strength (weights) of each connection.”

Neural networks ingest “embeddings in the form of multidimensional arrays like scalars (single numbers), vectors (ordered arrays of scalars), and matrices (scalars arranged into one or more columns and one or more rows) … [as well as] a fourth entity type that encapsulates scalars, vectors, and matrices [that] adds in descriptions of valid linear transformations (or relations).”

Facebook’s future plans are to “use the TTS system and data collection method to add more accents, dialogues, and languages beyond French, German, Italian, and Spanish to its portfolio” and also to make “the system even more light and efficient than it is currently so that it can run on smaller devices.” It is also “exploring features to make Portal’s voice respond with different speaking styles based on context.”