Google and IBM Create Advanced Text-to-Speech Systems

Both IBM and Google recently advanced development of Text-to-Speech (TTS) systems to create high-quality digital speech. OpenAI found that, since 2012, the compute power needed to train TTS models has exploded to more than 300,000 times. IBM created a much less compute-intensive model for speech synthesis, stating that it is able to do so in real-time and adapt to new speaking styles with little data. Google and Imperial College London created a generative adversarial network (GAN) to create high-quality synthetic speech.

VentureBeat reports that IBM researchers Zvi Kons, Slava Shechtman and Alex Sorin, who presented a paper at Interspeech 2019, noted that “to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs.”

The IBM team created a new system that “consists of three interconnected parts: a prosody feature predictor, an acoustic feature predictor, and a neural vocoder.”

Prosody prediction “learns the duration, pitch, and energy of speech samples, toward the goal of better representing a speaker’s style … [and] the acoustic feature production … creates representations of the speaker’s voice in the training or adaptation data, while the vocoder generates speech samples from the acoustic features.”

Working together, the components can “adapt synthesized voice to a target speaker via retraining, based on a small amount of data from the target speaker.” In a test, it was able to do so in as little as five minutes. The research is “the basis for IBM’s new Watson TTS service.”

Elsewhere, VB reports that Google and Imperial College London stated that their GAN-based TTS “not only generates high-fidelity speech with ‘naturalness’ but that it’s highly parallelizable, meaning it’s more easily trained across multiple machines compared with conventional alternatives.” The researchers noted that the limitation of current TTS models is “that they are difficult to parallelize over time: they predict each time step of an audio signal in sequence, which is computationally expensive and often impractical.”

“A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel,” they added. “An alternative approach for parallel waveform generation would be to use generative adversarial networks.”

Their proposed system “consists of a convolutional neural network that learned to produce raw audio by training on a corpus of speech with 567 encoded phonetic, duration, and pitch data.” They sampled “44 hours of two-second windows together with the corresponding linguistic features computed for five-millisecond windows” in order to “enable the model to generate sentences of arbitrary length.”

The researchers “evaluated GAN-TTS’ performance on a set of 1,000 sentences, first with human evaluators … [and] in the end, the best-performing model — which was trained for as many as 1 million steps — achieved comparable scores to baselines while requiring only 0.64 MFLOPs (millions of floating point operations per second) per sample (WaveNet needs 1.97 MFLOPs per sample).”