New Alexa Speaking Style Created by Neural Text-to-Speech

Amazon is training Alexa to speak like a newscaster, a feature that will roll out in a few weeks. The new speaking style is based on Amazon’s neural text-to-speech (NTTS) developments. The new voice style doesn’t sound human, but does stress words as a TV or radio announcer would. Before creating this voice, Amazon did a survey that showed that users prefer this newscaster style when listening to articles. The new voice is also an example of “the next generation of speech synthesis,” based on machine learning.

The Verge, which provides audio samples of the new voice and Alexa’s standard voice style, notes that up until now, Alexa has used the tried-and-true method, so-called concatenative speech synthesis, which “involves breaking up speech samples into distinct sounds (known as phonemes) and then stitching them back together to form new words and sentences.”

Although this method can provide “surprisingly good results,” the newer machine learning-enabled method is taking over. Google relied on AI lab DeepMind to produce “a new form of speech synthesis for Google Assistant” last October.

The new voice relied on applying machine learning to “audio clips from real life news channels,” to create patterns. “It’s difficult to describe these nuances precisely in words, and a data-driven approach can discover and generalize these more efficiently than a human,” said Amazon’s Trevor Wood. Amazon, which also created a whisper mode for Alexa, reported that, “it only took a few hours of data to teach Alexa the newscaster speaking voice.”

On Amazon’s developer blog, the company noted that the success of teaching Alexa newscaster style in a few hours “paves the way for Alexa and other services to adopt different speaking styles in different contexts, improving customer experiences.” The blog also provides examples that compare “speech synthesized using concatenative synthesis, NTTS with standard neutral style, and NTTS with newscaster style.”

It added that, according to users, “synthetic speech produced by neural networks sounds much more natural than speech produced through concatenative methods.” In fact, Amazon scored viewer responses, saying that listeners’ ratings for neutral NTTS reduced the discrepancy between human and synthetic speech by 46 percent, whereas the NTTS newscaster style “[shrunk] the discrepancy by a further 35 percent.”

The NTTS system is made up of “a neural network that converts a sequence of phonemes into a sequence of ‘spectrograms’, or snapshots of the energy levels in different frequency bands” and a vocoder that “converts the spectrograms into a continuous audio signal.” The newscaster style voice, it added, “would have been impossible with previous techniques based on concatenative synthesis.”