Google and Amazon Use AI to Improve Speech Recognition

Google’s artificial intelligence researchers made an unexpected discovery with its new SpecAugment data augmentation model for automatic speech recognition. Rather than augmenting input audio waveforms, SpecAugment applies augmentation directly to the audio spectrogram. Researchers discovered, to their surprise, that models trained with SpecAugment out-performed all other speech recognition methods, even without a language model. Amazon also revealed research on improving Alexa’s speech recognition by 15 percent.

VentureBeat reports that Google AI resident Daniel S. Park and research scientist William Chan wrote in a blog post that “while our networks still benefit from adding a language model, our results are encouraging in that it suggests the possibility of training networks that can be used for practical purposes without the aid of a language model.”

Automatic speech recognition (ASR) is used to translate speech into text for Google Assistant in Home smart speakers, or Android’s Gboard dictation tool.

According to a PricewaterhouseCoopers survey in 2018, “reductions in word error rates can be a key factor in conversational AI adoption rates.” VB also notes that “advances in language models and compute power have driven reductions in word error rates that in recent years … have made typing with your voice faster than your thumbs.”

Elsewhere, VB reports that Amazon published a research paper on “End-to-End Anchored Speech Recognition“ that describes a way to isolate noise and thus “improve the assistant’s ability to recognize speech by 15 percent.”

Alexa AI senior applied scientist Xin Fan explains, “One of the ways that we’re always trying to improve Alexa’s performance is by teaching her to ignore speech that isn’t intended for her. We assume that the speaker who activates an Alexa-enabled device by uttering its ‘wake word’ — usually ‘Alexa’ — is the one Alexa should be listening to … Essentially, our technique takes an acoustic snapshot of the wake word and compares subsequent speech to it.”

The technique to isolate wake words was merged with a standard speech recognition model, and two variations were tested. In an experiment, researchers trained one AI model to “more explicitly” recognize wake words, by adding a component “that directly compared the wake word acoustics with those of subsequent speech and then by using the result as an input to a separate component that learned to mask bits of the encoder’s vector.”

That experiment performed worse than the baseline however, reducing the error rate only 13 percent rather than 15 percent. The research team “speculates that this is because its masking decisions were based solely on the state of the encoder network” and plan future masking to consider the decoder as well.