Apple Researchers Improving Accuracy of Virtual Assistant

Over 50 million people worldwide use Apple’s virtual assistant Siri. Apple, focused on improving Siri’s capabilities, published research on how to improve voice trigger detection, speaker verification and language identification for multiple speakers. Apple researchers suggest that an AI model be trained for automatic speech recognition and speaker recognition. Rather than approach it as two independent tasks, the researchers proved that those tasks might actually help one another to “estimate both properties.”

VentureBeat reports that, “the researchers devised three sets of models capable of learning phonetic and speaker information, which they trained on a data set containing over 16,000 hours of annotated samples where 5,000 hours of audio had phonetic labels … [with] over 100 subjects contribut[ing] to the corpus using a smart speaker device in a range of acoustic settings.”

To allow measurement of the “false alarm” rate, 2,000 hours of “continuous audio recordings from TV, radio, and podcasts that didn’t contain the trigger phrase were added.” The models “showed an aptitude for learning both phonetic and speaker information while yielding accuracies ‘at least as good’ as the baseline models for each task.”

One of the three models actually showed a “relative improvement of 7.6 percent over the baseline on a text-independent task.”

The researchers noted that, “from a practical standpoint, being able to share computation between the two tasks can save on-device memory, computation time or latency, and the amount of power/battery consumed.” Another related study looked at “the task of false trigger mitigation, where speech not intended for a voice assistant like Siri is purposefully ignored by the assistant.”

The researchers, using a “graph neural network (GNN), a type of AI model that operates on the graph structure where every node is associated with a label and the goal is to predict the label of the nodes without ground-truth,” mitigated 87 percent of false triggers. In the future, “the team plans to extend GNN-based processing to other tasks, such as user-intent classification.”

Other Apple researchers are looking into a “speaker language identification system tailored to scenarios involving multilingual speakers.” The Washington Post commissioned a study of Amazon and Google smart speakers, finding that they were “30 percent less likely to understand non-American accents than those of native-born users.”

The researchers’ solution “incorporates knowledge about usage patterns into a dictation system that’s able to make decisions for speakers across over 60 locales,” via an “acoustic sub-model [that] makes predictions based on the evidence conveyed by the speech signal, and a context-aware prediction component [that] takes into account the assorted interaction context signals.”

The researchers also “developed a custom metric dubbed Average User Accuracy (AUA) that they say better reflects ‘population-level’ usage patterns in models.” The solution achieved “an average of 87 percent accuracy across language combinations while improving worst-case accuracy by over 60 percent relative to the baseline.”