Companies Turn to AI for New Approaches to Audio Solutions

To understand speech visually, by reading lips, in addition to aurally, is an advantage for which AI has been waiting, according to researchers at Meta Platforms (formerly Facebook). The company says it has developed a framework that learns by watching — Audio-Visual Hidden Unit BERT (AV-HuBERT) — and that it is 75 percent more accurate than competing automated speech recognition systems on several metrics. Meta claims that AV-HuBERT outperforms the former best audiovisual speech recognition system with only one-tenth the inuput, which makes it potentially useful with languages with little or no audio data.

Smartphone apps, augmented reality glasses and camera-equipped speakers, like Amazon’s Echo Show, could also benefit from AV-HuBERT. “In the future, AI frameworks like AV-HuBERT could be used to improve the performance of speech recognition technology in noisy everyday conditions — for example, interactions at a party or in a bustling street market,” Meta AI research scientist Abdelrahman Mohamed told VentureBeat.

The article suggests AV-HuBERT has been designed “somewhat uniquely” in that it “leverages unsupervised, or self-supervised, learning.”

“Meta isn’t the first to apply AI to the problem of lip-reading. In 2016, researchers at the University of Oxford created a system that was nearly twice as accurate as experienced lip readers in certain tests and could process video in close-to-real-time,” VentureBeat explains.

“In 2017, Alphabet-owned DeepMind trained a system on thousands of hours of TV shows to correctly translate about 50 percent of words without errors on a test set, far better than a human expert’s 12.4 percent But the University of Oxford and DeepMind models, as with many subsequent lip-reading models, were limited in the range of vocabulary that they could recognize.”

These models “also required datasets paired with transcripts in order to train,” notes VentureBeat, and they didn’t process audio from speakers in the video content used to learn visually.

Meanwhile, across the pond, Alex Mitchell created a website and app called Boomy that lets creators make their own songs with artificial intelligence. After a few clicks, Boomy says it will compose an original piece for you in under 30 seconds.

“It swiftly picks the track’s key, chords and melody,” writes BBC. “And from there you can then finesse your song. You can do things such as add or strip-out instruments, change the tempo, adjust the volumes, add echoes, make everything sound brighter or softer, and lay down some vocals.”

No Comments Yet

You can be the first to comment!

Sorry, comments for this entry are closed at this time.