OpenAI Rolls Out Open-Source Speech Recognition System

OpenAI has released a new open source AI speech recognition model called Whisper that can recognize and translate audio at levels it says compare in accuracy and robustness to human abilities. Case uses include transcription of speeches, interviews, podcasts and conversations. “Moreover, it enables transcription in multiple languages, as well as translation from those languages into English,” says OpenAI, which is open-sourcing models and inference code on GitHub “to serve as a foundation for building useful applications and for further research on robust speech processing.”

While there are already some very good automatic speech recognition (ASR) systems —some central to software and services from Google, Amazon and Meta Platforms — OpenAI says what sets Whisper apart is its training: 680,000 hours of multilingual and multitask supervised data collected from the web.

That large dataset “leads to improved robustness to accents, background noise and technical language,” Open AI notes in its announcement. “The primary intended users of [the Whisper] models are AI researchers studying robustness, generalization, capabilities, biases and constraints of the current model.”

Ars Technica says Whisper is “an encoder-decoder transformer, a type of neural network that can use context gleaned from input data to learn associations that can then be translated into the model’s output.”

Citing OpenAI’s Whisper GitHub repository, TechCrunch writes that the models “show strong ASR results in ~10 languages. They may exhibit additional capabilities … if fine-tuned on certain tasks like voice activity detection, speaker classification or speaker diarization but have not been robustly evaluated in these areas.”

Whisper doesn’t work equally well across languages, evidencing higher error rates on languages and vocal intonations that were not well-represented in the training data. TechCrunch reports that is “nothing new to the world of speech recognition, unfortunately. Biases have long plagued even the best systems, with a 2020 Stanford study finding systems from Amazon, Apple, Google, IBM and Microsoft made far fewer errors — about 35 percent — with users who are white than with users who are Black.”

Still, OpenAI is couching Whisper’s advances as a net win. “Since these methods learn directly from raw audio without the need for human labels, they can productively use large datasets of unlabeled speech and have been quickly scaled up to 1,000,000 hours of training data,” the technology’s researchers write in a scholarly paper, noting that is “when fine-tuned on standard benchmarks, this approach has improved the state of the art, especially in a low-data setting.”

Related:
I Used OpenAI’s New Tech to Transcribe Audio Right on My Laptop, The Verge, 9/23/22