Google Open-Sources Technology For Real-Time Captions

Google is looking to help developers create real-time captioning for long-form conversations in multiple languages. The company recently open-sourced the speech engine used for Live Transcribe, its Android speech-to-text transcription app designed for those who are deaf or hard of hearing, and posted the source code on GitHub. Live Transcribe, launched in February, is a tool that uses machine learning algorithms to convert audio into captions. Live Transcribe can transcribe speech in more than 70 languages and dialects into captions in real-time.

“Unlike Android’s upcoming Live Caption feature, Live Transcribe is a full-screen experience, uses your smartphone’s microphone (or an external microphone), and relies on the Google Cloud Speech API,” reports VentureBeat.

Live Transcribe allows users to type responses back on the screen. It is available on 1.8 billion Android devices. (Live Caption will be exclusive to select Android Q devices.)

According to the Google Open Source Blog, “relying on the cloud introduces several complications — most notably robustness to ever-changing network connections, data costs, and latency. Today, we are sharing our transcription engine with the world so that developers everywhere can build applications with robust transcription.” (The source code is available on GitHub.)

Google’s speech engine closes and restarts to accommodate for pauses and silence. It also “buffers audio locally and then sends it upon reconnection,” notes VB. Google evaluated audio codecs such as FLAC, AMR-WB and Opus, which all had different pros and cons based on different conditions. For example: “To reduce latency even further than the Cloud Speech API already does, Live Transcribe uses a custom Opus encoder. The encoder increases bitrate just enough so that ‘latency is visually indistinguishable to sending uncompressed audio.’”

“Opus, AMR-WB, and FLAC encoding can be easily enabled and configured,” explains VB. The Live Transcribe speech engine also “contains a text formatting library for visualizing ASR confidence, speaker ID, and more.”