Meta Says Its AI-Compressed Audio Codec Beats MP3 by 10x

Meta Platforms says its vision for the metaverse will rely heavily on compression technology “to deliver high-quality, uninterrupted experiences for everyone.” With that in mind, it’s trained its Fundamental AI Research (FAIR) lab on developing “hypercompression” solutions. First up is EnCodec, an audio technology it says compresses at 64 kbps, with no loss in quality, and at 10 times the efficiency of MP3. The EnCodec protocol has the potential to  greatly improve the sound and reliability of speech over low-bandwidth (like when your mobile phone is only getting one bar). It also works for music.

Meta publicly unveiled EnCodec last month in a paper called “High Fidelity Neural Audio Compression,” by Meta AI researchers Alexandre Défossez, Jade Copet, Gabriel Synnaeve and Yossi Adi.

“Meta describes its method as a three-part system trained to compress audio to a desired target size,” writes Ars Technica, noting “first, the encoder transforms uncompressed data into a lower frame rate ‘latent space,’” then “the ‘quantizer’ compresses the representation to the target size while keeping track of the most important information that will later be used to rebuild the original signal.”

The rebuilt, compressed signal can then be sent out over a network or saved. Lastly, “the decoder turns the compressed data back into audio in real time using a neural network on a single CPU,” Ars Technica explains.

Meta researchers say the company is the first to use this technology with 48 kHz stereo audio “(slightly better than CDs’ 44.1 kHz sampling rate), which is typical for music files distributed on the Internet,” Ars Technica says.

“We believe we can attain even smaller file sizes, as we haven’t yet reached the limits of quantization techniques,” Meta wrote in a blog post. “On the applied research side, there is more work that can be done on the trade-off between computing power and the size of compressed audio,” the company added, noting that thanks to AI “dedicated chips, such as those that are already on phones and laptops, could be improved in the future to help compress and decompress files, while consuming less power.”

Meta’s use of discriminators is key, “compressing the audio as much as possible without losing key elements of a signal that make it distinctive and recognizable,” Ars Technica says.

“The key to lossy compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates,” Meta adds, explaining that the discriminators are used to improve the perceptual quality of the generated samples, creating “a cat-and-mouse game where the discriminator’s job is to differentiate between real samples and reconstructed samples.”