Meta’s Multimodal AI Model Translates Nearly 100 Languages

Meta Platforms is releasing SeamlessM4T, the world’s “first all-in-one multilingual multimodal AI translation and transcription model,” according to the company. SeamlessM4T can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages, depending on the task. “Our single model provides on-demand translations that enable people who speak different languages to communicate more effectively,” Meta claims, adding that SeamlessM4T “implicitly recognizes the source languages without the need for a separate language identification model.”

Meta builds on formidable translation accomplishments that include No Language Left Behind text-to-text machine and Universal Speech Translator.

SeamlessM4T derives most directly from Meta’s Massively Multilingual Speech (MMS) model, which has speech synthesis and language recognition capabilities across more than 1,100 languages.

Building a universal language translator is challenging because existing systems cover only a small fraction of the world’s languages, according to Meta, stressing that its single system approach is superior to those that rely on separate models, with increased efficiency and higher-quality translations.

“In keeping with our approach to open science, we’re publicly releasing SeamlessM4T under a research license to allow researchers and developers to build on this work,” Meta announced in a blog post, adding that it is simultaneously releasing the metadata of SeamlessAlign, “the biggest open multimodal translation dataset to date, totaling 270,000 hours of mined speech and text alignments.”

TechCrunch notes, “Meta isn’t the only one investing resources in developing sophisticated AI translation and transcription tools,” calling SeamlessM4T “among the more ambitious efforts to date.” Other companies with competing models available commercially or as open source include Amazon, Microsoft and OpenAI, in addition to several startups.

Since 2022, Google has been publicly discussing its AI-powered Universal Speech Model (USM) that will understand the world’s 1,000 most popular languages.

Meanwhile, Mozilla has partnered with Nvidia on the ambitious Common Voice (MCV), a publicly crowdsourced multilingual speech corpus it calls “the largest of its kind in the world.” MCV is being made available as open-source pre-trained models.

Related:
SeamlessM4T — Massively Multilingual & Multimodal Machine Translation, Meta White Paper, 8/22/23

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.