Facebook Adds 24 Languages to Rosetta Translation Feature

Facebook’s Rosetta is a machine learning system that extracts text in many languages from over one billion images in a real time. Facebook built its own optical character recognition system that can process such huge amount of content, day in and day out. In a recent blog post, Facebook explained how Rosetta works, using a convolutional neural network to recognize and transcribe text, even non-Latin alphabets and non-English words. The system was trained with a mix of human- and machine-annotated public images.

Engadget reports that, “various teams within Facebook and Instagram are already using Rosetta to surface more content and to police their platforms.” The company recently added 24 new languages to its automatic translation services “including Serbian, Belarusian, Marathi, Sinhalese, Telugu, Nepali, Kannada, Urdu, Punjabi, Cambodian, Pashto, Mongolian, Zulu, Xhosa and Somali,” although translations are “at an early stage,” meaning they’ll “still have a lot of errors.”

In its own blog post, Facebook explains that Rosetta is currently “serving nearly 6 billion translations per day to our community,” enabled by artificial intelligence and neural machine translation (NMT). In 2018, Facebook’s Language and Translation Technologies (LATTE) set the goal of “no language left behind.” The primary challenge is a lack of training materials, since “most of these languages do not have a quantity of readily available human translations.”

With the addition of the 24 new languages, Facebook is now “serving translations for a total of 4,504 language directions (a pair of languages between which we offer translation, e.g., English to Spanish).”

NMT models were trained using open source PyTorch Translate framework “converted … to the ONNX format,” and used in the Caffe2 environment. LATTE used three different strategies. The first was to increase labeled in-domain data (i.e., Facebook posts), manually labeling “millions of words in 25 languages.” Second, the group explored “semi-supervised and data augmentation methods to generate additional training data,” a method that relied on “lower-accuracy models that are used to generate artificial training data.”

The team also “explored using monolingual data in a technique known as copy-target,” using the target side to replace the source side, which also resulted in improvements. “The intuition behind these two techniques is that they help low-resource models do better training of their decoder (the component that produces the translation) and produce more fluent translations.”

Another technique was to “build a translation detector that can tell when two sentences in different languages are translations of each other” and then “use this detector to mine translations from multilingual webpages,” which helped in 70 percent of the experiments although it can “introduce noise when the data they generate is not accurate.”

Last, “one of the most effective ways we found to improve the quality of a translation system for a specific dialect direction was to combine it with other related directions,” giving the example of improving translations from Belarusian to English by leveraging the relationship between Belarusian and Ukrainian to build a multilingual system. “The multilingual systems we experimented with were able to leverage similarities in dialects from the same language families.”