Microsoft and Nvidia Debut World’s Largest Language Model

Microsoft and Nvidia have trained what they describe as the most powerful AI-driven language model to date, the Megatron-Turing Natural Language Generation model (MT-NLG), which has “set the new standard for large-scale language models in both model scale and quality,” the firms say. As the successor to the companies’ Turing NLG 17B and Megatron-LM, the new MT-NLG has 530 billion parameters, or “3x the number of parameters compared to the existing largest model of this type” and demonstrates unmatched accuracy in a broad set of natural language tasks.

Such tasks include completion prediction, reading comprehension, common sense reasoning, natural language inferences and word sense disambiguation. The larger the number of parameters used to model, the richer and more nuanced the understanding of language, the companies wrote in a joint blog post by Microsoft’s Ali Alvi and Nvidia’s Paresh Khary.

“As a result, they generalize well as effective zero– or few-shot learners, with high accuracy on many NLP,” or neuro-linguistic programming, tasks and datasets, writes the duo, listing downstream applications such as summarization, automatic dialogue generation, translation, semantic search, and programming code autocompletion.

The companies write that training such models is challenging for two main reasons:

  1. It is no longer possible to fit the parameters of these models in the memory of even the largest GPU.
  2. The large number of compute operations required can result in unrealistically long training times, if special attention is not paid to optimizing the algorithms, software and hardware stack all together.

To train MT-NLG, Microsoft and Nvidia created a training dataset of  270 billion tokens from English-language websites. “Tokens, a way of separating pieces of text into smaller units in natural language, can either be words, characters, or parts of words,” writes VentureBeat. The MT-NLG trained by absorbing “a set of examples to learn patterns among data points, like grammatical and syntactical rules.”

The dataset was primarily drawn from The Pile, “an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The Pile spans academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more,” VentureBeat summarizes. Microsoft and Nvidia filtered the results for quality control, combining them with data snapshots from Common Crawl, a vast collection of webpages, news stories and social media posts maintained by Amazon.

Modeling such as that done by MT-NLG can magnify biases occurring in the data used for training. Microsoft and Nvidia concede the model “picks up stereotypes and biases” from training data, a concern that VentureBeat attributes to a portion of the dataset having been sourced “from communities with pervasive gender, race, physical, and religious prejudices, which curation can’t completely address.”

Microsoft and Nvidia assert they’re “committed to working on addressing [this] problem.”