Nvidia Turbo Charges NeMo Megatron Large Training Model

Nvidia has issued a software update for its formidable NeMo Megatron giant language training model, increasing efficiency and speed. Barely a year since Nvidia unveiled Megatron, this latest improvement further leverages the transformer engine architecture that has become synonymous with deep learning since Google introduced the concept in 2017. New features result in what Nvidia says is a 5x reduction in memory requirements and up to a 30 percent gain in speed for models as large as 1 trillion parameters, making NeMo Megatron better at handling transformer tasks across the entire stack.

Forbes points out that the 30 percent speedup translates into a 175B-parameter model like OpenAI’s GPT-3 taking only 24 days to train, rather than 34 on 1024 A-100 GPUs. While OpenAI has essentially replaced GPT-3 with IntructGPT as the default model for its APIs (as MIT Technology Review explained earlier this year), the company continues to use Nvidia technology to power its efforts. Nvidia’s Megatron update uses a new hyperparameter tool that automatically finds the correct training and inference configs.

Megatron also adds two processes to reduce memory requirements during training: sequence parallelism (SP), which “reduces activation memory requirements beyond the usual tensor and pipeline parallelism methods,” and selective activation recomputation (SAR), “a novel approach that focuses on selecting activations with high-memory low-compute requirements to recompute when memory constraints are too tight — which avoids the inefficiency of full activation recomputation,” Forbes explains.

The one-fifth memory requirement “with respect to doing only tensor parallelism” translates to overhead that is only “2–4 percent, in comparison to +36 percent in the case of applying full activation recomputation,” Forbes writes, highlighting a reduction “from 22 billion to 1 trillion parameters” across model sizes.

“As a framework, NeMo Megatron is a ‘top-to-bottom’ stack, Nvidia VP of deep learning software Ujval Kapasi told VentureBeat, which says that means “it includes GPU-accelerated machine learning libraries, hardware and networking optimizations for cluster deployments.”

Nvidia has released NeMo Megatron to a limited number of clients in early access, but it’s already being used to help train “some of the largest models on the planet,” VentureBeat writes, listing among them BLOOM, the BigScience Large Open-science Open-access Multilingual Language Model released this month with “support for 46 human languages and 13 programming languages.”

“Our stack is specifically optimized for Nvidia DGX SuperPODs,” Kapasi told VentureBeat, referring to the GPU clusters of that run large language model systems, adding that the Megatron stack “also works well on cloud systems.”

Right now, few areas of AI are generating more interest than natural language processing (NLP), which is Megatron’s specialty. And while it’s most often described as a LLM, or large language model, framework, “large” seems to understate a framework that can train and deploy large-scale models “up to trillions of parameters,” Nvidia says.

The Next Platform provides a quick survey detailing industry heavyweights, like Meta Platforms, that “are driving hard to the hoop in LLM. Google has GLaM, with 1.2 trillion parameters and LaMDA (137 billion) and in April introduced a new LLM called PaLM (Pathways Language Model), part of the company’s larger Pathways AI architecture.”

“DeepMind has Gopher (280 billion) and Chincilla (70 billion) and OpenAI two years ago unveiled GPT-3, a LLM with 175 billion parameters. Nvidia and Microsoft last year announced Megatron-Turing Natural Language Generation model (MT-NLG), with 530 billion parameters.”