Microsoft, Nvidia Partner on Azure-Hosted AI Supercomputer

Microsoft has entered into a multi-year deal with Nvidia to build what they’re calling “one of the world’s most advanced supercomputers,” powered by Microsoft Azure’s advanced supercomputing infrastructure combined with Nvidia GPUs, networking and full stack of AI software to help enterprises train, deploy and scale AI, including large, state-of-the-art models. “AI is fueling the next wave of automation across enterprises and industrial computing, enabling organizations to do more with less as they navigate economic uncertainties,” Microsoft cloud and AI group executive VP Scott Guthrie said of the alliance.

The partnership “unlocks the world’s most scalable supercomputer platform” and offers “state-of-the-art AI capabilities for every enterprise,” Guthrie added in a joint news release.

As part of the collaboration, Nvidia will utilize Azure’s scalable virtual machine instances to research and further accelerate advances in generative AI, a rapidly emerging area of AI in which foundational models like Megatron Turing NLG 530B are the basis for unsupervised, self-learning algorithms that create new text, code, digital images, video or audio.

Microsoft Azure’s AI-optimized virtual machines become the first public cloud instances to incorporate Nvidia Quantum-2 400Gb/s InfiniBand networking. “Customers can deploy thousands of GPUs in a single cluster to train massive large language models, build the most complex recommender systems at scale, and enable generative AI at scale,” the companies say.

Among the resources the partners are pouring into the new powerhouse is Microsoft DeepSpeed, a deep learning optimization software suite that will leverage the Nvidia H100 Transformer Engine to accelerate transformer-based models used for large language models, generative AI and writing computer code, among other applications.

TechCrunch calls Nvidia’s H100 GPUs “the flagship of Nvidia’s Hopper architecture,” accelerating machine learning by 1.5x to 6x over the previous A100s.

The demand for high-performance AI training infrastructure “has led to an arms race of sorts among cloud and hardware vendors,” TechCrunch reports, citing Cerebras’ unveiling this week of the 13.5-million core Andromeda AI supercomputer that it says delivers more than 1 exaflop of AI compute.

“Google and Amazon continue to invest in their own proprietary solutions, offering custom-designed chips — e.g. TPUs and Trainium — for accelerating AI training in the cloud,” TechCrunch says, predicting the push for more powerful AI hardware “will continue for the foreseeable future,” with a recent study indicating “compute requirements for large-scale AI models has been doubling at an average rate of 10.7 months between 2016 and 2022.”

In 2019, Microsoft announced it was investing $1 billion in partnership with OpenAI to build an AI supercomputer to run the Azure cloud. At the time it was proclaimed one of the largest supercomputers in the world, but the new deal with Nvidia is “presumably to support even more ambitious AI workloads,” TechCrunch writes.