SageMaker HyperPod: Amazon Accelerates AI Model Training

Amazon has launched five new capabilities to its SageMaker service, including Sagemaker HyperPod, which accelerates large language and foundation model training and tuning. Sagemaker HyperPod is said to shorten the training time by up to 40 percent using its purpose-built infrastructure designed for distributed training at scale. By optimizing acceleration, SageMaker Inference reduces foundation model deployment costs by 50 percent and latency by 20 percent on average, Amazon claims. “SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning infrastructure,” said Amazon.

The SageMaker concept debuted in 2017 and the company continues to improve it. SageMaker HyperPod “helps developers to train foundation models across thousands of chips,” SiliconANGLE writes, explaining that “foundation models are often too complex to be trained using a single AI chip” and as a result are usually “split across multiple processors,” which is technically complex and “can take weeks or months depending on the amount of hardware involved.”

SageMaker HyperPod provides access to on-demand AI training clusters. “Developers can provision a cluster through a combination of point-and-click commands and relatively simple scripts, which is significantly faster than manually configuring infrastructure,” SiliconANGLE notes.

Sagemaker could emerge a competitor for Google’s TPU Pods, SiliconANGLE suggests, noting the Alphabet technology trains using “clusters of up to 4,096 artificial intelligence chips that are available through the search giant’s public cloud.” The chips are coordinated by Google’s recently unveiled Cloud TPU Multislice Training, which automates basic maintenance tasks so developers can save time.

SageMaker Clarify “makes it easier for customers to evaluate and select foundation models quickly based on parameters that support responsible use of AI,” Amazon explains, adding that SageMaker Canvas is a visual interface that helps customers prep data “in just a few clicks” — no code required — using natural-language prompts and model building using foundation models,” Amazon shared in an announcement.

TechCrunch points out that HyperPod “allows users to frequently save checkpoints, allowing them to pause, analyze and optimize the training process without having to start over,” and also saves time through built-in fail-safes “so that when a GPU goes down for some reason, the entire training process doesn’t fail, too.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.