Amazon SageMaker HyperPod Slurm clusters now allow setting minimum capacity requirements with continuous provisioning

Amazon SageMaker HyperPod now allows users to set minimum capacity requirements for Slurm clusters with continuous provisioning, enhancing control over cluster availability for AI/ML workloads. This feature is particularly beneficial for distributed training frameworks and meeting service level agreements.

Amazon SageMaker HyperPod has introduced a new feature that enables users to specify minimum capacity requirements, known as MinCount, for clusters orchestrated with Slurm and utilizing continuous provisioning. This update allows HyperPod to initially set up clusters with available partial capacity, facilitating a quicker start for AI/ML tasks while concurrently adding the remaining instances in the background.

The introduction of MinCount addresses the need for certain training workloads to have a guaranteed minimum number of nodes before they can operate effectively. By allowing users to define the minimum number of instances required before an instance group transitions to the InService status, this feature provides improved control over cluster availability for job scheduling. This is especially beneficial for distributed training workloads using frameworks like PyTorch FSDP, Megatron-LM, or NVIDIA NeMo, which often rely on a fixed number of nodes to function efficiently and correctly. Additionally, it assists teams in ensuring a baseline GPU count to meet service level agreements (SLA) or achieve cost-efficiency before initiating a training session.

To utilize this feature, users can set the MinInstanceCount parameter in the CreateCluster or UpdateCluster API request, establishing a minimum capacity threshold for an instance group. The group remains in a Creating or Updating status until this threshold is met, after which it transitions to InService and becomes available for Slurm job scheduling. HyperPod will continue to launch additional instances beyond the specified MinCount until the desired target count is achieved. If the MinCount requirement is not fulfilled within a three-hour window, the system automatically reverts the instance group to its last successful state.

This functionality is now available in all AWS Regions where Amazon SageMaker HyperPod is supported. For further guidance on setting minimum capacity requirements for your cluster, refer to the Amazon SageMaker AI documentation under the section on Minimum capacity requirements (MinCount).