Amazon SageMaker HyperPod enhances support for Slurm-managed clusters with continuous provisioning

Amazon SageMaker HyperPod now supports continuous provisioning for Slurm-managed clusters, enhancing flexibility and efficiency for large-scale AI/ML workloads. This feature allows for immediate training job initiation and seamless scalability.

Amazon SageMaker HyperPod has expanded its capabilities by introducing continuous provisioning for clusters managed by the Slurm orchestrator. This advancement aims to provide enterprise customers with greater flexibility and efficiency when handling large-scale AI/ML training workloads. Clients utilizing Slurm-based clusters require swift training initiation, seamless scalability, maintenance without operational disruption, and detailed visibility into cluster operations.

Previously, if any instance group within a cluster could not be fully provisioned, it resulted in the failure and rollback of the entire cluster creation or scaling process. This would lead to delays and necessitate manual intervention. However, with the new continuous provisioning feature for Slurm, SageMaker HyperPod can automatically provision the remaining capacity in the background, allowing training jobs to commence immediately on available instances.

The system adopts a priority-based approach to provisioning, initiating the Slurm controller node first, followed by login and worker nodes in parallel. This ensures that the cluster becomes operational as swiftly as possible. HyperPod also handles failed node launches by retrying them asynchronously and integrating nodes into the Slurm cluster as they become available. This process ensures that clusters reach their intended scale reliably without manual input.

Moreover, the feature allows for concurrent, non-blocking scaling operations across multiple instance groups simultaneously, meaning that a capacity shortage in one group does not hinder scaling in others. These enhancements are designed to reduce time-to-training, optimize resource utilization, and enable customers to concentrate on innovation rather than infrastructure management.

This feature is now accessible for new SageMaker HyperPod clusters utilizing the Slurm orchestrator. Users can activate continuous provisioning by setting the NodeProvisioningMode parameter to “Continuous” during the creation of new HyperPod clusters via the CreateCluster API. It is also possible to enable this feature when establishing new clusters through the AWS CLI and the SageMaker AI console.

Continuous provisioning is available in all AWS Regions where Amazon SageMaker HyperPod is supported. For further information on continuous provisioning for Slurm clusters, please refer to the Amazon SageMaker HyperPod User Guide.