- Add Recipe Job Init Experience:
hyp init hyp-recipe-jobcommand for launching HyperPod recipe jobs (#409) - Introduce
replica_countparameter for training jobs; deprecatenode_countin favor ofreplica_count(#401)
- Exclude
DELETE_COMPLETEstacks at API level to preventListStacksthrottling (#410) - Fix
ml.prefix and wrong MIG profiles forp6-b200.48xlarge(#399) - Fix namespaced-role-and-bindings helm chart bugs (#406)
- Release Health Monitoring Agent 1.0.1481.0_1.0.392.0 with bug fixes (#405)
- Add g7e instance types to HyperPod helm chart values (nvidia/EFA device plugins) (#380)
- Add g7e instance types to Python constants and CLI (#385, #390)
- Add g7e instance types to health-monitoring-agent node affinity (#381)
- Add B300 MIG profiles to GPU operator ConfigMap (#396)
- Add MIG profile support for ml.p6-b300.48xlarge (Blackwell Ultra) (#398)
- CRD updates: BYO certificate, RequestLimitsConfig, Custom Kubernetes support (#402)
- Bump hyperpod-inference-operator subchart to v2.1.0 with image tag v3.1 (#402)
- Support AWS_REGION env var, cluster context fallback, centralize boto3 client creation (#395)
- Handle pagination in cluster stack listing (#394)
- Require --instance-type when specifying accelerator resources (#393)
- Fix EFA field naming in PyTorch job template v1.1:
efa_interfaces->efa,efa_interfaces_limit->efa_limit(#392) - Fix deep health check nodeSelector label to
sagemaker.amazonaws.com/deep-health-check-status: Passed(#386) - Remove non-EFA instance types from EFA device plugin nodeAffinity to prevent CrashLoopBackOff (#389)
- Add missing instance types and fix EFA/memory resource specs (#385)
- Release Health Monitoring Agent 1.0.1434.0_1.0.388.0 (#388)
Space CLI
- Added list all functionality and documentation updates
- Disabled traceback for cleaner error output
Inference Operator
- Inference Operator AddOn with NodeAffinity support and version 3.0 update
- Updated hyperpod-inference-operator to version 2.0.0 in HyperPodHelmChart
- Added AddOn migration script and README
Monitoring & Observability
- Emit metrics for CLI commands
Testing & Validation
- Added unit tests for inference CRDs
- Added CRD format check for inference
Dependencies & Versions
- Updated GPU operator container toolkit version
- Updated aws-efa-k8s-device-plugin version to 0.5.20
Configuration
- Instance types CRD changes
- Fixed syntax error in inferenceendpointconfigs by removing tab
- Add EFA support in manifest for training jobs (#345)
- Add end-to-end example documentation (#350)
- Add 4 new HyperPod GA regions (ca-central-1, ap-southeast-3, ap-southeast-4, eu-south-2) (#360)
- Update documentation for elastic training arguments (#343)
- Upgrade Inference Operator helm chart (#346)
- Update MIG config for GPU operator (#358)
- Release Health Monitoring Agent 1.0.1249.0_1.0.359.0 with enhanced Nvidia timeout analysis and bug fixes (#361)
- Fix canary test failures for GPU quota allocation integration tests (#356)
- Fix region fallback logic for health-monitoring-agent image URIs (#360)
- Remove command flag from init pytorch job integration test (#351)
- Skip expensive integration tests to improve CI performance (#355)
- Elastic training support for HyperPodTrainingOperator that is released in Reinvent 2025 keynote 3. This is a method that dynamically scales distributed machine learning operations.
- HyperPod Dev Spaces template for data scientists to create, manage, and access interactive ML development environments with configurable resource allocation and namespace isolation
- Support for KVCaching, intelligent routing, tiered storage, MIG
- Support for fractional gpu
- Support KVCache and Intelligent Routing support in template version 1.1
- User can modify jinja template to add parameters supported by CRD through init experience, for further CLI customization
- MIG support for model deployment on SageMaker Hyperpod Inference
- Describe cluster command
- User can use hyp describe cluster to learn more info about hp clusters
- Jinja template handling logic for inference and training
- User can modify jinja template to add parameters supported by CRD through init experience of inference and training, for further CLI customization
- Cluster creation template versioning
- User can choose cloudformation template version through cluster creation expeirence
- KVCache and intelligent routing for HyperPod Inference
- InferenceEndpointConfig CRD supported is updated to v1
- KVCache and Intelligent Routing support is added in template version 1.1
- Init Experience
- Init, Validate, and Create JumpStart endpoint, Custom endpoint, and PyTorch Training Job with local configuration
- Cluster management
- Bug fixes for cluster creation
- Fix for production canary failures caused by bad training job template.
- New version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes.
- Cluster management
- Bug Fixes with cluster creation
- Enable cluster template to be installed with hyperpod CLI .
- Cluster management
- Creation of cluster stack
- Describing and listing a cluster stack
- Updating a cluster
- Init Experience
- Init, Validate, Create with local configurations
- Task Governance feature for training jobs.
- Update volume flag to support hostPath and PVC
- Add an option to disable the deployment of KubeFlow TrainingOperator
- Enable telemetry for CLI
- Training Job - Create, List , Get
- Inference Jumpstart - Create , List, Get, Invoke
- Inference Custom - Create , List, Get, Invoke
- Observability changes
- feature: The HyperPod CLI now support (Hyperpod recipes). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more (here).
- feature: Add support for SageMaker HyperPod CLI