Skip to content

Latest commit

 

History

History
202 lines (133 loc) · 6.96 KB

File metadata and controls

202 lines (133 loc) · 6.96 KB

Changelog

v3.8.0 (2026-04-16)

Features

  • Add Recipe Job Init Experience: hyp init hyp-recipe-job command for launching HyperPod recipe jobs (#409)
  • Introduce replica_count parameter for training jobs; deprecate node_count in favor of replica_count (#401)

Bug Fixes

  • Exclude DELETE_COMPLETE stacks at API level to prevent ListStacks throttling (#410)
  • Fix ml. prefix and wrong MIG profiles for p6-b200.48xlarge (#399)
  • Fix namespaced-role-and-bindings helm chart bugs (#406)

Health Monitoring Agent

  • Release Health Monitoring Agent 1.0.1481.0_1.0.392.0 with bug fixes (#405)

v3.7.1 (2026-04-08)

New Instance Type Support

  • Add g7e instance types to HyperPod helm chart values (nvidia/EFA device plugins) (#380)
  • Add g7e instance types to Python constants and CLI (#385, #390)
  • Add g7e instance types to health-monitoring-agent node affinity (#381)
  • Add B300 MIG profiles to GPU operator ConfigMap (#396)
  • Add MIG profile support for ml.p6-b300.48xlarge (Blackwell Ultra) (#398)

Inference Operator

  • CRD updates: BYO certificate, RequestLimitsConfig, Custom Kubernetes support (#402)
  • Bump hyperpod-inference-operator subchart to v2.1.0 with image tag v3.1 (#402)

Enhancements

  • Support AWS_REGION env var, cluster context fallback, centralize boto3 client creation (#395)
  • Handle pagination in cluster stack listing (#394)
  • Require --instance-type when specifying accelerator resources (#393)

Bug Fixes

  • Fix EFA field naming in PyTorch job template v1.1: efa_interfaces -> efa, efa_interfaces_limit -> efa_limit (#392)
  • Fix deep health check nodeSelector label to sagemaker.amazonaws.com/deep-health-check-status: Passed (#386)
  • Remove non-EFA instance types from EFA device plugin nodeAffinity to prevent CrashLoopBackOff (#389)
  • Add missing instance types and fix EFA/memory resource specs (#385)

Health Monitoring Agent

  • Release Health Monitoring Agent 1.0.1434.0_1.0.388.0 (#388)

v3.7.0 (2026-03-02)

Space CLI

  • Added list all functionality and documentation updates
  • Disabled traceback for cleaner error output

Inference Operator

  • Inference Operator AddOn with NodeAffinity support and version 3.0 update
  • Updated hyperpod-inference-operator to version 2.0.0 in HyperPodHelmChart
  • Added AddOn migration script and README

Enhancements

Monitoring & Observability

  • Emit metrics for CLI commands

Testing & Validation

  • Added unit tests for inference CRDs
  • Added CRD format check for inference

Dependencies & Versions

  • Updated GPU operator container toolkit version
  • Updated aws-efa-k8s-device-plugin version to 0.5.20

Configuration

  • Instance types CRD changes

Bug Fixes

  • Fixed syntax error in inferenceendpointconfigs by removing tab

v3.6.0 (2026-01-27)

Features

  • Add EFA support in manifest for training jobs (#345)
  • Add end-to-end example documentation (#350)
  • Add 4 new HyperPod GA regions (ca-central-1, ap-southeast-3, ap-southeast-4, eu-south-2) (#360)

Enhancements

  • Update documentation for elastic training arguments (#343)
  • Upgrade Inference Operator helm chart (#346)
  • Update MIG config for GPU operator (#358)
  • Release Health Monitoring Agent 1.0.1249.0_1.0.359.0 with enhanced Nvidia timeout analysis and bug fixes (#361)

Bug Fixes

  • Fix canary test failures for GPU quota allocation integration tests (#356)
  • Fix region fallback logic for health-monitoring-agent image URIs (#360)
  • Remove command flag from init pytorch job integration test (#351)
  • Skip expensive integration tests to improve CI performance (#355)

v.3.5.0 (2025-12-03)

Features

  • Elastic training support for HyperPodTrainingOperator that is released in Reinvent 2025 keynote 3. This is a method that dynamically scales distributed machine learning operations.

v.3.4.0 (2025-11-20)

Features

  • HyperPod Dev Spaces template for data scientists to create, manage, and access interactive ML development environments with configurable resource allocation and namespace isolation
  • Support for KVCaching, intelligent routing, tiered storage, MIG
  • Support for fractional gpu
  • Support KVCache and Intelligent Routing support in template version 1.1
  • User can modify jinja template to add parameters supported by CRD through init experience, for further CLI customization
  • MIG support for model deployment on SageMaker Hyperpod Inference

v.3.3.1 (2025-10-30)

Features

  • Describe cluster command
    • User can use hyp describe cluster to learn more info about hp clusters
  • Jinja template handling logic for inference and training
    • User can modify jinja template to add parameters supported by CRD through init experience of inference and training, for further CLI customization
  • Cluster creation template versioning
    • User can choose cloudformation template version through cluster creation expeirence
  • KVCache and intelligent routing for HyperPod Inference
    • InferenceEndpointConfig CRD supported is updated to v1
    • KVCache and Intelligent Routing support is added in template version 1.1

v.3.3.0 (2025-09-23)

Features

  • Init Experience
    • Init, Validate, and Create JumpStart endpoint, Custom endpoint, and PyTorch Training Job with local configuration
  • Cluster management
    • Bug fixes for cluster creation

v.3.2.2 (2025-09-10)

Features

  • Fix for production canary failures caused by bad training job template.
  • New version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes.

v3.2.1 (2025-08-27)

Features

  • Cluster management
    • Bug Fixes with cluster creation
    • Enable cluster template to be installed with hyperpod CLI .

v3.2.0 (2025-08-25)

Features

  • Cluster management
    • Creation of cluster stack
    • Describing and listing a cluster stack
    • Updating a cluster
  • Init Experience
    • Init, Validate, Create with local configurations

v3.1.0 (2025-08-13)

Features

  • Task Governance feature for training jobs.

v3.0.2 (2025-07-31)

Features

  • Update volume flag to support hostPath and PVC
  • Add an option to disable the deployment of KubeFlow TrainingOperator
  • Enable telemetry for CLI

v3.0.0 (2025-07-10)

Features

  • Training Job - Create, List , Get
  • Inference Jumpstart - Create , List, Get, Invoke
  • Inference Custom - Create , List, Get, Invoke
  • Observability changes

v2.0.0 (2024-12-04)

Features

  • feature: The HyperPod CLI now support (Hyperpod recipes). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more (here).

v1.0.0 (2024-09-09)

Features

  • feature: Add support for SageMaker HyperPod CLI