Skip to content

aws-samples/sample-sagemaker-mlflow-rest-apis

Amazon SageMaker AI MLflow CDK Project

This CDK project demonstrates how to access Amazon SageMaker AI MLflow REST APIs via an external service. It supports two MLflow deployment modes — Tracking Server and Serverless App — and creates a Flask web application that serves as a secure proxy for external access to MLflow APIs, following AWS best practices.

Architecture Overview

The project consists of four main stacks:

  1. Networking Stack - Foundation VPC infrastructure (shared)
  2. SageMaker Domain Stack - Amazon SageMaker AI Studio domain with user profiles (shared)
  3. Managed MLflow Stack - MLflow Tracking Server or Serverless App (type-specific)
  4. Flask App Stack - Flask proxy application with ALB for MLflow API access (type-specific)

MLflow Deployment Modes

The project supports two MLflow deployment modes, selected via the mlflowType CDK context parameter:

Aspect Tracking Server (tracking) Serverless App (serverless)
Resource CfnMlflowTrackingServer Custom Resource (Lambda-backed create_mlflow_app)
API Endpoint https://{region}.experiments.sagemaker.aws https://mlflow.sagemaker.{region}.app.aws
SigV4 Service sagemaker-mlflow sagemaker
Request Header x-mlflow-sm-tracking-server-arn x-sm-mlflow-app-arn
IAM Permission sagemaker-mlflow:* sagemaker:CallMlflowAppApi, sagemaker-mlflow:*

Both modes can coexist in the same account/region — the Networking and Amazon SageMaker AI Domain stacks are shared, while the MLflow and Flask App stacks are named per type (e.g., sagemaker-infra-mlflow-tracking, sagemaker-infra-flaskapp-serverless).

Stack Dependencies

NetworkingStack (Foundation - shared)
├── SageMakerDomainStack (depends on NetworkingStack - shared)
├── ManagedMlflowStack (depends on NetworkingStack + SageMakerDomainStack - per mlflowType)
└── FlaskAppStack (depends on NetworkingStack + ManagedMlflowStack - per mlflowType)

Features

Networking Stack

  • Multi-AZ VPC with public, private, and isolated subnets
  • VPC endpoints for cost optimization and security
  • Security groups with least privilege access
  • NAT gateways for outbound internet access

Amazon SageMaker AI Domain Stack

  • Amazon SageMaker AI Studio domain with VPC-only access
  • IAM execution role with appropriate permissions
  • S3 bucket for Amazon SageMaker AI artifacts
  • Default user profile configuration

Managed MLflow Stack

  • Tracking Server mode: Amazon SageMaker AI MLflow tracking server with automatic model registration
  • Serverless App mode: Lambda Custom Resource that provisions a serverless MLflow app via create_mlflow_app API with auto-bundled boto3 Lambda layer (built automatically during cdk synth/cdk deploy — uses local pip/pip3 if available, falls back to Docker)
  • S3 bucket for MLflow artifacts with lifecycle policies
  • Fully managed backend (no database management required)

Flask App Stack

  • Flask proxy application supporting both tracking server and serverless MLflow modes
  • Auto-detects mode from ARN format (:mlflow-app/ in ARN = serverless)
  • SigV4 request signing with mode-appropriate service name and headers
  • Application Load Balancer (internet-facing, port 80 -> EC2 port 5000)
  • EC2 instance (Ubuntu 24.04 LTS) with AMI auto-resolved via SSM Parameter Store
  • Dedicated IAM role (FlaskMlflowRole) assumed via STS with minimal MLflow permissions
  • Security groups with least privilege access
  • Automated deployment with user data scripts (S3 helper download, systemd service setup)

Prerequisites

To follow along with this walkthrough, make sure you have the following prerequisites:

  • An AWS account
  • A workstation with the following tools installed:
  • AWS CLI configured with permissions to create VPC, EC2, Amazon SageMaker AI, S3, IAM, CloudFormation and Application Load Balancer on the AWS account
  • Node.js v18+
    • NPM
    • AWS CDK CLI v2.100.0+
  • Python 3.x with pip/pip3 (for auto-bundling the boto3 Lambda layer during serverless deployments). If unavailable, Docker is used as a fallback.

Helper Scripts

The project includes automated deployment of helper scripts to S3. Ensure the helpers/ folder contains:

  • app/main.py - Flask application (dual-mode proxy with SigV4 signing)
  • app/aws_utils.py - AWS utilities (STS AssumeRole, SigV4Auth with configurable service name)
  • app/requirements.txt - Python dependencies (boto3 >= 1.42.0)
  • mlflowproxy.service - Systemd service file (environment variables injected by CDK)
  • setup_mlflow_proxy_app.sh - Setup script
  • install_python13.sh - Python 3.13 installation script

Note: Helper scripts are automatically uploaded to S3 during CDK deployment using BucketDeployment.

Installation

  1. Clone the repository and install dependencies:
npm install
  1. Bootstrap CDK (if not already done):
npx cdk bootstrap aws://<ACCOUNT_ID>/<REGION>

Deployment

Deploy with Tracking Server (default)

npx cdk deploy --all

or explicitly:

npx cdk deploy --all -c mlflowType=tracking

Deploy with Serverless MLflow App

npx cdk deploy --all -c mlflowType=serverless

Deploy both modes (coexist in the same account/region)

npx cdk deploy --all -c mlflowType=tracking
npx cdk deploy --all -c mlflowType=serverless

Deploy to a different region

CDK_DEFAULT_REGION=us-west-2 npx cdk bootstrap aws://<ACCOUNT_ID>/us-west-2
CDK_DEFAULT_REGION=us-west-2 npx cdk deploy --all -c mlflowType=serverless

Configuration

Environment Variables

  • CDK_DEFAULT_ACCOUNT - AWS account ID
  • CDK_DEFAULT_REGION - AWS region (defaults to us-east-1)

Flask Application Components

The Flask application includes:

  • main.py - Flask proxy application with dual-mode MLflow API routing (tracking server / serverless)
  • aws_utils.py - AWS authentication, STS AssumeRole, and SigV4 request signing
  • mlflowproxy.service - Systemd service for automatic startup (environment variables: SM_MLFLOW_ROLE_ARN, SM_MLFLOW_AWS_REGION, SM_MLFLOW_RESOURCE_ARN)
  • setup_mlflow_proxy_app.sh - Installation and configuration script
  • install_python13.sh - Python 3.13 installation script

Cleanup

To destroy all resources for a specific mode:

npx cdk destroy --all -c mlflowType=tracking
npx cdk destroy --all -c mlflowType=serverless

Note: Some resources like S3 buckets have retention policies and may need manual deletion. The Networking and Amazon SageMaker AI Domain stacks are shared — destroy them only after both MLflow/Flask stacks are removed.

Troubleshooting

Common Issues

  1. VPC Endpoint Creation Fails: Ensure the region supports all required VPC endpoints
  2. Amazon SageMaker AI Domain Creation Timeout: Check security group rules and subnet configurations
  3. MLflow APIs Not Accessible: Verify Flask app is running and ALB health checks are passing
  4. Flask App 403 Errors: Check IAM role permissions — tracking server needs sagemaker-mlflow:*, serverless additionally needs sagemaker:CallMlflowAppApi
  5. ALB Health Check Failures: Verify EC2 instance is running and Flask app is started on port 5000
  6. Circular Dependency Errors: Ensure security group rules don't create circular references
  7. EC2 User Data Fails: Check /var/log/user-data.log via SSM — common cause is dpkg lock held by unattended-upgrades at boot
  8. Serverless MLflow App Creation Fails: The boto3 Lambda layer is auto-bundled during synthesis. If local pip/pip3 is unavailable, Docker must be running for the fallback bundling. Ensure boto3 >= 1.42.0 is installed in the layer (bundled Lambda boto3 is too old for create_mlflow_app API)

Contributors

This project was developed and maintained by:

  • Manish Garg
  • Ashish Bhatt
  • Ram Yennapusa

About

SageMaker MLflow CDK Project

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors