This CDK project demonstrates how to access Amazon SageMaker AI MLflow REST APIs via an external service. It supports two MLflow deployment modes — Tracking Server and Serverless App — and creates a Flask web application that serves as a secure proxy for external access to MLflow APIs, following AWS best practices.
The project consists of four main stacks:
- Networking Stack - Foundation VPC infrastructure (shared)
- SageMaker Domain Stack - Amazon SageMaker AI Studio domain with user profiles (shared)
- Managed MLflow Stack - MLflow Tracking Server or Serverless App (type-specific)
- Flask App Stack - Flask proxy application with ALB for MLflow API access (type-specific)
The project supports two MLflow deployment modes, selected via the mlflowType CDK context parameter:
| Aspect | Tracking Server (tracking) |
Serverless App (serverless) |
|---|---|---|
| Resource | CfnMlflowTrackingServer |
Custom Resource (Lambda-backed create_mlflow_app) |
| API Endpoint | https://{region}.experiments.sagemaker.aws |
https://mlflow.sagemaker.{region}.app.aws |
| SigV4 Service | sagemaker-mlflow |
sagemaker |
| Request Header | x-mlflow-sm-tracking-server-arn |
x-sm-mlflow-app-arn |
| IAM Permission | sagemaker-mlflow:* |
sagemaker:CallMlflowAppApi, sagemaker-mlflow:* |
Both modes can coexist in the same account/region — the Networking and Amazon SageMaker AI Domain stacks are shared, while the MLflow and Flask App stacks are named per type (e.g., sagemaker-infra-mlflow-tracking, sagemaker-infra-flaskapp-serverless).
NetworkingStack (Foundation - shared)
├── SageMakerDomainStack (depends on NetworkingStack - shared)
├── ManagedMlflowStack (depends on NetworkingStack + SageMakerDomainStack - per mlflowType)
└── FlaskAppStack (depends on NetworkingStack + ManagedMlflowStack - per mlflowType)
- Multi-AZ VPC with public, private, and isolated subnets
- VPC endpoints for cost optimization and security
- Security groups with least privilege access
- NAT gateways for outbound internet access
- Amazon SageMaker AI Studio domain with VPC-only access
- IAM execution role with appropriate permissions
- S3 bucket for Amazon SageMaker AI artifacts
- Default user profile configuration
- Tracking Server mode: Amazon SageMaker AI MLflow tracking server with automatic model registration
- Serverless App mode: Lambda Custom Resource that provisions a serverless MLflow app via
create_mlflow_appAPI with auto-bundled boto3 Lambda layer (built automatically duringcdk synth/cdk deploy— uses localpip/pip3if available, falls back to Docker) - S3 bucket for MLflow artifacts with lifecycle policies
- Fully managed backend (no database management required)
- Flask proxy application supporting both tracking server and serverless MLflow modes
- Auto-detects mode from ARN format (
:mlflow-app/in ARN = serverless) - SigV4 request signing with mode-appropriate service name and headers
- Application Load Balancer (internet-facing, port 80 -> EC2 port 5000)
- EC2 instance (Ubuntu 24.04 LTS) with AMI auto-resolved via SSM Parameter Store
- Dedicated IAM role (
FlaskMlflowRole) assumed via STS with minimal MLflow permissions - Security groups with least privilege access
- Automated deployment with user data scripts (S3 helper download, systemd service setup)
To follow along with this walkthrough, make sure you have the following prerequisites:
- An AWS account
- A workstation with the following tools installed:
- AWS CLI configured with permissions to create VPC, EC2, Amazon SageMaker AI, S3, IAM, CloudFormation and Application Load Balancer on the AWS account
- Node.js v18+
- NPM
- AWS CDK CLI v2.100.0+
- Python 3.x with
pip/pip3(for auto-bundling the boto3 Lambda layer during serverless deployments). If unavailable, Docker is used as a fallback.
The project includes automated deployment of helper scripts to S3. Ensure the helpers/ folder contains:
app/main.py- Flask application (dual-mode proxy with SigV4 signing)app/aws_utils.py- AWS utilities (STS AssumeRole, SigV4Auth with configurable service name)app/requirements.txt- Python dependencies (boto3 >= 1.42.0)mlflowproxy.service- Systemd service file (environment variables injected by CDK)setup_mlflow_proxy_app.sh- Setup scriptinstall_python13.sh- Python 3.13 installation script
Note: Helper scripts are automatically uploaded to S3 during CDK deployment using BucketDeployment.
- Clone the repository and install dependencies:
npm install- Bootstrap CDK (if not already done):
npx cdk bootstrap aws://<ACCOUNT_ID>/<REGION>npx cdk deploy --allor explicitly:
npx cdk deploy --all -c mlflowType=trackingnpx cdk deploy --all -c mlflowType=serverlessnpx cdk deploy --all -c mlflowType=tracking
npx cdk deploy --all -c mlflowType=serverlessCDK_DEFAULT_REGION=us-west-2 npx cdk bootstrap aws://<ACCOUNT_ID>/us-west-2
CDK_DEFAULT_REGION=us-west-2 npx cdk deploy --all -c mlflowType=serverlessCDK_DEFAULT_ACCOUNT- AWS account IDCDK_DEFAULT_REGION- AWS region (defaults to us-east-1)
The Flask application includes:
- main.py - Flask proxy application with dual-mode MLflow API routing (tracking server / serverless)
- aws_utils.py - AWS authentication, STS AssumeRole, and SigV4 request signing
- mlflowproxy.service - Systemd service for automatic startup (environment variables:
SM_MLFLOW_ROLE_ARN,SM_MLFLOW_AWS_REGION,SM_MLFLOW_RESOURCE_ARN) - setup_mlflow_proxy_app.sh - Installation and configuration script
- install_python13.sh - Python 3.13 installation script
To destroy all resources for a specific mode:
npx cdk destroy --all -c mlflowType=tracking
npx cdk destroy --all -c mlflowType=serverlessNote: Some resources like S3 buckets have retention policies and may need manual deletion. The Networking and Amazon SageMaker AI Domain stacks are shared — destroy them only after both MLflow/Flask stacks are removed.
- VPC Endpoint Creation Fails: Ensure the region supports all required VPC endpoints
- Amazon SageMaker AI Domain Creation Timeout: Check security group rules and subnet configurations
- MLflow APIs Not Accessible: Verify Flask app is running and ALB health checks are passing
- Flask App 403 Errors: Check IAM role permissions — tracking server needs
sagemaker-mlflow:*, serverless additionally needssagemaker:CallMlflowAppApi - ALB Health Check Failures: Verify EC2 instance is running and Flask app is started on port 5000
- Circular Dependency Errors: Ensure security group rules don't create circular references
- EC2 User Data Fails: Check
/var/log/user-data.logvia SSM — common cause is dpkg lock held byunattended-upgradesat boot - Serverless MLflow App Creation Fails: The boto3 Lambda layer is auto-bundled during synthesis. If local
pip/pip3is unavailable, Docker must be running for the fallback bundling. Ensure boto3 >= 1.42.0 is installed in the layer (bundled Lambda boto3 is too old forcreate_mlflow_appAPI)
This project was developed and maintained by:
- Manish Garg
- Ashish Bhatt
- Ram Yennapusa