Amazon SageMaker AI MLflow CDK Project

This CDK project demonstrates how to access Amazon SageMaker AI MLflow REST APIs via an external service. It supports two MLflow deployment modes — Tracking Server and Serverless App — and creates a Flask web application that serves as a secure proxy for external access to MLflow APIs, following AWS best practices.

Architecture Overview

The project consists of four main stacks:

Networking Stack - Foundation VPC infrastructure (shared)
SageMaker Domain Stack - Amazon SageMaker AI Studio domain with user profiles (shared)
Managed MLflow Stack - MLflow Tracking Server or Serverless App (type-specific)
Flask App Stack - Flask proxy application with ALB for MLflow API access (type-specific)

MLflow Deployment Modes

The project supports two MLflow deployment modes, selected via the mlflowType CDK context parameter:

Aspect	Tracking Server (`tracking`)	Serverless App (`serverless`)
Resource	`CfnMlflowTrackingServer`	Custom Resource (Lambda-backed `create_mlflow_app`)
API Endpoint	`https://{region}.experiments.sagemaker.aws`	`https://mlflow.sagemaker.{region}.app.aws`
SigV4 Service	`sagemaker-mlflow`	`sagemaker`
Request Header	`x-mlflow-sm-tracking-server-arn`	`x-sm-mlflow-app-arn`
IAM Permission	`sagemaker-mlflow:*`	`sagemaker:CallMlflowAppApi`, `sagemaker-mlflow:*`

Both modes can coexist in the same account/region — the Networking and Amazon SageMaker AI Domain stacks are shared, while the MLflow and Flask App stacks are named per type (e.g., sagemaker-infra-mlflow-tracking, sagemaker-infra-flaskapp-serverless).

Stack Dependencies

NetworkingStack (Foundation - shared)
├── SageMakerDomainStack (depends on NetworkingStack - shared)
├── ManagedMlflowStack (depends on NetworkingStack + SageMakerDomainStack - per mlflowType)
└── FlaskAppStack (depends on NetworkingStack + ManagedMlflowStack - per mlflowType)

Features

Networking Stack

Multi-AZ VPC with public, private, and isolated subnets
VPC endpoints for cost optimization and security
Security groups with least privilege access
NAT gateways for outbound internet access

Amazon SageMaker AI Domain Stack

Amazon SageMaker AI Studio domain with VPC-only access
IAM execution role with appropriate permissions
S3 bucket for Amazon SageMaker AI artifacts
Default user profile configuration

Managed MLflow Stack

Tracking Server mode: Amazon SageMaker AI MLflow tracking server with automatic model registration
Serverless App mode: Lambda Custom Resource that provisions a serverless MLflow app via create_mlflow_app API with auto-bundled boto3 Lambda layer (built automatically during cdk synth/cdk deploy — uses local pip/pip3 if available, falls back to Docker)
S3 bucket for MLflow artifacts with lifecycle policies
Fully managed backend (no database management required)

Flask App Stack

Flask proxy application supporting both tracking server and serverless MLflow modes
Auto-detects mode from ARN format (:mlflow-app/ in ARN = serverless)
SigV4 request signing with mode-appropriate service name and headers
Application Load Balancer (internet-facing, port 80 -> EC2 port 5000)
EC2 instance (Ubuntu 24.04 LTS) with AMI auto-resolved via SSM Parameter Store
Dedicated IAM role (FlaskMlflowRole) assumed via STS with minimal MLflow permissions
Security groups with least privilege access
Automated deployment with user data scripts (S3 helper download, systemd service setup)

Prerequisites

To follow along with this walkthrough, make sure you have the following prerequisites:

An AWS account
A workstation with the following tools installed:
AWS CLI configured with permissions to create VPC, EC2, Amazon SageMaker AI, S3, IAM, CloudFormation and Application Load Balancer on the AWS account
Node.js v18+
- NPM
- AWS CDK CLI v2.100.0+
Python 3.x with pip/pip3 (for auto-bundling the boto3 Lambda layer during serverless deployments). If unavailable, Docker is used as a fallback.

Helper Scripts

The project includes automated deployment of helper scripts to S3. Ensure the helpers/ folder contains:

app/main.py - Flask application (dual-mode proxy with SigV4 signing)
app/aws_utils.py - AWS utilities (STS AssumeRole, SigV4Auth with configurable service name)
app/requirements.txt - Python dependencies (boto3 >= 1.42.0)
mlflowproxy.service - Systemd service file (environment variables injected by CDK)
setup_mlflow_proxy_app.sh - Setup script
install_python13.sh - Python 3.13 installation script

Note: Helper scripts are automatically uploaded to S3 during CDK deployment using BucketDeployment.

Installation

Clone the repository and install dependencies:

npm install

Bootstrap CDK (if not already done):

npx cdk bootstrap aws://<ACCOUNT_ID>/<REGION>

Deployment

Deploy with Tracking Server (default)

npx cdk deploy --all

or explicitly:

npx cdk deploy --all -c mlflowType=tracking

Deploy with Serverless MLflow App

npx cdk deploy --all -c mlflowType=serverless

Deploy both modes (coexist in the same account/region)

npx cdk deploy --all -c mlflowType=tracking
npx cdk deploy --all -c mlflowType=serverless

Deploy to a different region

CDK_DEFAULT_REGION=us-west-2 npx cdk bootstrap aws://<ACCOUNT_ID>/us-west-2
CDK_DEFAULT_REGION=us-west-2 npx cdk deploy --all -c mlflowType=serverless

Configuration

Environment Variables

CDK_DEFAULT_ACCOUNT - AWS account ID
CDK_DEFAULT_REGION - AWS region (defaults to us-east-1)

Flask Application Components

The Flask application includes:

main.py - Flask proxy application with dual-mode MLflow API routing (tracking server / serverless)
aws_utils.py - AWS authentication, STS AssumeRole, and SigV4 request signing
mlflowproxy.service - Systemd service for automatic startup (environment variables: SM_MLFLOW_ROLE_ARN, SM_MLFLOW_AWS_REGION, SM_MLFLOW_RESOURCE_ARN)
setup_mlflow_proxy_app.sh - Installation and configuration script
install_python13.sh - Python 3.13 installation script

Cleanup

To destroy all resources for a specific mode:

npx cdk destroy --all -c mlflowType=tracking
npx cdk destroy --all -c mlflowType=serverless

Note: Some resources like S3 buckets have retention policies and may need manual deletion. The Networking and Amazon SageMaker AI Domain stacks are shared — destroy them only after both MLflow/Flask stacks are removed.

Troubleshooting

Common Issues

VPC Endpoint Creation Fails: Ensure the region supports all required VPC endpoints
Amazon SageMaker AI Domain Creation Timeout: Check security group rules and subnet configurations
MLflow APIs Not Accessible: Verify Flask app is running and ALB health checks are passing
Flask App 403 Errors: Check IAM role permissions — tracking server needs sagemaker-mlflow:*, serverless additionally needs sagemaker:CallMlflowAppApi
ALB Health Check Failures: Verify EC2 instance is running and Flask app is started on port 5000
Circular Dependency Errors: Ensure security group rules don't create circular references
EC2 User Data Fails: Check /var/log/user-data.log via SSM — common cause is dpkg lock held by unattended-upgrades at boot
Serverless MLflow App Creation Fails: The boto3 Lambda layer is auto-bundled during synthesis. If local pip/pip3 is unavailable, Docker must be running for the fallback bundling. Ensure boto3 >= 1.42.0 is installed in the layer (bundled Lambda boto3 is too old for create_mlflow_app API)

Contributors

This project was developed and maintained by:

Manish Garg
Ashish Bhatt
Ram Yennapusa

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bin		bin
helpers		helpers
lib		lib
.gitignore		.gitignore
.grype.yaml		.grype.yaml
.semgrepignore		.semgrepignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cdk.json		cdk.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon SageMaker AI MLflow CDK Project

Architecture Overview

MLflow Deployment Modes

Stack Dependencies

Features

Networking Stack

Amazon SageMaker AI Domain Stack

Managed MLflow Stack

Flask App Stack

Prerequisites

Helper Scripts

Installation

Deployment

Deploy with Tracking Server (default)

Deploy with Serverless MLflow App

Deploy both modes (coexist in the same account/region)

Deploy to a different region

Configuration

Environment Variables

Flask Application Components

Cleanup

Troubleshooting

Common Issues

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Amazon SageMaker AI MLflow CDK Project

Architecture Overview

MLflow Deployment Modes

Stack Dependencies

Features

Networking Stack

Amazon SageMaker AI Domain Stack

Managed MLflow Stack

Flask App Stack

Prerequisites

Helper Scripts

Installation

Deployment

Deploy with Tracking Server (default)

Deploy with Serverless MLflow App

Deploy both modes (coexist in the same account/region)

Deploy to a different region

Configuration

Environment Variables

Flask Application Components

Cleanup

Troubleshooting

Common Issues

Contributors

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages