Production deployment patterns and operational best practices.
- Container Deployment
- Azure Container Apps Job (GPU) — azd
- Batch Mode (Azure SDK)
- Resource Requirements
- Storage Configuration
- Azure Blob Storage Setup
- Security Hardening
- Monitoring and Logging
Create docker-compose.yml:
version: '3.8'
services:
3dgs-processor:
image: 3dgs-processor:gpu
container_name: 3dgs-processor
restart: unless-stopped
# Resource limits
deploy:
resources:
limits:
cpus: '8'
memory: 16G
reservations:
cpus: '4'
memory: 8G
# Volume mounts
volumes:
- ./input:/data/input
- ./output:/data/output
- ./processed:/data/processed
- ./error:/data/error
- ./config.yaml:/app/config.yaml:ro
# Environment configuration
environment:
INPUT_PATH: /data/input
OUTPUT_PATH: /data/output
PROCESSED_PATH: /data/processed
ERROR_PATH: /data/error
BACKEND: gsplat
LOG_LEVEL: info
STABILITY_TIMEOUT_SECS: 30
MAX_RETRIES: 3
RETENTION_DAYS: 30
MIN_DISK_SPACE_GB: 50
# Logging configuration
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"
# Health check
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40sDeploy:
docker-compose up -d
docker-compose logs -f 3dgs-processorCreate k8s/deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: 3dgs-processor
namespace: 3dgs
spec:
replicas: 1 # Single instance (sequential processing)
selector:
matchLabels:
app: 3dgs-processor
template:
metadata:
labels:
app: 3dgs-processor
spec:
containers:
- name: processor
image: 3dgs-processor:gpu
imagePullPolicy: Always
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "8"
memory: "16Gi"
env:
- name: INPUT_PATH
value: "/data/input"
- name: OUTPUT_PATH
value: "/data/output"
- name: PROCESSED_PATH
value: "/data/processed"
- name: ERROR_PATH
value: "/data/error"
- name: BACKEND
valueFrom:
configMapKeyRef:
name: 3dgs-config
key: backend
volumeMounts:
- name: data
mountPath: /data
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
readOnly: true
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
volumes:
- name: data
persistentVolumeClaim:
claimName: 3dgs-pvc
- name: config
configMap:
name: 3dgs-configCreate k8s/pvc.yaml for persistent storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: 3dgs-pvc
namespace: 3dgs
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 500GiDeploy:
kubectl create namespace 3dgs
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/deployment.yaml
kubectl logs -f deployment/3dgs-processor -n 3dgsDeploy serverless containers with Azure Files or Blobfuse2:
Option 1: Using Azure Files (Simpler)
# Set variables
RESOURCE_GROUP="<your-resource-group>"
LOCATION="eastus"
CONTAINER_NAME="3dgs-processor"
IMAGE="youracr.azurecr.io/3dgs-processor:gpu"
STORAGE_ACCOUNT="<your-storage-account>"
STORAGE_KEY="<your-storage-key>"
# Create container with Azure Files mount
az container create \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--image $IMAGE \
--cpu 8 \
--memory 16 \
--os-type Linux \
--restart-policy Always \
--environment-variables \
INPUT_PATH=/data/input \
OUTPUT_PATH=/data/output \
PROCESSED_PATH=/data/processed \
ERROR_PATH=/data/error \
BACKEND=gsplat \
LOG_LEVEL=info \
--azure-file-volume-account-name $STORAGE_ACCOUNT \
--azure-file-volume-account-key "$STORAGE_KEY" \
--azure-file-volume-share-name input \
--azure-file-volume-mount-path /data/inputOption 2: Using Managed Identity with Azure Blob Storage
Create aci-deployment.yaml:
apiVersion: '2021-09-01'
location: eastus
name: 3dgs-processor-group
properties:
containers:
- name: 3dgs-processor
properties:
image: youracr.azurecr.io/3dgs-processor:gpu
resources:
requests:
cpu: 8
memoryInGB: 16
environmentVariables:
- name: INPUT_PATH
value: /data/input
- name: OUTPUT_PATH
value: /data/output
- name: PROCESSED_PATH
value: /data/processed
- name: ERROR_PATH
value: /data/error
- name: BACKEND
value: gsplat
- name: AZURE_STORAGE_ACCOUNT
value: <your-storage-account>
volumeMounts:
- name: input-volume
mountPath: /data/input
- name: output-volume
mountPath: /data/output
volumes:
- name: input-volume
azureFile:
shareName: input
storageAccountName: <your-storage-account>
storageAccountKey: <key>
- name: output-volume
azureFile:
shareName: output
storageAccountName: <your-storage-account>
storageAccountKey: <key>
osType: Linux
restartPolicy: Always
identity:
type: SystemAssigned
type: Microsoft.ContainerInstance/containerGroupsDeploy:
az container create --resource-group $RESOURCE_GROUP --file aci-deployment.yamlMonitor ACI deployment:
# View logs
az container logs --resource-group $RESOURCE_GROUP --name $CONTAINER_NAME --follow
# Check status
az container show --resource-group $RESOURCE_GROUP --name $CONTAINER_NAME --query instanceView.state
# Execute commands inside container
az container exec --resource-group $RESOURCE_GROUP --name $CONTAINER_NAME --exec-command "/bin/sh"
# Delete when done
az container delete --resource-group $RESOURCE_GROUP --name $CONTAINER_NAMECost Optimization:
- Use Azure Container Instances for variable workloads (pay per second)
- Use AKS for consistent high-volume processing
- Set
--restart-policy OnFailurefor one-time jobs - Use spot instances for non-critical workloads
Deploy the 3DGS processor as a serverless GPU job on Azure Container Apps using the
Azure Developer CLI (azd). The job runs in batch mode: download videos from Azure Blob
Storage → extract frames → COLMAP reconstruction → gsplat GPU training → export PLY/SPLAT
→ upload results → exit.
Key characteristics:
- Serverless GPU — NVIDIA T4 (16 GB VRAM) via
Consumption-GPU-NC8as-T4workload profile - No local Docker required — images are built remotely via ACR Tasks
- Batch mode — single job execution, no long-running container
- Managed Identity — user-assigned MI for RBAC-based access to ACR and Storage
- RBAC separation — infrastructure provisioning and role assignments are separate steps
- Existing resource preservation — optionally reuse pre-existing Resource Groups, ACR,
Storage Accounts, and Container Apps Environments. These resources are referenced via
Bicep's
existingkeyword and are not deleted byazd down
| Requirement | Purpose |
|---|---|
Azure CLI (az) |
Azure resource management |
Azure Developer CLI (azd) |
Infrastructure-as-code orchestration |
| Azure subscription | With GPU quota in your target region |
| Test data | Download via ./scripts/e2e/01-download-testdata.sh |
No local Docker daemon is needed. The GPU image is built remotely on Azure Container Registry Tasks.
# 1. Initialize the azd environment
azd init
# 2. Configure environment
azd env set AZURE_LOCATION swedencentral
azd env set USE_GPU true
# 3. (Optional) Reuse existing Azure resources instead of creating new ones
# See "Reusing Existing Azure Resources" below for details.
./scripts/configure-existing-resources.sh
# 4. Provision infrastructure (also builds the GPU image on ACR — ~40 min)
azd provision
# 5. (Privileged user) Assign RBAC roles to the Managed Identity
./infra/scripts/assign-rbac.sh
# 6. Verify RBAC assignments
./infra/scripts/verify-rbac.sh
# 7. Upload test data to Azure Blob Storage
./infra/scripts/upload-testdata.sh
# 8. Run the GPU job
./infra/scripts/run-job.sh --logsazd provision creates the following Azure resources (all in a single resource group).
If you have configured existing resources via ./scripts/configure-existing-resources.sh,
those resources are referenced (Bicep existing keyword) rather than created — their
lifecycle is not managed by this deployment, and azd down will not delete them.
| Resource | Bicep Module | Purpose | Can Reuse Existing? |
|---|---|---|---|
| Resource Group | main.bicep |
rg-<env-name> |
✅ Yes |
| User-Assigned Managed Identity | modules/managed-identity.bicep |
Authenticate to ACR and Storage | No (always created) |
| Azure Container Registry (Basic) | modules/acr.bicep |
Store GPU Docker images | ✅ Yes |
| Storage Account + 4 containers | modules/storage.bicep |
input, output, processed, error |
✅ Yes |
| Log Analytics Workspace | modules/monitoring.bicep |
Container log aggregation | ✅ Yes |
| Container Apps Environment | modules/container-apps-env.bicep |
GPU workload profile (T4) | ✅ Yes |
| Container Apps Job (Manual trigger) | modules/container-apps-job.bicep |
The processor job itself | No (always created) |
After provisioning, the postprovision hook automatically:
- Builds the GPU Docker image via ACR Tasks (
infra/scripts/hooks/acr-build.sh) - Updates the Container Apps Job with the new image
RBAC role assignments are intentionally separated from infrastructure provisioning. This is because in many organizations, the person deploying infrastructure does not have permission to assign IAM roles. The two operations may require different privilege levels.
The deployer needs these permissions on the Azure subscription or resource group:
| Permission | Why |
|---|---|
Microsoft.Resources/subscriptions/resourceGroups/write |
Create the resource group |
Microsoft.ContainerRegistry/registries/* |
Create ACR and push images |
Microsoft.Storage/storageAccounts/* |
Create storage account and containers |
Microsoft.App/managedEnvironments/* |
Create Container Apps Environment |
Microsoft.App/jobs/* |
Create Container Apps Job |
Microsoft.ManagedIdentity/userAssignedIdentities/* |
Create managed identity |
Microsoft.OperationalInsights/workspaces/* |
Create Log Analytics workspace |
Typically the Contributor built-in role on the subscription is sufficient.
The preprovision hook (infra/scripts/hooks/preprovision.sh) also attempts to assign
deployer-level RBAC (AcrPush, Storage Blob Data Contributor) so the deployer can push
images and upload test data. This requires the deployer to have User Access Administrator
or Owner on the resource group. If the deployer lacks this permission, the hook
continues (it's non-blocking) — a privileged user can assign these roles later.
The Managed Identity needs two role assignments so the container job can pull images and read/write blobs at runtime:
| Role | Scope | Role Definition ID | Purpose |
|---|---|---|---|
| AcrPull | Container Registry | 7f951dda-4ed3-4680-a7ca-43fe172d538d |
Pull GPU image from ACR |
| Storage Blob Data Contributor | Storage Account | ba92f5b4-2d11-453d-a403-e96b0029c9fe |
Download input videos, upload outputs, move blobs |
A user with one of these roles on the target scope:
- Owner (full control, including RBAC)
- User Access Administrator (RBAC management only)
- A custom role with
Microsoft.Authorization/roleAssignments/writepermission
If the deployer does not have these permissions, the
azd provisionstep will succeed but the job will fail at runtime with authentication errors (HTTP 403 on Storage or image pull failures on ACR). Have a privileged user run the RBAC scripts.
# Assign roles via Azure CLI (reads values from azd env automatically)
./infra/scripts/assign-rbac.sh
# Or assign via Bicep deployment (alternative)
./infra/scripts/assign-rbac.sh --use-bicep# Check that both roles are assigned
./infra/scripts/verify-rbac.shExpected output when roles are correctly assigned:
🔍 Verifying RBAC role assignments for Managed Identity...
Principal ID: <managed-identity-principal-id>
✅ AcrPull on Container Registry
✅ Storage Blob Data Contributor on Storage Account
✅ All RBAC role assignments are in place.
If any roles are missing:
❌ AcrPull on Container Registry — MISSING
⚠️ 1 RBAC role assignment(s) missing.
Run './infra/scripts/assign-rbac.sh' as a privileged user to fix.
| Missing Role | Symptom |
|---|---|
| AcrPull | Job execution fails immediately — Container Apps cannot pull the image. The execution shows Failed status with no container logs (image never starts). |
| Storage Blob Data Contributor | Container starts but fails with Failed to download input blobs from Azure Blob Storage — the managed identity token is rejected with HTTP 403. |
The South Building dataset (128 multi-view images from UNC Chapel Hill) is used for testing. The upload script downloads it if needed, creates 3 test videos, and uploads them:
# Download test data + upload to blob storage (default prefix: south_building/)
./infra/scripts/upload-testdata.sh
# Custom blob prefix
./infra/scripts/upload-testdata.sh --prefix my_scene/
# Download only (no upload)
./infra/scripts/upload-testdata.sh --download-only# Start job and return immediately
./infra/scripts/run-job.sh
# Start job, wait for completion, show status
./infra/scripts/run-job.sh --wait
# Start job, wait, then show container logs
./infra/scripts/run-job.sh --logsA successful run produces these outputs in the output blob container:
| File | Description | Typical Size |
|---|---|---|
south_building/south_building.ply |
3D Gaussian point cloud (real GPU-trained geometry) | ~65 KB |
south_building/south_building.splat |
Web-optimized format for real-time rendering | ~53 KB |
south_building/manifest.json |
Video metadata (resolution, duration, codec, frame count) | ~7 KB |
south_building/.checkpoint.json |
Pipeline progress tracking | ~11 KB |
Input videos are moved from input/ to processed/ on success, or error/ on failure.
After modifying the processor code, rebuild and redeploy without re-provisioning:
# Build new image on ACR + update the job
./infra/scripts/deploy-job.sh
# Or skip the build and just redeploy the existing image
./infra/scripts/deploy-job.sh --skip-buildSet these via azd env set <NAME> <VALUE> before provisioning:
| Variable | Default | Description |
|---|---|---|
AZURE_LOCATION |
(required) | Azure region. Must support GPU T4 (e.g., swedencentral, eastus, westus) |
USE_GPU |
true |
Enable GPU workload profile |
GPU_PROFILE_TYPE |
Consumption-GPU-NC8as-T4 |
GPU type (Consumption-GPU-NC8as-T4 or Consumption-GPU-NC24-A100) |
PROCESSOR_BACKEND |
gsplat |
3DGS backend (gsplat, gaussian-splatting, mock) |
INCLUDE_RBAC |
true |
Include RBAC assignments in Bicep deployment |
USE_STORAGE_KEYS |
false |
Use storage account keys instead of RBAC (fallback) |
EXISTING_RESOURCE_GROUP |
(empty) | Name of an existing Resource Group to deploy into (see Reusing Existing Resources) |
EXISTING_ACR_NAME |
(empty) | Name of an existing Azure Container Registry to reuse |
EXISTING_STORAGE_ACCOUNT_NAME |
(empty) | Name of an existing Storage Account to reuse |
EXISTING_CONTAINER_APPS_ENV_NAME |
(empty) | Name of an existing Container Apps Environment to reuse |
EXISTING_LOG_ANALYTICS_NAME |
(empty) | Name of an existing Log Analytics workspace to reuse |
FORCE_DELETE |
(empty) | Set to true to allow azd down when existing resources are configured |
Tip: Use
./scripts/configure-existing-resources.shto interactively set theEXISTING_*variables instead of setting them manually.
Serverless GPU (T4) Container Apps is available in these regions:
swedencentral, eastus, westus, canadacentral, brazilsouth, australiaeast,
italynorth, francecentral, centralindia, japaneast, northcentralus,
southcentralus, southeastasia, southindia, westeurope, westus2, westus3
If you get a deployment error about workload profiles, ensure your subscription has GPU quota in the selected region. Check quota at: Azure Portal → Quotas
Before running azd down, remove the RBAC assignments:
./infra/scripts/cleanup-rbac.sh
azd down --purge --force
⚠️ Existing resource protection: If your environment is configured to reuse existing Azure resources,azd downwill be blocked by thepredownhook to prevent accidental deletion of shared infrastructure. See Reusing Existing Azure Resources for details.
All infrastructure scripts are in infra/scripts/:
| Script | Purpose | Requires Privilege? |
|---|---|---|
hooks/preprovision.sh |
Captures deployer identity, runs RBAC preflight check | No (auto-run by azd) |
hooks/postprovision.sh |
Builds GPU image on ACR, updates job | No (auto-run by azd) |
hooks/predown.sh |
Blocks azd down when existing resources are configured (override with FORCE_DELETE=true) |
No (auto-run by azd) |
hooks/acr-build.sh |
Creates minimal staging dir, runs az acr build for GPU target |
No |
assign-rbac.sh |
Assigns AcrPull + Storage Blob Data Contributor to MI | Yes — Owner or User Access Admin |
verify-rbac.sh |
Checks if required RBAC roles are assigned | No |
cleanup-rbac.sh |
Removes RBAC role assignments | Yes — Owner or User Access Admin |
run-job.sh |
Starts a job execution with --wait/--logs options |
No |
deploy-job.sh |
Rebuilds image on ACR + updates job | No |
upload-testdata.sh |
Downloads South Building dataset + uploads videos to blob storage | No (needs Storage Blob Data Contributor on deployer) |
Additional scripts in scripts/:
| Script | Purpose | Requires Privilege? |
|---|---|---|
configure-existing-resources.sh |
Interactive wizard to select existing Azure resources for reuse (sets azd env vars) | No (needs Azure CLI login) |
"MANAGED_IDENTITY_PRINCIPAL_ID is not set"
Run azd provision first — this creates the managed identity and saves its principal ID.
Image pull failure (no container logs)
The AcrPull role is missing. Run ./infra/scripts/assign-rbac.sh.
"Failed to download input blobs from Azure Blob Storage"
Either: (a) Storage Blob Data Contributor is missing — run ./infra/scripts/assign-rbac.sh,
or (b) no blobs exist at the BATCH_INPUT_PREFIX — run ./infra/scripts/upload-testdata.sh.
"Reconstruction quality too low"
Increase FRAME_RATE (e.g., from 2 to 3) in the job env vars to extract more frames.
ACR build timeout The GPU image build compiles COLMAP from source (~30 min). The default timeout is 3600s. If it still times out, check ACR Tasks quotas.
COLMAP matching timeout
Ensure COLMAP_MATCHER=sequential (not exhaustive) in the job env vars.
azd down blocked with "EXISTING RESOURCES DETECTED"
The predown hook prevents accidental deletion of shared resources. Either reset
the existing resource configuration or explicitly opt in:
# Option 1: Remove existing resource configuration and tear down normally
./scripts/configure-existing-resources.sh --reset
azd down --purge --force
# Option 2: Force delete (will destroy the entire resource group)
azd env set FORCE_DELETE true
azd down --purge --forceWhen deploying into an environment with pre-existing Azure infrastructure (e.g., a shared
resource group, a team ACR, or a storage account with data), you can configure azd to
reference those resources instead of creating new ones. Referenced resources use Bicep's
existing keyword — their lifecycle is not managed by this deployment and they are
not deleted by azd down.
| Resource | Env Variable | What Happens When Set |
|---|---|---|
| Resource Group | EXISTING_RESOURCE_GROUP |
Deploy into this RG instead of creating rg-<env-name> |
| Azure Container Registry | EXISTING_ACR_NAME |
Reference existing ACR; skip modules/acr.bicep |
| Storage Account | EXISTING_STORAGE_ACCOUNT_NAME |
Reference existing storage; skip modules/storage.bicep. Required blob containers (input, output, processed, error) must already exist |
| Container Apps Environment | EXISTING_CONTAINER_APPS_ENV_NAME |
Reference existing env; skip modules/container-apps-env.bicep |
| Log Analytics Workspace | EXISTING_LOG_ANALYTICS_NAME |
Reference existing workspace; skip modules/monitoring.bicep. When reusing an existing Container Apps Environment, this is not needed (the existing env already has logging configured) |
Resources that are always created by the deployment, regardless of configuration:
- Managed Identity — a new user-assigned identity is always created
- Container Apps Job — the processor job itself is always created
- RBAC role assignments — AcrPull and Storage Blob Data Contributor (when
INCLUDE_RBAC=true)
The easiest way to configure resource reuse is the interactive wizard:
./scripts/configure-existing-resources.shThe script:
- Lists available Resource Groups in your subscription
- Scans the selected RG for ACRs, Storage Accounts, and Container Apps Environments
- Lets you pick which resources to reuse (or skip to create new ones)
- Validates that required blob containers exist on a selected Storage Account
- Sets the
EXISTING_*environment variables in the azd environment - Sets
AZURE_LOCATIONto match the existing resource group's region
You can also set the variables directly:
azd env set EXISTING_RESOURCE_GROUP "my-shared-rg"
azd env set EXISTING_ACR_NAME "myteamacr"
azd env set EXISTING_STORAGE_ACCOUNT_NAME "mystorageaccount"
azd env set EXISTING_CONTAINER_APPS_ENV_NAME "my-cae"
azd env set EXISTING_LOG_ANALYTICS_NAME "my-law"Leave a variable empty (or omit it) to create that resource fresh.
To go back to creating all resources from scratch:
./scripts/configure-existing-resources.sh --resetThis clears all EXISTING_* variables and FORCE_DELETE from the azd environment.
When any EXISTING_* variable is set, azd down is blocked by the predown hook
(infra/scripts/hooks/predown.sh). This prevents accidental deletion of the resource
group (and everything inside it, including the existing shared resources).
The hook prints which existing resources are configured and exits with an error:
╔══════════════════════════════════════════════════════╗
║ ⚠️ EXISTING RESOURCES DETECTED ║
╚══════════════════════════════════════════════════════╝
This environment references pre-existing Azure resources:
• Resource Group: my-shared-rg
• Storage Account: mystorageaccount
Running 'azd down' would delete the resource group and ALL
resources inside it, including those listed above.
To proceed with teardown despite existing resources:
# Explicitly opt in to deletion
azd env set FORCE_DELETE true
azd down --purge --force
⚠️ Warning:FORCE_DELETEdeletes the entire resource group, including any existing resources you configured for reuse. Use with extreme caution.
The infra/main.bicep template uses conditional deployment based on whether EXISTING_*
parameters are empty:
var useExistingAcr = !empty(existingAcrName)
// New ACR — only created when not reusing
module acr 'modules/acr.bicep' = if (!useExistingAcr) { ... }
// Existing ACR — only referenced when reusing
module existingAcrRef 'modules/existing-acr.bicep' = if (useExistingAcr) { ... }
// Outputs resolve to whichever was used
output AZURE_CONTAINER_REGISTRY_NAME string = useExistingAcr
? existingAcrRef.outputs.name
: acr.outputs.nameThe same pattern applies to Storage Account, Container Apps Environment, and Resource Group.
Step-by-step instructions covering the complete lifecycle — from initial setup through a successful job run to clean teardown.
Ensure the following are installed and configured:
# Azure CLI — logged in
az login
az account show # verify correct subscription
# Azure Developer CLI
azd version # requires azd ≥ 1.5
# Test data (South Building dataset)
./scripts/e2e/01-download-testdata.sh# ── 1. Initialize ────────────────────────────────────────────────────────
azd init
# Choose an environment name (e.g., "3dgs-dev")
# ── 2. Configure ─────────────────────────────────────────────────────────
azd env set AZURE_LOCATION swedencentral # Must support GPU T4
azd env set USE_GPU true
# ── 3. Provision infrastructure ──────────────────────────────────────────
# Creates: RG, ACR, Storage, Container Apps Env, MI, Job
# Also builds the GPU Docker image on ACR (~40 min first time)
azd provision
# ── 4. Assign RBAC roles ─────────────────────────────────────────────────
# Requires Owner or User Access Administrator on the resource group
./infra/scripts/assign-rbac.sh
./infra/scripts/verify-rbac.sh
# ── 5. Upload test data ──────────────────────────────────────────────────
./infra/scripts/upload-testdata.sh
# ── 6. Run the job ────────────────────────────────────────────────────────
./infra/scripts/run-job.sh --logs
# ── 7. Verify outputs ────────────────────────────────────────────────────
# Check the output blob container for PLY, SPLAT, manifest, checkpoint files
STORAGE_ACCOUNT=$(azd env get-value AZURE_STORAGE_ACCOUNT_NAME)
az storage blob list --account-name "$STORAGE_ACCOUNT" -c output --auth-mode login -o table# ── 1. Initialize ────────────────────────────────────────────────────────
azd init
# ── 2. Configure existing resources ──────────────────────────────────────
# Interactive: select which existing resources to reuse
./scripts/configure-existing-resources.sh
# The script sets AZURE_LOCATION automatically from the resource group.
# Optionally configure additional settings:
azd env set USE_GPU true
# ── 3. Provision ─────────────────────────────────────────────────────────
# Only creates resources that aren't being reused (MI, Job, and any
# resources you chose to create fresh)
azd provision
# ── 4. Assign RBAC + upload data + run ────────────────────────────────────
./infra/scripts/assign-rbac.sh
./infra/scripts/verify-rbac.sh
./infra/scripts/upload-testdata.sh
./infra/scripts/run-job.sh --logsWhen all resources were created by azd, teardown is straightforward:
# ── 1. Clean up RBAC assignments ─────────────────────────────────────────
./infra/scripts/cleanup-rbac.sh
# ── 2. Destroy all resources ─────────────────────────────────────────────
azd down --purge --force
# Deletes: Resource Group and all resources inside it
# --purge: permanently deletes soft-delete-enabled resources (Key Vault, etc.)
# --force: skip confirmation promptsWhen existing resources are configured, azd down is blocked to prevent accidental
deletion. You have two options:
Option A — Remove only deployment-managed resources (recommended):
This approach leaves existing shared resources intact and only removes what the deployment created (Managed Identity, Container Apps Job, RBAC assignments):
# 1. Clean up RBAC assignments
./infra/scripts/cleanup-rbac.sh
# 2. Reset existing resource config so azd down only targets managed resources
./scripts/configure-existing-resources.sh --reset
# 3. Tear down
azd down --purge --forceNote: Because
azd downdeletes the entire resource group, this option still deletes the RG and all resources in it when the RG was created by the deployment. If you deployed into an existing RG, be aware thatazd downwill attempt to delete that RG too. In that case, use Option B or manually delete only the managed resources.
Option B — Force delete everything (use with caution):
# 1. Clean up RBAC assignments
./infra/scripts/cleanup-rbac.sh
# 2. Explicitly opt in to deletion
azd env set FORCE_DELETE true
# 3. Destroy the entire resource group (including existing resources!)
azd down --purge --force
⚠️ Warning: This deletes the entire resource group and all resources inside it, including any pre-existing ACR, Storage Account, or Container Apps Environment you configured for reuse.
Batch mode processes a single job using the Azure Blob Storage SDK directly — no BlobFuse2 FUSE mounts, no privileged containers, no continuous watching. The processor downloads input videos from blob storage, runs the pipeline locally, uploads outputs, and exits.
┌─────────────────┐ ┌──────────────────────────────────┐ ┌──────────────────┐
│ Azure Blob │ │ 3DGS Processor (batch mode) │ │ Azure Blob │
│ input container│────>│ 1. Download MP4s from blob │ │ output container │
│ scene_001/ │ │ 2. FFmpeg frame extraction │ │ │
│ view1.mp4 │ │ 3. FFprobe metadata │────>│ scene_001.ply │
│ view2.mp4 │ │ 4. Manifest generation │ │ scene_001.splat │
│ view3.mp4 │ │ 5. COLMAP reconstruction │ │ manifest.json │
│ │ │ 6. 3DGS training │ │ │
│ │ │ 7. Export PLY/SPLAT │ │ │
└─────────────────┘ │ 8. Upload outputs to blob │ └──────────────────┘
│ 9. Move inputs → processed/error │
│ 10. Exit (code 0 or 1) │
└──────────────────────────────────┘
# Set variables
RESOURCE_GROUP=<your-resource-group>
STORAGE_ACCOUNT=my3dgsdata # Must be globally unique, lowercase, 3-24 chars
LOCATION=eastus
# Create resource group (skip if exists)
az group create --name $RESOURCE_GROUP --location $LOCATION
# Create storage account
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2
# Create the four containers
for CONTAINER in input output processed error; do
az storage container create \
--account-name $STORAGE_ACCOUNT \
--name $CONTAINER
done# Upload MP4 files to the input container under a scene prefix
az storage blob upload-batch \
--account-name $STORAGE_ACCOUNT \
--destination input \
--source ./my-scene-videos/ \
--destination-path scene_001/
# Verify upload
az storage blob list \
--account-name $STORAGE_ACCOUNT \
--container-name input \
--prefix scene_001/ \
--output tableThe input container should look like:
input/
scene_001/
view1.mp4
view2.mp4
view3.mp4
See the Authentication section below for details on each method.
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=mock \
cargo run --release# List output blobs
az storage blob list \
--account-name $STORAGE_ACCOUNT \
--container-name output \
--output table
# Download results
az storage blob download-batch \
--account-name $STORAGE_ACCOUNT \
--source output \
--destination ./results/After success, inputs are moved to the processed container:
processed/
scene_001/
view1.mp4
view2.mp4
view3.mp4
On failure, inputs are moved to the error container instead.
# Mode selection
RUN_MODE=batch # Required: enables batch mode
# Azure storage account
AZURE_STORAGE_ACCOUNT=mystorageaccount # Required: the storage account name
# Batch job input
BATCH_INPUT_PREFIX=scene_001/ # Required: blob prefix to process
BATCH_JOB_ID=my-job-123 # Optional: custom job ID for logging
# Processing backend
BACKEND=mock # Or: gsplat, gaussian-splatting, 3dgs-cpp# Container names (defaults shown)
AZURE_BLOB_CONTAINER_INPUT=input # Where MP4s are uploaded
AZURE_BLOB_CONTAINER_OUTPUT=output # Where PLY/SPLAT/manifest are written
AZURE_BLOB_CONTAINER_PROCESSED=processed # Archive after success
AZURE_BLOB_CONTAINER_ERROR=error # Archive after failure
# Auth: see Authentication section below
AZURE_USE_MANAGED_IDENTITY=true # Use Managed Identity (Azure VMs, ACI, AKS)
AZURE_STORAGE_SAS_TOKEN="sv=2022-11-02&..." # Use SAS token auth
# Paths default to /tmp/3dgs-work/* in batch mode, but can be overridden
OUTPUT_PATH=/custom/output/dir
TEMP_PATH=/custom/temp/dirBatch mode supports three authentication methods, checked in priority order:
| Priority | Method | Env Variable | Best For |
|---|---|---|---|
| 1 | SAS Token | AZURE_STORAGE_SAS_TOKEN |
CI/CD pipelines, time-limited access, cross-tenant |
| 2 | Managed Identity | AZURE_USE_MANAGED_IDENTITY=true |
Azure VMs, ACI, AKS (production) |
| 3 | Azure CLI | (default) | Local development (az login) |
Note: Storage account keys (connection strings) are not supported in batch mode. Use SAS tokens for key-based scenarios — they provide time-limited access and can be scoped to specific containers and operations.
Generate a SAS token with the required permissions and set it as an environment variable. The token is appended to the storage endpoint URL — no credential object is needed.
# Generate an account-level SAS token (read, write, delete, list on blob service)
SAS_TOKEN=$(az storage account generate-sas \
--account-name $STORAGE_ACCOUNT \
--permissions rwdlac \
--services b \
--resource-types sco \
--expiry $(date -u -d '+24 hours' +%Y-%m-%dT%H:%MZ) \
--output tsv)
# Run with SAS token
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
AZURE_STORAGE_SAS_TOKEN="$SAS_TOKEN" \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=mock \
cargo run --releaseYou can also generate a container-level SAS token for tighter scoping:
# Container-level SAS (input container: read+list, output container: write)
# Note: you'll need separate tokens per container, or use account-level SAS
SAS_TOKEN=$(az storage container generate-sas \
--account-name $STORAGE_ACCOUNT \
--name input \
--permissions rl \
--expiry $(date -u -d '+24 hours' +%Y-%m-%dT%H:%MZ) \
--output tsv)# Log in to Azure (one-time setup)
az login
# Assign yourself Storage Blob Data Contributor role
az role assignment create \
--assignee $(az ad signed-in-user show --query id -o tsv) \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/$(az account show --query id -o tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT
# Run — Azure CLI credentials are used automatically
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=mock \
cargo run --releaseImportant:
az loginalone is not enough — you also need the Storage Blob Data Contributor RBAC role on the storage account. Without it, you'll get a 403 Forbidden error even though authentication succeeds.
# Enable system-assigned managed identity on a VM
az vm identity assign --name myVM --resource-group $RESOURCE_GROUP
# Get the VM's identity principal ID
PRINCIPAL_ID=$(az vm show --name myVM --resource-group $RESOURCE_GROUP \
--query identity.principalId -o tsv)
# Assign Storage Blob Data Contributor role
az role assignment create \
--assignee $PRINCIPAL_ID \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/$(az account show --query id -o tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT
# On the VM, run with managed identity
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
AZURE_USE_MANAGED_IDENTITY=true \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=gsplat \
./3dgs-processor# Prerequisites: az login, RBAC role assigned
az login
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=my3dgsdata \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=mock \
RECONSTRUCTION_BACKEND=colmap \
FRAME_RATE=2 \
cargo run --release# Generate SAS token
SAS_TOKEN=$(az storage account generate-sas \
--account-name my3dgsdata \
--permissions rwdlac --services b --resource-types sco \
--expiry $(date -u -d '+24 hours' +%Y-%m-%dT%H:%MZ) \
--output tsv)
# Run in Docker — no privileged mode needed
docker run --rm \
-e RUN_MODE=batch \
-e AZURE_STORAGE_ACCOUNT=my3dgsdata \
-e AZURE_STORAGE_SAS_TOKEN="$SAS_TOKEN" \
-e BATCH_INPUT_PREFIX=scene_001/ \
-e BACKEND=mock \
youracr.azurecr.io/3dgs-processor:cpudocker run --rm \
-e RUN_MODE=batch \
-e AZURE_STORAGE_ACCOUNT=my3dgsdata \
-e AZURE_USE_MANAGED_IDENTITY=true \
-e BATCH_INPUT_PREFIX=scene_001/ \
-e BACKEND=gsplat \
youracr.azurecr.io/3dgs-processor:gpuaz container create \
--resource-group mygroup \
--name 3dgs-batch-job \
--image myregistry.azurecr.io/3dgs-processor:gpu \
--assign-identity \
--environment-variables \
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=my3dgsdata \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=gsplat \
AZURE_USE_MANAGED_IDENTITY=true \
RECONSTRUCTION_BACKEND=colmap \
FRAME_RATE=2 \
--restart-policy Never \
--cpu 8 --memory 16# Use --secure-environment-variables for SAS tokens
az container create \
--resource-group mygroup \
--name 3dgs-batch-job \
--image myregistry.azurecr.io/3dgs-processor:gpu \
--environment-variables \
RUN_MODE=batch \
AZURE_STORAGE_ACCOUNT=my3dgsdata \
BATCH_INPUT_PREFIX=scene_001/ \
BACKEND=gsplat \
--secure-environment-variables \
AZURE_STORAGE_SAS_TOKEN="$SAS_TOKEN" \
--restart-policy Never \
--cpu 8 --memory 16| Feature | Watch Mode | Batch Mode |
|---|---|---|
| Azure access | BlobFuse2 (FUSE mount) | Azure SDK (direct API) |
| Privileged container | Yes (--device /dev/fuse) |
No |
| Lifecycle | Continuous service | Single job → exit |
| Trigger | File watcher detects new folders | External orchestration |
INPUT_PATH etc. |
Required local paths | Optional (defaults to /tmp) |
| Auth: Azure CLI | ✗ | ✓ (default for local dev) |
| Auth: SAS Token | Via BlobFuse2 config | ✓ AZURE_STORAGE_SAS_TOKEN |
| Auth: Managed Identity | Via BlobFuse2 config | ✓ AZURE_USE_MANAGED_IDENTITY=true |
| Auth: Connection String | Via BlobFuse2 config | ✗ (use SAS token instead) |
| Symptom | Cause | Fix |
|---|---|---|
AZURE_STORAGE_ACCOUNT is required |
Missing env var | Set AZURE_STORAGE_ACCOUNT=<name> |
No blobs found in container 'input' with prefix 'X' |
Wrong prefix or empty container | Check az storage blob list --container-name input --prefix X |
403 Forbidden with Azure CLI auth |
Missing RBAC role | Assign Storage Blob Data Contributor role |
403 Forbidden with SAS token |
Token expired or wrong permissions | Regenerate SAS with rwdlac permissions on blob service |
401 Unauthorized with Managed Identity |
Identity not assigned or wrong scope | Verify az vm identity show and RBAC assignment |
| Process exits with code 1 | Pipeline failure | Check stderr logs; inputs moved to error container |
Failed to create DeveloperToolsCredential |
Not logged in | Run az login |
- CPU: 4 cores (8+ cores recommended)
- RAM: 8GB (16GB+ recommended for large scenes)
- Disk: 100GB (500GB+ recommended for production)
- GPU: Optional (required for
gaussian-splattingandgsplatbackends)
Resource needs scale with:
- Video Resolution: 4K requires 2-3x resources vs. 1080p
- Video Count: 5+ videos per job require more memory
- Training Iterations: Higher iterations = longer GPU time
- Concurrent Jobs: Sequential processing (1 at a time)
Example Profiles:
| Profile | Videos | Resolution | CPU | RAM | Disk | GPU |
|---|---|---|---|---|---|---|
| Small | 2-3 | 720p-1080p | 4 | 8GB | 100GB | Optional |
| Medium | 3-5 | 1080p-1440p | 8 | 16GB | 250GB | Recommended |
| Large | 5+ | 1440p-4K | 16 | 32GB | 500GB | Required |
- Use SSDs for input/output paths (10x faster frame extraction)
- Separate volumes for input, output, and processed (parallel I/O)
- Local storage preferred over network mounts for performance
- NVMe drives ideal for high-throughput scenarios
Recommended directory layout:
/data/
├── input/ # Active uploads (fast SSD)
├── output/ # Processing outputs (fast SSD)
├── processed/ # Archive (slower HDD acceptable)
└── error/ # Failed jobs (manual review)
Mount with appropriate permissions:
docker run -d \
-v /data/input:/data/input \
-v /data/output:/data/output \
-v /data/processed:/data/processed:ro \ # Read-only for safety
-v /data/error:/data/error \
...NFS Configuration:
volumes:
- type: nfs
source: nfs-server.example.com:/export/3dgs/input
target: /data/input
volume:
nocopy: true
o: "addr=nfs-server.example.com,rw,sync"Note: Use poll_watcher for NFS mounts (set WATCHER_MODE=poll).
# Create file share
az storage share create --name 3dgs-input --quota 1024
# Mount in container
docker run -d \
-v 3dgs-input:/data/input \
--mount type=volume,dst=/data/input,volume-driver=azure,volume-opt=share=3dgs-input \
...- Azure Subscription - Active subscription (free tier works for testing)
- Azure Storage Account - General Purpose v2 with Blob Storage
- Azure CLI - For setup and management (
az loginrequired) - Authentication - One of: Connection String, SAS Token, Azure AD, or Managed Identity
- Linux Container Host - Blobfuse2 requires Linux kernel (not available on macOS)
Use the provided setup script to create test resources:
# Clone repository
cd 3DGS-accelerator
# Create Azure resources (auto-generates unique storage account name)
./scripts/azure-setup.sh
# Load credentials
source azure-test-config.env
# Run test to validate
./scripts/azure-test.sh
# Cleanup when done
./scripts/azure-cleanup.sh --delete-allFor production deployment, continue with the sections below.
# Set variables
RESOURCE_GROUP="<your-resource-group>"
LOCATION="eastus"
STORAGE_ACCOUNT="3dgsprodstorage" # Must be globally unique, lowercase, 3-24 chars
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
# Create storage account
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--https-only true \
--min-tls-version TLS1_2# Option A: Using connection string (if shared key enabled)
CONN_STRING=$(az storage account show-connection-string \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--query connectionString -o tsv)
for container in input output processed error; do
az storage container create \
--name $container \
--connection-string "$CONN_STRING"
done
# Option B: Using Azure AD authentication (enterprise environments)
for container in input output processed error; do
az storage container create \
--name $container \
--account-name $STORAGE_ACCOUNT \
--auth-mode login
doneContainer Purpose:
input/- Active uploads waiting for processingoutput/- Generated PLY/SPLAT models and manifestsprocessed/- Archive of successfully processed jobserror/- Failed jobs for manual review
The processor supports four authentication methods. Choose based on your security requirements:
Best for: Local testing, simple setups
Security:
Availability: ❌ Not available if shared key access disabled
# Get connection string
az storage account show-connection-string \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--query connectionString -o tsv
# Use in container
docker run -d --privileged \
--name 3dgs-processor \
-e AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=3dgsprodstorage;AccountKey=abc123...;EndpointSuffix=core.windows.net" \
-e AZURE_BLOB_CONTAINER_INPUT=input \
-e AZURE_BLOB_CONTAINER_OUTPUT=output \
-e AZURE_BLOB_CONTAINER_PROCESSED=processed \
-e AZURE_BLOB_CONTAINER_ERROR=error \
youracr.azurecr.io/3dgs-processor:gpuBest for: Temporary access, third-party integrations
Security: ✅ Time-limited, scoped permissions
Availability:
Generate User Delegation SAS (Azure AD-based):
# Requires Storage Blob Data Contributor role
EXPIRY="2026-12-31"
# Get user delegation SAS (works even with shared key disabled)
az storage container generate-sas \
--name input \
--account-name $STORAGE_ACCOUNT \
--permissions racwdl \
--expiry $EXPIRY \
--auth-mode login \
--as-user
# Output: se=2026-12-31&sp=racwdl&sv=2021-06-08&sr=c&sig=...
# Use in container
docker run -d --privileged \
--name 3dgs-processor \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
-e AZURE_STORAGE_SAS_TOKEN="se=2026-12-31&sp=racwdl&sv=2021-06-08&sr=c&sig=..." \
-e AZURE_BLOB_CONTAINER_INPUT=input \
youracr.azurecr.io/3dgs-processor:gpuBest Practices:
- Set expiry to shortest necessary duration (e.g., 7 days for testing, 90 days for production)
- Use user delegation SAS (Azure AD-based) instead of account SAS
- Rotate SAS tokens regularly
- Store in secrets manager (Azure Key Vault), not in code
Best for: Enterprise environments, security-first organizations
Security: ✅ No shared keys, identity-based, full audit trail
Availability: ✅ Works when shared key access is disabled
Setup:
# 1. Assign role to user/service principal/managed identity
USER_ID=$(az ad signed-in-user show --query id -o tsv)
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee $USER_ID \
--scope "/subscriptions/<subscription-id>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
# 2. Login to Azure (for local testing)
az login
# 3. Run container with Azure AD
docker run -d --privileged \
--name 3dgs-processor \
-v $HOME/.azure:/root/.azure:ro \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
-e AZURE_USE_AZURE_AD=true \
youracr.azurecr.io/3dgs-processor:gpuFor CI/CD (Service Principal):
# Create service principal
SP=$(az ad sp create-for-rbac --name "3dgs-processor-sp" --role "Storage Blob Data Contributor" --scopes "/subscriptions/<sub>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT")
# Use credentials in container
docker run -d --privileged \
-e AZURE_TENANT_ID=$(echo $SP | jq -r .tenant) \
-e AZURE_CLIENT_ID=$(echo $SP | jq -r .appId) \
-e AZURE_CLIENT_SECRET=$(echo $SP | jq -r .password) \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
youracr.azurecr.io/3dgs-processor:gpuBest for: Azure VM, Azure Container Instances, AKS
Security: ✅✅ No credentials, Azure-managed, most secure
Availability: Only in Azure environments (VM/ACI/AKS)
Azure Container Instances Example:
# 1. Create container with managed identity
az container create \
--resource-group $RESOURCE_GROUP \
--name 3dgs-processor \
--image youracr.azurecr.io/3dgs-processor:gpu \
--cpu 8 --memory 16 \
--assign-identity [system] \
--environment-variables \
AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
AZURE_USE_MANAGED_IDENTITY=true
# 2. Get managed identity ID
IDENTITY_ID=$(az container show \
--resource-group $RESOURCE_GROUP \
--name 3dgs-processor \
--query identity.principalId -o tsv)
# 3. Grant storage access
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee-object-id $IDENTITY_ID \
--scope "/subscriptions/<sub>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"No credentials needed in container! Azure manages authentication automatically.
Many organizations disable shared key access for compliance (SOC 2, ISO 27001, etc.):
# Check if shared key is disabled
az storage account show \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--query allowSharedKeyAccess -o tsv
# Output: falseIf shared key is disabled:
✅ Azure AD authentication - Works (use --auth-mode login)
✅ User delegation SAS - Works (Azure AD-based)
✅ Managed Identity - Works (Azure AD-based)
❌ Connection string - Blocked
❌ Account SAS - Blocked
Required Setup:
# 1. Assign required role
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee <user-or-sp-object-id> \
--scope "/subscriptions/<sub>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
# 2. All Azure CLI commands must use --auth-mode login
az storage blob upload \
--account-name $STORAGE_ACCOUNT \
--container-name input \
--name test.mp4 \
--file test.mp4 \
--auth-mode login # Required!
# 3. Container uses Azure AD or managed identity (see methods 3 & 4 above)Requirements:
- Linux kernel (not available on macOS/Windows)
- Privileged container (requires FUSE kernel module access)
- AMD64 architecture (blobfuse2 not available for ARM64)
Basic Mount:
docker run -d --privileged \
--name 3dgs-processor \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
-e AZURE_STORAGE_CONNECTION_STRING="..." \
-e INPUT_PATH=/mnt/blobfuse/input \
-e OUTPUT_PATH=/mnt/blobfuse/output \
-e PROCESSED_PATH=/mnt/blobfuse/processed \
-e ERROR_PATH=/mnt/blobfuse/error \
youracr.azurecr.io/3dgs-processor:gpuPerformance Tuning:
# Configure caching for better performance
docker run -d --privileged \
--name 3dgs-processor \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
-e AZURE_STORAGE_CONNECTION_STRING="..." \
-e BLOBFUSE_CACHE_SIZE_MB=10240 \
-e BLOBFUSE_FILE_CACHE_TIMEOUT_SECS=120 \
-e BLOBFUSE_ATTR_CACHE_TIMEOUT_SECS=60 \
-e BLOBFUSE_ENABLE_STREAMING=true \
youracr.azurecr.io/3dgs-processor:gpuCache Configuration:
BLOBFUSE_CACHE_SIZE_MB: Local disk cache size (default: 1024, recommend: 10240 for production)BLOBFUSE_FILE_CACHE_TIMEOUT_SECS: File cache validity (default: 120)BLOBFUSE_ATTR_CACHE_TIMEOUT_SECS: Attribute cache (default: 60)BLOBFUSE_ENABLE_STREAMING: Stream large files (default: false)
Why Privileged? Blobfuse2 requires CAP_SYS_ADMIN to mount FUSE filesystems.
Deployment with Managed Identity (Most Secure):
# Set variables
RESOURCE_GROUP="<your-resource-group>"
LOCATION="eastus"
CONTAINER_NAME="3dgs-processor"
ACR_NAME="yourcontainerregistry"
IMAGE="$ACR_NAME.azurecr.io/3dgs-processor:gpu"
STORAGE_ACCOUNT="3dgsprodstorage"
# Create container with managed identity
az container create \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--image $IMAGE \
--cpu 8 \
--memory 16 \
--os-type Linux \
--restart-policy Always \
--assign-identity [system] \
--acr-identity [system] \
--environment-variables \
INPUT_PATH=/mnt/blobfuse/input \
OUTPUT_PATH=/mnt/blobfuse/output \
PROCESSED_PATH=/mnt/blobfuse/processed \
ERROR_PATH=/mnt/blobfuse/error \
AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
AZURE_USE_MANAGED_IDENTITY=true \
BACKEND=gsplat \
LOG_LEVEL=info \
HEALTH_CHECK_ENABLED=true \
HEALTH_CHECK_PORT=8080 \
RETENTION_DAYS=30 \
MAX_RETRIES=3 \
--ports 8080
# Get managed identity principal ID
IDENTITY_ID=$(az container show \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--query identity.principalId -o tsv)
# Grant storage access to managed identity
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee-object-id $IDENTITY_ID \
--scope "/subscriptions/<subscription-id>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
# Grant ACR pull access (if using private ACR)
az role assignment create \
--role "AcrPull" \
--assignee-object-id $IDENTITY_ID \
--scope "/subscriptions/<subscription-id>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.ContainerRegistry/registries/$ACR_NAME"
echo "Waiting 2 minutes for role assignments to propagate..."
sleep 120
# Restart container to apply permissions
az container restart --resource-group $RESOURCE_GROUP --name $CONTAINER_NAMEMonitor Deployment:
# View logs (follow mode)
az container logs \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--follow
# Check health endpoint
CONTAINER_IP=$(az container show \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--query ipAddress.ip -o tsv)
curl http://$CONTAINER_IP:8080/health
# Check container status
az container show \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--query '{state:instanceView.state,cpu:containers[0].resources.requests.cpu,memory:containers[0].resources.requests.memoryInGB,restartCount:instanceView.currentState.restartCount}'
# Execute command inside container
az container exec \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--exec-command "/bin/bash"Update Deployment:
# Delete old container
az container delete \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME \
--yes
# Deploy new version (repeat create command with new image tag)Cost Optimization:
# Use spot instances for development (60-90% cheaper)
az container create \
--resource-group $RESOURCE_GROUP \
--name $CONTAINER_NAME-dev \
--image $IMAGE \
--priority Spot \
--cpu 4 \
--memory 8 \
...
# Set restart policy for one-time jobs
--restart-policy OnFailure # Only restart on failure, not on successCreate GPU-enabled VM:
RESOURCE_GROUP="<your-resource-group>"
VM_NAME="3dgs-gpu-vm"
LOCATION="eastus"
STORAGE_ACCOUNT="3dgsprodstorage"
# Create VM with GPU (NC-series for CUDA)
az vm create \
--resource-group $RESOURCE_GROUP \
--name $VM_NAME \
--location $LOCATION \
--size Standard_NC6s_v3 \
--image Ubuntu2204 \
--admin-username azureuser \
--generate-ssh-keys \
--assign-identity [system]
# Get managed identity
IDENTITY_ID=$(az vm show \
--resource-group $RESOURCE_GROUP \
--name $VM_NAME \
--query identity.principalId -o tsv)
# Grant storage access
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee-object-id $IDENTITY_ID \
--scope "/subscriptions/<sub>/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
# SSH into VM
VM_IP=$(az vm show \
--resource-group $RESOURCE_GROUP \
--name $VM_NAME \
--show-details \
--query publicIps -o tsv)
ssh azureuser@$VM_IPSetup VM:
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
# Install NVIDIA drivers (for GPU)
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
# Verify GPU
nvidia-smi
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smiRun Container on VM:
# Login to Azure from VM (for managed identity)
az login --identity
# Run container with GPU
docker run -d --privileged \
--name 3dgs-processor \
--gpus all \
--restart unless-stopped \
-v /mnt/blobfuse:/mnt/blobfuse \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
-e AZURE_USE_MANAGED_IDENTITY=true \
-e INPUT_PATH=/mnt/blobfuse/input \
-e OUTPUT_PATH=/mnt/blobfuse/output \
-e PROCESSED_PATH=/mnt/blobfuse/processed \
-e ERROR_PATH=/mnt/blobfuse/error \
-e BACKEND=gsplat \
-e LOG_LEVEL=info \
youracr.azurecr.io/3dgs-processor:gpu
# Monitor
docker logs -f 3dgs-processorAuto-start on VM reboot:
# Create systemd service
sudo tee /etc/systemd/system/3dgs-processor.service <<EOF
[Unit]
Description=3DGS Video Processor
Requires=docker.service
After=docker.service
[Service]
Restart=always
ExecStart=/usr/bin/docker start -a 3dgs-processor
ExecStop=/usr/bin/docker stop -t 10 3dgs-processor
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable 3dgs-processor
sudo systemctl start 3dgs-processor
sudo systemctl status 3dgs-processorSee Kubernetes Deployment section above for full deployment manifests.
AKS-specific quick start:
# Create AKS cluster with managed identity
az aks create \
--resource-group $RESOURCE_GROUP \
--name 3dgs-aks-cluster \
--node-count 2 \
--node-vm-size Standard_D8s_v3 \
--enable-managed-identity \
--generate-ssh-keys
# Get credentials
az aks get-credentials \
--resource-group $RESOURCE_GROUP \
--name 3dgs-aks-cluster
# Enable Azure AD workload identity (for pod-level managed identity)
az aks update \
--resource-group $RESOURCE_GROUP \
--name 3dgs-aks-cluster \
--enable-oidc-issuer \
--enable-workload-identity
# Deploy application
kubectl create namespace 3dgs
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/pvc.yaml
# Monitor
kubectl logs -f deployment/3dgs-processor -n 3dgs
kubectl get pods -n 3dgsScale down when idle:
# Manual scale
kubectl scale deployment/3dgs-processor --replicas=0 -n 3dgs # Stop
kubectl scale deployment/3dgs-processor --replicas=1 -n 3dgs # Start
# Auto-scale based on CPU (future enhancement)
kubectl autoscale deployment/3dgs-processor \
--cpu-percent=50 \
--min=0 \
--max=3 \
-n 3dgs- Run as Non-Root User (when possible):
USER 1000:1000- Read-Only Root Filesystem:
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp
- /var/tmp- Limit Capabilities:
cap_drop:
- ALL
cap_add:
- SYS_ADMIN # Required for Blobfuse2 mountingNever hardcode credentials. Use secrets management:
# Docker Compose with secrets
secrets:
azure_connection_string:
file: ./secrets/azure_conn.txt
services:
3dgs-processor:
secrets:
- azure_connection_string
environment:
AZURE_STORAGE_CONNECTION_STRING_FILE: /run/secrets/azure_connection_stringKubernetes Secrets:
kubectl create secret generic azure-storage \
--from-literal=connection-string="DefaultEndpointsProtocol=..." \
-n 3dgs
# Reference in deployment
env:
- name: AZURE_STORAGE_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: azure-storage
key: connection-string- Restrict Ingress: Only health check port (8080) if needed
- Use Private Endpoints: Azure Storage private endpoints
- Firewall Rules: Whitelist container IPs in storage account
Logs automatically redact Azure credentials:
[INFO] Mounting Azure Blob Storage: account=mystorageaccount, container=3dgs-input, auth=[REDACTED]
Logs are JSON-formatted for easy parsing:
{
"timestamp": "2024-01-01T12:00:00Z",
"level": "INFO",
"target": "three_dgs_processor::processor::job",
"message": "Job completed successfully",
"job_id": "scene_001",
"duration_secs": 245.3,
"frames_extracted": 450,
"gaussian_count": 150000
}ELK Stack:
logging:
driver: "fluentd"
options:
fluentd-address: "fluentd.example.com:24224"
tag: "3dgs-processor"Azure Monitor:
docker run -d \
--log-driver=azuremonitor \
--log-opt WorkspaceId=<workspace-id> \
--log-opt WorkspaceKey=<workspace-key> \
...Key metrics to monitor:
- Job Success Rate:
processed/vs.error/folder counts - Processing Duration: Time from upload to output
- Disk Usage:
/datapartition utilization - Memory Usage: Container RSS
- Frame Extraction Rate: Frames/second
- Training Loss: Final loss value per job
Prometheus Integration (future):
# Expose metrics endpoint
-e METRICS_PORT=9090
# Scrape config
scrape_configs:
- job_name: '3dgs-processor'
static_configs:
- targets: ['processor:9090']HTTP health endpoint:
curl http://localhost:8080/health
# Response
{
"status": "healthy",
"uptime_secs": 86400,
"jobs_processed": 42,
"disk_free_gb": 150.5
}- Configuration: Version control
config.yamland env files - Processed Data: Archive to cold storage after retention period
- Error Folder: Backup before manual deletion
- Logs: Retain for 30-90 days
Restart Resilience:
- Service checks
processed/anderror/on startup - Duplicate detection prevents reprocessing
- In-progress jobs resume from last checkpoint
Data Loss Prevention:
- Use Azure Blob Storage geo-redundant replication
- Regular snapshots of persistent volumes
- Separate processed/error archives from active processing
Horizontal Scaling: Not recommended (sequential processing model)
Vertical Scaling: Increase CPU/RAM/GPU for faster processing
Alternative Approach: Deploy multiple instances with different input folders
# Instance 1
-e INPUT_PATH=/data/input-1
# Instance 2
-e INPUT_PATH=/data/input-2Route uploads to different instances based on load balancing.
Error: Key based authentication is not permitted on this storage account
Cause: Shared key access is disabled (enterprise security policy)
Solution: Use Azure AD authentication or managed identity:
# Option 1: Azure AD
docker run -d --privileged \
-v $HOME/.azure:/root/.azure:ro \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
-e AZURE_USE_AZURE_AD=true \
youracr.azurecr.io/3dgs-processor:gpu
# Option 2: Service Principal
docker run -d --privileged \
-e AZURE_TENANT_ID=$TENANT_ID \
-e AZURE_CLIENT_ID=$CLIENT_ID \
-e AZURE_CLIENT_SECRET=$CLIENT_SECRET \
-e AZURE_STORAGE_ACCOUNT=$STORAGE_ACCOUNT \
youracr.azurecr.io/3dgs-processor:gpuError: AuthorizationFailed or This request is not authorized to perform this operation
Cause: Missing role assignment
Solution: Grant Storage Blob Data Contributor role:
# For user
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee $(az ad signed-in-user show --query id -o tsv) \
--scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
# For service principal
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee <sp-object-id> \
--scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
# For managed identity (from ACI/VM)
az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee-object-id <managed-identity-principal-id> \
--scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"Wait 5 minutes for role propagation before retrying.
Error: SAS token expired or Server failed to authenticate the request
Cause: SAS token expired or malformed
Solution: Generate new SAS token:
# User delegation SAS (works with shared key disabled)
az storage container generate-sas \
--name input \
--account-name $STORAGE_ACCOUNT \
--permissions racwdl \
--expiry 2026-12-31 \
--auth-mode login \
--as-userError: fusermount: command not found or FUSE device not found
Cause: Running on macOS or Windows (FUSE requires Linux kernel)
Solution: Deploy to Linux environment:
- Azure Container Instances (Linux)
- Azure VM (Ubuntu/Debian)
- Local Linux machine via Docker/Podman
Error: operation not permitted when mounting
Cause: Container not running in privileged mode
Solution: Add --privileged flag:
docker run -d --privileged ... # Required for FUSEError: blobfuse2: No such file or directory
Cause: Running ARM64 image (blobfuse2 only available on AMD64)
Solution: Use AMD64 image:
docker pull --platform linux/amd64 youracr.azurecr.io/3dgs-processor:cpuError: Slow mount performance or timeouts
Cause: Insufficient cache configuration
Solution: Increase cache size:
# Add caching environment variables
-e BLOBFUSE_CACHE_SIZE_MB=10240 \
-e BLOBFUSE_FILE_CACHE_TIMEOUT_SECS=120Error: No space left on device
Cause: Insufficient disk space in container or host
Solution:
# 1. Check host disk space
df -h
# 2. Clean old containers/images
docker system prune -a
# 3. Increase container disk quota (if using overlay2)
docker run -d --storage-opt size=100G ...
# 4. Check retention policy removes old jobs
-e RETENTION_DAYS=7 \
-e MIN_DISK_SPACE_GB=50Error: Container exits immediately or processor crashes on startup
Cause: Missing required environment variables
Solution: Check logs and verify minimum required variables:
# View logs
docker logs 3dgs-processor
# Minimum required variables
-e INPUT_PATH=/mnt/blobfuse/input \
-e OUTPUT_PATH=/mnt/blobfuse/output \
-e PROCESSED_PATH=/mnt/blobfuse/processed \
-e ERROR_PATH=/mnt/blobfuse/error \
-e AZURE_STORAGE_ACCOUNT=<account> \
-e AZURE_STORAGE_CONNECTION_STRING=<connection-string> # OR other auth methodError: Connection timeout or Unable to connect to Azure
Cause: Firewall, private endpoint, or network policy
Solution:
# 1. Check firewall rules on storage account
az storage account show \
--name $STORAGE_ACCOUNT \
--query networkRuleSet
# 2. Add container IP to whitelist (if restricted)
az storage account network-rule add \
--account-name $STORAGE_ACCOUNT \
--ip-address <container-public-ip>
# 3. Enable service endpoint (for ACI/AKS)
az storage account update \
--name $STORAGE_ACCOUNT \
--default-action Deny \
--bypass AzureServices
# 4. Use private endpoint (VNet deployment)
az network private-endpoint create \
--name 3dgs-private-endpoint \
--resource-group $RESOURCE_GROUP \
--vnet-name <vnet> --subnet <subnet> \
--private-connection-resource-id <storage-account-id> \
--group-id blob \
--connection-name 3dgs-blob-connectionError: DNS resolution failed for storage account
Cause: Private endpoint DNS not configured
Solution: Configure private DNS zone:
az network private-dns zone create \
--resource-group $RESOURCE_GROUP \
--name privatelink.blob.core.windows.net
az network private-dns link vnet create \
--resource-group $RESOURCE_GROUP \
--zone-name privatelink.blob.core.windows.net \
--name 3dgs-dns-link \
--virtual-network <vnet> \
--registration-enabled falseSlow blob upload/download
Causes & Solutions:
-
Geographic distance: Deploy container in same region as storage account
# Check storage account location az storage account show --name $STORAGE_ACCOUNT --query location # Deploy to same region az container create --location <same-region> ...
-
Insufficient bandwidth: Upgrade storage account tier
# Upgrade to Premium (better IOPS) az storage account update \ --name $STORAGE_ACCOUNT \ --sku Premium_LRS
-
Cache not configured: Add blobfuse2 cache settings (see above)
-
Concurrent operations throttled: Azure storage limits operations per second
# Monitor throttling az monitor metrics list \ --resource <storage-account-id> \ --metric "Throttling"
Check container status:
# Container logs
docker logs 3dgs-processor --tail 100 --follow
# Container exec for debugging
docker exec -it 3dgs-processor /bin/bash
# Check mounted blobfuse
df -h # Inside container
mount | grep blobfuse
# Test Azure connectivity
az storage blob list \
--account-name $STORAGE_ACCOUNT \
--container-name input \
--auth-mode loginCheck Azure resources:
# Storage account status
az storage account show \
--name $STORAGE_ACCOUNT \
--query '{name:name, location:location, sku:sku.name, allowSharedKey:allowSharedKeyAccess}'
# List containers
az storage container list \
--account-name $STORAGE_ACCOUNT \
--auth-mode login
# Check role assignments
az role assignment list \
--scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
# View activity logs (recent operations)
az monitor activity-log list \
--resource-group $RESOURCE_GROUP \
--offset 1hIf issues persist:
-
Review Documentation:
- TROUBLESHOOTING.md - General troubleshooting
- scripts/AZURE_TESTING.md - Azure test scripts
- USER_GUIDE.md - Configuration reference
-
Test with Scripts:
# Validate Azure setup ./scripts/azure-setup.sh source azure-test-config.env ./scripts/azure-test.sh
-
Check Logs:
- Container logs:
docker logs 3dgs-processor - Azure Activity Logs: Azure Portal → Storage Account → Activity log
- Health endpoint:
curl http://localhost:8080/health
- Container logs:
-
File Issue:
- GitHub Issues: Include logs, configuration (redact credentials), and error messages