Progress Tracking and Checkpointing

The 3DGS video processor includes comprehensive progress tracking and checkpointing capabilities that enable:

Real-time progress monitoring via health endpoint
Automatic checkpoint persistence for restart resilience
Resume from checkpoint after failures or restarts
Detailed stage tracking through the processing pipeline

Overview
Processing Stages
Checkpoint Storage
Health Endpoint Integration
Resume Capability
Usage Examples
Configuration

Overview

The progress tracking system monitors job execution through 8 distinct stages, saving checkpoint data to disk after each stage completes. This provides:

Visibility: Track progress percentage and current stage
Resilience: Resume jobs after process restarts
Monitoring: Query job status via HTTP health endpoint
Debugging: Inspect saved checkpoints for troubleshooting

Processing Stages

Jobs progress through the following stages:

Stage	Name	Description	Progress %
0	Validation	Folder validation and video discovery	0%
1	FrameExtraction	Extract frames from all videos concurrently	12.5%
2	MetadataExtraction	Extract GPS and camera metadata	25%
3	ManifestGeneration	Generate manifest.json with camera intrinsics	37.5%
4	ColmapReconstruction	Run COLMAP sparse reconstruction	50%
5	Training	Train 3DGS model	62.5%
6	PlyExport	Export model to PLY format	75%
7	SplatExport	Export model to SPLAT format	87.5%
8	Completed	Job finished successfully	100%

Progress percentage is calculated as: (stage_number / 8) * 100

Checkpoint Storage

File Location

Checkpoints are stored in the output directory as:

{output_folder}/.checkpoint.json

For example:

/mnt/output/job-abc123/.checkpoint.json

Checkpoint Format

Checkpoints are JSON files containing:

{
  "job_id": "job-abc123",
  "stage": "Training",
  "input_folder": "/mnt/input/scene-001",
  "output_folder": "/mnt/output/job-abc123",
  "temp_folder": "/tmp/job-abc123",
  "timestamp": 1708790400,
  "completed_stages": {
    "video_count": 3,
    "total_frames": 450,
    "manifest_path": "/mnt/output/job-abc123/manifest.json",
    "colmap_sparse_path": "/tmp/job-abc123/colmap/sparse",
    "colmap_points": 125000,
    "gaussian_count": null,
    "ply_path": null,
    "splat_path": null
  }
}

Checkpoint Lifecycle

Created: When job starts (Validation stage)
Updated: After each stage completes
Finalized: When job completes successfully (marked as Completed)
Retained: Kept for status queries (cleaned up by retention policy)

Health Endpoint Integration

When the health check endpoint is enabled (HEALTH_CHECK_ENABLED=true), it exposes real-time progress information.

Enabling Health Endpoint

export HEALTH_CHECK_ENABLED=true
export HEALTH_CHECK_PORT=8080

Health Response Format

{
  "state": "processing",
  "last_update": "2026-02-24T12:30:00Z",
  "current_job": {
    "job_id": "job-abc123",
    "stage": "Training",
    "progress_percentage": 62.5,
    "video_count": 3,
    "total_frames": 450,
    "gaussian_count": null,
    "started_at": "2026-02-24T12:00:00Z"
  }
}

Querying Progress

curl http://localhost:8080/health | jq .

# Example output:
{
  "state": "processing",
  "current_job": {
    "job_id": "scene-capture-20260224",
    "stage": "Training",
    "progress_percentage": 62.5,
    "video_count": 5,
    "total_frames": 750,
    "started_at": "2026-02-24T12:00:00Z"
  }
}

Monitoring Progress in Script

#!/bin/bash
# Monitor job progress until completion

while true; do
  response=$(curl -s http://localhost:8080/health)
  state=$(echo "$response" | jq -r '.state')
  
  if [ "$state" == "processing" ]; then
    progress=$(echo "$response" | jq -r '.current_job.progress_percentage')
    stage=$(echo "$response" | jq -r '.current_job.stage')
    echo "Progress: ${progress}% - Stage: $stage"
  elif [ "$state" == "watching" ]; then
    echo "Job completed, processor watching for new jobs"
    break
  else
    echo "State: $state"
  fi
  
  sleep 5
done

Resume Capability

The processor automatically resumes jobs from the last completed checkpoint.

How Resume Works

On job start: Check for existing .checkpoint.json in output folder
Validate checkpoint: Ensure it's recent (< 24 hours) and not completed
Resume from stage: Skip already-completed stages, continue from current stage
Re-process if needed: Some stages (frame extraction, training) may be re-executed

Resume Example

If a job fails at the Training stage:

Stage 0 (Validation): ✓ Completed (skipped on resume)
Stage 1 (FrameExtraction): ✓ Completed (skipped on resume)  
Stage 2 (MetadataExtraction): ✓ Completed (skipped on resume)
Stage 3 (ManifestGeneration): ✓ Completed (skipped on resume)
Stage 4 (ColmapReconstruction): ✓ Completed (skipped on resume)
Stage 5 (Training): ✗ Failed (resume starts here)

On restart, the job automatically resumes from Training stage, skipping all previous stages.

Force Fresh Start

To force a fresh start (ignore checkpoints), delete the checkpoint file:

rm /mnt/output/job-abc123/.checkpoint.json

Usage Examples

Example 1: Monitor Progress Programmatically

use three_dgs_processor::health::{HealthCheckState, JobProgress};
use three_dgs_processor::processor::ProcessingStage;

async fn monitor_job(health_state: &HealthCheckState) {
    loop {
        let status = health_state.get_status().await;
        
        if let Some(job) = status.current_job {
            println!("Job: {} - {}% - {}", 
                job.job_id,
                job.progress_percentage,
                job.stage
            );
            
            if job.progress_percentage >= 100.0 {
                break;
            }
        }
        
        tokio::time::sleep(Duration::from_secs(5)).await;
    }
}

Example 2: Inspect Checkpoint

# View current checkpoint
cat /mnt/output/job-abc123/.checkpoint.json | jq .

# Check progress percentage
cat /mnt/output/job-abc123/.checkpoint.json | jq '.stage' | \
  python -c "
stages = {'Validation': 0, 'FrameExtraction': 12.5, 'MetadataExtraction': 25,
          'ManifestGeneration': 37.5, 'ColmapReconstruction': 50,
          'Training': 62.5, 'PlyExport': 75, 'SplatExport': 87.5, 
          'Completed': 100}
import sys, json
stage = json.load(sys.stdin)
print(f'{stages.get(stage, 0)}%')
"

Example 3: Resume After Restart

# Start processor
docker run -d --name 3dgs-processor \
  -e INPUT_PATH=/mnt/input \
  -e OUTPUT_PATH=/mnt/output \
  youracr.azurecr.io/3dgs-processor:cpu

# ... processor crashes or is stopped ...

# Restart - automatically resumes from checkpoint
docker start 3dgs-processor

# Check logs to confirm resume
docker logs 3dgs-processor | grep "Resuming from checkpoint"

Configuration

Progress tracking is always enabled. Checkpoint files are created automatically in the output directory.

Environment Variables

No specific configuration needed for progress tracking itself. Related settings:

# Health endpoint (optional, for exposing progress via HTTP)
HEALTH_CHECK_ENABLED=true
HEALTH_CHECK_PORT=8080

# Output path (where checkpoints are stored)
OUTPUT_PATH=/mnt/output

# Retention (how long to keep completed checkpoints)
CLEANUP_RETENTION_DAYS=7  # Default: 7 days

Checkpoint Retention

Completed checkpoints (stage = Completed) are retained based on CLEANUP_RETENTION_DAYS:

export CLEANUP_RETENTION_DAYS=7  # Keep completed checkpoints for 7 days

Old checkpoints are cleaned up by the retention scheduler to prevent disk space issues.

Architecture

Key Components

ProgressTracker (src/processor/progress.rs)
- Tracks current stage and progress percentage
- Saves checkpoints to disk
- Provides resume capability
JobCheckpoint (src/processor/progress.rs)
- Serializable checkpoint data structure
- Persisted as JSON to output folder
ProcessingStage (src/processor/progress.rs)
- Enum of all pipeline stages
- Provides progress percentage calculation
HealthCheckState (src/health/mod.rs)
- Exposes progress via HTTP endpoint
- Updated by ProgressTracker during job execution

Data Flow

Job Execution
     ↓
ProgressTracker (creates checkpoint)
     ↓
Complete Stage → Update checkpoint → Save to disk
     ↓
Update HealthCheckState (if enabled)
     ↓
Health Endpoint returns progress

Troubleshooting

Checkpoint Not Resuming

Issue: Job restarts from beginning even though checkpoint exists

Solutions:

Check checkpoint age - must be < 24 hours
Verify checkpoint is not marked as Completed
Check logs for "Checkpoint too old" warnings
Ensure output folder path matches checkpoint location

Progress Not Updating in Health Endpoint

Issue: /health endpoint shows stale progress

Solutions:

Verify HEALTH_CHECK_ENABLED=true
Check health endpoint port is correct
Ensure job is actually running (check container logs)
Confirm checkpoint file is being updated (check timestamps)

Disk Space Issues from Checkpoints

Issue: Many old checkpoint files consuming space

Solutions:

Reduce CLEANUP_RETENTION_DAYS value
Manually delete old output folders
Enable automatic cleanup (verify retention scheduler is running)

Best Practices

Monitor Progress: Enable health endpoint in production for visibility
Set Reasonable Retention: Balance debugging needs vs disk space
Log Checkpoint Events: Retain logs showing checkpoint save/resume
Test Resume: Periodically test restart resilience in staging
Handle Failed Jobs: Move failed jobs to error folder to avoid re-processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress Tracking and Checkpointing

Table of Contents

Overview

Processing Stages

Checkpoint Storage

File Location

Checkpoint Format

Checkpoint Lifecycle

Health Endpoint Integration

Enabling Health Endpoint

Health Response Format

Querying Progress

Monitoring Progress in Script

Resume Capability

How Resume Works

Resume Example

Force Fresh Start

Usage Examples

Example 1: Monitor Progress Programmatically

Example 2: Inspect Checkpoint

Example 3: Resume After Restart

Configuration

Environment Variables

Checkpoint Retention

Architecture

Key Components

Data Flow

Troubleshooting

Checkpoint Not Resuming

Progress Not Updating in Health Endpoint

Disk Space Issues from Checkpoints

Best Practices

Related Documentation

FilesExpand file tree

PROGRESS_TRACKING.md

Latest commit

History

PROGRESS_TRACKING.md

File metadata and controls

Progress Tracking and Checkpointing

Table of Contents

Overview

Processing Stages

Checkpoint Storage

File Location

Checkpoint Format

Checkpoint Lifecycle

Health Endpoint Integration

Enabling Health Endpoint

Health Response Format

Querying Progress

Monitoring Progress in Script

Resume Capability

How Resume Works

Resume Example

Force Fresh Start

Usage Examples

Example 1: Monitor Progress Programmatically

Example 2: Inspect Checkpoint

Example 3: Resume After Restart

Configuration

Environment Variables

Checkpoint Retention

Architecture

Key Components

Data Flow

Troubleshooting

Checkpoint Not Resuming

Progress Not Updating in Health Endpoint

Disk Space Issues from Checkpoints

Best Practices

Related Documentation