The 3DGS video processor includes comprehensive progress tracking and checkpointing capabilities that enable:
- Real-time progress monitoring via health endpoint
- Automatic checkpoint persistence for restart resilience
- Resume from checkpoint after failures or restarts
- Detailed stage tracking through the processing pipeline
- Overview
- Processing Stages
- Checkpoint Storage
- Health Endpoint Integration
- Resume Capability
- Usage Examples
- Configuration
The progress tracking system monitors job execution through 8 distinct stages, saving checkpoint data to disk after each stage completes. This provides:
- Visibility: Track progress percentage and current stage
- Resilience: Resume jobs after process restarts
- Monitoring: Query job status via HTTP health endpoint
- Debugging: Inspect saved checkpoints for troubleshooting
Jobs progress through the following stages:
| Stage | Name | Description | Progress % |
|---|---|---|---|
| 0 | Validation | Folder validation and video discovery | 0% |
| 1 | FrameExtraction | Extract frames from all videos concurrently | 12.5% |
| 2 | MetadataExtraction | Extract GPS and camera metadata | 25% |
| 3 | ManifestGeneration | Generate manifest.json with camera intrinsics | 37.5% |
| 4 | ColmapReconstruction | Run COLMAP sparse reconstruction | 50% |
| 5 | Training | Train 3DGS model | 62.5% |
| 6 | PlyExport | Export model to PLY format | 75% |
| 7 | SplatExport | Export model to SPLAT format | 87.5% |
| 8 | Completed | Job finished successfully | 100% |
Progress percentage is calculated as: (stage_number / 8) * 100
Checkpoints are stored in the output directory as:
{output_folder}/.checkpoint.json
For example:
/mnt/output/job-abc123/.checkpoint.json
Checkpoints are JSON files containing:
{
"job_id": "job-abc123",
"stage": "Training",
"input_folder": "/mnt/input/scene-001",
"output_folder": "/mnt/output/job-abc123",
"temp_folder": "/tmp/job-abc123",
"timestamp": 1708790400,
"completed_stages": {
"video_count": 3,
"total_frames": 450,
"manifest_path": "/mnt/output/job-abc123/manifest.json",
"colmap_sparse_path": "/tmp/job-abc123/colmap/sparse",
"colmap_points": 125000,
"gaussian_count": null,
"ply_path": null,
"splat_path": null
}
}- Created: When job starts (Validation stage)
- Updated: After each stage completes
- Finalized: When job completes successfully (marked as Completed)
- Retained: Kept for status queries (cleaned up by retention policy)
When the health check endpoint is enabled (HEALTH_CHECK_ENABLED=true), it exposes real-time progress information.
export HEALTH_CHECK_ENABLED=true
export HEALTH_CHECK_PORT=8080{
"state": "processing",
"last_update": "2026-02-24T12:30:00Z",
"current_job": {
"job_id": "job-abc123",
"stage": "Training",
"progress_percentage": 62.5,
"video_count": 3,
"total_frames": 450,
"gaussian_count": null,
"started_at": "2026-02-24T12:00:00Z"
}
}curl http://localhost:8080/health | jq .
# Example output:
{
"state": "processing",
"current_job": {
"job_id": "scene-capture-20260224",
"stage": "Training",
"progress_percentage": 62.5,
"video_count": 5,
"total_frames": 750,
"started_at": "2026-02-24T12:00:00Z"
}
}#!/bin/bash
# Monitor job progress until completion
while true; do
response=$(curl -s http://localhost:8080/health)
state=$(echo "$response" | jq -r '.state')
if [ "$state" == "processing" ]; then
progress=$(echo "$response" | jq -r '.current_job.progress_percentage')
stage=$(echo "$response" | jq -r '.current_job.stage')
echo "Progress: ${progress}% - Stage: $stage"
elif [ "$state" == "watching" ]; then
echo "Job completed, processor watching for new jobs"
break
else
echo "State: $state"
fi
sleep 5
doneThe processor automatically resumes jobs from the last completed checkpoint.
- On job start: Check for existing
.checkpoint.jsonin output folder - Validate checkpoint: Ensure it's recent (< 24 hours) and not completed
- Resume from stage: Skip already-completed stages, continue from current stage
- Re-process if needed: Some stages (frame extraction, training) may be re-executed
If a job fails at the Training stage:
Stage 0 (Validation): ✓ Completed (skipped on resume)
Stage 1 (FrameExtraction): ✓ Completed (skipped on resume)
Stage 2 (MetadataExtraction): ✓ Completed (skipped on resume)
Stage 3 (ManifestGeneration): ✓ Completed (skipped on resume)
Stage 4 (ColmapReconstruction): ✓ Completed (skipped on resume)
Stage 5 (Training): ✗ Failed (resume starts here)
On restart, the job automatically resumes from Training stage, skipping all previous stages.
To force a fresh start (ignore checkpoints), delete the checkpoint file:
rm /mnt/output/job-abc123/.checkpoint.jsonuse three_dgs_processor::health::{HealthCheckState, JobProgress};
use three_dgs_processor::processor::ProcessingStage;
async fn monitor_job(health_state: &HealthCheckState) {
loop {
let status = health_state.get_status().await;
if let Some(job) = status.current_job {
println!("Job: {} - {}% - {}",
job.job_id,
job.progress_percentage,
job.stage
);
if job.progress_percentage >= 100.0 {
break;
}
}
tokio::time::sleep(Duration::from_secs(5)).await;
}
}# View current checkpoint
cat /mnt/output/job-abc123/.checkpoint.json | jq .
# Check progress percentage
cat /mnt/output/job-abc123/.checkpoint.json | jq '.stage' | \
python -c "
stages = {'Validation': 0, 'FrameExtraction': 12.5, 'MetadataExtraction': 25,
'ManifestGeneration': 37.5, 'ColmapReconstruction': 50,
'Training': 62.5, 'PlyExport': 75, 'SplatExport': 87.5,
'Completed': 100}
import sys, json
stage = json.load(sys.stdin)
print(f'{stages.get(stage, 0)}%')
"# Start processor
docker run -d --name 3dgs-processor \
-e INPUT_PATH=/mnt/input \
-e OUTPUT_PATH=/mnt/output \
youracr.azurecr.io/3dgs-processor:cpu
# ... processor crashes or is stopped ...
# Restart - automatically resumes from checkpoint
docker start 3dgs-processor
# Check logs to confirm resume
docker logs 3dgs-processor | grep "Resuming from checkpoint"Progress tracking is always enabled. Checkpoint files are created automatically in the output directory.
No specific configuration needed for progress tracking itself. Related settings:
# Health endpoint (optional, for exposing progress via HTTP)
HEALTH_CHECK_ENABLED=true
HEALTH_CHECK_PORT=8080
# Output path (where checkpoints are stored)
OUTPUT_PATH=/mnt/output
# Retention (how long to keep completed checkpoints)
CLEANUP_RETENTION_DAYS=7 # Default: 7 daysCompleted checkpoints (stage = Completed) are retained based on CLEANUP_RETENTION_DAYS:
export CLEANUP_RETENTION_DAYS=7 # Keep completed checkpoints for 7 daysOld checkpoints are cleaned up by the retention scheduler to prevent disk space issues.
-
ProgressTracker (
src/processor/progress.rs)- Tracks current stage and progress percentage
- Saves checkpoints to disk
- Provides resume capability
-
JobCheckpoint (
src/processor/progress.rs)- Serializable checkpoint data structure
- Persisted as JSON to output folder
-
ProcessingStage (
src/processor/progress.rs)- Enum of all pipeline stages
- Provides progress percentage calculation
-
HealthCheckState (
src/health/mod.rs)- Exposes progress via HTTP endpoint
- Updated by ProgressTracker during job execution
Job Execution
↓
ProgressTracker (creates checkpoint)
↓
Complete Stage → Update checkpoint → Save to disk
↓
Update HealthCheckState (if enabled)
↓
Health Endpoint returns progress
Issue: Job restarts from beginning even though checkpoint exists
Solutions:
- Check checkpoint age - must be < 24 hours
- Verify checkpoint is not marked as Completed
- Check logs for "Checkpoint too old" warnings
- Ensure output folder path matches checkpoint location
Issue: /health endpoint shows stale progress
Solutions:
- Verify
HEALTH_CHECK_ENABLED=true - Check health endpoint port is correct
- Ensure job is actually running (check container logs)
- Confirm checkpoint file is being updated (check timestamps)
Issue: Many old checkpoint files consuming space
Solutions:
- Reduce
CLEANUP_RETENTION_DAYSvalue - Manually delete old output folders
- Enable automatic cleanup (verify retention scheduler is running)
- Monitor Progress: Enable health endpoint in production for visibility
- Set Reasonable Retention: Balance debugging needs vs disk space
- Log Checkpoint Events: Retain logs showing checkpoint save/resume
- Test Resume: Periodically test restart resilience in staging
- Handle Failed Jobs: Move failed jobs to error folder to avoid re-processing
- Architecture - System architecture overview
- Deployment - Deployment guide
- User Guide - End-to-end usage guide
- Troubleshooting - Common issues and solutions