Technical Pipeline Documentation
Ben Tang
We present a comprehensive pipeline for zero-shot 3D hand pose estimation from multi-view RGB videos of patients with dystonia performing piano and banjo playing assessments. Our approach addresses three primary challenges: (1) extreme out-of-distribution hand poses characteristic of dystonia, (2) motion blur from ballistic key presses, and (3) temporal consistency across video frames.
The pipeline comprises five stages: audio-based multi-view synchronization with sub-frame precision, adaptive image preprocessing, single-view 3D hand pose estimation using Hamba with test-time augmentation for uncertainty quantification, camera calibration achieving <0.1 pixel intrinsic and ~0.8 pixel extrinsic error, and multi-view fusion via L-BFGS optimization with inverse-variance weighted reprojection loss.
We employ the MANO parametric hand model with consistent shape parameters across all frames and a two-stage optimization strategy. Our batch optimization approach enables true bidirectional temporal smoothing, eliminating phase-shift artifacts.
- Introduction
- Related Work
- Stage 1: Video Synchronization
- Stage 2: Image Preprocessing
- Stage 3: Single-View Hand Pose Estimation
- Stage 4: Camera Calibration
- Stage 5: Multi-View Fusion
- Challenges and Lessons Learned
- Results and Visualization
- Discussion
- Appendices
Dystonia is a movement disorder characterized by sustained or intermittent muscle contractions causing abnormal, often repetitive, movements or postures. Quantitative assessment of dystonia severity remains challenging, as current clinical evaluations rely heavily on subjective rating scales. Objective biomechanical measurements derived from hand motion during functional tasks — such as piano or banjo playing — could provide valuable clinical biomarkers.
This project develops a computer vision pipeline to reconstruct accurate 3D hand pose from multi-view RGB videos of dystonia patients. The goal is to derive biomechanical measurements including joint angles, velocity profiles, jerk metrics, and abnormal posture patterns.
Three primary challenges distinguish this application from standard hand pose estimation:
Challenge 1: Extreme Out-of-Distribution Poses Dystonic hands exhibit configurations rarely seen in standard training datasets:
- Front finger completely curled while middle finger rigidly extended
- Ring finger curled with thumb and pinky maximally splayed
- Sustained abnormal co-contraction patterns
This is the primary challenge — most off-the-shelf pose estimators fail catastrophically on these extreme poses.
Challenge 2: Motion Blur Piano key presses involve rapid ballistic finger movements causing significant motion blur at 60 FPS. This blur degrades both hand detection and pose estimation accuracy.
Challenge 3: Temporal Consistency Frame-to-frame jitter in pose estimates must be minimized while preserving genuine high-frequency movements characteristic of dystonia (involuntary jerks, tremor).
Input (3 RGB Videos)
→ [Stage 1] Sync (Audio FFT)
→ [Stage 2] Preprocess (CLAHE)
→ [Stage 3] Pose (Hamba + TTA)
↘
[Stage 4] Calibration (Checkerboard) → [Stage 5] Fusion (L-BFGS) → Output (MANO)
| Input | Specification |
|---|---|
| Videos | 3 synchronized RGB (Front, Left, Right) |
| Resolution | 1920×1080 @ 59.94/60 FPS |
| Duration | ~3 minutes per recording |
| Cameras | Static, fixed relative positions |
| Calibration | Checkerboard videos (~50% of subjects) |
| Output | Specification |
|---|---|
| MANO models | Per-frame |
| Shape |
Consistent across all frames |
| 3D joints | 21 joints in world coordinates (mm) |
| Uncertainty | Per-joint |
| Metrics | Joint angles, velocity, acceleration, jerk |
Our model selection journey evaluated both 2D and 3D methods:
| Method | Type | Result |
|---|---|---|
| MediaPipe | 2D | Failed on dystonia |
| OpenPose | 2D | Outdated |
| ViTPose | 2D | No 3D constraints |
| HaMeR | 3D | Inconsistent |
| WiLoR | 3D | Good detector |
| Hamba | 3D | Best poses |
Selected approach: WiLoR detector + Hamba pose estimator.
Fur Elise (Mao et al., 2024): Piano hand motion synthesis using 5 cameras with MIDI ground truth. Their pipeline: HaMeR → RANSAC triangulation → Butterworth smoothing → IK refinement with MIDI contacts. We lack MIDI ground truth and have only 3 views.
Look Ma, No Markers (Bagautdinov et al., 2024): This Microsoft Research paper fundamentally shaped our multi-view fusion approach. Key contributions we adapt:
- Probabilistic landmark weighting: Weight each landmark by its predicted visibility confidence. High-confidence views dominate; occluded views contribute minimally.
-
Inverse-variance weighting: Fuse multi-view predictions using
$1/\sigma^2$ weighting, where$\sigma$ is landmark uncertainty. - No explicit occlusion detection: Let learned uncertainties handle occlusion implicitly, avoiding the chicken-and-egg problem where you need the 3D pose to detect occlusions but need occlusion information to estimate the 3D pose.
- Holistic body model: Optimize a parametric body model rather than raw 3D keypoints, ensuring anatomically plausible poses.
MANO (Romero et al., 2017) is a parametric hand model learned from ~1000 high-resolution 3D scans:
Shape β ∈ ℝ¹⁰ ─┐
Pose θ ∈ ℝ⁴⁸ ─┤── MANO Forward Kinematics ──┬── Vertices V ∈ ℝ⁷⁷⁸ˣ³
Translation t ∈ ℝ³ ┘ └── Joints J ∈ ℝ²¹ˣ³
-
Shape
$\boldsymbol{\beta}$ : Controls hand proportions (consistent across frames) -
Pose
$\boldsymbol{\theta}$ : Encodes 16 joint rotations (3 DOF each) - Forward kinematics: Produces 778 mesh vertices and 21 joint positions
The three camera views are not hardware-synchronized. We recover temporal alignment using audio-based cross-correlation with sub-frame precision.
Extract audio @ 16kHz (FFmpeg)
↓
FFT cross-correlation: c[n] = Σ_k a_ref[k] · a_target[k+n]
↓
Find correlation peak within ±60s window
↓
Parabolic interpolation for sub-sample precision
↓
Convert to frame offset: δ = t_offset × fps
Cross-Correlation. For reference audio
Computed via FFT in scipy.signal.correlate.
Sub-Frame Precision. Fit parabola to peak and neighbors
Confidence Metric. Based on peak-to-noise ratio:
Key Insight: Confidence ≥ 0.7: reliable sync. Confidence < 0.3: requires manual verification.
Operations are applied in a specific order to prevent noise amplification:
Raw Frame → Bilateral Denoise → Gamma → CLAHE → Unsharp Mask → Enhanced Frame
(reduce noise) (brightness) (contrast) (edges)
We experimented with RVRT (Recurrent Video Restoration Transformer) for motion deblurring. RVRT did improve motion clarity significantly, particularly for fast ballistic finger movements. However, we decided against including it for two reasons:
- Computational overhead: Processing 1080p video at 60 FPS required chunking into overlapping segments with 16-frame windows, leading to ~10× real-time processing even on high-end GPUs.
- Hallucination risk: Neural deblurring models can hallucinate plausible but incorrect textures and positions. For clinical applications where measurement accuracy is paramount, this introduces an unacceptable source of error — the model may "sharpen" a finger into a position it was never actually in.
We rely on the enhancement pipeline (CLAHE, gamma, sharpening) which is computationally cheap and does not introduce hallucinated content.
Bilateral Denoising. Edge-preserving filter:
Gamma Correction:
Values
CLAHE. Applied to L channel in LAB space with clip limit and 8×8 tile grid.
Unsharp Masking:
where
| Mode | Denoise | Gamma | CLAHE | Sharpen | Use Case |
|---|---|---|---|---|---|
minimal |
-- | 1.0 | -- | -- | Good lighting |
standard |
-- | 0.6 | 2.0 | -- | Indoor video |
dystonia |
h=6 | 0.85 | 2.0 | 0.2 | Shadowed hands |
latching_fix |
h=3 | 0.75 | 3.0 | 1.5 | Finger separation |
Key Insight: The
latching_fixmode uses aggressive sharpening (α = 1.5) to create Mach bands at edges, helping separate overlapping fingers (e.g., thumb "latching" onto index finger).
Frame → [WiLoR YOLO Hand Detector] → [Crop & Augment] → [Hamba Bi-Mamba]
(conf=0.6) (7 TTA passes) ↓
MANO Params (β, θ, t)
2D Keypoints + Uncertainty
Multiple Detection Handling. Select largest bounding box per hand type:
Missing Frame Interpolation. Linear interpolation between valid neighbors:
Outlier Detection. Median filter with MAD (Median Absolute Deviation):
Seven passes with varying scale (1.2, 1.3, 1.4) and rotation (0°, ±20°, ±40°). Results are aggregated to compute mean keypoints and per-joint uncertainty.
Uncertainty Computation. Per-joint standard deviation across
This gives a scalar uncertainty in pixels for each of the 21 joints.
Motion Valley Detection → Two-Stage Checkerboard Detection → Pattern Valid?
├─ yes → OpenCV calibrateCamera → K, distortion (<0.1 px error)
└─ no → retry detection
Motion Valley Detection. Select frames with minimal camera/checkerboard motion:
- Compute frame-to-frame mean absolute difference
- Apply moving average smoothing (window = 5)
- Find local minima, expand to valleys
- Select stillest, ensuring ≥30 frame separation
Pattern Validation. Reject false positives (e.g., piano keys) by validating corner intensity patterns: diagonal quadrants should match, adjacent quadrants should differ.
Fixed Parameters. Principal point fixed at image center,
Front camera defines the world origin (
Solution: Y-Coordinate Normalization Checkerboard detectors may return corners in reversed order. We enforce consistency:
- Top-right corner: index
$c - 1$ - Bottom-left corner: index
$(r-1) \cdot c$ - If
$y_{\text{top-right}} > y_{\text{bottom-left}}$ : flip cornersThis deterministic check is instant and 100% reliable.
Reject misdetections by validating geometric consistency:
- Row vectors should be parallel within each row
- Column vectors should be parallel within each column
- Each cell should form a parallelogram (opposite sides equal)
Maximum deviation ratio ≤ 0.15 (deviation / mean spacing).
Achieved Accuracy:
- Intrinsic: < 0.1 pixels reprojection error
- Extrinsic: **~ 0.8 pixels** reprojection error (after all fixes)
Initialization:
β: median across all frames/views
θ: Front view + median hand pose
t: triangulated wrist
↓
Stage 1: Global Positioning (50 L-BFGS iterations)
Optimize: θ_global, t
Freeze: θ_hand, β
Loss: L_reproj + 10λ_temp · L_temporal
↓
Stage 2: Full Refinement (150 L-BFGS iterations)
Optimize: θ_global, θ_hand, t
Freeze: β
Loss: L_reproj + λ_anchor · L_anchor + λ_temp · L_temporal
For gradient-based optimization, we use the continuous 6D rotation representation (Zhou et al., 2019):
Inverse via Gram-Schmidt orthogonalization ensures a valid rotation matrix.
The primary loss measures 2D reprojection error with inverse-variance weighting:
where:
-
$\pi_v(\cdot)$ : perspective projection with radial distortion for view$v$ -
$w_{v,j}^{\text{manual}}$ : manual occlusion weight -
$\sigma_{v,j}$ : TTA uncertainty -
$\epsilon = 10^{-5}$ : numerical stability
Regularizes hand pose toward initialization:
Key Insight: Anchor loss is applied only to hand pose, not global orientation or translation, because those are affected by focal length scaling from Hamba initialization.
Penalizes vertex velocity across frames:
Using all 778 vertices (not just 21 joints) provides smoother regularization.
| View | Thumb | Index | Middle | Ring | Pinky |
|---|---|---|---|---|---|
| Front | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Left | 1.0 | 1.0 | 0.25-0.5 | 0.125-0.25 | 0.0 |
| Right | 0.0 | 0.0 | 0.125-0.25 | 0.25-0.5 | 1.0 |
Sequential: [1] → [2] → [3] → [4] → [5] ← Phase shift!
Batch: [1] ↔ [2] ↔ [3] ↔ [4] ↔ [5] ← Bidirectional
Analogy: Sequential = tugging a string left then right (never taut). Batch = pulling both ends simultaneously (taut = smooth + accurate).
Solution: Phase-Shift-Free Smoothing By optimizing all
$N$ frames simultaneously, each frame is pulled toward both neighbors. This provides true bidirectional temporal smoothing without the phase shift of forward-only or forward-backward sequential processing.
Challenge: Hamba uses pseudo focal length (~37,500) differing from actual camera focal (~1,800). This causes depth initialization errors.
Solution:
- Initialize wrist via triangulation using actual calibration
- L-BFGS optimizes translation to match 2D observations
- Do not anchor global orientation or translation (only hand pose)
| Failure | Description |
|---|---|
| DLT initialization | Triangulating weak-perspective 2D predictions produced bad 3D |
| Camera divergence | Extrinsics drifted >100mm, >45° |
| Circular dependency | Occlusion detection needs 3D pose, but 3D pose is what we optimize |
Lesson: Use MANO forward kinematics from Hamba, not DLT triangulation.
- CoTracker Temporal Tracking: For static multi-view cameras, 3D optimization provides better temporal consistency than 2D tracking.
- Occlusion via Mesh Ray-Casting: Circular dependency (needs 3D pose to detect occlusion). Manual weights + TTA uncertainty suffice.
Jerk Score. Third derivative of position indicating movement smoothness:
Peak Acceleration. Threshold-based (>5 m/s²) flash with linear decay.
Z-score relative to PIANO10M healthy dataset:
Color mapping:
| Aspect | Fur Elise | Ours |
|---|---|---|
| Camera views | 5 | 3 |
| Ground truth | MIDI sensors | None |
| Pose model | HaMeR | Hamba |
| Refinement | IK from contacts | L-BFGS |
| Subjects | Elite pianists | Dystonia patients |
We compensate for fewer views and no ground truth through better single-view estimation (Hamba), TTA uncertainty, and batch optimization.
- Only 3 views (weaker triangulation geometry)
- No ground truth for validation
- Manual occlusion weights (not learned)
- Consistent
$\boldsymbol{\beta}$ cannot handle shape changes
- Video-based models (HaMeR-V) for native temporal consistency
- Multi-view architectures as they mature
- Learned uncertainty prediction
- Clinical validation with severity ratings
Key contributions:
- Robust estimation: WiLoR detector + Hamba pose + TTA uncertainty
- Sub-pixel calibration: Motion valleys + pattern validation + Y-normalization
- Principled fusion: Inverse-variance weighted L-BFGS with manual occlusion weights
- Phase-shift-free smoothing: Batch optimization for true bidirectional coupling
- Design lessons: MANO forward kinematics beats DLT triangulation
| Mode | Denoise h | Gamma | CLAHE Clip | CLAHE Tile | Sharpen α | Sharpen σ |
|---|---|---|---|---|---|---|
| minimal | 0 | 1.0 | 0 | 8 | 0 | 1.0 |
| standard | 0 | 0.6 | 2.0 | 8 | 0 | 1.0 |
| dystonia | 6 | 0.85 | 2.0 | 8 | 0.2 | 1.0 |
| latching_fix | 3 | 0.75 | 3.0 | 8 | 1.5 | 1.5 |
| Idx | Joint | Idx | Joint | Idx | Joint |
|---|---|---|---|---|---|
| 0 | Wrist | 7 | Index DIP | 14 | Ring PIP |
| 1 | Thumb CMC | 8 | Index Tip | 15 | Ring DIP |
| 2 | Thumb MCP | 9 | Middle MCP | 16 | Ring Tip |
| 3 | Thumb IP | 10 | Middle PIP | 17 | Pinky MCP |
| 4 | Thumb Tip | 11 | Middle DIP | 18 | Pinky PIP |
| 5 | Index MCP | 12 | Middle Tip | 19 | Pinky DIP |
| 6 | Index PIP | 13 | Ring MCP | 20 | Pinky Tip |
| View | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Front | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Left | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.25 | 0.5 |
| Right | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.125 | 0.25 |
| View | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|---|---|---|---|---|---|
| Front | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Left | 0.5 | 0.5 | 0.125 | 0.25 | 0.25 | 0.25 | 0.0 | 0.0 | 0.0 | 0.0 |
| Right | 0.25 | 0.25 | 0.25 | 0.5 | 0.5 | 0.5 | 1.0 | 1.0 | 1.0 | 1.0 |
- C. Lugaresi et al., "MediaPipe: A Framework for Building Perception Pipelines," arXiv:1906.08172, 2019.
- Z. Cao et al., "OpenPose: Realtime Multi-Person 2D Pose Estimation," IEEE TPAMI, 2019.
- Y. Xu et al., "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation," NeurIPS, 2022.
- R. Khirodkar et al., "Sapiens: Foundation for Human Vision Models," arXiv:2408.12569, 2024.
- J. Romero et al., "Embodied Hands: Modeling and Capturing Hands and Bodies Together," SIGGRAPH Asia, 2017.
- G. Pavlakos et al., "Reconstructing Hands in 3D with Transformers," CVPR, 2024.
- R. Potamias et al., "WiLoR: End-to-end 3D Hand Localization and Reconstruction," arXiv:2409.12259, 2024.
- H. Li et al., "Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba," arXiv:2407.09646, 2024.
- L. Mao et al., "Fur Elise: Capturing and Physically Synthesizing Hand Motions of Piano Performance," arXiv:2410.05791, 2024.
- T. Bagautdinov et al., "Look Ma, No Markers: Holistic Performance Capture," arXiv:2410.11520, 2024.
- Y. Zhou et al., "On the Continuity of Rotation Representations in Neural Networks," CVPR, 2019.
- N. Karaev et al., "CoTracker: It is Better to Track Together," ECCV, 2024.