Skip to content

bentang18/ReDys

Repository files navigation

Multi-View 3D Hand Pose Estimation for Dystonia Assessment

Technical Pipeline Documentation

Ben Tang


Abstract

We present a comprehensive pipeline for zero-shot 3D hand pose estimation from multi-view RGB videos of patients with dystonia performing piano and banjo playing assessments. Our approach addresses three primary challenges: (1) extreme out-of-distribution hand poses characteristic of dystonia, (2) motion blur from ballistic key presses, and (3) temporal consistency across video frames.

The pipeline comprises five stages: audio-based multi-view synchronization with sub-frame precision, adaptive image preprocessing, single-view 3D hand pose estimation using Hamba with test-time augmentation for uncertainty quantification, camera calibration achieving <0.1 pixel intrinsic and ~0.8 pixel extrinsic error, and multi-view fusion via L-BFGS optimization with inverse-variance weighted reprojection loss.

We employ the MANO parametric hand model with consistent shape parameters across all frames and a two-stage optimization strategy. Our batch optimization approach enables true bidirectional temporal smoothing, eliminating phase-shift artifacts.


Table of Contents


1. Introduction

Clinical Motivation

Dystonia is a movement disorder characterized by sustained or intermittent muscle contractions causing abnormal, often repetitive, movements or postures. Quantitative assessment of dystonia severity remains challenging, as current clinical evaluations rely heavily on subjective rating scales. Objective biomechanical measurements derived from hand motion during functional tasks — such as piano or banjo playing — could provide valuable clinical biomarkers.

This project develops a computer vision pipeline to reconstruct accurate 3D hand pose from multi-view RGB videos of dystonia patients. The goal is to derive biomechanical measurements including joint angles, velocity profiles, jerk metrics, and abnormal posture patterns.

Technical Challenges

Three primary challenges distinguish this application from standard hand pose estimation:

Challenge 1: Extreme Out-of-Distribution Poses Dystonic hands exhibit configurations rarely seen in standard training datasets:

  • Front finger completely curled while middle finger rigidly extended
  • Ring finger curled with thumb and pinky maximally splayed
  • Sustained abnormal co-contraction patterns

This is the primary challenge — most off-the-shelf pose estimators fail catastrophically on these extreme poses.

Challenge 2: Motion Blur Piano key presses involve rapid ballistic finger movements causing significant motion blur at 60 FPS. This blur degrades both hand detection and pose estimation accuracy.

Challenge 3: Temporal Consistency Frame-to-frame jitter in pose estimates must be minimized while preserving genuine high-frequency movements characteristic of dystonia (involuntary jerks, tremor).

Pipeline Overview

Input (3 RGB Videos)
  → [Stage 1] Sync (Audio FFT)
  → [Stage 2] Preprocess (CLAHE)
  → [Stage 3] Pose (Hamba + TTA)
                                    ↘
  [Stage 4] Calibration (Checkerboard) → [Stage 5] Fusion (L-BFGS) → Output (MANO)

Input and Output Specifications

Input Specification
Videos 3 synchronized RGB (Front, Left, Right)
Resolution 1920×1080 @ 59.94/60 FPS
Duration ~3 minutes per recording
Cameras Static, fixed relative positions
Calibration Checkerboard videos (~50% of subjects)
Output Specification
MANO models Per-frame $(\boldsymbol{\beta}, \boldsymbol{\theta}, \mathbf{t})$
Shape $\boldsymbol{\beta}$ Consistent across all frames
3D joints 21 joints in world coordinates (mm)
Uncertainty Per-joint $\sigma$ from TTA
Metrics Joint angles, velocity, acceleration, jerk

2. Related Work

Evolution of Hand Pose Estimation

Our model selection journey evaluated both 2D and 3D methods:

Method Type Result
MediaPipe 2D Failed on dystonia
OpenPose 2D Outdated
ViTPose 2D No 3D constraints
HaMeR 3D Inconsistent
WiLoR 3D Good detector
Hamba 3D Best poses

Selected approach: WiLoR detector + Hamba pose estimator.

Key Inspirations

Fur Elise (Mao et al., 2024): Piano hand motion synthesis using 5 cameras with MIDI ground truth. Their pipeline: HaMeR → RANSAC triangulation → Butterworth smoothing → IK refinement with MIDI contacts. We lack MIDI ground truth and have only 3 views.

Look Ma, No Markers (Bagautdinov et al., 2024): This Microsoft Research paper fundamentally shaped our multi-view fusion approach. Key contributions we adapt:

  • Probabilistic landmark weighting: Weight each landmark by its predicted visibility confidence. High-confidence views dominate; occluded views contribute minimally.
  • Inverse-variance weighting: Fuse multi-view predictions using $1/\sigma^2$ weighting, where $\sigma$ is landmark uncertainty.
  • No explicit occlusion detection: Let learned uncertainties handle occlusion implicitly, avoiding the chicken-and-egg problem where you need the 3D pose to detect occlusions but need occlusion information to estimate the 3D pose.
  • Holistic body model: Optimize a parametric body model rather than raw 3D keypoints, ensuring anatomically plausible poses.

The MANO Hand Model

MANO (Romero et al., 2017) is a parametric hand model learned from ~1000 high-resolution 3D scans:

Shape β ∈ ℝ¹⁰  ─┐
Pose  θ ∈ ℝ⁴⁸  ─┤── MANO Forward Kinematics ──┬── Vertices V ∈ ℝ⁷⁷⁸ˣ³
Translation t ∈ ℝ³ ┘                              └── Joints   J ∈ ℝ²¹ˣ³
  • Shape $\boldsymbol{\beta}$: Controls hand proportions (consistent across frames)
  • Pose $\boldsymbol{\theta}$: Encodes 16 joint rotations (3 DOF each)
  • Forward kinematics: Produces 778 mesh vertices and 21 joint positions

3. Stage 1: Video Synchronization

The three camera views are not hardware-synchronized. We recover temporal alignment using audio-based cross-correlation with sub-frame precision.

Algorithm

Extract audio @ 16kHz (FFmpeg)
       ↓
FFT cross-correlation: c[n] = Σ_k a_ref[k] · a_target[k+n]
       ↓
Find correlation peak within ±60s window
       ↓
Parabolic interpolation for sub-sample precision
       ↓
Convert to frame offset: δ = t_offset × fps

Mathematical Formulation

Cross-Correlation. For reference audio $a_{\text{ref}}$ (Front) and target $a_{\text{target}}$:

$$c[n] = \sum_{k} a_{\text{ref}}[k] \cdot a_{\text{target}}[k + n]$$

Computed via FFT in $O(N \log N)$ using scipy.signal.correlate.

Sub-Frame Precision. Fit parabola to peak and neighbors $(y_0, y_1, y_2)$:

$$\delta = \frac{y_0 - y_2}{2(y_0 - 2y_1 + y_2)}, \quad \delta \in [-0.5, 0.5]$$

Confidence Metric. Based on peak-to-noise ratio:

$$\text{confidence} = \min\left(1.0,\ \frac{c[\text{peak}]}{20 \cdot \text{median}(|c|) + \epsilon}\right)$$

Key Insight: Confidence ≥ 0.7: reliable sync. Confidence < 0.3: requires manual verification.


4. Stage 2: Image Preprocessing

Enhancement Pipeline

Operations are applied in a specific order to prevent noise amplification:

Raw Frame → Bilateral Denoise → Gamma → CLAHE → Unsharp Mask → Enhanced Frame
              (reduce noise)    (brightness) (contrast) (edges)

Motion Deblurring (Considered but Not Used)

We experimented with RVRT (Recurrent Video Restoration Transformer) for motion deblurring. RVRT did improve motion clarity significantly, particularly for fast ballistic finger movements. However, we decided against including it for two reasons:

  1. Computational overhead: Processing 1080p video at 60 FPS required chunking into overlapping segments with 16-frame windows, leading to ~10× real-time processing even on high-end GPUs.
  2. Hallucination risk: Neural deblurring models can hallucinate plausible but incorrect textures and positions. For clinical applications where measurement accuracy is paramount, this introduces an unacceptable source of error — the model may "sharpen" a finger into a position it was never actually in.

We rely on the enhancement pipeline (CLAHE, gamma, sharpening) which is computationally cheap and does not introduce hallucinated content.

Mathematical Formulations

Bilateral Denoising. Edge-preserving filter:

$$I_{\text{out}}(x) = \frac{1}{W_p} \sum_{x_i \in \Omega} I(x_i) \cdot f_r(|I(x_i) - I(x)|) \cdot g_s(|x_i - x|)$$

Gamma Correction:

$$I_{\text{out}} = I_{\text{in}}^{1/\gamma}$$

Values $\gamma &lt; 1$ brighten the image (e.g., $\gamma = 0.6 \Rightarrow I^{1.667}$).

CLAHE. Applied to L channel in LAB space with clip limit and 8×8 tile grid.

Unsharp Masking:

$$I_{\text{sharp}} = I \cdot (1 + \alpha) - (G_\sigma * I) \cdot \alpha$$

where $G_\sigma$ is a Gaussian kernel and $\alpha$ is the sharpening strength.

Preset Modes

Mode Denoise Gamma CLAHE Sharpen Use Case
minimal -- 1.0 -- -- Good lighting
standard -- 0.6 2.0 -- Indoor video
dystonia h=6 0.85 2.0 0.2 Shadowed hands
latching_fix h=3 0.75 3.0 1.5 Finger separation

Key Insight: The latching_fix mode uses aggressive sharpening (α = 1.5) to create Mach bands at edges, helping separate overlapping fingers (e.g., thumb "latching" onto index finger).


5. Stage 3: Single-View Hand Pose Estimation

Architecture: WiLoR Detector + Hamba Pose

Frame → [WiLoR YOLO Hand Detector] → [Crop & Augment] → [Hamba Bi-Mamba]
           (conf=0.6)                   (7 TTA passes)         ↓
                                                    MANO Params (β, θ, t)
                                                    2D Keypoints + Uncertainty

Hand Detection with Robustness

Multiple Detection Handling. Select largest bounding box per hand type:

$$\text{selected} = \arg\max_{\text{det}} (x_2 - x_1)(y_2 - y_1)$$

Missing Frame Interpolation. Linear interpolation between valid neighbors:

$$\text{bbox}_t = (1-\alpha) \cdot \text{bbox}_{t_{\text{prev}}} + \alpha \cdot \text{bbox}_{t_{\text{next}}}, \quad \alpha = \frac{t - t_{\text{prev}}}{t_{\text{next}} - t_{\text{prev}}}$$

Outlier Detection. Median filter with MAD (Median Absolute Deviation):

$$\text{outlier if } |c_t - \text{median}(c)| > 3 \cdot \text{MAD}, \quad \text{MAD} = \text{median}(|c_i - \text{median}(c)|)$$

Test-Time Augmentation (TTA)

Seven passes with varying scale (1.2, 1.3, 1.4) and rotation (0°, ±20°, ±40°). Results are aggregated to compute mean keypoints and per-joint uncertainty.

Uncertainty Computation. Per-joint standard deviation across $K$ TTA passes:

$$\sigma_j = \sqrt{\text{Var}(x_j) + \text{Var}(y_j)} = \sqrt{\sigma_{x,j}^2 + \sigma_{y,j}^2}$$

This gives a scalar uncertainty in pixels for each of the 21 joints.


6. Stage 4: Camera Calibration

Intrinsic Calibration Pipeline

Motion Valley Detection → Two-Stage Checkerboard Detection → Pattern Valid?
                                                                ├─ yes → OpenCV calibrateCamera → K, distortion (<0.1 px error)
                                                                └─ no  → retry detection

Motion Valley Detection. Select frames with minimal camera/checkerboard motion:

  1. Compute frame-to-frame mean absolute difference
  2. Apply moving average smoothing (window = 5)
  3. Find local minima, expand to valleys
  4. Select stillest, ensuring ≥30 frame separation

Pattern Validation. Reject false positives (e.g., piano keys) by validating corner intensity patterns: diagonal quadrants should match, adjacent quadrants should differ.

Fixed Parameters. Principal point fixed at image center, $k_3 = p_1 = p_2 = 0$, optionally couple $f_x = f_y$.

Extrinsic Calibration

Front camera defines the world origin ($R = I, t = 0$). Left and Right cameras are calibrated via pairwise stereo calibration against Front.

180° Rotation Ambiguity Resolution

Solution: Y-Coordinate Normalization Checkerboard detectors may return corners in reversed order. We enforce consistency:

  • Top-right corner: index $c - 1$
  • Bottom-left corner: index $(r-1) \cdot c$
  • If $y_{\text{top-right}} &gt; y_{\text{bottom-left}}$: flip corners

This deterministic check is instant and 100% reliable.

Parallelogram Grid Validation

Reject misdetections by validating geometric consistency:

  • Row vectors should be parallel within each row
  • Column vectors should be parallel within each column
  • Each cell should form a parallelogram (opposite sides equal)

Maximum deviation ratio ≤ 0.15 (deviation / mean spacing).

Achieved Accuracy:

  • Intrinsic: < 0.1 pixels reprojection error
  • Extrinsic: **~ 0.8 pixels** reprojection error (after all fixes)

7. Stage 5: Multi-View Fusion via L-BFGS Optimization

Optimization Overview

Initialization:
  β: median across all frames/views
  θ: Front view + median hand pose
  t: triangulated wrist
       ↓
Stage 1: Global Positioning (50 L-BFGS iterations)
  Optimize: θ_global, t
  Freeze: θ_hand, β
  Loss: L_reproj + 10λ_temp · L_temporal
       ↓
Stage 2: Full Refinement (150 L-BFGS iterations)
  Optimize: θ_global, θ_hand, t
  Freeze: β
  Loss: L_reproj + λ_anchor · L_anchor + λ_temp · L_temporal

6D Rotation Representation

For gradient-based optimization, we use the continuous 6D rotation representation (Zhou et al., 2019):

$$\text{rot2sixd}(R) = [R_{:,0},\ R_{:,1}] \in \mathbb{R}^6$$

Inverse via Gram-Schmidt orthogonalization ensures a valid rotation matrix.

Loss Functions

Reprojection Loss

The primary loss measures 2D reprojection error with inverse-variance weighting:

$$\mathcal{L}_{\text{reproj}} = \frac{1}{N_{\text{obs}}} \sum_{v \in \mathcal{V}} \sum_{j=1}^{21} \frac{w_{v,j}^{\text{manual}}}{2(\sigma_{v,j}^2 + \epsilon)} \left|\pi_v(J_j) - \hat{J}_{v,j}^{2D}\right|_2^2$$

where:

  • $\pi_v(\cdot)$: perspective projection with radial distortion for view $v$
  • $w_{v,j}^{\text{manual}}$: manual occlusion weight
  • $\sigma_{v,j}$: TTA uncertainty
  • $\epsilon = 10^{-5}$: numerical stability

Anchor Loss (Stage 2 Only)

Regularizes hand pose toward initialization:

$$\mathcal{L}_{\text{anchor}} = \frac{1}{N} \sum_{t=1}^{N} \left|\boldsymbol{\theta}_{\text{hand},t}^{6D} - \boldsymbol{\theta}_{\text{anchor},t}^{6D}\right|_2^2$$

Key Insight: Anchor loss is applied only to hand pose, not global orientation or translation, because those are affected by focal length scaling from Hamba initialization.

Temporal Smoothness Loss

Penalizes vertex velocity across frames:

$$\mathcal{L}_{\text{temporal}} = \frac{1}{N-1} \sum_{t=1}^{N-1} \sum_{v=1}^{778} \left|V_t^{(v)} - V_{t-1}^{(v)}\right|_2^2$$

Using all 778 vertices (not just 21 joints) provides smoother regularization.

Manual Per-View Per-Joint Weights

View Thumb Index Middle Ring Pinky
Front 1.0 1.0 1.0 1.0 1.0
Left 1.0 1.0 0.25-0.5 0.125-0.25 0.0
Right 0.0 0.0 0.125-0.25 0.25-0.5 1.0

Batch vs. Sequential Optimization

Sequential:  [1] → [2] → [3] → [4] → [5]    ← Phase shift!
Batch:       [1] ↔ [2] ↔ [3] ↔ [4] ↔ [5]    ← Bidirectional

Analogy: Sequential = tugging a string left then right (never taut). Batch = pulling both ends simultaneously (taut = smooth + accurate).

Solution: Phase-Shift-Free Smoothing By optimizing all $N$ frames simultaneously, each frame is pulled toward both neighbors. This provides true bidirectional temporal smoothing without the phase shift of forward-only or forward-backward sequential processing.

Focal Length Mismatch Handling

Challenge: Hamba uses pseudo focal length (~37,500) differing from actual camera focal (~1,800). This causes depth initialization errors.

Solution:

  1. Initialize wrist via triangulation using actual calibration
  2. L-BFGS optimizes translation to match 2D observations
  3. Do not anchor global orientation or translation (only hand pose)

8. Challenges and Lessons Learned

Abandoned: Spatio-Temporal Bundle Adjustment (STBA)

Failure Description
DLT initialization Triangulating weak-perspective 2D predictions produced bad 3D
Camera divergence Extrinsics drifted >100mm, >45°
Circular dependency Occlusion detection needs 3D pose, but 3D pose is what we optimize

Lesson: Use MANO forward kinematics from Hamba, not DLT triangulation.

Other Abandoned Approaches

  • CoTracker Temporal Tracking: For static multi-view cameras, 3D optimization provides better temporal consistency than 2D tracking.
  • Occlusion via Mesh Ray-Casting: Circular dependency (needs 3D pose to detect occlusion). Manual weights + TTA uncertainty suffice.

9. Results and Visualization

Dystonia Metrics

Jerk Score. Third derivative of position indicating movement smoothness:

$$\text{jerk}_j(t) = \left|\frac{d^3 J_j(t)}{dt^3}\right|_2, \quad \text{score}_j = \left(\frac{\text{jerk}_j - P_{50}}{P_{95} - P_{50}}\right)^{0.5}$$

Peak Acceleration. Threshold-based (>5 m/s²) flash with linear decay.

Baseline Comparison

Z-score relative to PIANO10M healthy dataset:

$$z_j = \frac{|\theta_j - \mu_{\text{baseline}}|}{\sigma_{\text{baseline}}}$$

Color mapping: $z \leq 1$ (green), $1 &lt; z &lt; 3$ (yellow), $z \geq 3$ (red).


10. Discussion

Comparison with Fur Elise

Aspect Fur Elise Ours
Camera views 5 3
Ground truth MIDI sensors None
Pose model HaMeR Hamba
Refinement IK from contacts L-BFGS
Subjects Elite pianists Dystonia patients

We compensate for fewer views and no ground truth through better single-view estimation (Hamba), TTA uncertainty, and batch optimization.

Limitations

  1. Only 3 views (weaker triangulation geometry)
  2. No ground truth for validation
  3. Manual occlusion weights (not learned)
  4. Consistent $\boldsymbol{\beta}$ cannot handle shape changes

Future Work

  1. Video-based models (HaMeR-V) for native temporal consistency
  2. Multi-view architectures as they mature
  3. Learned uncertainty prediction
  4. Clinical validation with severity ratings

Conclusion

Key contributions:

  1. Robust estimation: WiLoR detector + Hamba pose + TTA uncertainty
  2. Sub-pixel calibration: Motion valleys + pattern validation + Y-normalization
  3. Principled fusion: Inverse-variance weighted L-BFGS with manual occlusion weights
  4. Phase-shift-free smoothing: Batch optimization for true bidirectional coupling
  5. Design lessons: MANO forward kinematics beats DLT triangulation

Appendices

A. Complete Preprocessing Parameters

Mode Denoise h Gamma CLAHE Clip CLAHE Tile Sharpen α Sharpen σ
minimal 0 1.0 0 8 0 1.0
standard 0 0.6 2.0 8 0 1.0
dystonia 6 0.85 2.0 8 0.2 1.0
latching_fix 3 0.75 3.0 8 1.5 1.5

B. MANO Joint Indices

Idx Joint Idx Joint Idx Joint
0 Wrist 7 Index DIP 14 Ring PIP
1 Thumb CMC 8 Index Tip 15 Ring DIP
2 Thumb MCP 9 Middle MCP 16 Ring Tip
3 Thumb IP 10 Middle PIP 17 Pinky MCP
4 Thumb Tip 11 Middle DIP 18 Pinky PIP
5 Index MCP 12 Middle Tip 19 Pinky DIP
6 Index PIP 13 Ring MCP 20 Pinky Tip

C. Complete Manual Weights (Right Hand)

View 0 1 2 3 4 5 6 7 8 9 10
Front 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Left 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.25 0.5
Right 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.125 0.25
View 11 12 13 14 15 16 17 18 19 20
Front 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Left 0.5 0.5 0.125 0.25 0.25 0.25 0.0 0.0 0.0 0.0
Right 0.25 0.25 0.25 0.5 0.5 0.5 1.0 1.0 1.0 1.0

References

  1. C. Lugaresi et al., "MediaPipe: A Framework for Building Perception Pipelines," arXiv:1906.08172, 2019.
  2. Z. Cao et al., "OpenPose: Realtime Multi-Person 2D Pose Estimation," IEEE TPAMI, 2019.
  3. Y. Xu et al., "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation," NeurIPS, 2022.
  4. R. Khirodkar et al., "Sapiens: Foundation for Human Vision Models," arXiv:2408.12569, 2024.
  5. J. Romero et al., "Embodied Hands: Modeling and Capturing Hands and Bodies Together," SIGGRAPH Asia, 2017.
  6. G. Pavlakos et al., "Reconstructing Hands in 3D with Transformers," CVPR, 2024.
  7. R. Potamias et al., "WiLoR: End-to-end 3D Hand Localization and Reconstruction," arXiv:2409.12259, 2024.
  8. H. Li et al., "Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba," arXiv:2407.09646, 2024.
  9. L. Mao et al., "Fur Elise: Capturing and Physically Synthesizing Hand Motions of Piano Performance," arXiv:2410.05791, 2024.
  10. T. Bagautdinov et al., "Look Ma, No Markers: Holistic Performance Capture," arXiv:2410.11520, 2024.
  11. Y. Zhou et al., "On the Continuity of Rotation Representations in Neural Networks," CVPR, 2019.
  12. N. Karaev et al., "CoTracker: It is Better to Track Together," ECCV, 2024.

About

Multi-view 3D hand pose estimation pipeline for dystonia assessment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors