Multi-View 3D Hand Pose Estimation for Dystonia Assessment

Technical Pipeline Documentation

Ben Tang

Abstract

We present a comprehensive pipeline for zero-shot 3D hand pose estimation from multi-view RGB videos of patients with dystonia performing piano and banjo playing assessments. Our approach addresses three primary challenges: (1) extreme out-of-distribution hand poses characteristic of dystonia, (2) motion blur from ballistic key presses, and (3) temporal consistency across video frames.

The pipeline comprises five stages: audio-based multi-view synchronization with sub-frame precision, adaptive image preprocessing, single-view 3D hand pose estimation using Hamba with test-time augmentation for uncertainty quantification, camera calibration achieving <0.1 pixel intrinsic and ~0.8 pixel extrinsic error, and multi-view fusion via L-BFGS optimization with inverse-variance weighted reprojection loss.

We employ the MANO parametric hand model with consistent shape parameters across all frames and a two-stage optimization strategy. Our batch optimization approach enables true bidirectional temporal smoothing, eliminating phase-shift artifacts.

1. Introduction

Clinical Motivation

Dystonia is a movement disorder characterized by sustained or intermittent muscle contractions causing abnormal, often repetitive, movements or postures. Quantitative assessment of dystonia severity remains challenging, as current clinical evaluations rely heavily on subjective rating scales. Objective biomechanical measurements derived from hand motion during functional tasks — such as piano or banjo playing — could provide valuable clinical biomarkers.

This project develops a computer vision pipeline to reconstruct accurate 3D hand pose from multi-view RGB videos of dystonia patients. The goal is to derive biomechanical measurements including joint angles, velocity profiles, jerk metrics, and abnormal posture patterns.

Technical Challenges

Three primary challenges distinguish this application from standard hand pose estimation:

Challenge 1: Extreme Out-of-Distribution Poses Dystonic hands exhibit configurations rarely seen in standard training datasets:

Front finger completely curled while middle finger rigidly extended

Ring finger curled with thumb and pinky maximally splayed

Sustained abnormal co-contraction patterns

This is the primary challenge — most off-the-shelf pose estimators fail catastrophically on these extreme poses.

Challenge 2: Motion Blur Piano key presses involve rapid ballistic finger movements causing significant motion blur at 60 FPS. This blur degrades both hand detection and pose estimation accuracy.

Challenge 3: Temporal Consistency Frame-to-frame jitter in pose estimates must be minimized while preserving genuine high-frequency movements characteristic of dystonia (involuntary jerks, tremor).

Pipeline Overview

Input (3 RGB Videos)
  → [Stage 1] Sync (Audio FFT)
  → [Stage 2] Preprocess (CLAHE)
  → [Stage 3] Pose (Hamba + TTA)
                                    ↘
  [Stage 4] Calibration (Checkerboard) → [Stage 5] Fusion (L-BFGS) → Output (MANO)

Input and Output Specifications

Input	Specification
Videos	3 synchronized RGB (Front, Left, Right)
Resolution	1920×1080 @ 59.94/60 FPS
Duration	~3 minutes per recording
Cameras	Static, fixed relative positions
Calibration	Checkerboard videos (~50% of subjects)

Output	Specification
MANO models	Per-frame $(\boldsymbol{\beta}, \boldsymbol{\theta}, \mathbf{t})$
Shape $\boldsymbol{\beta}$	Consistent across all frames
3D joints	21 joints in world coordinates (mm)
Uncertainty	Per-joint $\sigma$ from TTA
Metrics	Joint angles, velocity, acceleration, jerk

2. Related Work

Evolution of Hand Pose Estimation

Our model selection journey evaluated both 2D and 3D methods:

Method	Type	Result
MediaPipe	2D	Failed on dystonia
OpenPose	2D	Outdated
ViTPose	2D	No 3D constraints
HaMeR	3D	Inconsistent
WiLoR	3D	Good detector
Hamba	3D	Best poses

Selected approach: WiLoR detector + Hamba pose estimator.

Key Inspirations

Fur Elise (Mao et al., 2024): Piano hand motion synthesis using 5 cameras with MIDI ground truth. Their pipeline: HaMeR → RANSAC triangulation → Butterworth smoothing → IK refinement with MIDI contacts. We lack MIDI ground truth and have only 3 views.

Look Ma, No Markers (Bagautdinov et al., 2024): This Microsoft Research paper fundamentally shaped our multi-view fusion approach. Key contributions we adapt:

Probabilistic landmark weighting: Weight each landmark by its predicted visibility confidence. High-confidence views dominate; occluded views contribute minimally.
Inverse-variance weighting: Fuse multi-view predictions using $1/\sigma^2$ weighting, where $\sigma$ is landmark uncertainty.
No explicit occlusion detection: Let learned uncertainties handle occlusion implicitly, avoiding the chicken-and-egg problem where you need the 3D pose to detect occlusions but need occlusion information to estimate the 3D pose.
Holistic body model: Optimize a parametric body model rather than raw 3D keypoints, ensuring anatomically plausible poses.

The MANO Hand Model

MANO (Romero et al., 2017) is a parametric hand model learned from ~1000 high-resolution 3D scans:

Shape β ∈ ℝ¹⁰  ─┐
Pose  θ ∈ ℝ⁴⁸  ─┤── MANO Forward Kinematics ──┬── Vertices V ∈ ℝ⁷⁷⁸ˣ³
Translation t ∈ ℝ³ ┘                              └── Joints   J ∈ ℝ²¹ˣ³

Shape $\boldsymbol{\beta}$: Controls hand proportions (consistent across frames)
Pose $\boldsymbol{\theta}$: Encodes 16 joint rotations (3 DOF each)
Forward kinematics: Produces 778 mesh vertices and 21 joint positions

3. Stage 1: Video Synchronization

The three camera views are not hardware-synchronized. We recover temporal alignment using audio-based cross-correlation with sub-frame precision.

Algorithm

Extract audio @ 16kHz (FFmpeg)
       ↓
FFT cross-correlation: c[n] = Σ_k a_ref[k] · a_target[k+n]
       ↓
Find correlation peak within ±60s window
       ↓
Parabolic interpolation for sub-sample precision
       ↓
Convert to frame offset: δ = t_offset × fps

Mathematical Formulation

Cross-Correlation. For reference audio $a_{\text{ref}}$ (Front) and target $a_{\text{target}}$:

$$c[n] = \sum_{k} a_{\text{ref}}[k] \cdot a_{\text{target}}[k + n]$$

Computed via FFT in $O(N \log N)$ using scipy.signal.correlate.

Sub-Frame Precision. Fit parabola to peak and neighbors $(y_0, y_1, y_2)$:

$$\delta = \frac{y_0 - y_2}{2(y_0 - 2y_1 + y_2)}, \quad \delta \in [-0.5, 0.5]$$

Confidence Metric. Based on peak-to-noise ratio:

$$\text{confidence} = \min\left(1.0,\ \frac{c[\text{peak}]}{20 \cdot \text{median}(|c|) + \epsilon}\right)$$

Key Insight: Confidence ≥ 0.7: reliable sync. Confidence < 0.3: requires manual verification.

4. Stage 2: Image Preprocessing

Enhancement Pipeline

Operations are applied in a specific order to prevent noise amplification:

Raw Frame → Bilateral Denoise → Gamma → CLAHE → Unsharp Mask → Enhanced Frame
              (reduce noise)    (brightness) (contrast) (edges)

Motion Deblurring (Considered but Not Used)

We experimented with RVRT (Recurrent Video Restoration Transformer) for motion deblurring. RVRT did improve motion clarity significantly, particularly for fast ballistic finger movements. However, we decided against including it for two reasons:

Computational overhead: Processing 1080p video at 60 FPS required chunking into overlapping segments with 16-frame windows, leading to ~10× real-time processing even on high-end GPUs.
Hallucination risk: Neural deblurring models can hallucinate plausible but incorrect textures and positions. For clinical applications where measurement accuracy is paramount, this introduces an unacceptable source of error — the model may "sharpen" a finger into a position it was never actually in.

We rely on the enhancement pipeline (CLAHE, gamma, sharpening) which is computationally cheap and does not introduce hallucinated content.

Mathematical Formulations

Bilateral Denoising. Edge-preserving filter:

$$I_{\text{out}}(x) = \frac{1}{W_p} \sum_{x_i \in \Omega} I(x_i) \cdot f_r(|I(x_i) - I(x)|) \cdot g_s(|x_i - x|)$$

Gamma Correction:

$$I_{\text{out}} = I_{\text{in}}^{1/\gamma}$$

Values $\gamma < 1$ brighten the image (e.g., $\gamma = 0.6 \Rightarrow I^{1.667}$).

CLAHE. Applied to L channel in LAB space with clip limit and 8×8 tile grid.

Unsharp Masking:

$$I_{\text{sharp}} = I \cdot (1 + \alpha) - (G_\sigma * I) \cdot \alpha$$

where $G_\sigma$ is a Gaussian kernel and $\alpha$ is the sharpening strength.

Preset Modes

Mode	Denoise	Gamma	CLAHE	Sharpen	Use Case
`minimal`	--	1.0	--	--	Good lighting
`standard`	--	0.6	2.0	--	Indoor video
`dystonia`	h=6	0.85	2.0	0.2	Shadowed hands
`latching_fix`	h=3	0.75	3.0	1.5	Finger separation

Key Insight: The latching_fix mode uses aggressive sharpening (α = 1.5) to create Mach bands at edges, helping separate overlapping fingers (e.g., thumb "latching" onto index finger).

5. Stage 3: Single-View Hand Pose Estimation

Architecture: WiLoR Detector + Hamba Pose

Frame → [WiLoR YOLO Hand Detector] → [Crop & Augment] → [Hamba Bi-Mamba]
           (conf=0.6)                   (7 TTA passes)         ↓
                                                    MANO Params (β, θ, t)
                                                    2D Keypoints + Uncertainty

Hand Detection with Robustness

Multiple Detection Handling. Select largest bounding box per hand type:

$$\text{selected} = \arg\max_{\text{det}} (x_2 - x_1)(y_2 - y_1)$$

Missing Frame Interpolation. Linear interpolation between valid neighbors:

$$\text{bbox}_t = (1-\alpha) \cdot \text{bbox}_{t_{\text{prev}}} + \alpha \cdot \text{bbox}_{t_{\text{next}}}, \quad \alpha = \frac{t - t_{\text{prev}}}{t_{\text{next}} - t_{\text{prev}}}$$

Outlier Detection. Median filter with MAD (Median Absolute Deviation):

$$\text{outlier if } |c_t - \text{median}(c)| > 3 \cdot \text{MAD}, \quad \text{MAD} = \text{median}(|c_i - \text{median}(c)|)$$

Test-Time Augmentation (TTA)

Seven passes with varying scale (1.2, 1.3, 1.4) and rotation (0°, ±20°, ±40°). Results are aggregated to compute mean keypoints and per-joint uncertainty.

Uncertainty Computation. Per-joint standard deviation across $K$ TTA passes:

$$\sigma_j = \sqrt{\text{Var}(x_j) + \text{Var}(y_j)} = \sqrt{\sigma_{x,j}^2 + \sigma_{y,j}^2}$$

This gives a scalar uncertainty in pixels for each of the 21 joints.

6. Stage 4: Camera Calibration

Intrinsic Calibration Pipeline

Motion Valley Detection → Two-Stage Checkerboard Detection → Pattern Valid?
                                                                ├─ yes → OpenCV calibrateCamera → K, distortion (<0.1 px error)
                                                                └─ no  → retry detection

Motion Valley Detection. Select frames with minimal camera/checkerboard motion:

Compute frame-to-frame mean absolute difference
Apply moving average smoothing (window = 5)
Find local minima, expand to valleys
Select stillest, ensuring ≥30 frame separation

Pattern Validation. Reject false positives (e.g., piano keys) by validating corner intensity patterns: diagonal quadrants should match, adjacent quadrants should differ.

Fixed Parameters. Principal point fixed at image center, $k_3 = p_1 = p_2 = 0$, optionally couple $f_x = f_y$.

Extrinsic Calibration

Front camera defines the world origin ($R = I, t = 0$). Left and Right cameras are calibrated via pairwise stereo calibration against Front.

180° Rotation Ambiguity Resolution

Solution: Y-Coordinate Normalization Checkerboard detectors may return corners in reversed order. We enforce consistency:

Top-right corner: index $c - 1$

Bottom-left corner: index $(r-1) \cdot c$

If $y_{\text{top-right}} > y_{\text{bottom-left}}$: flip corners

This deterministic check is instant and 100% reliable.

Parallelogram Grid Validation

Reject misdetections by validating geometric consistency:

Row vectors should be parallel within each row
Column vectors should be parallel within each column
Each cell should form a parallelogram (opposite sides equal)

Maximum deviation ratio ≤ 0.15 (deviation / mean spacing).

Achieved Accuracy:

Intrinsic: < 0.1 pixels reprojection error
Extrinsic: **~ 0.8 pixels** reprojection error (after all fixes)

7. Stage 5: Multi-View Fusion via L-BFGS Optimization

Optimization Overview

Initialization:
  β: median across all frames/views
  θ: Front view + median hand pose
  t: triangulated wrist
       ↓
Stage 1: Global Positioning (50 L-BFGS iterations)
  Optimize: θ_global, t
  Freeze: θ_hand, β
  Loss: L_reproj + 10λ_temp · L_temporal
       ↓
Stage 2: Full Refinement (150 L-BFGS iterations)
  Optimize: θ_global, θ_hand, t
  Freeze: β
  Loss: L_reproj + λ_anchor · L_anchor + λ_temp · L_temporal

6D Rotation Representation

For gradient-based optimization, we use the continuous 6D rotation representation (Zhou et al., 2019):

$$\text{rot2sixd}(R) = [R_{:,0},\ R_{:,1}] \in \mathbb{R}^6$$

Inverse via Gram-Schmidt orthogonalization ensures a valid rotation matrix.

Loss Functions

Reprojection Loss

The primary loss measures 2D reprojection error with inverse-variance weighting:

$$\mathcal{L}_{\text{reproj}} = \frac{1}{N_{\text{obs}}} \sum_{v \in \mathcal{V}} \sum_{j=1}^{21} \frac{w_{v,j}^{\text{manual}}}{2(\sigma_{v,j}^2 + \epsilon)} \left|\pi_v(J_j) - \hat{J}_{v,j}^{2D}\right|_2^2$$

where:

$\pi_v(\cdot)$: perspective projection with radial distortion for view $v$
$w_{v,j}^{\text{manual}}$: manual occlusion weight
$\sigma_{v,j}$: TTA uncertainty
$\epsilon = 10^{-5}$: numerical stability

Anchor Loss (Stage 2 Only)

Regularizes hand pose toward initialization:

$$\mathcal{L}_{\text{anchor}} = \frac{1}{N} \sum_{t=1}^{N} \left|\boldsymbol{\theta}_{\text{hand},t}^{6D} - \boldsymbol{\theta}_{\text{anchor},t}^{6D}\right|_2^2$$

Key Insight: Anchor loss is applied only to hand pose, not global orientation or translation, because those are affected by focal length scaling from Hamba initialization.

Temporal Smoothness Loss

Penalizes vertex velocity across frames:

$$\mathcal{L}_{\text{temporal}} = \frac{1}{N-1} \sum_{t=1}^{N-1} \sum_{v=1}^{778} \left|V_t^{(v)} - V_{t-1}^{(v)}\right|_2^2$$

Using all 778 vertices (not just 21 joints) provides smoother regularization.

Manual Per-View Per-Joint Weights

View	Thumb	Index	Middle	Ring	Pinky
Front	1.0	1.0	1.0	1.0	1.0
Left	1.0	1.0	0.25-0.5	0.125-0.25	0.0
Right	0.0	0.0	0.125-0.25	0.25-0.5	1.0

Batch vs. Sequential Optimization

Sequential:  [1] → [2] → [3] → [4] → [5]    ← Phase shift!
Batch:       [1] ↔ [2] ↔ [3] ↔ [4] ↔ [5]    ← Bidirectional

Analogy: Sequential = tugging a string left then right (never taut). Batch = pulling both ends simultaneously (taut = smooth + accurate).

Solution: Phase-Shift-Free Smoothing By optimizing all $N$ frames simultaneously, each frame is pulled toward both neighbors. This provides true bidirectional temporal smoothing without the phase shift of forward-only or forward-backward sequential processing.

Focal Length Mismatch Handling

Challenge: Hamba uses pseudo focal length (~37,500) differing from actual camera focal (~1,800). This causes depth initialization errors.

Solution:

Initialize wrist via triangulation using actual calibration

L-BFGS optimizes translation to match 2D observations

Do not anchor global orientation or translation (only hand pose)

8. Challenges and Lessons Learned

Abandoned: Spatio-Temporal Bundle Adjustment (STBA)

Failure	Description
DLT initialization	Triangulating weak-perspective 2D predictions produced bad 3D
Camera divergence	Extrinsics drifted >100mm, >45°
Circular dependency	Occlusion detection needs 3D pose, but 3D pose is what we optimize

Lesson: Use MANO forward kinematics from Hamba, not DLT triangulation.

Other Abandoned Approaches

CoTracker Temporal Tracking: For static multi-view cameras, 3D optimization provides better temporal consistency than 2D tracking.
Occlusion via Mesh Ray-Casting: Circular dependency (needs 3D pose to detect occlusion). Manual weights + TTA uncertainty suffice.

9. Results and Visualization

Dystonia Metrics

Jerk Score. Third derivative of position indicating movement smoothness:

$$\text{jerk}_j(t) = \left|\frac{d^3 J_j(t)}{dt^3}\right|_2, \quad \text{score}_j = \left(\frac{\text{jerk}_j - P_{50}}{P_{95} - P_{50}}\right)^{0.5}$$

Peak Acceleration. Threshold-based (>5 m/s²) flash with linear decay.

Baseline Comparison

Z-score relative to PIANO10M healthy dataset:

$$z_j = \frac{|\theta_j - \mu_{\text{baseline}}|}{\sigma_{\text{baseline}}}$$

Color mapping: $z \leq 1$ (green), $1 < z < 3$ (yellow), $z \geq 3$ (red).

10. Discussion

Comparison with Fur Elise

Aspect	Fur Elise	Ours
Camera views	5	3
Ground truth	MIDI sensors	None
Pose model	HaMeR	Hamba
Refinement	IK from contacts	L-BFGS
Subjects	Elite pianists	Dystonia patients

We compensate for fewer views and no ground truth through better single-view estimation (Hamba), TTA uncertainty, and batch optimization.

Limitations

Only 3 views (weaker triangulation geometry)
No ground truth for validation
Manual occlusion weights (not learned)
Consistent $\boldsymbol{\beta}$ cannot handle shape changes

Future Work

Video-based models (HaMeR-V) for native temporal consistency
Multi-view architectures as they mature
Learned uncertainty prediction
Clinical validation with severity ratings

Conclusion

Key contributions:

Robust estimation: WiLoR detector + Hamba pose + TTA uncertainty
Sub-pixel calibration: Motion valleys + pattern validation + Y-normalization
Principled fusion: Inverse-variance weighted L-BFGS with manual occlusion weights
Phase-shift-free smoothing: Batch optimization for true bidirectional coupling
Design lessons: MANO forward kinematics beats DLT triangulation

Appendices

A. Complete Preprocessing Parameters

Mode	Denoise h	Gamma	CLAHE Clip	CLAHE Tile	Sharpen α	Sharpen σ
minimal	0	1.0	0	8	0	1.0
standard	0	0.6	2.0	8	0	1.0
dystonia	6	0.85	2.0	8	0.2	1.0
latching_fix	3	0.75	3.0	8	1.5	1.5

B. MANO Joint Indices

Idx	Joint	Idx	Joint	Idx	Joint
0	Wrist	7	Index DIP	14	Ring PIP
1	Thumb CMC	8	Index Tip	15	Ring DIP
2	Thumb MCP	9	Middle MCP	16	Ring Tip
3	Thumb IP	10	Middle PIP	17	Pinky MCP
4	Thumb Tip	11	Middle DIP	18	Pinky PIP
5	Index MCP	12	Middle Tip	19	Pinky DIP
6	Index PIP	13	Ring MCP	20	Pinky Tip

C. Complete Manual Weights (Right Hand)

View	0	1	2	3	4	5	6	7	8	9	10
Front	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
Left	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.25	0.5
Right	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.125	0.25

View	11	12	13	14	15	16	17	18	19	20
Front	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
Left	0.5	0.5	0.125	0.25	0.25	0.25	0.0	0.0	0.0	0.0
Right	0.25	0.25	0.25	0.5	0.5	0.5	1.0	1.0	1.0	1.0

References

C. Lugaresi et al., "MediaPipe: A Framework for Building Perception Pipelines," arXiv:1906.08172, 2019.
Z. Cao et al., "OpenPose: Realtime Multi-Person 2D Pose Estimation," IEEE TPAMI, 2019.
Y. Xu et al., "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation," NeurIPS, 2022.
R. Khirodkar et al., "Sapiens: Foundation for Human Vision Models," arXiv:2408.12569, 2024.
J. Romero et al., "Embodied Hands: Modeling and Capturing Hands and Bodies Together," SIGGRAPH Asia, 2017.
G. Pavlakos et al., "Reconstructing Hands in 3D with Transformers," CVPR, 2024.
R. Potamias et al., "WiLoR: End-to-end 3D Hand Localization and Reconstruction," arXiv:2409.12259, 2024.
H. Li et al., "Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba," arXiv:2407.09646, 2024.
L. Mao et al., "Fur Elise: Capturing and Physically Synthesizing Hand Motions of Piano Performance," arXiv:2410.05791, 2024.
T. Bagautdinov et al., "Look Ma, No Markers: Holistic Performance Capture," arXiv:2410.11520, 2024.
Y. Zhou et al., "On the Continuity of Rotation Representations in Neural Networks," CVPR, 2019.
N. Karaev et al., "CoTracker: It is Better to Track Together," ECCV, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evals		evals
samples		samples
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
technical_memo_v2.docx		technical_memo_v2.docx
technical_memo_v2.pdf		technical_memo_v2.pdf
technical_memo_v2.tex		technical_memo_v2.tex

Folders and files

Latest commit

History

Repository files navigation

Multi-View 3D Hand Pose Estimation for Dystonia Assessment

Abstract

Table of Contents

1. Introduction

Clinical Motivation

Technical Challenges

Pipeline Overview

Input and Output Specifications

2. Related Work

Evolution of Hand Pose Estimation

Key Inspirations

The MANO Hand Model

3. Stage 1: Video Synchronization

Algorithm

Mathematical Formulation

4. Stage 2: Image Preprocessing

Enhancement Pipeline

Motion Deblurring (Considered but Not Used)

Mathematical Formulations

Preset Modes

5. Stage 3: Single-View Hand Pose Estimation

Architecture: WiLoR Detector + Hamba Pose

Hand Detection with Robustness

Test-Time Augmentation (TTA)

6. Stage 4: Camera Calibration

Intrinsic Calibration Pipeline

Extrinsic Calibration

180° Rotation Ambiguity Resolution

Parallelogram Grid Validation

7. Stage 5: Multi-View Fusion via L-BFGS Optimization

Optimization Overview

6D Rotation Representation

Loss Functions

Reprojection Loss

Anchor Loss (Stage 2 Only)

Temporal Smoothness Loss

Manual Per-View Per-Joint Weights

Batch vs. Sequential Optimization

Focal Length Mismatch Handling

8. Challenges and Lessons Learned

Abandoned: Spatio-Temporal Bundle Adjustment (STBA)

Other Abandoned Approaches

9. Results and Visualization

Dystonia Metrics

Baseline Comparison

10. Discussion

Comparison with Fur Elise

Limitations

Future Work

Conclusion

Appendices

A. Complete Preprocessing Parameters

B. MANO Joint Indices

C. Complete Manual Weights (Right Hand)

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages