Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou
University of California, Los Angeles

git clone --recursive git@github.com:genforce/JOSH.git
cd JOSH
conda create -n josh python=3.10 -y # must use python 3.10 for chumpy compatibility
conda activate josh
# assume CUDA 12.8, install pytorch and packages
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install --no-build-isolation git+https://github.com/mattloper/chumpy
pip install -e .
- Download SMPL body models (SMPL_MALE.pkl, SMPL_FEMALE.pkl, SMPL_NEUTRAL.pkl) at the official webpage and place then under
data/smplfolder. - Download VIMO checkpoint(vimo_checkpoint.pth.tar) for HMR and place it under
data/checkpoints. - Download DECO checkpoint(deco_best.pth) for contact estimation and place it under
data/checkpoints. - Move the function
parse_chunksfromthird_party/tram/lib/pipeline/tools.pytothird_party/tram/lib/models/hmr_vimo.pyso we don't install extra dependencies.
Assume the demo video is located at $input_folder/XXXX.mp4, run the following:
rerun --serve-grpc # in another terminal, for visualization
bash josh_demo.sh $input_folder
For example, run bash josh_demo.sh assets/demo1, we will store all the intermediate outputs as well as the final result under $input_folder.
Compared to the original paper, we now support using the local point cloud from the state-of-the-art method Pi3X as initialization, which could lead to a better reconstruction performance.
Note that since JOSH is an optimization-based method, you may want to tune the hyper-parameters for the optimal performance (see josh/config.py). With the default hyperparameters, you should get the following results after running the demos:
Demo 1 Sample Output
Demo 2 Sample Output
Long Demo Sample Output
For long videos (>=200 frames), we apply chunk processing and then aggregate the chunk results by simply concatenating them (see josh/aggregate_results.py). We will leave global bundle adjustment to future work.
Download the JOSH3R checkpoint from this link to $CKPT_PATH, and use the same $input_folder from the JOSH demo, then run the follows:
python josh/inference_josh3r.py --input_folder "$input_folder" --ckpt_path $CKPT_PATH --visualize
Note that the scene reconstruction quality of JOSH3R may not be great due to the end-to-end inference of the base model MASt3R without optimization, but the global human trajectory prediction should look more plausible.
pip install evo # for camera pose evaluation
We provide evaluation scripts at josh/eval on all the datasets with basic instructions. Please refer to the original dataset repos for data downloading and processing. The scripts are not thoroughly tested, and feel free to open an issue if you encounter any problems or bugs.
We would like to thank the following projects for inspiring our work and open-sourcing their implementations:
Human Mesh Recovery: WHAM, TRAM, HMR2.0
Human Detection and Segmentation: SAM3
Scene Reconstruction: DUSt3R, MASt3R, Pi3
Human Contact Estimation: BSTRO, DECO
Evaluation Datasets: EMDB, SLOPER4D, RICH
For any questions or discussions, please contact Zhizheng Liu.
If our work is helpful to your research, please cite the following:
@article{liu2026joint,
title={Joint Optimization for 4D Human-Scene Reconstruction in the Wild},
author={Liu, Zhizheng and Lin, Joe and Wu, Wayne and Zhou, Bolei},
journal={The Fourteenth International Conference on Learning Representations},
year={2026}
} 
