AIR ORCHESTRA

Summary

Goal

The goal of this project is to build a real-time mixed-reality (AR/MX) system that allows users to interact with a virtual orchestra in space using their environment.

The system is designed to be plug-and-play, requiring no calibration, and to run on low-power devices.

Main objectives:

Create a real-time interactive “air orchestra” experience
Ensure zero-setup usability (no calibration required)
Keep the system as lightweight as possible (high portability
Avoid heavy ML/VLM-based solutions in favor of efficient computer vision techniques
Design a modular and extensible architecture for future expansions and features

Architecture Rationale

The codebase was organized according to the role of each module.
The main criterion was to separate data acquisition, processing logic, visualization, resource management, and persistence.

Input Layer

The input/ folder contains components responsible for reading or estimating incoming data from external sources.

camera.py stores the camera model (CameraIntrinsics, Camera) because these classes describe the sensing device used by the system.
depth_model.py contains depth estimation because it produces depth information from input images.
hand_tracking.py contains hand-related functions because they interpret live gesture input.

input modules do not implement application logic; they only transform raw or external data into a usable form.

Core Layer

The core/ folder contains the main computational logic of the application.

geometry.py includes geometric computations such as normal estimation, rotation matrices, and size estimation.
surface.py includes depth normalization and surface clustering because these operations extract meaningful structure from depth data.
floor_detection.py contains floor and surface masking logic because it classifies regions of the scene.
mesh_transform.py handles mesh transformations because it applies scale and translation to 3D objects.
collision.py checks overlaps because collision detection is part of the placement decision process.

they do not depend on user interface rendering or file I/O. They represent the actual reasoning part of the system.

Rendering Layer

The rendering/ folder contains everything related to visualization.

depth_visual.py converts depth maps into displayable images and overlays masks.
scene.py draws the status and scene elements on screen.

This separation is useful because visualization should remain independent from the underlying placement logic.
As a result, the same computation can be reused even if the presentation changes.

Assets Layer

The assets/ folder contains resources and helper functions for loading them.

mesh_loader.py loads 3D models and provides a fallback mesh.
sound.py handles audio playback and sound selection.

Storage Layer

The scene_storage/ folder contains functions for saving and loading placements.

serialize_placements
save_placements
load_placements

Application Layer

init.py
main.py

Input -> Core Processing -> Collision/Placement -> Rendering -> Storage

Method

Input

RGB image captured by the laptop camera
Camera intrinsic parameters (fx, fy, cx, cy) assumed from a pinhole camera model

Depth estimation

The system processes the RGB image using Depth Anything to generate a monocular depth map. This depth map is then projected into a 3D point cloud using assume camera intrinsics. By doing so, the scene is represented in 3D space rather than purely in image space.

The system can operate in two modalities

surface mode: returns all detected stable surfaces
loor mode: returns only the dominant floor region

Once the 3D structure is reconstructed, local surface properties are extracted using classical computer vision techniques. Surface normals are estimated from local neighborhoods, resulting in a description of the orientation of small surface patches. These normals are then clustered to group regions with consistent orientation allowing the system to separate different types of surfaces in the environment.

The resulting clusters are filtered using geometric constraints to remove unstable or irrelevant regions. Only surfaces that satisfy stability criteria and coherent orientation are retained.

Floor detection uses additional constraints applied on top of the estimated surface mask. The candidate regions are first restricted to the lower half of the image, based on the assumption that the floor is typically located in this area of the camera view. The remaining candidates are filtered using depth consistency, where extreme depth values are removed using percentile-based clipping. The resulting candidates are taken through connected component analysis, selecting the largest region in contact with the bottom of the image.

To improve results in both cases, morphological operations are applied to reduce noise and improve spatial continuity. A closing operation with a 9x9 kernel is used to fill small gaps, followed by an opening operation with a 3x3 kernel to remove small isolated artifacts.

The selected 3D points are then projected back into the image plane to generate a 2D interaction mask. This mask can be oerlaid on the screen with alpga blending and used for AR object placement.

Gesture-Based Control

The system uses gesture recognition to control interaction modes and actions. Gestures are evaluated over multiple consecutive frames to ensure temporal stability and reduce noise.

Two main gesture types are used:

"thumbs up", used to confirm object placement
"victory sign" to mode-switch, for toggle interaction modes

In case of noisy environments, a keyboard fallback mechanism is implemented.

Spatial Placement and Interaction Logic

Object placement is handled in 3D space. Each object is represented by a convex hull computer from its mesh. This allow to check collision by approximate the geometry without a per-triangle evaluation

Object orientation is aligned with the estimated surface normal, computed via a PCA-based local depth neighborhood. This enable the object to be placed to follow the orientation of the detected surface.

Finally, once placed, occlusion is handled using the reconstructed geometry of objects and the previously estimated surface coordinates in 3D space. This gives a correct depth ordering anin the final render.

Interaction and Sound Triggering

User interaction is based on hand tracking and gesture recognition. For each detected hand, a 2D cursor is extracted and projected into 3D space by casting a ray from the camera origin through the corresponding pixel.

The ray direction is computed by unprojecting the cursor position at a fixed depth and normalizing the resulting vector. This ray is then tested against all placed virtual objects in the scene. For each object, the system computes the shortest distance between the object position and the ray, selecting the closest valid candidate.

If the distance is below a predefined threshold, the object is considered “hit”. A per-instrument cooldown mechanism is applied to prevent repeated triggering at high frequency. When a valid hit occurs, the corresponding sound is played.

Optimization

Scene changes are tracked via a dirty flag, which triggers recomputation of the geometric pipeline when changes occur.

Scene Persistence (JSON)

The position, type, and orientation of instruments can be saved and reloaded using a JSON-based persistence system. Saved configurations are independent of the current scene state,

Limitations

The system uses a monocular depth estimation model (Depth Anything), which does not provide true metric depth. As a result, absolute scale is inaccurate and may vary depending on scene and lighting conditions. Attempts to use more complex models were tested, but the computational cost was too high for our goals and real-time capture.

Camera intrinsic parameters (fx, fy, cx, cy) are not obtained through calibration and are instead assumed based on a pinhole camera model. This approximation can introduce geometric distortions, particularly affecting 3D reconstruction and raycasting accuracy. This design choice was intentional to maintain a plug-and-play system with no setup requirements.

Surface detection is based on local geometric consistency and normal clustering. Surface detection can become unstable or fail when the camera is tilted, when the floor is not in contact with the bottom (floor detection only) or when pointing particulary shiny and noisy texturized surfaces.

The main performance bottleneck of the system comes from mesh complexity and textures. Even after optimizations, FPS drops can occur during placement. This does not happen when using primitive geometries, suggesting a rendering rather than algorithmic limitation. Thanks to our optimization, this problem occours only during the placement procedure.

Future Work

Future improvements could focus on:

Introduce incremental or partial updates of the scene instead of full recomputatio
lightweight self-calibration techniques to improve camera intrinsics at runtime without explicit user input
hybrid depth approaches (e.g., temporal smoothing or multi-frame fusion) to improve stability without significantly increasing computational cost
enhanced interaction models, including multi-user support or more expressive gesture vocabularies
Test and optimize the pipeline on mobile or embedded hardware

Requirements

mediapipe 0.10.35

numpy 2.4.6

open3d 0.19.0

opencv_contrib_python 4.13.0.92

opencv_python 4.13.0.92

Pillow 12.2.0

torch 2.11.0

transformers 5.8.0

trimesh 4.12.2

How to run it

run main.py

usable flags

--camera: Sets the camera index number to use (default: 0).
--output-dir: Specifies the directory path where outputs (e.g., saved placements or debug data) will be stored (default: DEFAULT_OUTPUT).
--gesture-model: Specifies the file path to the gesture recognition model used by MediaPipe (default: DEFAULT_GESTURE_MODEL).
--model-id: Selects the Hugging Face model ID for Depth Anything V2. Only two variants are supported:
- "depth-anything/Depth-Anything-V2-Small-hf"
- "depth-anything/Depth-Anything-V2-Large-hf" (default: "depth-anything/Depth-Anything-V2-Small-hf")
--device: Defines the computation device used to run the depth model. Supported values include cpu, mps (Apple Silicon), or cuda (NVIDIA GPU). If not specified, the best available device is selected automatically (default: None).
--mask-mode: Determines the strategy used to generate the placement mask after depth estimation:
- "floor": restricts placement to floor-like regions only
- "both": allows placement on both floor and general horizontal surfaces
  (default: "both")

Controls

Key	Gesture	Action	Mode
R	-	Reload the scene and run surface detection	All modes
1-5	-	Select one of the instruments.	All modes
M	-	Change detection modality (floor/surface)	All modes
P	-	disable/enable overlay.	All modes
Q	-	Exit	All modes
P	Victory sign	Switch PLAY mode to PLACE mode	All modes
C	-	Clean the scene	All modes
S	-	Save the scene	All modes
Space	Thumbs up (left hand)	Place the object	PLACE mode only

Tutorial

Press R to initialize the scene and perform surface detection.

Note: The system assumes the camera is pointing toward a valid surface (e.g., the floor, multiple surface, not yourself).

To use different modality of surface acquiasition: press M and R reload again

NOTE The floor algorithm suppost that the floor is in contact with the lower part of the image, the "surface" algorith instead is less filtred and so more noisy.

The system enters PLACE mode.
A green area will appear, representing valid placement zones.
Use your right index finger to point where you want to place the instrument.
Confirm placement by:
- Pressing Space, or
- Performing a thumbs up gesture with your left hand
Use keys 1–5 to select different instruments.
To switch to interaction mode:
- Press P, or
- Perform a victory sign gesture
In PLAY mode, bring one or both index fingers close to an instrument to interact with it.

In Conclusion

This project engaged us on multiple levels. We applied knowledge from 3D modeling and mesh processing, 2D projection and unwrapping concepts, 3D/2D geometry and mathematics, as well as computer vision, graphics, and audio processing.

While it is not a production-ready system, we believe it establishes a solid foundation for future development.

In particular, it helped us build practical experience that can be extended into more complex projects and strengthened our technical portfolio and CV.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Air_Orchestra		Air_Orchestra
roadmap		roadmap
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIR ORCHESTRA

Summary

Goal

Architecture Rationale

Input Layer

Core Layer

Rendering Layer

Assets Layer

Storage Layer

Application Layer

Method

Input

Depth estimation

Gesture-Based Control

Spatial Placement and Interaction Logic

Interaction and Sound Triggering

Optimization

Scene Persistence (JSON)

Limitations

Future Work

Requirements

How to run it

usable flags

Controls

Tutorial

In Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AIR ORCHESTRA

Summary

Goal

Architecture Rationale

Input Layer

Core Layer

Rendering Layer

Assets Layer

Storage Layer

Application Layer

Method

Input

Depth estimation

Gesture-Based Control

Spatial Placement and Interaction Logic

Interaction and Sound Triggering

Optimization

Scene Persistence (JSON)

Limitations

Future Work

Requirements

How to run it

usable flags

Controls

Tutorial

In Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages