Team: Andrea Migliore · Giovanni Elisei · Carlo Fiammenghi
SemanticSpot 3D is an end-to-end pipeline that turns a walkthrough video of a room into a semantically searchable 3D point cloud.
A user can type a natural-language query (e.g. "silver water bottle") and the system highlights the matching object directly inside a 3D viewer.
| Stage | What happens | Tools |
|---|---|---|
| 1. Data Capture | Record a video walking around the environment | Phone camera |
| 2. Structure from Motion | Extract frames → estimate camera poses → reconstruct a dense 3D point cloud | COLMAP |
| 3. Semantic Projection | Segment every frame with SAM, embed each segment with CLIP, then project the features onto the 3D points via known camera intrinsics/extrinsics | SAM (ViT-B) + CLIP (ViT-B/32) |
| 4. Semantic Search | At query time, encode the text with CLIP, compute cosine similarity against every point's feature vector, and highlight the best matches | CLIP + DBSCAN (spatial cleanup) |
The Streamlit app allows you to:
- Search by text: type an object description to highlight matching points in the 3D model.
- Visualize consistency: show a heatmap based on how often points were detected across different camera angles.
- Filter noise: run DBSCAN to remove isolated points and outliers.
- Adjust settings: change similarity thresholds, consistency limits, and point counts in real time.
Run the app on the pre-built scene.
- Python 3.10+
- Git
git clone https://github.com/AndreMigX/SemanticSpot3D.git
cd SemanticSpot3D
python -m venv venv
source venv/bin/activate # macOS / Linux
# venv\Scripts\activate # Windows
pip install -r requirements.txtThe preprocessed semantic scene file (semantic_scene.npz, ~637 MB) is too large for Git. Download it into the project root:
curl -L -o semantic_scene.npz \
"https://docs.google.com/uc?export=download&id=1nosa5p3XwNpPgQvJaOqPbCTfPBMOMP-E&confirm=t"streamlit run app.pyOr use the scripts:
./run.sh # macOS / Linux
run.bat # WindowsThe app will open automatically in your browser at http://localhost:8501.
Process a new video and generate your own semantic_scene.npz, then launch the demo on it.
- Everything from above
- COLMAP installed on your system
- A walkthrough video of the scene you want to reconstruct
Place your video as src/video.mp4, then run:
python src/extract_frames.pyThis extracts frames at 3 fps into src/frames/.
Use COLMAP to reconstruct the 3D scene from the extracted frames:
- Import the frames and run feature extraction + matching
- Run sparse reconstruction, then dense reconstruction (patch-match stereo + fusion)
- Export the model in text format (
cameras.txt,images.txt,points3D.txt) intocolmap/model/ - Place the fused dense point cloud (
fused.ply) incolmap/
Note: Extracting the dense point cloud in COLMAP requires an NVIDIA GPU with CUDA.
The SAM ViT-B checkpoint (~358 MB) is required for segmentation:
curl -L -o sam_vit_b_01ec64.pth \
"https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth"Open and run all cells in preprocessor.ipynb. This will:
- Load the dense point cloud and voxel-downsample it
- For each video frame: run SAM segmentation → CLIP embedding → 3D projection
- Average the CLIP features across views and compute consistency scores
- Save everything to
semantic_scene.npz
streamlit run app.pyThe app loads semantic_scene.npz from the project root
SemanticSpot3D/
├── app.py # Streamlit app
├── preprocessor.ipynb # Pipeline to create semantic point cloud
├── requirements.txt # Python dependencies
├── run.sh / run.bat # Launch scripts
├── src/
│ ├── extract_frames.py # Frames extraction utility
│ ├── frames/ # Extracted video frames
│ └── video.mp4 # Input video
├── colmap/
│ ├── fused.ply # Dense 3D point cloud from COLMAP
│ └── model/ # COLMAP text exports (cameras, images, points3D)
├── semantic_scene.npz # Semantic scene (gitignored, download separately)
└── sam_vit_b_01ec64.pth # SAM checkpoint (gitignored, download separately)
| Component | Technology |
|---|---|
| 3D Reconstruction | COLMAP |
| Image Segmentation | Segment Anything (SAM) - ViT-B |
| Semantic Features | CLIP - ViT-B/32 (HuggingFace Transformers) |
| Spatial Filtering | DBSCAN (scikit-learn) |
| App | Streamlit + Plotly |
| Point Cloud I/O | Open3D |