SemanticSpot 3D

Team: Andrea Migliore · Giovanni Elisei · Carlo Fiammenghi

Summary

SemanticSpot 3D is an end-to-end pipeline that turns a walkthrough video of a room into a semantically searchable 3D point cloud.
A user can type a natural-language query (e.g. "silver water bottle") and the system highlights the matching object directly inside a 3D viewer.

How it works

Stage	What happens	Tools
1. Data Capture	Record a video walking around the environment	Phone camera
2. Structure from Motion	Extract frames → estimate camera poses → reconstruct a dense 3D point cloud	COLMAP
3. Semantic Projection	Segment every frame with SAM, embed each segment with CLIP, then project the features onto the 3D points via known camera intrinsics/extrinsics	SAM (ViT-B) + CLIP (ViT-B/32)
4. Semantic Search	At query time, encode the text with CLIP, compute cosine similarity against every point's feature vector, and highlight the best matches	CLIP + DBSCAN (spatial cleanup)

Demo Video

https://youtu.be/KG0jpMZ2JLM

The Streamlit app allows you to:

Search by text: type an object description to highlight matching points in the 3D model.
Visualize consistency: show a heatmap based on how often points were detected across different camera angles.
Filter noise: run DBSCAN to remove isolated points and outliers.
Adjust settings: change similarity thresholds, consistency limits, and point counts in real time.

Run the Demo (pre-computed scene)

Run the app on the pre-built scene.

Prerequisites

Python 3.10+
Git

1. Clone and install

git clone https://github.com/AndreMigX/SemanticSpot3D.git
cd SemanticSpot3D

python -m venv venv
source venv/bin/activate        # macOS / Linux
# venv\Scripts\activate          # Windows

pip install -r requirements.txt

2. Download the pre-computed scene

The preprocessed semantic scene file (semantic_scene.npz, ~637 MB) is too large for Git. Download it into the project root:

curl -L -o semantic_scene.npz \
  "https://docs.google.com/uc?export=download&id=1nosa5p3XwNpPgQvJaOqPbCTfPBMOMP-E&confirm=t"

3. Launch the app

streamlit run app.py

Or use the scripts:

./run.sh          # macOS / Linux
run.bat           # Windows

The app will open automatically in your browser at http://localhost:8501.

Run the Full Pipeline from Scratch

Process a new video and generate your own semantic_scene.npz, then launch the demo on it.

Prerequisites

Everything from above
COLMAP installed on your system
A walkthrough video of the scene you want to reconstruct

Step 1: Extract frames from the video

Place your video as src/video.mp4, then run:

python src/extract_frames.py

This extracts frames at 3 fps into src/frames/.

Step 2: Run COLMAP

Use COLMAP to reconstruct the 3D scene from the extracted frames:

Import the frames and run feature extraction + matching
Run sparse reconstruction, then dense reconstruction (patch-match stereo + fusion)
Export the model in text format (cameras.txt, images.txt, points3D.txt) into colmap/model/
Place the fused dense point cloud (fused.ply) in colmap/

Note: Extracting the dense point cloud in COLMAP requires an NVIDIA GPU with CUDA.

Step 3: Download the SAM checkpoint

The SAM ViT-B checkpoint (~358 MB) is required for segmentation:

curl -L -o sam_vit_b_01ec64.pth \
  "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth"

Step 4: Run the preprocessing notebook

Open and run all cells in preprocessor.ipynb. This will:

Load the dense point cloud and voxel-downsample it
For each video frame: run SAM segmentation → CLIP embedding → 3D projection
Average the CLIP features across views and compute consistency scores
Save everything to semantic_scene.npz

Step 5: Launch the app

streamlit run app.py

The app loads semantic_scene.npz from the project root

Project Structure

SemanticSpot3D/
├── app.py                  # Streamlit app
├── preprocessor.ipynb      # Pipeline to create semantic point cloud
├── requirements.txt        # Python dependencies
├── run.sh / run.bat        # Launch scripts
├── src/
│   ├── extract_frames.py   # Frames extraction utility
│   ├── frames/             # Extracted video frames
│   └── video.mp4           # Input video
├── colmap/
│   ├── fused.ply           # Dense 3D point cloud from COLMAP
│   └── model/              # COLMAP text exports (cameras, images, points3D)
├── semantic_scene.npz      # Semantic scene (gitignored, download separately)
└── sam_vit_b_01ec64.pth    # SAM checkpoint (gitignored, download separately)

Tech Stack

Component	Technology
3D Reconstruction	COLMAP
Image Segmentation	Segment Anything (SAM) - ViT-B
Semantic Features	CLIP - ViT-B/32 (HuggingFace Transformers)
Spatial Filtering	DBSCAN (scikit-learn)
App	Streamlit + Plotly
Point Cloud I/O	Open3D

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
colmap		colmap
poster		poster
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
preprocessor.ipynb		preprocessor.ipynb
preprocessor_OLD.ipynb		preprocessor_OLD.ipynb
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh
sam_vit_b_01ed64_link.txt		sam_vit_b_01ed64_link.txt
semantic_scene.npz_link.txt		semantic_scene.npz_link.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemanticSpot 3D

Summary

How it works

Demo Video

Run the Demo (pre-computed scene)

Prerequisites

1. Clone and install

2. Download the pre-computed scene

3. Launch the app

Run the Full Pipeline from Scratch

Prerequisites

Step 1: Extract frames from the video

Step 2: Run COLMAP

Step 3: Download the SAM checkpoint

Step 4: Run the preprocessing notebook

Step 5: Launch the app

Project Structure

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SemanticSpot 3D

Summary

How it works

Demo Video

Run the Demo (pre-computed scene)

Prerequisites

1. Clone and install

2. Download the pre-computed scene

3. Launch the app

Run the Full Pipeline from Scratch

Prerequisites

Step 1: Extract frames from the video

Step 2: Run COLMAP

Step 3: Download the SAM checkpoint

Step 4: Run the preprocessing notebook

Step 5: Launch the app

Project Structure

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages