+
+
+ Open-Vocabulary Object Tracking with Grounding DINO, SAM 2 and CLIP
+
+ We present an open-vocabulary object tracking system that enables users to search, segment, and track arbitrary objects in images and videos using natural language queries.
+
+ Our pipeline combines Grounding DINO for text-conditioned object detection, CLIP for semantic verification, and SAM 2 for segmentation and temporal tracking.
+
+ The system supports interactive querying through a Gradio web interface and demonstrates how modern vision foundation models can be integrated into a unified visual understanding pipeline.
+
+