Skip to content

GuoleiSun/Awesome-SAM2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 

Repository files navigation

Awesome SAM2 (Segment Anything in Images and Videos)

Awesome SAM2 Stars Forks Last Updated

📖 About This Repository

This repository aims to be the most comprehensive collection of materials (papers, codes, datasets, demos) about SAM2 (Segment Anything in Images and Videos), Meta AI's groundbreaking vision foundation model.

SAM2 represents a significant advancement in computer vision, extending the capabilities of the original SAM to handle both images and videos with unprecedented accuracy and efficiency. This curated list covers the rapidly expanding ecosystem of SAM2 applications across diverse domains - from medical imaging to robotics, from 3D reconstruction to video generation.

🔥 Why SAM2?

SAM2 has revolutionized segmentation tasks by:

  • Providing unified image and video segmentation capabilities
  • Enabling zero-shot generalization across domains
  • Offering efficient real-time processing
  • Supporting diverse prompting mechanisms (points, boxes, masks, text)

📈 Repository Stats: Currently tracking 500+ papers and projects across 15+ domains

🤝 Contributing: We continuously improve this collection. Feel free to submit PRs for missed works, corrections, or new categories!


This repo aims to include materials (papers, codes, slides) about SAM2 (segment anything in images and videos), a vision foundation model released by Meta AI . We are continuously improving the project. Welcome to PR the works (papers, repos) that are missed.

SAM2

📊 Repository Statistics

Domain Papers Key Highlights
🏥 Medical 80+ Surgery, 3D medical imaging, pathology
🎬 Video Processing 70+ Object tracking, temporal consistency
🤖 Robotics 50+ Manipulation, navigation, human-robot interaction
🛰️ Remote Sensing 20+ Satellite imagery, environmental monitoring
🎨 Generation/Editing 35+ Video synthesis, image editing, creative tools
🧊 3D Processing 45+ Point clouds, mesh processing, reconstruction
🎯 Core Segmentation 60+ Novel applications and improvements
Total 500+ Across 15+ domains

💡 Quick Navigation: Use Ctrl+F to search for specific keywords, or click on the Table of Contents links above for domain-specific papers.

Contents

Papers/Projects

📚 Surveys & Reviews

Comprehensive overviews and systematic analyses of SAM2 applications across various domains

Release Title Code
2024.07 Segment Anything for Videos: A Systematic Survey 📖 Repo
2024.08 Unleashing the Potential of SAM2 for Biomedical Images and Videos: A Survey 📖 Repo
2024.10 On Efficient Variants of Segment Anything Model: A Survey NA
2025.03 SAM2 for Image and Video Segmentation: A Comprehensive Survey NA
2025.09 A Systematic Survey and Meta-Analysis of the Segment Anything Model in Remote Sensing Image Processing: Challenges, Advances, Applications, and Opportunities 📖 Repo

🎯 Traditional Segmentation

Core image and video segmentation applications, including novel architectures and domain-specific adaptations

🖼️ Image Segmentation

Release Title Code
2024.10 Towards Natural Image Matting in the Wild via Real-Scenario Prior 🔗 Code
2024.11 CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation NA
2025.03 Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement 🌐Project page
2025.04 MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation 🔗 Code
2025.04 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding 🌐Project page
Segmentation Applications
Release Title Code
2025.03 Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance 🔗 Code
2025.03 MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation 🔗 Code
2025.03 Segment Any-Quality Images with Generative Latent Space Enhancement NA
2025.03 Superpowering Open-Vocabulary Object Detectors for X-ray Vision 🔗 Code
2025.04 S4M: Boosting Semi-Supervised Instance Segmentation with SAM 🌐Project page
2025.04 KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection NA
2025.04 MovSAM: A Single-image Moving Object Segmentation Framework 🔗 Code
2025.04 Few-Shot Adaptation of Grounding DINO for Agricultural Domain NA
2025.08 LENS: Learning to Segment Anything with Unified Reinforced Reasoning 🔗 Code
2025.09 Semantic Segmentation of Marine Animal Images by U-Net Based on Multi-Cognitive Visual Adapter and Dual-Attention Fusion Mechanism 📊 Data
2025.10 SinkSAM-Net: Knowledge-driven self-supervised sinkhole segmentation using topographic priors and Segment Anything Model 🌐Project page
Other Image Tasks
Release Title Code
2024.11 SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation 🔗 Code
2024.11 SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory 🔗 Code
2025.04 MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking 🔗 Code
2025.05 Vision Foundation Model Embedding-Based Semantic Anomaly Detection NA
2025.05 Synthetic Data Pre-Training for Runway Damage Assessment NA
2025.05 Single-sided estimates of surface breaking porosity in additive manufacturing using multiple inspection techniques NA
2025.05 TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models 🔗 Code
2025.05 PixelThink: Towards Efficient Chain-of-Pixel Reasoning 🌐Project page
2025.05 SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection 🔗 Code
2025.06 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval 🔗 Code
2025.08 DOMR:Establishing Cross-View Segmentation via Dense Object Matching NA
2025.08 WeedSense: Multi-Task Learning for Weed Segmentation, Height Estimation, and Growth Stage Classification 🌐Project page

🎬 Video Segmentation

Temporal segmentation, object tracking, and video understanding applications

Release Title Code
2024.08 Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track NA
2024.08 The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation NA
2024.08 LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS NA
2024.09 Temporally Propagated Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking NA
2024.10 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree 🔗 Code
2024.11 A Distractor-Aware Memory for Visual Object Tracking with SAM2 🔗 Code
2024.11 There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks 🔗 Code
2024.12 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMs 🌐Project page
2025.02 Towards Fine-grained Interactive Segmentation in Images and Videos NA
2025.03 WeGen: A Unified Model for Interactive Multimodal Generation as We Chat 🔗 Code
2025.03 Pseudo-LiDAR With Two-Dimensional Instance for Monocular Three-Dimensional Object Tracking NA
2025.03 Segment Any Motion in Videos 🌐Project page
2025.04 SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation 🔗 Code
2025.04 DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency 🔗 Code
2025.07 HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking 🔗 Code
Referring/Reasoning Video Object Segmentation
Release Title Code
2024.11 SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation 🔗 Code
2024.11 SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory 🔗 Code
2025.03 Online Reasoning Video Segmentation with Just-in-Time Digital Twins NA
2025.04 The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation NA
2025.04 GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation 🌐Project page

Other Video Tasks

Release Title Code
2025.03 MMCD: Memory-Based Multimodal Change Detection NA
2025.03 EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting 🕒Soon
2025.03 Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene 🌐Project page
2025.03 EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining 🔗 Code
2025.03 High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight 🔗 Code
2025.03 FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation NA
2025.04 Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting 🔗 Code
2025.04 How Can Objects Help Video-Language Understanding? 🕒Soon
2025.04 SAMJAM:Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos NA
2025.05 Research on a traffic flow statistical algorithm based on YBOVDT and SAM2 📊 Data
2025.05 One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory NA
2025.06 ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer 🔗 Code
2025.06 Track Any Object:A Granular Video Anomaly Detection Pipeline 🌐Project page
2025.06 Open-World Object Counting in Videos 🔗 Code
2025.06 Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations NA
2025.06 SAM2RL:Towards Reinforcement Learning Memory Control in Segment Anything Model 2 NA
2025.07 Visual tracking by matching points using diffusion model 🔗 Code
2025.07 Intelligent and quantitative ligament breakup event analysis in 65 kHz off-axis holographic video of swirl spray 🔗 Code
2025.07 Towards Blind Bitstream-corrupted Video Recovery: AVisual Foundation Model-driven Framework NA
2025.07 SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking 🔗 Code
2025.08 Towards automated video-based human behavior analysis: leveraging AI capabilities for spatial behavior detection NA

🔊 Audio-visual segmentation (AVS)

Multi-modal approaches combining audio and visual information for segmentation

Release Title Code
2024.08 Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation 🔗 Code
2025.02 Audio visual segmentation through text embeddings NA

📊 Graph Learning

Scene graph generation and graph-based reasoning with SAM2

Release Title Code
2025.03 Universal Scene Graph Generation 🌐Project page

🏥 Medical Domain

Healthcare applications including surgery, diagnostics, and biomedical research

Medical Video & 3D Segmentation

Release Title Code
2024.08 Segment anything in medical images and videos: Benchmark and deployment 🔗 Code
2024.08 SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation NA
2024.08 Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation NA
2024.08 Novel adaptation of video segmentation to 3D MRI: efficient zero-shot knee segmentation with SAM2 NA
2024.08 Biomedical SAM 2: Segment anything in biomedical images and videos 🔗 Code
2024.08 Polyp SAM 2: Advancing Zero-shot Polyp Segmentation in Colorectal Cancer Detection 🔗 Code
2024.08 Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning 🔗 Code
2024.08 Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation NA
2024.09 SAM-OCTA2: Layer Sequence OCTA Segmentation with Fine-tuned Segment Anything Model 2 🔗 Code
2024.09 Self-Prompting Polyp Segmentation in Colonoscopy using Hybrid Yolo-SAM 2 Model 🔗 Code
2024.10 A-MFST: Adaptive Multi-Flow Sparse Tracker for Real-Time Tissue Tracking Under Occlusion NA
2024.10 ECHOPulse: ECG controlled echocardio-grams video generation 🔗 Code
2024.11 Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery NA
2025.02 SASVi - Segment Any Surgical Video 🔗 Code
2025.02 Text-Promtable propagation for referring medical image sequence segmentation NA
2025.02 Less is More? Revisiting the Importance of Frame Rate in Real-Time Zero-Shot Surgical Video Segmentation
2025.03 SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection 🔗 Code(& dataset)
2025.03 Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering 🔗 Code
2025.03 Rethinking Few-Shot Medical Image Segmentation by SAM2: A Training-Free Framework with Augmentative Prompting and Dynamic Matching NA
2025.03 Self-Prompting Driven SAM2 for 3D Medical Image Segmentation NA
2025.04 RP-SAM2: Refining Point Prompts for Stable Surgical Instrument Segmentation 🔗 Code
2025.04 Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation NA
2025.04 MedSAM2: Segment Anything in 3D Medical Images and Videos 🔗 Code
2025.05 Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos 🕒Soon
2025.05 Adapting Segment Anything 2 for Diabetic Retinopathy Lesion Segmentation NA
2025.07 Beyond Rigid AI: Towards Natural Human-Machine Symbiosis for Interoperative Surgical Assistance NA
2025.07 Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2 NA
2025.08 Edge2Prompt: Modality-Agnostic Model for Out-of-Distribution Liver Segmentation NA
2025.08 F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery 🔗 Code
2025.08 TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios 🔗 Code
2025.08 SAM2Med3D: Leveraging video foundation models for 3D breast MRI segmentation Data Upon Request

Medical Image Segmentation

Release Title Code
2024.08 SAM & SAM 2 in 3D Slicer: SegmentWithSAM Extension for Annotating Medical Images 🔗 Code
2024.08 SAM2-PATH: A better segment anything model for semantic segmentation in digital pathology 🔗 Code
2024.08 Is SAM 2 Better than SAM in Medical Image Segmentation? NA
2024.08 A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation 🔗 Code
2024.08 Interactive 3D Medical Image Segmentation 🔗 Code
2024.08 SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation 🔗 Code
2024.08 SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More 🔗 Code
2024.08 Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models
2024.10 SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection 🔗 Code
2024.11 A multi-task learning model for clinically interpretable sesamoiditis grading NA
2024.11 Zero-shot capability of SAM-family models for bone segmentation in CT scans NA
2024.11 SAM-I2I: Unleash the Power of Segment Anything Model for Medical Image Translation NA
2024.12 Medical SAM 2: Segment Medical Images As Video Via Segment Anything Model 2 🔗 Code
2025.03 Self-Prompting Driven SAM2 for 3D Medical Image Segmentation NA
2025.03 WeakMedSAM: Weakly-Supervised Medical Image Segmentation via SAM with Sub-Class Exploration and Prompt Affinity Mining 🔗 Code
2025.03 Research on recognition of diabetic retinopathy hemorrhage lesions based on fine tuning of segment anything model NA
2025.04 HRMedSeg: Unlocking High-resolution Medical Image segmentation via Memory-efficient Attention Modeling 🔗 Code
2025.04 Prompt Once, Segment Everything: Leveraging SAM 2 Potential for Infinite Medical Image Segmentation with a Single Prompt 🔗 Code
2025.04 Semi-automated segmentation of magnitude images in 4D flow MR scans using segment anything model 2 (SAM 2) NA
2025.05 ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking 🔗 Code
2025.06 MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation NA
2025.06 SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts 🌐Project page
2025.06 Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning 🔗 Code
2025.07 Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data 🕒Soon
2025.08 Training-Free Breast Ultrasound Image Segmentation with Retrieval-based SAM2 NA
2025.08 CXR-ODDet: An Omni-Decoupled Multi-ClassLesion Localization Framework for Automatic ChestX-Ray Analysis NA
2025.08 LGFFM: A Localized and Globalized Frequency Fusion Model for Ultrasound Image Segmentation 🔗 Code
2025.04 Semi-automated segmentation of magnitude images in 4D flow MR scans using segment anything model 2 (SAM 2) NA
2025.09 Co-Seg: Mutual Prompt-Guided Collaborative Learning for Tissue and Nuclei Segmentation 🔗 Code

Other Medical Applications

Release Title Code
2025.03 Flip Learning: Weakly supervised erase to segment nodules in breast ultrasound NA
2025.03 From Monocular Vision to Autonomous Action:Guiding Tumor Resection via 3D Reconstruction NA
2025.03 Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins NA
2025.03 Early Detection and Classification of Lung Cancer using Segment Anything Model 2 and Dense Net NA
2025.04 Zero-Shot 4D Lidar Panoptic Segmentation NA
2025.04 SYNTHFM: Training Modality-Agnostic Foundation Models for Medical Image Segmentation Without Real Medical Data NA
2025.04 VoxelFeat: Voxel-wise foundation model features NA
2025.06 Leadership Assessment in Pediatric Intensive Care Unit Team Training NA
2025.08 GM-ABS: Promptable Generalist Model Drives Active Barely Supervised Training in Specialist Model for 3D Medical Image Segmentation 🔗 Code
2025.08 Transgene-free generation of mouse post-gastrulation whole embryo models solely from naive ESCs and iPSCs NA
2025.09 A Machine Learning Assisted Tool and Numerical Model for Analyzing Lipid Nanoparticles 🔗 Code
2025.09 Evaluating the Efficacy of Mebendazole Repurposing for Ovarian Cancer Therapy Using Optical Coherence Tomography NA
2025.09 Evaluation of Radio Frequency Ablation in Human Left Atrial Tissues for Atrial Fibrillation Using Optical Coherence Tomography NA
2025.09 Abundance of Maternal Mitochondrial Genome Is Dispensable up to the Mitochondrial Genome Activation in Post-Implantation Embryos NA
2025.09 The Evaluation of a Deep Learning Approach to Automatic Segmentation of Teeth and Shade Guides for Tooth Shade Matching Using the SAM2 Algorithm NA
2025.09 Neck-focused Remote Photoplethysmography (rPPG): A comparative study using clinical data and the PyVHR framework NA

🎭 Camouflaged Object Detection (COD)

Detecting and segmenting objects that blend with their surroundings

Video COD

Release Title Code
2024.07 Evaluating SAM2's Role in Camouflaged Object Detection: From SAM to SAM2 🔗 Code
2024.09 When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation 🔗 Code
2025.03 CamSAM2: Segment Anything Accurately in Camouflaged Videos 🔗 Code
2025.04 CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection NA

Image COD

Release Title Code
2024.08 SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation 🔗 Code
2024.08 SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More 🔗 Code
2025.07 HFS-SAM2: Segment Anything Model 2 with High-Frequency Feature Supplementation for Camouflaged Object Detection 🔗 Code
2025.09 SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection NA

🛰️ Remote Sensing

Satellite imagery analysis, environmental monitoring, and geospatial applications

Release Title Code
2024.11 DED-SAM: Adapting Segment Anything Model 2 for Dual Encoder-Decoder Change Detection NA
2025.01 Prompt-Based Segmentation at Multiple Resolutions and Lighting Conditions using Segment Anything Model 2 NA
2025.03 Customized SAM 2 for Referring Remote Sensing Image Segmentation NA
2025.05 InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition 🔗 Code
2025.06 Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification NA
2025.06 Bundle adjustment for multi-source Mars orbiter imagery with generalized control constraints NA
2025.07 Leveraging SAM 2 and LiDAR for Automated Individual Tree Crown Delineation: A Comparative Evaluation of Prompting Methods 🔗 Code
2025.07 Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment 🔗 Code
2025.07 CSW-SAM: a cross-scale algorithm for very-high-resolution water body segmentation based on segment anything model 2 NA
2025.07 A Fine Agricultural Flood Segmentation Model For HJ-2E S-band SAR Data NA
2025.08 DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction 🔗 Code
2025.09 PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection 🔗 Code
2025.09 An Automatic Sample Augmentation Method for Paddy Rice Mapping Based on Segment Anything Model and Phenological Features—A Case Study in Southwest China NA
2025.09 SOPSeg: Prompt-based Small Object Instance Segmentation in Remote SensingImagery NA
2025.09 BiSAM-CD: Zero-Shot Remote Sensing Change Detection via Bidirectional Temporal Memory in SAM2 🔗 Code
2025.10 SinkSAM-Net: Knowledge-driven self-supervised sinkhole segmentation using topographic priors and Segment Anything Model 🌐Project page

🧊 3D Processing & Point Clouds

Three-dimensional data processing, reconstruction, and analysis

3D Segmentation

Release Title Code
2024.08 Segment Any Mesh: Zero-shot Mesh Part Segmentation via Lifting Segment Anything 2 to 3D 🔗 Code
2024.11 Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting NA
2024.11 Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking 🌐Project page
2025.03 Segment-then-Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting 🌐Project page
2025.04 DSM: Building A Diverse Semantic Map for 3D Visual Grounding 🌐Project page
2025.07 GraphSeg: Constructing Segmented 3D Representations via Graph Edge Addition and Contraction NA

3D Reconstruction

Release Title Code
2024.12 Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Videos 🌐Project page
2024.11 Updating Dynamic 3D Scene Graphs from Egocentric Observations 🌐Project page
2025.02 Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction 🔗 Code

Other 3D Applications

Release Title Code
2025.03 LP-Gaussians: Learnable Parametric Gaussian Splatting for Efficient Dynamic Reconstruction of Single-View Scenes 🌐Project page
2025.03 DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction 🌐Project page
2025.03 Free Your Hands: Lightweight Relightable Turntable Capture Pipeline NA
2025.03 WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images 🕒Soon
2025.03 Pseudo-LiDAR With Two-Dimensional Instance for Monocular Three-Dimensional Object Tracking NA
2025.03 SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint NA
2025.03 SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining 🔗 Code
2025.03 COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting 🔗 Code
2025.03 Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields 🌐Project page
2025.03 Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying 🌐Project page
2025.04 Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting 🌐Project page
2025.04 FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents NA
2025.06 GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation 🌐Project page
2025.05 Constructing a 3D Town from a Single Image 🌐Project page
2025.06 GenMOJO: Robust Multi-Object 4D Generation for In-the-wild Videos 🌐Project page
2025.06 CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image 🌐Project page
2025.06 BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing 🌐Project page
2025.07 LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion 🌐Project page
2025.07 Consistent Bokeh for Multi-View Images With 3D Gaussian Splatting 🔗 Code
2025.07 Defect segmentation and 3D reconstruction in concrete structures using SAM 2 and 3D Gaussian splatting Upon Request
2025.07 Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints 🔗 Code
2025.07 MG-Mono: A Lightweight Multi-Granularity Method for Self-Supervised Monocular Depth Estimation 🔗 Code
2025.08 Training-free automatic instance segmentation of girder bridge point cloud via large model fusion with reverse entity modelling verification NA
2025.08 SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass 🌐Project page

🎨 Image or Video Generation & Editing

Creative applications including content generation, editing, and artistic tools

Release Title Code
2024.10 AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing 🔗 Code
2024.11 VideoDirector: Precise Video Editing via Text-to-Video Models 🌐Project page
2024.11 VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing 🌐Project page
2024.11 Generative Omnimatte: Learning to Decompose Video into Layers 🌐Project page
2024.12 InterDyn: Controllable Interactive Dynamics with Video Diffusion Models 🌐Project page
2025.01 MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis 🌐Project page
2025.01 BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations 🌐Project page
2025.03 TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video Synthesis NA
2025.03 Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias 🔗 Code
2025.03 Unified Dense Prediction of Video Diffusion NA
2025.03 DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image 🕒Soon
2025.03 FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing 🌐Project page
2025.03 MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance 🌐Project page
2025.03 Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting 🌐Project page
2025.03 Multi-Subject and Motion Customization of Text-to-Video Diffusion Models 🌐Project page
2025.04 DreamFuse: Adaptive Image Fusion with Diffusion Transformer 🌐Project page
2025.04 Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution 🔗 Code
2025.06 Keyframe-Guided Creative Video Inpainting 🌐Project page
2025.06 OmniGen2: Exploration to Advanced Multimodal Generation 🌐Project page
2025.07 Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning 🔗 Code
2025.07 Enhanced Velocity Field Modeling for Gaussian Video Reconstruction NA
2025.08 NEP: Autoregressive Image Editing via Next Editing Token Prediction 🌐Project page

🗺️ Simultaneous Localization and Mapping (SLAM / VO)

Navigation, mapping, and localization applications

Release Title Code
2024.11 OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping NA
2025.06 MCOO-SLAM: A Multi-Camera Omnidirectional Object SLAM System NA
2025.07 VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization NA
2025.08 FlyMeThrough: Human-AI Collaborative 3D Indoor Mapping with Commodity Drones NA

💡 Light Field Segmentation

Advanced imaging techniques and multi-dimensional visual processing

Release Title Code
2024.11 Segment Anything in Light Fields for Real-Time Applications via Constrained Prompting 🔗 Code

🤖 Robotics

Autonomous systems, manipulation, navigation, and human-robot interaction

Release Title Code
2024.10 A Pipeline for Segmenting and Structuring RGB-D Data for Robotics Applications NA
2024.10 VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model 🌐Project page
2025.02 Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos 🌐Project page
2025.02 Map Space Belief Prediction for Manipulation-Enhanced Mapping (To be released)
2025.02 Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids 🌐Project page
2025.03 DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping 🌐Project page
2025.03 Autonomous Dissection in Robotic Cholecystectomy NA
2025.03 MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model 🌐Project page
2025.03 LuciBot: Automated Robot Policy Learning from Generated Videos 🌐Project page
2025.03 IMPACT : Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models 🌐Project page
2025.03 VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility NA
2025.03 ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis 🌐Project page
2025.04 Slot-Level Robotic Placement via Visual Imitation from Single Human Video 🌐Project page
2025.04 Entangled chip removal utilizing mass-spring model with mobile manipulator NA
2025.05 Symbolically-Guided Visual Plan Inference from Uncurated Video Data NA
2025.05 Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer 🌐Project page
2025.05 Grasp the Invisibility by Vision-Language guided Active View Planning NA
2025.07 Geometry-aware 4D Video Generation for Robot Manipulation 🌐Project page
2025.07 Object-Centric Mobile Manipulation through SAM2-Guided Perception and Imitation Learning NA
2025.07 GraspGen: A Diffusion-based Framework for 6-DOF Grasping 🌐Project page
2025.07 RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping 🔗 Code
2025.08 Train Once, Deploy Anywhere: Realizing Data-Efficient Dynamic Object Manipulation 🔗 Code
2025.09 ObjectReact: Learning Object-Relative Control for Visual Navigation 🌐Project page

⚡ Adaptation, Compression & Edge Applications

Efficiency optimizations, model compression, and deployment on resource-constrained devices

Release Title Code
2025.03 LVMScissor: Split and Schedule Large Vision Model Inference on Mobile Edges via Salp Swarm Algorithm NA
2025.03 SALT: Parameter-Efficient Fine-Tuning via Singular Value Adaptation with Low-Rank Transformation 🔗 Code
2025.04 Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models NA
2025.05 Deploying Vision Foundation AI Models on the Edge. The SAM2 Experience NA
2025.09 ZLATTE: A Geometry-Aware, Learning-Free Framework for Language-Driven Trajectory Reshaping in Human-Robot Interaction NA

📖 Training

Resources for model training, datasets, and learning frameworks

Datasets

Curated datasets and benchmarks for SAM2 training and evaluation

Release Title Code
2025.02 SurgPose: a Dataset for Articulated Robotic Surgical Tool Pose Estimation and Tracking 🌐Project page
2025.02 The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition NA
2025.02 Picking the Cream of the Crop:Visual-Centric Data Selection with Collaborative Agents 🔗 Code
2025.03 Phantom: Training Robots Without Robots Using Only Human Videos 🌐Project page
2025.03 Scalable Real2Sim: Physics-Aware Asset Generation Via Robotic Pick-and-Place Setups 🌐Project page
2025.03 What Are You Doing? A Closer Look at Controllable Human Video Generation 🔗 Code
2025.03 Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting 🕒Soon
2025.03 Referring to Any Person 🌐Project page
2025.03 AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations NA
2025.03 DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios 🌐Project page
2025.05 A fusion network for multi-modality medical image registration with progressive feature alignment 🔗 Code
2025.04 InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians NA
2025.04 VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing 🌐Project page
2025.04 UrbanWaste: In-the-Bin Dataset for Waste Disposal Inspection with Multi-Granularity Hierarchical Labels 🌐Project page
2025.06 HD-EPIC: A Highly-Detailed Egocentric Video Dataset 🌐Project page
2025.06 GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities 🌐Project page
2025.06 INTERNSPATIAL: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models NA
2025.06 BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos 🌐Project page
2025.06 SAM4D:Segment Anything in Camera and LiDAR Streams 🌐Project page
2025.06 XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation 🌐Project page
2025.07 A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers 🔗 Code
2025.07 Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation 🌐Project page
2025.08 MOSE: Complex Video Object Segmentation Dataset 🌐Project page
2025.08 DreamVE: Unified Instruction-based Image and Video Editing 🌐Project page
2025.09 SPATIALVID: A Large-scale Video Dataset with Spatial Annotations 🌐Project page

🔄 Used for Data Augmentation (/Tool)

Tools and methods for data synthesis, augmentation, and preprocessing

Release Title Code
2025.02 Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach 🌐Project page
2025.03 A Taxonomy for Evaluating Generalist Robot Policies 🌐Project page
2025.03 CRESTE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance 🌐Project page
2025.03 Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks 🌐Project page
2025.03 YOLOE: Real-Time Seeing Anything 🔗 Code
2025.03 VACE: All-in-One Video Creation and Editing 🌐Project page
2025.03 V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video 🌐Project page
2025.03 Better Together: Unified Motion Capture and 3D Avatar Reconstruction NA
2025.03 CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance NA
2025.03 RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance 🌐Project page
2025.03 Evaluating the FLUX.1 Synthetic Data on YOLOv9 for AI-Powered Poultry Farming NA
2025.03 Any2Caption : Interpreting Any Condition to Caption for Controllable Video Generation 🌐Project page
2025.05 LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration 🔗 Code
2025.04 VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning 🌐Project page
2025.04 M2Flow: A Motion Information Fusion Framework for Enhanced Unsupervised Optical Flow Estimation in Autonomous Driving NA
2025.05 Interspatial Attention for Efficient 4D Human Video Generation 🌐Project page
2025.06 Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments NA
2025.06 Impact of Synthetic Data from Diffusion Models on Weed Detection Performance NA
2025.06 VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models 🌐Project page
2025.06 Building Software for Analyzing Muck Piles After Blasting in Laboratory Conditions with Integrated Artificial Intelligence NA
2025.06 WeedSwin hierarchical vision transformer with SAM-2 for multi-stage weed detection and classification On Request
2025.07 Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data 🔗 Code
2025.07 RCG: Safety-Critical Scenario Generation for Robust Autonomous Driving via Real-World Crash Grounding 🕒Soon

🛠️ Training Helper

Supporting tools and frameworks for model training and fine-tuning

Release Title Code
2025.03 DINeMo: Learning Neural Mesh Models with no 3D Annotations 🌐Project page
2025.04 CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model NA
2025.04 Aligning Anime Video Generation with Human Feedback 🕒Soon
2025.04 OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding 🌐Project page
2025.06 HunyuanVideo-HOMA Generic Human-Object Interaction in Multimodal Driven Human Animation 🌐Project page
2025.06 Enhancing Visual Localization with Cross-Domain Image Generation 🌐Project page
2025.07 RoboBrain 2.0: See Better. Think Harder. Do Smarter. 🌐Project page
2025.07 Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents 🔗 Code

📊 Performance Evaluations

Benchmarking, evaluation metrics, and comparative studies

Release Title Code
2025.02 Vector-Quantized Vision Foundation Models for Object-Centric Learning NA
2025.04 WorldScore: A Unified Evaluation Benchmark for World Generation 🌐Project page
2025.04 BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting NA
2025.05 UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset 🔗 Code
2025.05 Leveraging Segment Anything Model 2 (SAM 2) to optimize segmentation for synthetic data quality in high-clutter baggage NA
2025.05 Synergistic Enhancement: A Study on the Design of Large Models Assisted by End-toEnd Road Damage Prompt Network and Methods for Quantification of Damage Morphological Features NA
2025.06 Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition NA
2025.06 AI-Driven MRI-based Brain Tumour Segmentation Benchmarking NA
2025.07 Amulti-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level 🔗 Code
2025.07 enLLASD: An ensemble deep learning framework to automate derivation of lower-limb alignments for skeletal dysplasia NA
2025.07 Semantic Segmentation of iPS Cells: Case Study on Model Complexity in Biomedical Imaging NA
2025.08 Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2 NA

Post Processing

Release Title Code
2025.03 Easi3R: Estimating Disentangled Motion from DUSt3R Without Training 🌐Project page
2025.04 Multi-identity Human Image Animation with Structural Video Diffusion NA
2025.06 Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry 🌐Project page
2025.06 Leader360V: A Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environments NA
2025.07 Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction 🌐Project page

🛡️ Robustness

Security, adversarial robustness, and reliability studies

Release Title Code
2025.04 Robust SAM: On the Adversarial Robustness of Vision Foundation Models NA

🌟 Unique Applications/Usage

Novel and creative applications that don't fit traditional categories

Release Title Code
2024.09 Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models NA
2024.09 Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models 🔗 Code
2024.09 Point of Interest Recognition and Tracking in Aerial Video during Live Cycling Broadcasts NA
2024.10 ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting 🔗 Code
2024.10 GRS: Generating Robotic Simulation Tasks from Real-World Images NA
2024.10 Iterative Optimization Annotation Pipeline and ALSS-YOLO-Seg for Efficient Banana Plantation Segmentation in UAV Imagery NA
2024.10 Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting 🔗 Code
2025.01 Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images 📊Data
2025.02 Best Foot Forward: Robust Foot Reconstruction in-the-wild
2025.03 ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment 🌐Project page
2025.03 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse 🌐Project page
2025.04 MORPHEUS: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments 🌐Project page
2025.05 Air-Ground Collaboration for Language-Specified Missions in Unknown Environments NA
2025.06 In Situ Detection and Measurement of Broccoli Heads Under Different Lighting Conditions Using Proximal Remote Sensing NA
2025.07 Zero-Shot Recognition of Test Tube Types by Automatically Collecting and Labeling RGB Data NA
2025.07 Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation NA
2025.07 Phys2Real: Physically-Informed Gaussian Splatting for Adaptive Sim-to-Real Transfer in Robotic Manipulation NA

🤝 Contributing

We welcome contributions from the community! Here's how you can help improve this repository:

How to Contribute

  1. Submit a Pull Request with new papers/projects
  2. Report Issues for broken links or incorrect information
  3. Suggest New Categories for emerging SAM2 applications
  4. Improve Organization by suggesting better categorization

Contribution Guidelines

  • Format: Follow the existing table format with Release Date | Title | Code links
  • Quality: Include only peer-reviewed papers or significant projects
  • Recency: Focus on 2024+ publications (SAM2 era)
  • Completeness: Provide accurate metadata and working links when available
  • Categories: Place papers in the most appropriate domain category

Adding New Papers

| YYYY.MM | [Paper Title](paper_link) | [🔗 Code](code_link) / [🌐Project page](project_link) / NA |

Link Icons Guide

  • 🔗 [🔗 Code] - Source code repositories
  • 🌐 [🌐Project page] - Official project websites
  • 📊 [📊Data] - Datasets and benchmarks
  • 🖥️ [🖥️Demo] - Interactive demonstrations
  • 📖 [📖Repo] - Documentation repositories
  • 🕒 🕒Soon - Coming soon
  • NA - Not available

📈 Repository Metrics

  • Total Papers: 500+ and growing
  • Domains Covered: 15+ major application areas
  • Time Range: 2024-2025 (SAM2 era)
  • Update Frequency: Weekly additions
  • Community: Open for contributions

🎉 Acknowledgments

Special thanks to:

  • Meta AI for releasing SAM2 and advancing the field
  • Research Community for the incredible pace of SAM2 adoption and innovation
  • Contributors who help maintain and improve this collection
  • Reviewers who ensure quality and accuracy

📄 License

This collection is maintained under the MIT License. Individual papers and projects retain their original licenses.


⭐ Star this repository if you find it helpful! ⭐
Last updated: August 2025 | Maintained by the Community

About

This repo aims to include materials (papers, codes, slides) about SAM2 (segment anything in images and videos). We are continuously improving the project. Welcome to PR the works (papers, repos) that are missed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors