When using local SAM-Audio models (sam-audio-large and sam-audio-large-tv) compared to the results shown on the official ground page (https://aidemos.meta.com/segment-anything/editor/segment-audio). The issues manifest in two main areas:
-
General Performance Gap: The separation quality from local models is consistently worse than the official demo page, even when using the checkpoints and following the official examples.
-
Visual Prompting Performance: When using visual prompting without text descriptions (setting descriptions=[""] as recommended in the documentation), the separation results are particularly poor - often failing to isolate the target sound effectively.
Environment
- Model versions tested:
sam-audio-large (from hub)
sam-audio-large-tv (from hub)
- SAM3 version: Latest from
git+https://github.com/facebookresearch/sam3.git
I've tried the following approaches to improve the results, but none have resolved the performance gap:
-
Parameter Configuration:
- ✅ Enabled
predict_spans=True
- ✅ Set
reranking_candidates=8
- ✅ Used empty string
descriptions=[""] for visual prompting
-
Model Versions:
- ✅ Tested both
sam-audio-large and sam-audio-large-tv
- ✅ Verified checkpoint files are complete and properly loaded
- ✅ Confirmed model weights are loaded with
strict=True
-
Mask Quality:
- ✅ Confirmed mask values are correct (binary masks from SAM3)
-
Text Prompting Comparison:
- When using text prompting alone (without visual masks), results are better but still not matching the official demos
- The combination of visual + text prompting shows some improvement, but still falls short
Specific Issues Observed
-
Gender Voice Separation Failure: On the official model hub demos, the model successfully separates male and female voices. However, when using local models for inference, male and female voices are consistently mixed together, even when the visual object is clearly an isolated male person. This suggests the visual prompting is not effectively guiding the audio separation.
-
Irrelevant Sound Extraction with Empty Descriptions: When using visual prompting with descriptions=[""] (as recommended in the documentation), the model often extracts sounds that are completely unrelated to the visual object in the mask. For example, when masking a person speaking, the model might extract background music or environmental sounds instead of the person's voice.
Request for Help
We've been working on this issue for some time and have tried various approaches, but the performance gap persists. We would greatly appreciate any guidance or insights from the maintainers. If there are any undocumented configuration requirements, preprocessing steps, or known limitations that could explain these discrepancies, please let us know. Thank you for your time and for maintaining this excellent project!
Additional Information
- I've verified that the SAM3 masks are being generated correctly and match the expected format
- The video files and audio inputs are the same as those used in the official examples
- Model loading completes without errors or warnings
- All dependencies are installed correctly
When using local SAM-Audio models (
sam-audio-largeandsam-audio-large-tv) compared to the results shown on the official ground page (https://aidemos.meta.com/segment-anything/editor/segment-audio). The issues manifest in two main areas:General Performance Gap: The separation quality from local models is consistently worse than the official demo page, even when using the checkpoints and following the official examples.
Visual Prompting Performance: When using visual prompting without text descriptions (setting
descriptions=[""]as recommended in the documentation), the separation results are particularly poor - often failing to isolate the target sound effectively.Environment
sam-audio-large(from hub)sam-audio-large-tv(from hub)git+https://github.com/facebookresearch/sam3.gitI've tried the following approaches to improve the results, but none have resolved the performance gap:
Parameter Configuration:
predict_spans=Truereranking_candidates=8descriptions=[""]for visual promptingModel Versions:
sam-audio-largeandsam-audio-large-tvstrict=TrueMask Quality:
Text Prompting Comparison:
Specific Issues Observed
Gender Voice Separation Failure: On the official model hub demos, the model successfully separates male and female voices. However, when using local models for inference, male and female voices are consistently mixed together, even when the visual object is clearly an isolated male person. This suggests the visual prompting is not effectively guiding the audio separation.
Irrelevant Sound Extraction with Empty Descriptions: When using visual prompting with
descriptions=[""](as recommended in the documentation), the model often extracts sounds that are completely unrelated to the visual object in the mask. For example, when masking a person speaking, the model might extract background music or environmental sounds instead of the person's voice.Request for Help
We've been working on this issue for some time and have tried various approaches, but the performance gap persists. We would greatly appreciate any guidance or insights from the maintainers. If there are any undocumented configuration requirements, preprocessing steps, or known limitations that could explain these discrepancies, please let us know. Thank you for your time and for maintaining this excellent project!
Additional Information