Cannot reproduce official model hub results with local models (large/tv_large)

When using local SAM-Audio models (`sam-audio-large` and `sam-audio-large-tv`) compared to the results shown on the official ground page (https://aidemos.meta.com/segment-anything/editor/segment-audio). The issues manifest in two main areas:

1. **General Performance Gap**: The separation quality from local models is consistently worse than the official demo page, even when using the checkpoints and following the official examples.

2. **Visual Prompting Performance**: When using visual prompting without text descriptions (setting `descriptions=[""]` as recommended in the documentation), the separation results are particularly poor - often failing to isolate the target sound effectively.

## Environment

- **Model versions tested**: 
  - `sam-audio-large` (from hub)
  - `sam-audio-large-tv` (from hub)
- **SAM3 version**: Latest from `git+https://github.com/facebookresearch/sam3.git`

I've tried the following approaches to improve the results, but none have resolved the performance gap:

1. **Parameter Configuration**:
   - ✅ Enabled `predict_spans=True`
   - ✅ Set `reranking_candidates=8`
   - ✅ Used empty string `descriptions=[""]` for visual prompting

2. **Model Versions**:
   - ✅ Tested both `sam-audio-large` and `sam-audio-large-tv`
   - ✅ Verified checkpoint files are complete and properly loaded
   - ✅ Confirmed model weights are loaded with `strict=True`

3. **Mask Quality**:
   - ✅ Confirmed mask values are correct (binary masks from SAM3)

5. **Text Prompting Comparison**:
   - When using text prompting alone (without visual masks), results are better but still not matching the official demos
   - The combination of visual + text prompting shows some improvement, but still falls short

## Specific Issues Observed

1. **Gender Voice Separation Failure**: On the official model hub demos, the model successfully separates male and female voices. However, when using local models for inference, male and female voices are consistently mixed together, even when the visual object is clearly an isolated male person. This suggests the visual prompting is not effectively guiding the audio separation.

2. **Irrelevant Sound Extraction with Empty Descriptions**: When using visual prompting with `descriptions=[""]` (as recommended in the documentation), the model often extracts sounds that are completely unrelated to the visual object in the mask. For example, when masking a person speaking, the model might extract background music or environmental sounds instead of the person's voice.

## Request for Help

We've been working on this issue for some time and have tried various approaches, but the performance gap persists. We would greatly appreciate any guidance or insights from the maintainers. If there are any undocumented configuration requirements, preprocessing steps, or known limitations that could explain these discrepancies, please let us know. Thank you for your time and for maintaining this excellent project!

## Additional Information

- I've verified that the SAM3 masks are being generated correctly and match the expected format
- The video files and audio inputs are the same as those used in the official examples
- Model loading completes without errors or warnings
- All dependencies are installed correctly



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce official model hub results with local models (large/tv_large) #82

Environment

Specific Issues Observed

Request for Help

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot reproduce official model hub results with local models (large/tv_large) #82

Description

Environment

Specific Issues Observed

Request for Help

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions