Skip to content

Cannot reproduce official model hub results with local models (large/tv_large) #82

@Whale36377

Description

@Whale36377

When using local SAM-Audio models (sam-audio-large and sam-audio-large-tv) compared to the results shown on the official ground page (https://aidemos.meta.com/segment-anything/editor/segment-audio). The issues manifest in two main areas:

  1. General Performance Gap: The separation quality from local models is consistently worse than the official demo page, even when using the checkpoints and following the official examples.

  2. Visual Prompting Performance: When using visual prompting without text descriptions (setting descriptions=[""] as recommended in the documentation), the separation results are particularly poor - often failing to isolate the target sound effectively.

Environment

  • Model versions tested:
    • sam-audio-large (from hub)
    • sam-audio-large-tv (from hub)
  • SAM3 version: Latest from git+https://github.com/facebookresearch/sam3.git

I've tried the following approaches to improve the results, but none have resolved the performance gap:

  1. Parameter Configuration:

    • ✅ Enabled predict_spans=True
    • ✅ Set reranking_candidates=8
    • ✅ Used empty string descriptions=[""] for visual prompting
  2. Model Versions:

    • ✅ Tested both sam-audio-large and sam-audio-large-tv
    • ✅ Verified checkpoint files are complete and properly loaded
    • ✅ Confirmed model weights are loaded with strict=True
  3. Mask Quality:

    • ✅ Confirmed mask values are correct (binary masks from SAM3)
  4. Text Prompting Comparison:

    • When using text prompting alone (without visual masks), results are better but still not matching the official demos
    • The combination of visual + text prompting shows some improvement, but still falls short

Specific Issues Observed

  1. Gender Voice Separation Failure: On the official model hub demos, the model successfully separates male and female voices. However, when using local models for inference, male and female voices are consistently mixed together, even when the visual object is clearly an isolated male person. This suggests the visual prompting is not effectively guiding the audio separation.

  2. Irrelevant Sound Extraction with Empty Descriptions: When using visual prompting with descriptions=[""] (as recommended in the documentation), the model often extracts sounds that are completely unrelated to the visual object in the mask. For example, when masking a person speaking, the model might extract background music or environmental sounds instead of the person's voice.

Request for Help

We've been working on this issue for some time and have tried various approaches, but the performance gap persists. We would greatly appreciate any guidance or insights from the maintainers. If there are any undocumented configuration requirements, preprocessing steps, or known limitations that could explain these discrepancies, please let us know. Thank you for your time and for maintaining this excellent project!

Additional Information

  • I've verified that the SAM3 masks are being generated correctly and match the expected format
  • The video files and audio inputs are the same as those used in the official examples
  • Model loading completes without errors or warnings
  • All dependencies are installed correctly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions