Skip to content

The Choices of Reasoning Models? #11

@HBinLi

Description

@HBinLi

Hi authors,

Thank you for your excellent work and valuable contribution to the community. I have a question regarding the reproducibility of the DeepConf results.

When I attempted to reproduce the method using Qwen3-VL-Instruct 8B and Qwen3-VL-Thinking 8B for pure-text reasoning on AIME24 and AIME25, I observed that confidence-based voting is quite unstable, and in many cases, its performance is even worse than simple majority voting.

This leads me to wonder: does the effectiveness of DeepConf depend on the choice of model?
Have the authors conducted similar discussions or analyses regarding the relationship between model selection and the stability of confidence-based voting?

Thanks again for your great work, and I look forward to your insights.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions