The Choices of Reasoning Models?

Hi authors,

Thank you for your excellent work and valuable contribution to the community. I have a question regarding the reproducibility of the DeepConf results.

When I attempted to reproduce the method using Qwen3-VL-Instruct 8B and Qwen3-VL-Thinking 8B for pure-text reasoning on AIME24 and AIME25, I observed that confidence-based voting is quite unstable, and in many cases, its performance is even worse than simple majority voting.

This leads me to wonder: does the effectiveness of DeepConf depend on the choice of model?
Have the authors conducted similar discussions or analyses regarding the relationship between model selection and the stability of confidence-based voting?

Thanks again for your great work, and I look forward to your insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Choices of Reasoning Models? #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The Choices of Reasoning Models? #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions