Hi authors,
Thank you for your excellent work and valuable contribution to the community. I have a question regarding the reproducibility of the DeepConf results.
When I attempted to reproduce the method using Qwen3-VL-Instruct 8B and Qwen3-VL-Thinking 8B for pure-text reasoning on AIME24 and AIME25, I observed that confidence-based voting is quite unstable, and in many cases, its performance is even worse than simple majority voting.
This leads me to wonder: does the effectiveness of DeepConf depend on the choice of model?
Have the authors conducted similar discussions or analyses regarding the relationship between model selection and the stability of confidence-based voting?
Thanks again for your great work, and I look forward to your insights.
Hi authors,
Thank you for your excellent work and valuable contribution to the community. I have a question regarding the reproducibility of the DeepConf results.
When I attempted to reproduce the method using Qwen3-VL-Instruct 8B and Qwen3-VL-Thinking 8B for pure-text reasoning on AIME24 and AIME25, I observed that confidence-based voting is quite unstable, and in many cases, its performance is even worse than simple majority voting.
This leads me to wonder: does the effectiveness of DeepConf depend on the choice of model?
Have the authors conducted similar discussions or analyses regarding the relationship between model selection and the stability of confidence-based voting?
Thanks again for your great work, and I look forward to your insights.