I have run a mmlu benchmark on Qwen/Qwen3.5-4B and i got the overall score as 58.71, but the leaderboard shows 79.1. What is am i missing? Why there is huge gap in results?
python evaluate_from_local.py --model "Qwen/Qwen3.5-4B"
GPU : A100
I even tried on llm-eval lib, and i got the same result.
`
------category level sta------
Average accuracy 0.8006 - biology
Average accuracy 0.5919 - business
Average accuracy 0.5936 - chemistry
Average accuracy 0.6659 - computer science
Average accuracy 0.7026 - economics
Average accuracy 0.4272 - engineering
Average accuracy 0.6296 - health
Average accuracy 0.4751 - history
Average accuracy 0.3569 - law
Average accuracy 0.6736 - math
Average accuracy 0.5346 - other
Average accuracy 0.5030 - philosophy
Average accuracy 0.6005 - physics
Average accuracy 0.6855 - psychology
------average acc sta------
Average accuracy: 0.5871
`
Help will be Appreciated 👍 @Wyyyb @lavdnone2
I have run a mmlu benchmark on
Qwen/Qwen3.5-4Band i got the overall score as 58.71, but the leaderboard shows 79.1. What is am i missing? Why there is huge gap in results?python evaluate_from_local.py --model "Qwen/Qwen3.5-4B"GPU : A100
I even tried on llm-eval lib, and i got the same result.
`
------category level sta------
Average accuracy 0.8006 - biology
Average accuracy 0.5919 - business
Average accuracy 0.5936 - chemistry
Average accuracy 0.6659 - computer science
Average accuracy 0.7026 - economics
Average accuracy 0.4272 - engineering
Average accuracy 0.6296 - health
Average accuracy 0.4751 - history
Average accuracy 0.3569 - law
Average accuracy 0.6736 - math
Average accuracy 0.5346 - other
Average accuracy 0.5030 - philosophy
Average accuracy 0.6005 - physics
Average accuracy 0.6855 - psychology
------average acc sta------
Average accuracy: 0.5871
`
Help will be Appreciated 👍 @Wyyyb @lavdnone2