Benchmark Result Mismatch

I have run a mmlu benchmark on `Qwen/Qwen3.5-4B` and i got the overall score as 58.71, but the leaderboard shows 79.1. What is am i missing? Why there is huge gap in results? 
`python evaluate_from_local.py --model "Qwen/Qwen3.5-4B" `
GPU :  A100
I even tried on llm-eval lib, and i got the same result.

`
------category level sta------
Average accuracy 0.8006 - biology
Average accuracy 0.5919 - business
Average accuracy 0.5936 - chemistry
Average accuracy 0.6659 - computer science
Average accuracy 0.7026 - economics
Average accuracy 0.4272 - engineering
Average accuracy 0.6296 - health
Average accuracy 0.4751 - history
Average accuracy 0.3569 - law
Average accuracy 0.6736 - math
Average accuracy 0.5346 - other
Average accuracy 0.5030 - philosophy
Average accuracy 0.6005 - physics
Average accuracy 0.6855 - psychology

------average acc sta------
Average accuracy: 0.5871
`
Help will be Appreciated 👍  @Wyyyb @lavdnone2 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Result Mismatch #79

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark Result Mismatch #79

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions