Skip to content

Benchmark Result Mismatch #79

@Linux-Server

Description

@Linux-Server

I have run a mmlu benchmark on Qwen/Qwen3.5-4B and i got the overall score as 58.71, but the leaderboard shows 79.1. What is am i missing? Why there is huge gap in results?
python evaluate_from_local.py --model "Qwen/Qwen3.5-4B"
GPU : A100
I even tried on llm-eval lib, and i got the same result.

`
------category level sta------
Average accuracy 0.8006 - biology
Average accuracy 0.5919 - business
Average accuracy 0.5936 - chemistry
Average accuracy 0.6659 - computer science
Average accuracy 0.7026 - economics
Average accuracy 0.4272 - engineering
Average accuracy 0.6296 - health
Average accuracy 0.4751 - history
Average accuracy 0.3569 - law
Average accuracy 0.6736 - math
Average accuracy 0.5346 - other
Average accuracy 0.5030 - philosophy
Average accuracy 0.6005 - physics
Average accuracy 0.6855 - psychology

------average acc sta------
Average accuracy: 0.5871
`
Help will be Appreciated 👍 @Wyyyb @lavdnone2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions