ParsiEval: A Benchmark for Persian Language Understanding

This project introduces ParsiEval, a comprehensive benchmark designed to evaluate the performance of Large Language Models (LLMs) on Persian language tasks. The primary goal of ParsiEval is to provide a standardized and challenging testbed for assessing the capabilities of LLMs in understanding and reasoning in Persian.

Dataset

The ParsiEval dataset consists of 364 multiple-choice questions that span a wide variety of topics, including:

History
Literature
General Knowledge
Science

Each question is designed to test a model's ability to comprehend context, recall facts, and make logical inferences in Persian.

Goal and Motivation

The development of high-quality benchmarks for languages other than English is crucial for advancing the field of multilingual NLP. ParsiEval aims to fill this gap for the Persian language by providing a robust evaluation suite that can be used to:

Track the progress of Persian language models over time.
Identify the strengths and weaknesses of different model architectures.
Drive research and development in Persian language technology.

Results

Here are the results of the evaluation for different models:

Rank	Model	Size	License	API Provider	Avg Latency	Total Latency	Accuracy
1	gemini-2.5-pro	unknown	proprietary	openrouter	8.87s	3230.16s	84.62%
2	gpt-5-2025-08-07	unknown	apache-2.0	openai	6.08s	2297.52s	82.01%
3	grok-4	unknown	proprietary	openrouter	11.59s	1599.73s	81.88%
4	gpt-4o	unknown	proprietary	openai	0.90s	326.08s	79.12%
5	gpt-5-mini	unknown	proprietary	openrouter	1.98s	721.03s	77.75%
6	kimi-k2-instruct	1000b	modified-mit	cerebras	1.37s	497.76s	77.20%
7	llama-4-maverick-17b-128e-instruct	17b	open-weight-llama4	openrouter	1.79s	651.43s	75.27%
8	deepseek-v3-0324	685	mit	openrouter	2.72s	991.41s	74.73%
9	llama-3.3-70b	70b	open-weight-llama3.3	cerebras	1.75s	638.42s	73.35%
10	llama-4-scout-17b-16e-instruct	17b	open-weight-llama4	cerebras	0.68s	247.32s	71.43%
11	qwen-3-235b-a22b-instruct-2507	235b	apache-2.0	cerebras	0.70s	253.16s	70.60%
12	gpt-oss-120b	120b	apache-2.0	cerebras	2.84s	1034.47s	70.05%
13	gemma-3-27b-it	27b	open-weight-gemma	openrouter	1.13s	409.72s	67.03%
14	qwen3-30b-a3b-thinking	30b	apache-2.0	openrouter	10.29s	3744.42s	65.66%
15	gpt-oss-20b	20b	apache-2.0	cerebras	2.73s	994.36s	63.74%
16	gpt-4.1-nano	unknown	proprietary	openai	1.05s	382.53s	63.74%
17	gemma2-9b-it	9b	open-weight-gemma	groq	2.24s	815.02s	59.62%
18	mistral-small-2503	24	apache-2.0	openrouter	2.60s	948.16s	59.34%
19	qwen-3-32b	32b	apache-2.0	groq	1.86s	677.18s	59.07%
20	qwen3-30b-a3b-instruct-2507	30b	apache-2.0	openrouter	0.74s	267.75s	57.69%
21	google/gemma-3n-e4b	4b	open-weight-gemma	lmstudio	0.17s	63.68s	54.55%
22	qwen/qwen3-4b-thinking-2507	4b	apache-2.0	lmstudio	45.23s	13569.00s	51.97%
23	gemma3:4b	4B	open-weight-gemma	ollama	0.13s	46.63s	43.13%
24	qwen3-4b-instruct-2507	4b	apache-2.0	lmstudio	0.23s	83.35s	42.86%
25	llama3.2:3b	3B	open-weight-llama3	ollama	0.13s	46.11s	34.34%
26	qwen2.5:1.5b	1.5B	apache-2.0	ollama	0.07s	24.79s	32.14%
27	phi4-mini	1.3B	mit	ollama	0.24s	88.75s	29.67%
28	smollm3-3b	3b	apache-2.0	lmstudio	0.26s	94.60s	27.20%
29	gemma3:270m	270M	open-weight-gemma	ollama	0.16s	58.82s	25.00%
30	liquid/lfm2-1.2b	1.2b	open-weight-lfm1.0	lmstudio	0.67s	243.90s	25.00%
31	gemma3:1b	1B	open-weight-gemma	ollama	0.19s	70.21s	23.08%
32	gemma-3-270m-it	270m	open-weight-gemma	lmstudio	0.05s	17.19s	21.15%
33	qwen2.5:0.5b	0.5B	apache-2.0	ollama	1.00s	365.51s	19.78%
34	llama3.2:1b	1B	open-weight-llama3.2	ollama	0.17s	63.05s	17.03%
35	gemma2:2b	2B	open-weight-gemma	ollama	0.81s	295.59s	15.93%

Accuracy

Top Models

Analysis of the highest performing models

Local Models

Examination of models running on local hardware

Future Plan

Dataset Expansion

Increase Question Volume: Expand from 364 to 1000+ questions across all domains
Add New Domains: Include categories such as:
Academic: Educational and scholarly content
Business: Commercial and economic topics
Food: Culinary knowledge and food culture
History: Historical events and figures
Literature: Persian literature and literary analysis
Medical: Healthcare and medical knowledge
Politics: Political science and current affairs
Social Media: Digital communication and social platforms
Technology: Technical and scientific concepts
Arts and Culture: Cultural heritage and art forms
Sports and Recreation: Physical activities and leisure pursuits
Geography and Environment: Landscapes, climates, and natural resources
Religion and Philosophy: Belief systems and philosophical viewpoints
Law and Legal Studies: Legal systems, principles, and practices

Technical Enhancements

Leaderboard System: Create a public leaderboard for tracking model performance over time
Multi-modal Extensions: Explore incorporation of visual elements for comprehensive language understanding

Research Directions

Academic Publication: Publish research findings and benchmark results to arXiv for broader scientific community access
Cross-lingual Analysis: Compare Persian model performance with other languages
Fine-tuning Studies: Investigate optimal fine-tuning strategies for Persian language models
Cultural Adaptation: Study the impact of cultural context on model performance

Community Engagement

Open Source Contributions: Encourage community contributions for question generation and validation
Academic Partnerships: Collaborate with Persian language and NLP research institutions

Long-term Vision

Standardization: Establish ParsiEval as the de facto standard for Persian language model evaluation
International Recognition: Promote adoption in international NLP conferences and competitions
Real-world Applications: Bridge the gap between benchmark performance and practical Persian language applications
Continuous Evolution: Maintain relevance through regular updates reflecting current language usage and cultural developments

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
parsi-eval-2		parsi-eval-2
plots		plots
results		results
.gitignore		.gitignore
README.md		README.md
create_visuals.py		create_visuals.py
parsi-eval-1.csv		parsi-eval-1.csv
run_eval.py		run_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParsiEval: A Benchmark for Persian Language Understanding

Dataset

Goal and Motivation

Results

Accuracy

Top Models

Local Models

Future Plan

Dataset Expansion

Technical Enhancements

Research Directions

Community Engagement

Long-term Vision

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParsiEval: A Benchmark for Persian Language Understanding

Dataset

Goal and Motivation

Results

Accuracy

Top Models

Local Models

Future Plan

Dataset Expansion

Technical Enhancements

Research Directions

Community Engagement

Long-term Vision

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages