UrduReason-Eval: Urdu Reasoning Evaluation Dataset

Dataset Description

UrduReason-Eval is an evaluation dataset for assessing reasoning capabilities in Urdu language models. The dataset contains 800 carefully curated reasoning problems across multiple categories, designed to test both linguistic understanding and mathematical reasoning in Urdu.

Total Samples: 800
Language: Urdu (اردو)
Primary Focus: Multi-step reasoning, arithmetic, and logical deduction
Intended Use: Evaluation only (not for training)

Dataset Splits

The dataset is organized by reasoning categories:

Category	Count	Description
`arithmetic_reasoning`	236	Mathematical word problems requiring multi-step calculations
`logical_deduction`	205	Problems requiring logical inference and constraint satisfaction
`temporal_reasoning`	154	Time-based problems involving dates, schedules, and durations
`comparative_reasoning`	148	Comparative problems involving relationships and ordering
`commonsense_causal_reasoning`	50	Causal reasoning about everyday physical and chemical processes
`causal_reasoning`	7	Abstract causal reasoning problems

Difficulty Distribution

Hard: 567 samples (70.9%)
Medium: 233 samples (29.1%)

Answer Types

short_text: 541 samples (67.6%)
integer: 251 samples (31.4%)
float: 8 samples (1.0%)

Motivation & Goal

Urdu is a low-resource language with over 230 million speakers worldwide, yet evaluation benchmarks for Urdu language models remain limited. UrduReason-Eval addresses this gap by providing a comprehensive evaluation dataset that:

Tests Multi-step Reasoning: Problems require combining multiple pieces of information and performing sequential reasoning steps
Evaluates Language Understanding: Questions are written in natural Urdu, testing both comprehension and reasoning
Covers Diverse Reasoning Types: From arithmetic to logical deduction to temporal reasoning
Provides Standardized Evaluation: Consistent format and answer normalization for reliable model comparison

This dataset is intended for researchers and developers working on:

Urdu language models
Low-resource language evaluation
Multilingual reasoning benchmarks
Cross-lingual transfer learning

Dataset Creation

Generation Process

The dataset was created through an LLM-assisted generation process:

Initial Generation: Questions were generated with assistance from large general-purpose language models
Manual Review: All samples underwent human review for:
- Correctness of questions and answers
- Clarity and naturalness of Urdu language
- Appropriateness of difficulty level
- Elimination of ambiguous or underdetermined problems
Normalization: All numeric answers were normalized to Western numerals (0-9) for consistent evaluation
Validation: Mathematical correctness was verified, and format inconsistencies were corrected

Quality Assurance

All arithmetic problems were manually verified for correctness
Answers were normalized to ensure consistent evaluation
Ambiguous questions with multiple valid answers were removed or corrected
Format violations (e.g., mixed numeral systems) were fixed
Duplicate entries were eliminated

Dataset Statistics

Total Entries: 800
Average Question Length: ~150-200 Urdu words
Answer Format: Normalized (Western numerals for numeric answers)
Coverage: 6 reasoning categories, 2 difficulty levels

Intended Use

✅ Recommended Uses

Evaluation: Assessing reasoning capabilities of Urdu language models
Benchmarking: Comparing different models on standardized reasoning tasks
Research: Studying reasoning in low-resource languages
Development: Testing model improvements on Urdu reasoning tasks

❌ Not Recommended

Training: This dataset is designed for evaluation, not training
Fine-tuning: Using this dataset for fine-tuning may lead to data leakage in evaluation
Commercial Applications: See License section for usage restrictions

Data Fields

Each entry contains the following fields:

id: Unique identifier (e.g., "ur_reason_001")
category: Reasoning category (see Dataset Splits section)
question: The reasoning problem in Urdu
gold_answer: The correct answer (normalized format)
answer_type: Type of answer ("integer", "float", or "short_text")
difficulty: Difficulty level ("hard" or "medium")

Installation

To use this dataset, install the required dependencies:

pip install -r requirements.txt

Or install directly:

pip install datasets pandas

Usage Example

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("HaseebAsif/UrduReason-Eval")

# Access by category
arithmetic = dataset.filter(lambda x: x["category"] == "arithmetic_reasoning")

# Example entry
example = dataset[0]
print(f"Question: {example['question']}")
print(f"Answer: {example['gold_answer']}")
print(f"Category: {example['category']}")
print(f"Difficulty: {example['difficulty']}")

Evaluation

For evaluation, compare model outputs against gold_answer fields. For numeric answers, ensure outputs are normalized to Western numerals. For text answers, exact match or semantic similarity metrics may be used.

DOI

This dataset is archived on Zenodo and available at: https://doi.org/10.5281/zenodo.18260252

Citations

If you use this dataset in your research, please cite:

@dataset{urdureason_eval_2026,
  title = {UrduReason-Eval: Urdu Reasoning Evaluation Dataset},
  author = {Asif, Haseeb},
  year = {2026},
  doi = {10.5281/zenodo.18260252},
  url = {https://huggingface.co/datasets/HaseebAsif/UrduReason-Eval}
}

License

["See LICENSE file"]

Acknowledgments

This dataset was created to support research in Urdu language understanding and reasoning. We thank the community for feedback and contributions.

Contact

For questions, issues, or contributions, please open an issue on the Hugging Face dataset repository or email me at m.haseebasif5@gmail.com

Note: This dataset is provided as-is for research purposes. Users are responsible for ensuring appropriate use and compliance with applicable laws and regulations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
dataset_script.py		dataset_script.py
evaluation.ipynb		evaluation.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UrduReason-Eval: Urdu Reasoning Evaluation Dataset

Dataset Description

Dataset Splits

Difficulty Distribution

Answer Types

Motivation & Goal

Dataset Creation

Generation Process

Quality Assurance

Dataset Statistics

Intended Use

✅ Recommended Uses

❌ Not Recommended

Data Fields

Installation

Usage Example

Evaluation

DOI

Citations

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UrduReason-Eval: Urdu Reasoning Evaluation Dataset

Dataset Description

Dataset Splits

Difficulty Distribution

Answer Types

Motivation & Goal

Dataset Creation

Generation Process

Quality Assurance

Dataset Statistics

Intended Use

✅ Recommended Uses

❌ Not Recommended

Data Fields

Installation

Usage Example

Evaluation

DOI

Citations

License

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages