UrduReason-Eval is an evaluation dataset for assessing reasoning capabilities in Urdu language models. The dataset contains 800 carefully curated reasoning problems across multiple categories, designed to test both linguistic understanding and mathematical reasoning in Urdu.
- Total Samples: 800
- Language: Urdu (اردو)
- Primary Focus: Multi-step reasoning, arithmetic, and logical deduction
- Intended Use: Evaluation only (not for training)
The dataset is organized by reasoning categories:
| Category | Count | Description |
|---|---|---|
arithmetic_reasoning |
236 | Mathematical word problems requiring multi-step calculations |
logical_deduction |
205 | Problems requiring logical inference and constraint satisfaction |
temporal_reasoning |
154 | Time-based problems involving dates, schedules, and durations |
comparative_reasoning |
148 | Comparative problems involving relationships and ordering |
commonsense_causal_reasoning |
50 | Causal reasoning about everyday physical and chemical processes |
causal_reasoning |
7 | Abstract causal reasoning problems |
- Hard: 567 samples (70.9%)
- Medium: 233 samples (29.1%)
- short_text: 541 samples (67.6%)
- integer: 251 samples (31.4%)
- float: 8 samples (1.0%)
Urdu is a low-resource language with over 230 million speakers worldwide, yet evaluation benchmarks for Urdu language models remain limited. UrduReason-Eval addresses this gap by providing a comprehensive evaluation dataset that:
- Tests Multi-step Reasoning: Problems require combining multiple pieces of information and performing sequential reasoning steps
- Evaluates Language Understanding: Questions are written in natural Urdu, testing both comprehension and reasoning
- Covers Diverse Reasoning Types: From arithmetic to logical deduction to temporal reasoning
- Provides Standardized Evaluation: Consistent format and answer normalization for reliable model comparison
This dataset is intended for researchers and developers working on:
- Urdu language models
- Low-resource language evaluation
- Multilingual reasoning benchmarks
- Cross-lingual transfer learning
The dataset was created through an LLM-assisted generation process:
- Initial Generation: Questions were generated with assistance from large general-purpose language models
- Manual Review: All samples underwent human review for:
- Correctness of questions and answers
- Clarity and naturalness of Urdu language
- Appropriateness of difficulty level
- Elimination of ambiguous or underdetermined problems
- Normalization: All numeric answers were normalized to Western numerals (0-9) for consistent evaluation
- Validation: Mathematical correctness was verified, and format inconsistencies were corrected
- All arithmetic problems were manually verified for correctness
- Answers were normalized to ensure consistent evaluation
- Ambiguous questions with multiple valid answers were removed or corrected
- Format violations (e.g., mixed numeral systems) were fixed
- Duplicate entries were eliminated
- Total Entries: 800
- Average Question Length: ~150-200 Urdu words
- Answer Format: Normalized (Western numerals for numeric answers)
- Coverage: 6 reasoning categories, 2 difficulty levels
- Evaluation: Assessing reasoning capabilities of Urdu language models
- Benchmarking: Comparing different models on standardized reasoning tasks
- Research: Studying reasoning in low-resource languages
- Development: Testing model improvements on Urdu reasoning tasks
- Training: This dataset is designed for evaluation, not training
- Fine-tuning: Using this dataset for fine-tuning may lead to data leakage in evaluation
- Commercial Applications: See License section for usage restrictions
Each entry contains the following fields:
id: Unique identifier (e.g., "ur_reason_001")category: Reasoning category (see Dataset Splits section)question: The reasoning problem in Urdugold_answer: The correct answer (normalized format)answer_type: Type of answer ("integer", "float", or "short_text")difficulty: Difficulty level ("hard" or "medium")
To use this dataset, install the required dependencies:
pip install -r requirements.txtOr install directly:
pip install datasets pandasfrom datasets import load_dataset
# Load the dataset
dataset = load_dataset("HaseebAsif/UrduReason-Eval")
# Access by category
arithmetic = dataset.filter(lambda x: x["category"] == "arithmetic_reasoning")
# Example entry
example = dataset[0]
print(f"Question: {example['question']}")
print(f"Answer: {example['gold_answer']}")
print(f"Category: {example['category']}")
print(f"Difficulty: {example['difficulty']}")For evaluation, compare model outputs against gold_answer fields. For numeric answers, ensure outputs are normalized to Western numerals. For text answers, exact match or semantic similarity metrics may be used.
This dataset is archived on Zenodo and available at: https://doi.org/10.5281/zenodo.18260252
If you use this dataset in your research, please cite:
@dataset{urdureason_eval_2026,
title = {UrduReason-Eval: Urdu Reasoning Evaluation Dataset},
author = {Asif, Haseeb},
year = {2026},
doi = {10.5281/zenodo.18260252},
url = {https://huggingface.co/datasets/HaseebAsif/UrduReason-Eval}
}["See LICENSE file"]
This dataset was created to support research in Urdu language understanding and reasoning. We thank the community for feedback and contributions.
For questions, issues, or contributions, please open an issue on the Hugging Face dataset repository or email me at m.haseebasif5@gmail.com
Note: This dataset is provided as-is for research purposes. Users are responsible for ensuring appropriate use and compliance with applicable laws and regulations.