Skip to content

Haseebasif7/UrduReason-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UrduReason-Eval: Urdu Reasoning Evaluation Dataset

🔗 View on Hugging Face

Dataset Description

UrduReason-Eval is an evaluation dataset for assessing reasoning capabilities in Urdu language models. The dataset contains 800 carefully curated reasoning problems across multiple categories, designed to test both linguistic understanding and mathematical reasoning in Urdu.

  • Total Samples: 800
  • Language: Urdu (اردو)
  • Primary Focus: Multi-step reasoning, arithmetic, and logical deduction
  • Intended Use: Evaluation only (not for training)

Dataset Splits

The dataset is organized by reasoning categories:

Category Count Description
arithmetic_reasoning 236 Mathematical word problems requiring multi-step calculations
logical_deduction 205 Problems requiring logical inference and constraint satisfaction
temporal_reasoning 154 Time-based problems involving dates, schedules, and durations
comparative_reasoning 148 Comparative problems involving relationships and ordering
commonsense_causal_reasoning 50 Causal reasoning about everyday physical and chemical processes
causal_reasoning 7 Abstract causal reasoning problems

Difficulty Distribution

  • Hard: 567 samples (70.9%)
  • Medium: 233 samples (29.1%)

Answer Types

  • short_text: 541 samples (67.6%)
  • integer: 251 samples (31.4%)
  • float: 8 samples (1.0%)

Motivation & Goal

Urdu is a low-resource language with over 230 million speakers worldwide, yet evaluation benchmarks for Urdu language models remain limited. UrduReason-Eval addresses this gap by providing a comprehensive evaluation dataset that:

  1. Tests Multi-step Reasoning: Problems require combining multiple pieces of information and performing sequential reasoning steps
  2. Evaluates Language Understanding: Questions are written in natural Urdu, testing both comprehension and reasoning
  3. Covers Diverse Reasoning Types: From arithmetic to logical deduction to temporal reasoning
  4. Provides Standardized Evaluation: Consistent format and answer normalization for reliable model comparison

This dataset is intended for researchers and developers working on:

  • Urdu language models
  • Low-resource language evaluation
  • Multilingual reasoning benchmarks
  • Cross-lingual transfer learning

Dataset Creation

Generation Process

The dataset was created through an LLM-assisted generation process:

  1. Initial Generation: Questions were generated with assistance from large general-purpose language models
  2. Manual Review: All samples underwent human review for:
    • Correctness of questions and answers
    • Clarity and naturalness of Urdu language
    • Appropriateness of difficulty level
    • Elimination of ambiguous or underdetermined problems
  3. Normalization: All numeric answers were normalized to Western numerals (0-9) for consistent evaluation
  4. Validation: Mathematical correctness was verified, and format inconsistencies were corrected

Quality Assurance

  • All arithmetic problems were manually verified for correctness
  • Answers were normalized to ensure consistent evaluation
  • Ambiguous questions with multiple valid answers were removed or corrected
  • Format violations (e.g., mixed numeral systems) were fixed
  • Duplicate entries were eliminated

Dataset Statistics

  • Total Entries: 800
  • Average Question Length: ~150-200 Urdu words
  • Answer Format: Normalized (Western numerals for numeric answers)
  • Coverage: 6 reasoning categories, 2 difficulty levels

Intended Use

✅ Recommended Uses

  • Evaluation: Assessing reasoning capabilities of Urdu language models
  • Benchmarking: Comparing different models on standardized reasoning tasks
  • Research: Studying reasoning in low-resource languages
  • Development: Testing model improvements on Urdu reasoning tasks

❌ Not Recommended

  • Training: This dataset is designed for evaluation, not training
  • Fine-tuning: Using this dataset for fine-tuning may lead to data leakage in evaluation
  • Commercial Applications: See License section for usage restrictions

Data Fields

Each entry contains the following fields:

  • id: Unique identifier (e.g., "ur_reason_001")
  • category: Reasoning category (see Dataset Splits section)
  • question: The reasoning problem in Urdu
  • gold_answer: The correct answer (normalized format)
  • answer_type: Type of answer ("integer", "float", or "short_text")
  • difficulty: Difficulty level ("hard" or "medium")

Installation

To use this dataset, install the required dependencies:

pip install -r requirements.txt

Or install directly:

pip install datasets pandas

Usage Example

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("HaseebAsif/UrduReason-Eval")

# Access by category
arithmetic = dataset.filter(lambda x: x["category"] == "arithmetic_reasoning")

# Example entry
example = dataset[0]
print(f"Question: {example['question']}")
print(f"Answer: {example['gold_answer']}")
print(f"Category: {example['category']}")
print(f"Difficulty: {example['difficulty']}")

Evaluation

For evaluation, compare model outputs against gold_answer fields. For numeric answers, ensure outputs are normalized to Western numerals. For text answers, exact match or semantic similarity metrics may be used.

DOI

This dataset is archived on Zenodo and available at: https://doi.org/10.5281/zenodo.18260252

Citations

If you use this dataset in your research, please cite:

@dataset{urdureason_eval_2026,
  title = {UrduReason-Eval: Urdu Reasoning Evaluation Dataset},
  author = {Asif, Haseeb},
  year = {2026},
  doi = {10.5281/zenodo.18260252},
  url = {https://huggingface.co/datasets/HaseebAsif/UrduReason-Eval}
}

License

["See LICENSE file"]

Acknowledgments

This dataset was created to support research in Urdu language understanding and reasoning. We thank the community for feedback and contributions.

Contact

For questions, issues, or contributions, please open an issue on the Hugging Face dataset repository or email me at m.haseebasif5@gmail.com


Note: This dataset is provided as-is for research purposes. Users are responsible for ensuring appropriate use and compliance with applicable laws and regulations.

About

UrduReason-Eval: A comprehensive evaluation dataset with 800 Urdu reasoning problems across 6 categories (arithmetic, logical deduction, temporal, comparative, and causal reasoning) for assessing reasoning capabilities in Urdu language models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors