📑 BBC Dari Web Crawler

A Python-based crawler for building a structured Dari text dataset

📘 Project Overview

This project is a Python web crawler that extracts and organizes news articles from BBC Dari into a structured dataset.
The crawler systematically collects headlines, publication dates, authors, article bodies, and URLs, and saves them into a CSV file (dari_dataset.csv) encoded in UTF-8.

The dataset is intended for academic and research use, particularly in Natural Language Processing (NLP), machine learning, and linguistic studies involving the Dari language.

🎯 Objectives

✅ Build a clean and reliable Dari-language dataset.
✅ Facilitate NLP tasks such as classification, sentiment analysis, and summarization.
✅ Provide resources for machine learning models trained on low-resource languages.
✅ Contribute to linguistic research in Dari.

🔑 Features

🌐 Crawls multiple BBC Dari categories (Politics, Economy, Culture, World, etc.).
📝 Extracts clean, UTF-8 encoded text suitable for NLP tasks.
💾 Stores structured data in a CSV format.
🔄 Designed to be extendable to other Dari-language news sources.

📂 Dataset Format (`dari_dataset.csv`)

Column	Description
`title`	Headline of the article
`date`	Publication date (ISO format)
`author`	Author name (if available, else `null`)
`content`	Full article text
`url`	Original source link

🛠️ Installation & Usage

Prerequisites

Python 3.9+
Install dependencies:

pip install requests beautifulsoup4 pandas

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
__pycache__		__pycache__
data		data
Crawler_BBC_Dari.py		Crawler_BBC_Dari.py
cleaner.py		cleaner.py
collect_topics.py		collect_topics.py
config.yaml		config.yaml
detector.py		detector.py
fetcher.py		fetcher.py
main.py		main.py
parser.py		parser.py
readme.md		readme.md
saver.py		saver.py
topics_list.txt		topics_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📑 BBC Dari Web Crawler

📘 Project Overview

🎯 Objectives

🔑 Features

📂 Dataset Format (`dari_dataset.csv`)

🛠️ Installation & Usage

Prerequisites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📑 BBC Dari Web Crawler

📘 Project Overview

🎯 Objectives

🔑 Features

📂 Dataset Format (dari_dataset.csv)

🛠️ Installation & Usage

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📂 Dataset Format (`dari_dataset.csv`)

Packages