Arabic Corpus Refinery

A specialized tool for cleaning and normalizing Arabic text corpora using Google's Gemini. This tool is designed to prepare high-quality text data for training Large Language Models (LLMs) while preserving the unique characteristics of Arabic dialects and expressions.

Features

Automated Text Cleaning: Uses Google Gemini AI to intelligently clean and normalize Arabic text
Dialect Preservation: Maintains unique dialect expressions and vocabulary while standardizing orthography
Memory Efficient: Processes large corpora in chunks to work on low-memory systems (1GB RAM)
Resume Capability: Can resume processing from where it left off if interrupted
Progress Tracking: Real-time dashboard showing processing progress and statistics
Robust Error Handling: Includes retry logic for API calls and comprehensive logging

Prerequisites

Python 3.7 or higher
Google Gemini API key
Text corpus files in .txt format

Installation

Clone the repository:

git clone https://github.com/o96a/LLMCorpusKit.git
cd LLMCorpusKit

Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables:
- Copy .env.example to .env
- Get your Google Gemini API key from Google AI Studio
- Add your API key to the .env file:
```
cp .env.example .env
# Edit .env file and add your API key
```

Usage

Prepare your corpus:
- Create a corpus folder in the project directory
- Place your .txt files containing Arabic text in this folder
Run the cleaning process:
```
python main.py
```
Monitor progress:
- The script displays a real-time dashboard with progress information
- Cleaned files will be saved in the cleaned_corpus folder
- Processing state is automatically saved and can be resumed if interrupted

Configuration

You can modify the following settings in main.py:

CHUNK_SIZE: Size of text chunks processed at once (default: 15000 characters)
MODEL_NAME: Gemini model to use (default: 'gemini-2.0-flash')
CORPUS_PATH: Path to input corpus files (default: 'corpus')
CLEANED_PATH: Path for cleaned output files (default: 'cleaned_corpus')

Text Processing Rules

The tool applies the following cleaning and normalization rules:

Orthographic Normalization

Standardizes "ى" (alif maqsura) to "ي" (ya) where appropriate
Removes diacritics and tatweel marks
Standardizes laughter expressions (e.g., "ههههه" → "ههه")
Removes excessive punctuation

Spelling Correction

Fixes common typos and spelling mistakes
Standardizes word variations while preserving dialect

Content Refinement

Removes conversational filler words with no semantic value
Improves sentence structure and punctuation
Handles transliterated foreign words consistently

Preservation Constraints

Does NOT translate to Modern Standard Arabic (MSA)
Does NOT remove unique dialect expressions
Maintains the authentic flavor of the dialect

File Structure

LLMCorpusKit/
├── main.py                 # Main processing script
├── requirements.txt        # Python dependencies
├── .env.example           # Environment variables template
├── .gitignore             # Git ignore rules
├── README.md              # This file
├── corpus/                # Input text files (create this folder)
├── cleaned_corpus/        # Output cleaned files (auto-created)
├── processing_state.json  # Processing state (auto-created)
└── processing.log         # Processing logs (auto-created)

Troubleshooting

Common Issues

API Key Error:
- Ensure your .env file contains a valid GOOGLE_API_KEY
- Verify your API key has access to Gemini models
Memory Issues:
- Reduce CHUNK_SIZE if running on very low memory systems
- Ensure you have enough disk space for output files
Processing Interruption:
- The script automatically saves progress and can resume from the last checkpoint
- Simply run python main.py again to continue

Logs and Debugging

Check processing.log for detailed error messages
The script provides verbose console output during processing
Processing state is saved in processing_state.json

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Google Gemini AI for powerful text processing capabilities
The Arabic language community for preserving rich dialects

Support

If you encounter any issues or have questions, please:

Check the troubleshooting section above
Search existing GitHub Issues
Create a new issue with detailed information about your problem

Note: This tool is designed for Arabic text processing. It can work with various Arabic dialects and maintains their unique characteristics while improving text quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Corpus Refinery

Features

Prerequisites

Installation

Usage

Configuration

Text Processing Rules

Orthographic Normalization

Spelling Correction

Content Refinement

Preservation Constraints

File Structure

Troubleshooting

Common Issues

Logs and Debugging

Contributing

License

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

sudaverse/LLMCorpusKit

Folders and files

Latest commit

History

Repository files navigation

Arabic Corpus Refinery

Features

Prerequisites

Installation

Usage

Configuration

Text Processing Rules

Orthographic Normalization

Spelling Correction

Content Refinement

Preservation Constraints

File Structure

Troubleshooting

Common Issues

Logs and Debugging

Contributing

License

Acknowledgments

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages