A specialized tool for cleaning and normalizing Arabic text corpora using Google's Gemini. This tool is designed to prepare high-quality text data for training Large Language Models (LLMs) while preserving the unique characteristics of Arabic dialects and expressions.
- Automated Text Cleaning: Uses Google Gemini AI to intelligently clean and normalize Arabic text
- Dialect Preservation: Maintains unique dialect expressions and vocabulary while standardizing orthography
- Memory Efficient: Processes large corpora in chunks to work on low-memory systems (1GB RAM)
- Resume Capability: Can resume processing from where it left off if interrupted
- Progress Tracking: Real-time dashboard showing processing progress and statistics
- Robust Error Handling: Includes retry logic for API calls and comprehensive logging
- Python 3.7 or higher
- Google Gemini API key
- Text corpus files in
.txtformat
-
Clone the repository:
git clone https://github.com/o96a/LLMCorpusKit.git cd LLMCorpusKit -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
- Copy
.env.exampleto.env - Get your Google Gemini API key from Google AI Studio
- Add your API key to the
.envfile:
cp .env.example .env # Edit .env file and add your API key - Copy
-
Prepare your corpus:
- Create a
corpusfolder in the project directory - Place your
.txtfiles containing Arabic text in this folder
- Create a
-
Run the cleaning process:
python main.py
-
Monitor progress:
- The script displays a real-time dashboard with progress information
- Cleaned files will be saved in the
cleaned_corpusfolder - Processing state is automatically saved and can be resumed if interrupted
You can modify the following settings in main.py:
CHUNK_SIZE: Size of text chunks processed at once (default: 15000 characters)MODEL_NAME: Gemini model to use (default: 'gemini-2.0-flash')CORPUS_PATH: Path to input corpus files (default: 'corpus')CLEANED_PATH: Path for cleaned output files (default: 'cleaned_corpus')
The tool applies the following cleaning and normalization rules:
- Standardizes "ى" (alif maqsura) to "ي" (ya) where appropriate
- Removes diacritics and tatweel marks
- Standardizes laughter expressions (e.g., "ههههه" → "ههه")
- Removes excessive punctuation
- Fixes common typos and spelling mistakes
- Standardizes word variations while preserving dialect
- Removes conversational filler words with no semantic value
- Improves sentence structure and punctuation
- Handles transliterated foreign words consistently
- Does NOT translate to Modern Standard Arabic (MSA)
- Does NOT remove unique dialect expressions
- Maintains the authentic flavor of the dialect
LLMCorpusKit/
├── main.py # Main processing script
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
├── README.md # This file
├── corpus/ # Input text files (create this folder)
├── cleaned_corpus/ # Output cleaned files (auto-created)
├── processing_state.json # Processing state (auto-created)
└── processing.log # Processing logs (auto-created)
-
API Key Error:
- Ensure your
.envfile contains a validGOOGLE_API_KEY - Verify your API key has access to Gemini models
- Ensure your
-
Memory Issues:
- Reduce
CHUNK_SIZEif running on very low memory systems - Ensure you have enough disk space for output files
- Reduce
-
Processing Interruption:
- The script automatically saves progress and can resume from the last checkpoint
- Simply run
python main.pyagain to continue
- Check
processing.logfor detailed error messages - The script provides verbose console output during processing
- Processing state is saved in
processing_state.json
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Gemini AI for powerful text processing capabilities
- The Arabic language community for preserving rich dialects
If you encounter any issues or have questions, please:
- Check the troubleshooting section above
- Search existing GitHub Issues
- Create a new issue with detailed information about your problem
Note: This tool is designed for Arabic text processing. It can work with various Arabic dialects and maintains their unique characteristics while improving text quality.