Dataset generator CLI for multilingual student feedback. Generate high-quality, synthetic datasets for training and evaluating sentiment analysis models in various Philippine languages and dialects.
- Multilingual Support: Supports English, Tagalog, Cebuano, Taglish, and Cebuano-English mix.
- Multiple Providers: Integration with OpenAI, Google Gemini, and local models via Ollama.
- Sync & Async Generation: Fast concurrent generation using
asyncio. - Balanced Sampling: Option to generate equal distributions of sentiment labels.
- Progress Tracking: Built-in
tqdmsupport for monitoring generation.
Ensure you have uv installed.
git clone https://github.com/CtrlAltElite-Devs/facugen.git
cd facugen
uv syncCreate a .env file in the root directory:
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_keyEnsure Ollama is installed and running on your machine. No API key is required for local models.
uv run facugen generate --lang taglish --count 100Pull your desired model first:
ollama pull qwen2.5:7bRun generation using the local model:
uv run facugen generate --lang tagalog --count 50 --model qwen2.5:7buv run facugen generate \
--lang cebu_eng_mix \
--count 30 \
--model gpt-4o-mini \
--balance-labels \
--seed 42 \
--async \
--concurrency 5 \
--out datasets/feedback.jsonl--lang: Target language (choices:cebu_eng_mix,cebuano,english,tagalog,taglish).--count: Number of samples to generate.--model: Model to use (e.g.,gpt-4o,gemini-1.5-flash,qwen2.5:7b).--out: Output path for JSONL file (default:out/dataset.jsonl).--balance-labels: Ensures equal distribution of positive, neutral, and negative sentiments.--seed: Set a random seed for reproducible dataset generation.--async: Enables concurrent requests for much faster generation.--concurrency: Max number of parallel requests (default: 5). Recommended value is 5 to avoid aggressive rate limiting from providers.
Warning
Gemini Model Support: Currently, Gemini models are more limited in features compared to OpenAI models within this CLI:
- Rate Limiting: Gemini free-tier models (like
gemini-2.5-flash-lite) have very strict quotas (e.g., 20 requests per day). - Stability: You may encounter more frequent
RESOURCE_EXHAUSTEDerrors with Gemini. The CLI implements exponential backoff, but large batches may still fail if daily quotas are reached. - Inference Speed: In concurrent mode, Gemini models might throttle more aggressively than OpenAI.
MIT