Skip to content

Dataset generator CLI for multilingual student feedback. Generate high-quality, synthetic datasets for training and evaluating sentiment analysis models in various Philippine languages and dialects.

Notifications You must be signed in to change notification settings

CtrlAltElite-Devs/facugen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FacuGen

Dataset generator CLI for multilingual student feedback. Generate high-quality, synthetic datasets for training and evaluating sentiment analysis models in various Philippine languages and dialects.

Features

  • Multilingual Support: Supports English, Tagalog, Cebuano, Taglish, and Cebuano-English mix.
  • Multiple Providers: Integration with OpenAI, Google Gemini, and local models via Ollama.
  • Sync & Async Generation: Fast concurrent generation using asyncio.
  • Balanced Sampling: Option to generate equal distributions of sentiment labels.
  • Progress Tracking: Built-in tqdm support for monitoring generation.

Installation

Ensure you have uv installed.

git clone https://github.com/CtrlAltElite-Devs/facugen.git
cd facugen
uv sync

Configuration

Cloud Providers

Create a .env file in the root directory:

OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key

Local Models (Ollama)

Ensure Ollama is installed and running on your machine. No API key is required for local models.

Usage

Basic Generation

uv run facugen generate --lang taglish --count 100

Local Generation (Ollama)

Pull your desired model first:

ollama pull qwen2.5:7b

Run generation using the local model:

uv run facugen generate --lang tagalog --count 50 --model qwen2.5:7b

Advanced Options

uv run facugen generate \
  --lang cebu_eng_mix \
  --count 30 \
  --model gpt-4o-mini \
  --balance-labels \
  --seed 42 \
  --async \
  --concurrency 5 \
  --out datasets/feedback.jsonl

Command Arguments

  • --lang: Target language (choices: cebu_eng_mix, cebuano, english, tagalog, taglish).
  • --count: Number of samples to generate.
  • --model: Model to use (e.g., gpt-4o, gemini-1.5-flash, qwen2.5:7b).
  • --out: Output path for JSONL file (default: out/dataset.jsonl).
  • --balance-labels: Ensures equal distribution of positive, neutral, and negative sentiments.
  • --seed: Set a random seed for reproducible dataset generation.
  • --async: Enables concurrent requests for much faster generation.
  • --concurrency: Max number of parallel requests (default: 5). Recommended value is 5 to avoid aggressive rate limiting from providers.

Important: Model Limitations

Warning

Gemini Model Support: Currently, Gemini models are more limited in features compared to OpenAI models within this CLI:

  • Rate Limiting: Gemini free-tier models (like gemini-2.5-flash-lite) have very strict quotas (e.g., 20 requests per day).
  • Stability: You may encounter more frequent RESOURCE_EXHAUSTED errors with Gemini. The CLI implements exponential backoff, but large batches may still fail if daily quotas are reached.
  • Inference Speed: In concurrent mode, Gemini models might throttle more aggressively than OpenAI.

License

MIT

About

Dataset generator CLI for multilingual student feedback. Generate high-quality, synthetic datasets for training and evaluating sentiment analysis models in various Philippine languages and dialects.

Topics

Resources

Stars

Watchers

Forks

Languages