The ridiculously easy, stupidly powerful no-code machine that turns your boring PDFs and TXT files into god-tier synthetic LLM training data
Brainbrew — Think of it like a mad scientist + coffee machine combo: you dump in documents, hit one button, and BOOM — fresh, high-quality instruction datasets appear like magic. No coding. No spreadsheets. No crying over JSON formatting at 3 a.m.
We took the original prototype, slayed every bug, switched to production-grade distilabel magic, added semantic chunking, multi-model ensemble, dataset deduplication, quality scoring, progress bars, Docker, and a bunch of other goodies… then wrapped it in a shiny Streamlit UI that even your grandma could use.
Current version: v1.2.0
- Zero coding — literally just upload files and click "Generate Dataset"
- Distilabel-powered evolution — Evol-Instruct with configurable evolution depth
- Multi-model ensemble — comma-separate your models for diverse, high-quality output
- Semantic chunking — paragraph-aware document splitting that respects topic boundaries
- Dataset deduplication — exact-match + near-duplicate removal via shingle Jaccard
- Quality scoring — SUPER / GOOD / NORMAL / BAD / DISASTER grades after generation
- 4 export formats — Alpaca, ShareGPT, ChatML, and OpenAI fine-tuning JSONL
- vLLM or OpenAI — choose speed (GPU) or zero-setup (API)
- Auto LoRA training — optional one-click fine-tune with Unsloth
- Hugging Face publish — one checkbox and your dataset is live on the Hub
- Resume support — crashed runs resume from the last completed batch
- Error handling & progress bars — because crashes are for amateurs
- Docker ready — run it anywhere without summoning the dependency demon
- 132+ automated tests — full CI/CD with pytest, ruff, and mypy
In short: it's what every AI guy wanted and never found anywhere.
- Quality Modes: Fast (cheap & quick), Balanced (sweet spot), Research (maximum brain juice)
- Output Formats: Alpaca, ShareGPT, ChatML, OpenAI — pick what your training framework needs
- Smart Filtering: Automatic refusal cleaning + quality scoring dashboard
- Multi-Model Ensemble: Split prompts across multiple teacher models for diversity
- Deduplication: Exact hash + near-duplicate Jaccard filtering
- Cost Estimator: See estimated cost and time before you click Generate
- Live Stats: Record count, average output length, uniqueness ratio
- Dataset Preview: See the first 5 examples before downloading
- Checkpoint/Resume: Large runs save state for crash recovery
- Pydantic Config: Type-safe everything (no more surprise crashes)
git clone https://github.com/Yog-Sotho/Brainbrew.git
cd Brainbrewbash install.shThe installer handles everything: Python version check, virtual environment, pip dependencies, GPU detection, and .env setup.
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.sample .envEdit .env:
OPENAI_API_KEY=sk-...
HF_TOKEN=hf_...
HF_USERNAME=yourusernamestreamlit run app.pyBoom. Browser opens. You're now a dataset wizard.
docker build -t brainbrew .
docker run --gpus all -p 8501:8501 --env-file .env brainbrewOr use the installer:
bash install.sh --dockerOpen http://localhost:8501 and flex.
- Upload your PDFs or TXT files (multiple OK!)
- Pick your teacher model (GPT-4o for API, or Llama-3.1-8B for vLLM speed)
- Optionally enter multiple models comma-separated for ensemble diversity
- Choose quality mode (Fast / Balanced / Research)
- Choose output format (Alpaca / ShareGPT / ChatML / OpenAI)
- Slide to desired dataset size
- Optional: enable semantic chunking, deduplication, LoRA training, HF publish
- Smash the big Generate Dataset button
- Check your quality score, preview examples, and download
Done. Go train a model that actually knows your niche.
- Use vLLM — Lightning fast (needs 24+ GB VRAM)
- OpenAI API Key — fallback for laptop warriors
- HF Token — for publishing
- Semantic Chunking — paragraph-aware splitting (experimental)
- Deduplication — remove near-duplicate instruction/output pairs
- Temperature, LoRA rank, batch size — smart-defaulted but tweakable
- Streamlit – beautiful UI
- distilabel 1.5.x – the real MVP (Evol-Instruct + generation + filtering)
- vLLM – GPU wizardry
- Unsloth – fastest LoRA training on the planet
- LangChain text splitters – character & semantic chunking
- Pydantic + Structlog – no more "it worked on my machine" excuses
- pytest – 132+ tests with CI/CD via GitHub Actions
| Mode | GPU Needed? | Speed | Cost |
|---|---|---|---|
| OpenAI API | None | Medium | $$ (API) |
| vLLM (8B) | 24 GB+ VRAM | Blazing | Free |
| LoRA training | 8 GB+ VRAM | Fast | Free |
Pro tip: Start with OpenAI mode. Once it works, flex with vLLM on RunPod/Modal.
- "CUDA out of memory" — Turn off vLLM or use smaller model
- OpenAI rate limit — Chill, use smaller batch or wait
- Nothing happens — Check console + make sure you uploaded files
- HF publish fails — Token wrong? Repo name taken? Classic.
- bitsandbytes error — Needs CUDA. Expected on CPU-only machines.
Still stuck? Open an issue. We'll roast the bug together.
Brainbrew ships with 132+ automated tests covering config validation, security (API key leakage, filename sanitisation), pipeline orchestration, exporter formats, LoRA training, HF publishing, and more. No GPU required to run tests.
# Install test deps
pip install pytest
# Run all tests
pytest tests/ -v
# Run just security tests
pytest tests/test_security.py -vCI runs automatically on every push and PR via GitHub Actions.
Love it? Want to make it even cooler?
- Fork it
- Make changes (we love clean PRs)
- Run
pytest tests/ -vand make sure everything passes - Submit PR
Ideas welcome: RAG retrieval, multi-modal support, web UI for cloud, additional export formats, etc.
MIT — do whatever you want. Just don't blame us if your model becomes too powerful and takes over the world.


