Brainbrew

The ridiculously easy, stupidly powerful no-code machine that turns your boring PDFs and TXT files into god-tier synthetic LLM training data

Brainbrew — Think of it like a mad scientist + coffee machine combo: you dump in documents, hit one button, and BOOM — fresh, high-quality instruction datasets appear like magic. No coding. No spreadsheets. No crying over JSON formatting at 3 a.m.

We took the original prototype, slayed every bug, switched to production-grade distilabel magic, added semantic chunking, multi-model ensemble, dataset deduplication, quality scoring, progress bars, Docker, and a bunch of other goodies… then wrapped it in a shiny Streamlit UI that even your grandma could use.

Current version: v1.2.0

Drop in any knowledge — PDFs, text, books, docs — and let Brainbrew do the rest.

Why Brainbrew Slaps

Zero coding — literally just upload files and click "Generate Dataset"
Distilabel-powered evolution — Evol-Instruct with configurable evolution depth
Multi-model ensemble — comma-separate your models for diverse, high-quality output
Semantic chunking — paragraph-aware document splitting that respects topic boundaries
Dataset deduplication — exact-match + near-duplicate removal via shingle Jaccard
Quality scoring — SUPER / GOOD / NORMAL / BAD / DISASTER grades after generation
4 export formats — Alpaca, ShareGPT, ChatML, and OpenAI fine-tuning JSONL
vLLM or OpenAI — choose speed (GPU) or zero-setup (API)
Auto LoRA training — optional one-click fine-tune with Unsloth
Hugging Face publish — one checkbox and your dataset is live on the Hub
Resume support — crashed runs resume from the last completed batch
Error handling & progress bars — because crashes are for amateurs
Docker ready — run it anywhere without summoning the dependency demon
132+ automated tests — full CI/CD with pytest, ruff, and mypy

In short: it's what every AI guy wanted and never found anywhere.

Features

Quality Modes: Fast (cheap & quick), Balanced (sweet spot), Research (maximum brain juice)
Output Formats: Alpaca, ShareGPT, ChatML, OpenAI — pick what your training framework needs
Smart Filtering: Automatic refusal cleaning + quality scoring dashboard
Multi-Model Ensemble: Split prompts across multiple teacher models for diversity
Deduplication: Exact hash + near-duplicate Jaccard filtering
Cost Estimator: See estimated cost and time before you click Generate
Live Stats: Record count, average output length, uniqueness ratio
Dataset Preview: See the first 5 examples before downloading
Checkpoint/Resume: Large runs save state for crash recovery
Pydantic Config: Type-safe everything (no more surprise crashes)

Quick Start (Takes 2 Minutes)

1. Clone & Setup

git clone https://github.com/Yog-Sotho/Brainbrew.git
cd Brainbrew

2. Run the installer (Python 3.12+)

bash install.sh

The installer handles everything: Python version check, virtual environment, pip dependencies, GPU detection, and .env setup.

3. Or install manually

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.sample .env

Edit .env:

OPENAI_API_KEY=sk-...
HF_TOKEN=hf_...
HF_USERNAME=yourusername

4. Run It

streamlit run app.py

Boom. Browser opens. You're now a dataset wizard.

Docker (For the Cool Kids)

docker build -t brainbrew .
docker run --gpus all -p 8501:8501 --env-file .env brainbrew

Or use the installer:

bash install.sh --docker

Open http://localhost:8501 and flex.

How to Use (So Easy It's Embarrassing)

Upload your PDFs or TXT files (multiple OK!)
Pick your teacher model (GPT-4o for API, or Llama-3.1-8B for vLLM speed)
Optionally enter multiple models comma-separated for ensemble diversity
Choose quality mode (Fast / Balanced / Research)
Choose output format (Alpaca / ShareGPT / ChatML / OpenAI)
Slide to desired dataset size
Optional: enable semantic chunking, deduplication, LoRA training, HF publish
Smash the big Generate Dataset button
Check your quality score, preview examples, and download

Done. Go train a model that actually knows your niche.

Brainbrew output flowing into your model

High-quality Q&A pairs stream straight into your model's brain. Automated study, zero effort.

Advanced Settings (Sidebar)

Use vLLM — Lightning fast (needs 24+ GB VRAM)
OpenAI API Key — fallback for laptop warriors
HF Token — for publishing
Semantic Chunking — paragraph-aware splitting (experimental)
Deduplication — remove near-duplicate instruction/output pairs
Temperature, LoRA rank, batch size — smart-defaulted but tweakable

Tech Stack

Streamlit – beautiful UI
distilabel 1.5.x – the real MVP (Evol-Instruct + generation + filtering)
vLLM – GPU wizardry
Unsloth – fastest LoRA training on the planet
LangChain text splitters – character & semantic chunking
Pydantic + Structlog – no more "it worked on my machine" excuses
pytest – 132+ tests with CI/CD via GitHub Actions

Hardware Requirements

Mode	GPU Needed?	Speed	Cost
OpenAI API	None	Medium	$$ (API)
vLLM (8B)	24 GB+ VRAM	Blazing	Free
LoRA training	8 GB+ VRAM	Fast	Free

Pro tip: Start with OpenAI mode. Once it works, flex with vLLM on RunPod/Modal.

Troubleshooting

"CUDA out of memory" — Turn off vLLM or use smaller model
OpenAI rate limit — Chill, use smaller batch or wait
Nothing happens — Check console + make sure you uploaded files
HF publish fails — Token wrong? Repo name taken? Classic.
bitsandbytes error — Needs CUDA. Expected on CPU-only machines.

Still stuck? Open an issue. We'll roast the bug together.

Testing

Brainbrew ships with 132+ automated tests covering config validation, security (API key leakage, filename sanitisation), pipeline orchestration, exporter formats, LoRA training, HF publishing, and more. No GPU required to run tests.

# Install test deps
pip install pytest

# Run all tests
pytest tests/ -v

# Run just security tests
pytest tests/test_security.py -v

CI runs automatically on every push and PR via GitHub Actions.

Contributing

The Brainbrew community brewing together

Every great dataset starts with great contributors. Jump in — the cauldron is warm.

Love it? Want to make it even cooler?

Fork it
Make changes (we love clean PRs)
Run pytest tests/ -v and make sure everything passes
Submit PR

Ideas welcome: RAG retrieval, multi-modal support, web UI for cloud, additional export formats, etc.

License

MIT — do whatever you want. Just don't blame us if your model becomes too powerful and takes over the world.

Now go brew some brains.

Made with chaos, coffee, and zero patience for bad datasets.

Star the repo if it saved you 20 hours this week. You know you want to.

Yog-Sotho ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
assets		assets
docs		docs
old		old
pipeline		pipeline
publish		publish
tests		tests
training		training
.env.sample		.env.sample
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
ci.yml		ci.yml
config.py		config.py
install.sh		install.sh
orchestrator.py		orchestrator.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brainbrew

Why Brainbrew Slaps

Features

Quick Start (Takes 2 Minutes)

1. Clone & Setup

2. Run the installer (Python 3.12+)

3. Or install manually

4. Run It

Docker (For the Cool Kids)

How to Use (So Easy It's Embarrassing)

Advanced Settings (Sidebar)

Tech Stack

Hardware Requirements

Troubleshooting

Testing

Contributing

License

Now go brew some brains.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Brainbrew

Why Brainbrew Slaps

Features

Quick Start (Takes 2 Minutes)

1. Clone & Setup

2. Run the installer (Python 3.12+)

3. Or install manually

4. Run It

Docker (For the Cool Kids)

How to Use (So Easy It's Embarrassing)

Advanced Settings (Sidebar)

Tech Stack

Hardware Requirements

Troubleshooting

Testing

Contributing

License

Now go brew some brains.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages