Skip to content

Yog-Sotho/Brainbrew

Repository files navigation

Brainbrew Logo

Brainbrew

The ridiculously easy, stupidly powerful no-code machine that turns your boring PDFs and TXT files into god-tier synthetic LLM training data

License: MIT Python 3.12+ Docker Ready For non-tech users distilabel Powered Streamlit Powered GPU Required

Open in Codespaces GitHub Stars Sponsor on GitHub

Brainbrew — Think of it like a mad scientist + coffee machine combo: you dump in documents, hit one button, and BOOM — fresh, high-quality instruction datasets appear like magic. No coding. No spreadsheets. No crying over JSON formatting at 3 a.m.

We took the original prototype, slayed every bug, switched to production-grade distilabel magic, added semantic chunking, multi-model ensemble, dataset deduplication, quality scoring, progress bars, Docker, and a bunch of other goodies… then wrapped it in a shiny Streamlit UI that even your grandma could use.

Current version: v1.2.0

Knowledge inputs flowing into Brainbrew

Drop in any knowledge — PDFs, text, books, docs — and let Brainbrew do the rest.


Why Brainbrew Slaps

  • Zero coding — literally just upload files and click "Generate Dataset"
  • Distilabel-powered evolution — Evol-Instruct with configurable evolution depth
  • Multi-model ensemble — comma-separate your models for diverse, high-quality output
  • Semantic chunking — paragraph-aware document splitting that respects topic boundaries
  • Dataset deduplication — exact-match + near-duplicate removal via shingle Jaccard
  • Quality scoring — SUPER / GOOD / NORMAL / BAD / DISASTER grades after generation
  • 4 export formats — Alpaca, ShareGPT, ChatML, and OpenAI fine-tuning JSONL
  • vLLM or OpenAI — choose speed (GPU) or zero-setup (API)
  • Auto LoRA training — optional one-click fine-tune with Unsloth
  • Hugging Face publish — one checkbox and your dataset is live on the Hub
  • Resume support — crashed runs resume from the last completed batch
  • Error handling & progress bars — because crashes are for amateurs
  • Docker ready — run it anywhere without summoning the dependency demon
  • 132+ automated tests — full CI/CD with pytest, ruff, and mypy

In short: it's what every AI guy wanted and never found anywhere.


Features

  • Quality Modes: Fast (cheap & quick), Balanced (sweet spot), Research (maximum brain juice)
  • Output Formats: Alpaca, ShareGPT, ChatML, OpenAI — pick what your training framework needs
  • Smart Filtering: Automatic refusal cleaning + quality scoring dashboard
  • Multi-Model Ensemble: Split prompts across multiple teacher models for diversity
  • Deduplication: Exact hash + near-duplicate Jaccard filtering
  • Cost Estimator: See estimated cost and time before you click Generate
  • Live Stats: Record count, average output length, uniqueness ratio
  • Dataset Preview: See the first 5 examples before downloading
  • Checkpoint/Resume: Large runs save state for crash recovery
  • Pydantic Config: Type-safe everything (no more surprise crashes)

Quick Start (Takes 2 Minutes)

1. Clone & Setup

git clone https://github.com/Yog-Sotho/Brainbrew.git
cd Brainbrew

2. Run the installer (Python 3.12+)

bash install.sh

The installer handles everything: Python version check, virtual environment, pip dependencies, GPU detection, and .env setup.

3. Or install manually

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.sample .env

Edit .env:

OPENAI_API_KEY=sk-...
HF_TOKEN=hf_...
HF_USERNAME=yourusername

4. Run It

streamlit run app.py

Boom. Browser opens. You're now a dataset wizard.


Docker (For the Cool Kids)

docker build -t brainbrew .
docker run --gpus all -p 8501:8501 --env-file .env brainbrew

Or use the installer:

bash install.sh --docker

Open http://localhost:8501 and flex.


How to Use (So Easy It's Embarrassing)

  1. Upload your PDFs or TXT files (multiple OK!)
  2. Pick your teacher model (GPT-4o for API, or Llama-3.1-8B for vLLM speed)
  3. Optionally enter multiple models comma-separated for ensemble diversity
  4. Choose quality mode (Fast / Balanced / Research)
  5. Choose output format (Alpaca / ShareGPT / ChatML / OpenAI)
  6. Slide to desired dataset size
  7. Optional: enable semantic chunking, deduplication, LoRA training, HF publish
  8. Smash the big Generate Dataset button
  9. Check your quality score, preview examples, and download

Done. Go train a model that actually knows your niche.

Brainbrew output flowing into your model

High-quality Q&A pairs stream straight into your model's brain. Automated study, zero effort.


Advanced Settings (Sidebar)

  • Use vLLM — Lightning fast (needs 24+ GB VRAM)
  • OpenAI API Key — fallback for laptop warriors
  • HF Token — for publishing
  • Semantic Chunking — paragraph-aware splitting (experimental)
  • Deduplication — remove near-duplicate instruction/output pairs
  • Temperature, LoRA rank, batch size — smart-defaulted but tweakable

Tech Stack

  • Streamlit – beautiful UI
  • distilabel 1.5.x – the real MVP (Evol-Instruct + generation + filtering)
  • vLLM – GPU wizardry
  • Unsloth – fastest LoRA training on the planet
  • LangChain text splitters – character & semantic chunking
  • Pydantic + Structlog – no more "it worked on my machine" excuses
  • pytest – 132+ tests with CI/CD via GitHub Actions

Hardware Requirements

Mode GPU Needed? Speed Cost
OpenAI API None Medium $$ (API)
vLLM (8B) 24 GB+ VRAM Blazing Free
LoRA training 8 GB+ VRAM Fast Free

Pro tip: Start with OpenAI mode. Once it works, flex with vLLM on RunPod/Modal.


Troubleshooting

  • "CUDA out of memory" — Turn off vLLM or use smaller model
  • OpenAI rate limit — Chill, use smaller batch or wait
  • Nothing happens — Check console + make sure you uploaded files
  • HF publish fails — Token wrong? Repo name taken? Classic.
  • bitsandbytes error — Needs CUDA. Expected on CPU-only machines.

Still stuck? Open an issue. We'll roast the bug together.


Testing

Brainbrew ships with 132+ automated tests covering config validation, security (API key leakage, filename sanitisation), pipeline orchestration, exporter formats, LoRA training, HF publishing, and more. No GPU required to run tests.

# Install test deps
pip install pytest

# Run all tests
pytest tests/ -v

# Run just security tests
pytest tests/test_security.py -v

CI runs automatically on every push and PR via GitHub Actions.


Contributing

The Brainbrew community brewing together

Every great dataset starts with great contributors. Jump in — the cauldron is warm.

Love it? Want to make it even cooler?

  1. Fork it
  2. Make changes (we love clean PRs)
  3. Run pytest tests/ -v and make sure everything passes
  4. Submit PR

Ideas welcome: RAG retrieval, multi-modal support, web UI for cloud, additional export formats, etc.


License

MIT — do whatever you want. Just don't blame us if your model becomes too powerful and takes over the world.

Brainbrew Logo

Now go brew some brains.

Made with chaos, coffee, and zero patience for bad datasets.

Star the repo if it saved you 20 hours this week. You know you want to.

Yog-Sotho ❤️

About

A simple GUI tool that generates LLM training datasets through model distillation. Designed for **non-technical users**.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors