Dataset Lab is a powerful, file-based dataset engineering system for creating, refining, and exporting high-quality instruction-style QA datasets. Built with Python/FastAPI (backend) and React/Vite (frontend).
# 1. Clone the repository
git clone https://github.com/amarnath123456789/Dataset-Creator-App.git
cd Dataset-Creator-App
# 2. Install everything (Python venv + pip + npm — all automated)
python install.py
# 3. Start the app
python datasetlab.py startThe app will open automatically at http://localhost:5173 🎉
Windows users: You can also just double-click
start.bat— no terminal needed.
python datasetlab.py <command>
| Command | Description |
|---|---|
start |
Start the backend + frontend servers |
stop |
Stop all running servers |
status |
Show live server status |
open |
Open the app in your browser |
logs |
Show recent server log output |
- Upload & Chunking — Ingest source documents and split into manageable chunks.
- Generation — Use local or cloud LLMs to generate QA pairs.
- Refinement — Clean, filter, and format the generated datasets.
- Export — Export to JSON, JSONL, or CSV ready for fine-tuning.
| Requirement | Details |
|---|---|
| OS | Windows 10/11, macOS (M1/M2/Intel), Linux |
| Python | 3.9+ — Download |
| Node.js | 18+ — Download |
| RAM | 8GB minimum (16GB+ recommended for local LLMs) |
| Disk | 2GB+ for app; 4–10GB per local model |
| Ollama | Optional — for offline local LLMs — Download |
install.py checks all of these for you and will warn you if anything is missing.
install.py creates the .env file for you automatically. You can also create or edit it manually at dataset-lab/.env:
# Document Processing
DEFAULT_CHUNK_SIZE=800
DEFAULT_CHUNK_OVERLAP=100
DEFAULT_SIMILARITY_THRESHOLD=0.92
# Cloud LLM API Keys (optional — only needed for cloud models)
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here- Install Ollama from ollama.com
- Pull a model:
ollama run llama3
- Dataset Lab auto-detects Ollama at
http://localhost:11434.
If you prefer to set up manually without the installer:
# Backend
cd dataset-lab
python -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # macOS/Linux
pip install -r backend/requirements.txt
# Frontend
cd frontend
npm install
# Run backend
python -m backend.main
# Run frontend (new terminal)
npm run devDataset-Creator-App/
├── install.py ← One-command installer
├── datasetlab.py ← CLI runner (start/stop/status/open/logs)
├── start.bat ← Windows double-click starter
├── start.sh ← macOS/Linux shell starter
└── dataset-lab/
├── backend/ ← FastAPI backend
│ ├── engines/ ← LLM & processing logic
│ ├── routes/ ← API endpoints
│ ├── main.py ← Entry point
│ └── requirements.txt
├── frontend/ ← React / Vite frontend
│ ├── src/
│ └── package.json
├── projects/ ← Generated project data
├── .venv/ ← Python virtual environment (created by installer)
├── .logs/ ← Server logs (created on start)
└── .env ← Your config (created by installer)
| Problem | Fix |
|---|---|
python install.py fails |
Ensure Python 3.9+ and Node 18+ are installed and on your PATH |
| Pipeline stuck in "Running" | Delete .running / .stop files in dataset-lab/projects/<project>/ |
| Cannot connect to Ollama | Run ollama run llama3 and verify it serves at http://localhost:11434 |
| Port 8000/5173 in use | Stop the conflicting process or change the port in backend/main.py |
| Backend/frontend crashed | Run python datasetlab.py logs to see what went wrong |
| Error | Cause | Solution |
|---|---|---|
ModuleNotFoundError |
Missing Python package | Run python install.py again |
CORS Error |
Frontend can't reach backend | Make sure backend is running (python datasetlab.py status) |
npm error: … |
Missing Node modules | Run python install.py again |
- Fork the repository.
- Create a branch:
git checkout -b feature/awesome-feature - Commit with clear messages.
- Push and open a Pull Request.
Thank you for using Dataset Lab! Found a bug? Open an issue on GitHub.







