A complete guide to building the data pipeline behind models like GPT-4, Claude, DeepSeek-V3, and Llama 3 — from crawling raw internet data to producing a training-ready token stream.
- Overview: What Goes Into an LLM
- Phase 1: Web Crawling
- Phase 2: Text Extraction
- Phase 3: Quality Filtering
- Phase 4: Deduplication
- Phase 5: Text Cleaning & Normalization
- Phase 6: Domain-Specific Data
- Phase 7: Custom Data — GitHub, Google, PDFs, APIs
- Phase 8: Synthetic Data Generation
- Phase 9: Tokenization
- Phase 10: Data Mixing & Packaging
- Phase 11: Curriculum Learning & Data Scheduling
- Phase 12: Training at Scale
- Phase 13: Post-Training — SFT, DPO, RLHF
- Phase 14: End-to-End with superGPT
- Appendix: Open Datasets
- References
A frontier LLM requires trillions of tokens of diverse, high-quality data. Here's what the leading labs actually use:
| Model | Total Tokens | General Text | Code | Math & Reasoning | Multilingual |
|---|---|---|---|---|---|
| Llama 3 | 15T | 50% | 17% | 25% | 8% |
| DeepSeek-V3 | 14.8T | ~55% | ~20% | ~15% | ~10% |
| GPT-4 | ~13T (est.) | ~50% | ~20% | ~20% | ~10% |
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ RAW CRAWL │───▶│ EXTRACTION │───▶│ FILTERING │
│ (petabytes) │ │ HTML → Text │ │ Quality/PII │
└──────────────┘ └───────────────┘ └──────────────┘
│
┌───────────────┐ ┌──────▼───────┐
│ TOKENIZE │◀───│ DEDUP │
│ BPE/SentPc │ │ MinHash/LSH │
└───────┬───────┘ └──────────────┘
│
┌───────▼───────┐ ┌──────────────┐
│ MIX & PACK │───▶│ TRAINING │
│ Ratios/Shards│ │ FSDP/DeepSp │
└───────────────┘ └──────────────┘
Common Crawl is a free, open repository of web crawl data containing petabytes of web pages. This is what FineWeb, RedPajama, The Pile, and most open LLM datasets are built from.
# Common Crawl data is on AWS S3 (free, no auth required)
# Each monthly crawl = ~3 billion pages
# List available crawls
aws s3 ls s3://commoncrawl/ --no-sign-request
# Download a WARC segment
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2024-10/segments/... \
./data/raw/ --no-sign-request
File formats:
- WARC (Web ARChive): Raw HTTP responses + HTML — use this one
- WET: Pre-extracted plain text (lower quality)
- WAT: Metadata only (URLs, timestamps)
Tip
Always process WARC files yourself with Trafilatura rather than using WET files. Your text quality will be dramatically better.
"""Custom web crawler for specialized domains."""
import scrapy
class LLMDataSpider(scrapy.Spider):
name = "llm_data"
custom_settings = {
'ROBOTSTXT_OBEY': True, # Always respect robots.txt
'DOWNLOAD_DELAY': 1.0, # 1 req/sec (be polite)
'CONCURRENT_REQUESTS': 8,
'DEPTH_LIMIT': 3,
}
def parse(self, response):
text = ' '.join(response.css('p::text, h1::text, h2::text').getall())
if len(text) > 200:
yield {'url': response.url, 'text': text}
for href in response.css('a::attr(href)').getall():
yield response.follow(href, self.parse)
Crawling tools comparison:
| Tool | Best For | Scale |
|---|---|---|
| Common Crawl | Full web corpus, free | Petabyte |
| Scrapy | Custom domain crawls | Medium |
| Trafilatura | Text extraction from HTML | Any |
| Firecrawl | JS-rendered pages | Small-Med |
Caution
Always check: robots.txt, Terms of Service, copyright/licensing, GDPR/CCPA for personal data. Use permissively-licensed data when possible.
Raw HTML is full of menus, ads, JavaScript. You need only the main content.
"""Extract clean text from HTML using Trafilatura (industry standard)."""
import trafilatura
def extract_from_warc(warc_path):
"""Process a Common Crawl WARC file."""
import warcio
with open(warc_path, 'rb') as f:
for record in warcio.ArchiveIterator(f):
if record.rec_type == 'response':
html = record.content_stream().read().decode('utf-8', errors='ignore')
text = trafilatura.extract(html, favor_precision=True)
if text and len(text) > 200:
yield {
'url': record.rec_headers.get('WARC-Target-URI'),
'text': text,
}
For petabyte scale, use Apache Spark on AWS EMR:
# Spark job: distribute across 1000+ executors
warc_paths = spark.read.text("s3://commoncrawl/.../warc.paths.gz")
extracted = warc_paths.rdd.flatMap(extract_from_warc)
extracted.saveAsTextFile("s3://my-bucket/extracted/")
This is the most impactful step. Quality > quantity.
"""Based on FineWeb, RedPajama, and DeepMind Gopher heuristics."""
import re
from collections import Counter
def heuristic_filter(text: str) -> tuple[bool, str]:
"""Returns (pass, reason). Document must pass ALL checks."""
words = text.split()
n_words = len(words)
if n_words < 50: return False, "too_short"
if n_words > 100_000: return False, "too_long"
# Average word length (catches garbled text)
avg_wl = sum(len(w) for w in words) / n_words
if avg_wl < 3 or avg_wl > 10: return False, "unusual_words"
# Alpha ratio (catches code dumps, symbol spam)
alpha = sum(c.isalpha() for c in text)
if alpha / max(len(text), 1) < 0.5: return False, "low_alpha"
# Line repetition
lines = text.split('\n')
if len(set(lines)) / max(len(lines), 1) < 0.5:
return False, "too_repetitive"
# Top bigram frequency (from Gopher paper)
bigrams = [f"{words[i]} {words[i+1]}" for i in range(len(words)-1)]
bc = Counter(bigrams)
if bc and bc.most_common(1)[0][1] / len(bigrams) > 0.2:
return False, "repeated_ngrams"
# URL density (catches link farms)
urls = re.findall(r'https?://\S+', text)
if len(urls) / max(n_words, 1) > 0.1:
return False, "too_many_urls"
return True, "passed"
"""Filter by language using fastText (176 languages)."""
import fasttext
model = fasttext.load_model('lid.176.bin')
def detect_language(text):
pred = model.predict(text.replace('\n', ' ')[:1000], k=1)
return pred[0][0].replace('__label__', ''), pred[1][0]
def is_english(text, min_conf=0.65):
lang, conf = detect_language(text)
return lang == 'en' and conf >= min_conf
This is what separates FineWeb-Edu from FineWeb — and why it trains dramatically better models:
"""
The FineWeb-Edu approach:
1. Use Llama-3-70B to score 500K samples on educational quality (0-5)
2. Train a fast fastText classifier on those labels
3. Run classifier on the full 15T token corpus
4. Keep only documents scoring ≥ 3
This single step improved downstream benchmarks by 10-20%.
"""
QUALITY_PROMPT = """Rate the educational quality of this text (0-5):
0 = Spam/ads 1 = Social media 2 = Average web content
3 = Informative 4 = Educational 5 = Textbook-quality
Text: {text}
Score:"""
# Score 500K samples with a big LLM → train fastText → filter everything
# pip install fasttext
# fasttext supervised -input labels.txt -output quality_model -epoch 25
"""Remove personal data (email, phone, SSN, credit cards)."""
import re
PII_PATTERNS = {
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'cc': r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b',
}
def redact_pii(text):
for name, pattern in PII_PATTERNS.items():
text = re.sub(pattern, f'[{name.upper()}_REDACTED]', text)
return text
Web crawls contain 30-50% duplicates. Training on duplicates wastes compute and causes memorization.
import hashlib
def exact_dedup(documents):
seen = set()
for doc in documents:
h = hashlib.sha256(doc['text'].encode()).hexdigest()
if h not in seen:
seen.add(h)
yield doc
Used by FineWeb, RedPajama, The Pile, Dolma. Finds documents that are ~80%+ similar.
"""
MinHash LSH — the industry standard for near-duplicate detection.
1. Convert documents to n-gram sets (shingles)
2. Compute MinHash signatures (compact fingerprints)
3. Use LSH banding to find candidate pairs
4. Remove near-duplicates (Jaccard similarity > 0.8)
"""
from datasketch import MinHash, MinHashLSH
def create_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
words = text.lower().split()
for i in range(len(words) - 4):
shingle = ' '.join(words[i:i+5])
m.update(shingle.encode('utf-8'))
return m
def fuzzy_dedup(documents, threshold=0.8):
lsh = MinHashLSH(threshold=threshold, num_perm=128)
minhashes = {}
for doc in documents:
mh = create_minhash(doc['text'])
minhashes[doc['id']] = mh
try:
lsh.insert(doc['id'], mh)
except ValueError:
pass # Already exists
# Find duplicates
duplicates = set()
for doc_id, mh in minhashes.items():
if doc_id in duplicates: continue
for candidate in lsh.query(mh):
if candidate != doc_id:
duplicates.add(candidate)
return [d for d in documents if d['id'] not in duplicates]
For trillion-token scale, use suffix array dedup (finds duplicate substrings across documents):
# pip install text-dedup
python -m text_dedup.suffix_array --input corpus.jsonl --output deduped.jsonl --k 50
Even after filtering and dedup, raw text has encoding issues, invisible characters, and formatting problems that hurt training.
"""
Unicode normalization — critical for consistent tokenization.
Without this, 'é' (precomposed) and 'é' (e + combining accent) are different tokens.
"""
import unicodedata
import re
def normalize_unicode(text: str) -> str:
"""Normalize to NFC form and clean problematic characters."""
# NFC: precompose characters (é as single char, not e + accent)
text = unicodedata.normalize('NFC', text)
# Remove zero-width characters (invisible but break tokenization)
text = re.sub(r'[\u200b\u200c\u200d\ufeff\u00ad]', '', text)
# Normalize quotes and dashes
text = text.replace('"', '"').replace('"', '"')
text = text.replace(''', "'").replace(''', "'")
text = text.replace('–', '-').replace('—', '-')
text = text.replace('…', '...')
# Remove control characters (except newline, tab)
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', text)
return text
"""
Deep clean text extracted from web pages.
Even after Trafilatura, some HTML artifacts remain.
"""
import re
def deep_clean_text(text: str) -> str:
"""Remove residual HTML, fix encoding, normalize whitespace."""
# Remove any residual HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove HTML entities
import html
text = html.unescape(text)
# Fix common encoding artifacts
replacements = {
'’': "'", '“': '"', 'â€\x9d': '"',
'â€"': '—', 'â€"': '–', ' ': ' ',
'é': 'é', 'á': 'á', 'ñ': 'ñ',
}
for bad, good in replacements.items():
text = text.replace(bad, good)
# Collapse excessive whitespace
text = re.sub(r'[ \t]+', ' ', text) # Multiple spaces → one
text = re.sub(r'\n{3,}', '\n\n', text) # 3+ newlines → 2
text = re.sub(r'(\n\s*){3,}', '\n\n', text) # Blank lines
# Remove lines that are just URLs
lines = text.split('\n')
lines = [l for l in lines if not re.match(r'^\s*https?://\S+\s*$', l)]
# Remove lines that are just navigation (short repeated patterns)
lines = [l for l in lines if len(l.strip()) > 3 or l.strip() == '']
return '\n'.join(lines).strip()
"""
Complete text cleaning pipeline — run this on every document
AFTER extraction and BEFORE tokenization.
"""
def clean_document(text: str) -> str | None:
"""Full cleaning pipeline. Returns None if document should be dropped."""
# Step 1: Unicode normalize
text = normalize_unicode(text)
# Step 2: Deep clean (HTML artifacts, encoding issues)
text = deep_clean_text(text)
# Step 3: Remove excessive repetition within document
lines = text.split('\n')
seen_lines = set()
deduped_lines = []
for line in lines:
stripped = line.strip().lower()
if stripped and stripped in seen_lines:
continue # Skip duplicate lines within same document
seen_lines.add(stripped)
deduped_lines.append(line)
text = '\n'.join(deduped_lines)
# Step 4: Final length check
words = text.split()
if len(words) < 50:
return None # Too short after cleaning
return text
Code dramatically improves reasoning — even for non-code tasks.
Primary source: The Stack v2 — 6.4 TB, 358 languages, permissively licensed.
from datasets import load_dataset
ds = load_dataset("bigcode/the-stack-v2", split="train", streaming=True)
def filter_code(example):
code = example['content']
if len(code) < 100 or len(code) > 100_000: return False
# Skip auto-generated
if any(m in code[:500] for m in ['auto-generated', 'DO NOT EDIT']): return False
# Skip minified
if len(code) / max(code.count('\n'), 1) > 200: return False
return True
# Language weights (by impact on reasoning)
LANG_WEIGHTS = {
'python': 0.25, 'javascript': 0.12, 'typescript': 0.10,
'java': 0.10, 'c++': 0.08, 'c': 0.06, 'go': 0.05,
'rust': 0.05, 'shell': 0.03, 'sql': 0.03,
}
Mathematical data is critical for reasoning transfer.
| Dataset | Size | Content | Use |
|---|---|---|---|
| OpenWebMath | 14.7B tokens | Math from web (LaTeX, proofs) | Pre-training |
| MathPile | 9.5B tokens | Textbooks, arXiv, StackExchange | Pre-training |
| GSM8K | 8.5K problems | Grade school word problems | SFT |
| MATH | 12.5K problems | Competition math (AMC, Olympiad) | SFT |
| NuminaMath-CoT | 860K problems | CoT solutions, all levels | SFT |
# Key: preserve LaTeX notation during processing!
def process_math(text):
# DON'T strip $...$ or \[...\] — these teach math reasoning
text = re.sub(r'\n{3,}', '\n\n', text)
has_math = any(m in text for m in ['$', '\\frac', '\\sum', 'theorem', 'proof'])
return text if has_math else None
- arXiv: 2M+ papers with full LaTeX source — Kaggle mirror
- S2ORC: 81M papers across all fields — GitHub
- PubMed: 35M+ biomedical abstracts
- Project Gutenberg: 70K+ public domain books — gutenberg.org
- Critical for teaching long-range coherence
| Dataset | Size | Quality |
|---|---|---|
| OpenAssistant/oasst2 | ~50M tok | High |
| WizardLM Evol-Instruct | ~100M tok | High |
| UltraChat | ~500M tok | Medium |
| Orca-Math | ~50M tok | High |
Most practitioners don't start from Common Crawl. Here's how to build a custom dataset from real-world sources.
"""
Crawl GitHub repos for code training data.
Uses the GitHub API + git clone for full content.
"""
import os
import subprocess
import requests
import json
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN") # Set your PAT
HEADERS = {"Authorization": f"token {GITHUB_TOKEN}"} if GITHUB_TOKEN else {}
def search_github_repos(language: str, min_stars: int = 100, max_repos: int = 500):
"""Find popular repos by language."""
repos = []
page = 1
while len(repos) < max_repos:
url = (f"https://api.github.com/search/repositories"
f"?q=language:{language}+stars:>{min_stars}"
f"&sort=stars&per_page=100&page={page}")
resp = requests.get(url, headers=HEADERS)
data = resp.json()
if 'items' not in data:
break
for item in data['items']:
# Only permissively-licensed repos
license_key = (item.get('license') or {}).get('key', '')
if license_key in ['mit', 'apache-2.0', 'bsd-2-clause', 'bsd-3-clause']:
repos.append({
'name': item['full_name'],
'url': item['clone_url'],
'stars': item['stargazers_count'],
'license': license_key,
'language': language,
})
page += 1
if page > 5:
break
return repos[:max_repos]
def clone_and_extract(repo_url, output_dir, extensions=None):
"""Clone a repo and extract source files."""
repo_name = repo_url.split('/')[-1].replace('.git', '')
clone_dir = f"/tmp/repos/{repo_name}"
# Shallow clone (saves bandwidth)
subprocess.run(['git', 'clone', '--depth', '1', repo_url, clone_dir],
capture_output=True, timeout=60)
if extensions is None:
extensions = {'.py', '.js', '.ts', '.java', '.cpp', '.c', '.go', '.rs',
'.rb', '.php', '.swift', '.kt', '.scala', '.sh', '.sql'}
documents = []
for root, _, files in os.walk(clone_dir):
# Skip hidden dirs, tests, vendor, node_modules
if any(skip in root for skip in ['/.', '/test', '/vendor', '/node_modules',
'/__pycache__', '/dist', '/build']):
continue
for fname in files:
ext = os.path.splitext(fname)[1]
if ext in extensions:
fpath = os.path.join(root, fname)
try:
with open(fpath, 'r', errors='ignore') as f:
content = f.read()
if 100 < len(content) < 100_000: # Skip tiny/huge files
documents.append({
'text': content,
'source': f"github:{repo_url}:{fname}",
'language': ext[1:],
})
except:
pass
# Cleanup
subprocess.run(['rm', '-rf', clone_dir], capture_output=True)
return documents
# Example: Crawl top Python repos
repos = search_github_repos('python', min_stars=500, max_repos=200)
all_code = []
for repo in repos:
docs = clone_and_extract(repo['url'], 'data/code/')
all_code.extend(docs)
print(f" {repo['name']}: {len(docs)} files")
print(f"Total: {len(all_code)} code files from {len(repos)} repos")
"""
Crawl content from Google search results on specific topics.
Useful for building domain-specific datasets (medical, legal, etc.)
"""
import requests
from bs4 import BeautifulSoup
import trafilatura
import time
def google_search(query: str, num_results: int = 50) -> list[str]:
"""Get URLs from Google search (use SerpAPI for production)."""
# Option 1: Use SerpAPI (recommended, has free tier)
# pip install google-search-results
from serpapi import GoogleSearch
urls = []
for start in range(0, num_results, 10):
params = {
"q": query,
"start": start,
"num": 10,
"api_key": os.getenv("SERPAPI_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results.get("organic_results", []):
urls.append(result["link"])
time.sleep(1) # Rate limit
return urls[:num_results]
def crawl_search_results(queries: list[str]) -> list[dict]:
"""Crawl and extract text from Google search results."""
documents = []
for query in queries:
urls = google_search(query, num_results=30)
for url in urls:
try:
downloaded = trafilatura.fetch_url(url)
if downloaded:
text = trafilatura.extract(downloaded, favor_precision=True)
if text and len(text) > 200:
documents.append({
'text': text,
'source': url,
'query': query,
})
except Exception as e:
continue
time.sleep(0.5) # Be polite
return documents
# Example: Build a machine learning textbook corpus
ml_queries = [
"transformer architecture explained tutorial",
"backpropagation gradient descent mathematics",
"attention mechanism self-attention multi-head",
"convolutional neural networks image recognition",
"reinforcement learning policy gradient methods",
"natural language processing tokenization embeddings",
"generative adversarial networks training stability",
"diffusion models denoising score matching",
]
ml_docs = crawl_search_results(ml_queries)
print(f"Collected {len(ml_docs)} documents on ML topics")
"""
Extract training data from PDF files.
Great for textbooks, research papers, technical manuals.
"""
def extract_pdf_text(pdf_path: str) -> str:
"""Extract clean text from a PDF file."""
# Option 1: PyMuPDF (fastest, best quality)
import fitz # pip install PyMuPDF
doc = fitz.open(pdf_path)
text_parts = []
for page in doc:
text = page.get_text("text")
if text.strip():
text_parts.append(text)
full_text = '\n'.join(text_parts)
# Clean up PDF artifacts
import re
full_text = re.sub(r'-\n(\w)', r'\1', full_text) # Fix hyphenation
full_text = re.sub(r'(\w)\n(\w)', r'\1 \2', full_text) # Fix line breaks mid-sentence
full_text = re.sub(r'\n{3,}', '\n\n', full_text)
return full_text
def process_pdf_directory(pdf_dir: str) -> list[dict]:
"""Process all PDFs in a directory."""
import glob
documents = []
for pdf_path in glob.glob(os.path.join(pdf_dir, '**/*.pdf'), recursive=True):
try:
text = extract_pdf_text(pdf_path)
if len(text) > 500: # Skip very short PDFs
documents.append({
'text': text,
'source': pdf_path,
'type': 'pdf',
})
except Exception as e:
print(f"Error processing {pdf_path}: {e}")
return documents
# Example: Process a directory of ML textbooks
docs = process_pdf_directory('/data/textbooks/')
print(f"Extracted text from {len(docs)} PDFs")
"""
Process a Wikipedia dump — one of the cleanest, most structured data sources.
Every major LLM uses Wikipedia.
"""
from datasets import load_dataset
# HuggingFace hosts pre-processed Wikipedia
wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
documents = []
for i, article in enumerate(wiki):
text = article['text']
if len(text) > 500: # Skip stubs
documents.append({
'text': text,
'title': article['title'],
'source': 'wikipedia',
})
if i >= 100_000: # Limit for example
break
print(f"Loaded {len(documents)} Wikipedia articles")
"""
StackOverflow Q&A — excellent for teaching models to answer technical questions.
Available as public data dumps.
"""
from datasets import load_dataset
# Load StackOverflow questions/answers
so = load_dataset("koutch/stackoverflow_python", split="train", streaming=True)
def format_qa_pair(example):
"""Format as question → answer for training."""
return {
'text': (
f"Question: {example['question_body']}\n\n"
f"Answer: {example['answer_body']}\n\n"
f"Score: {example['answer_score']}"
),
'source': 'stackoverflow',
}
# Alternative: Download the full XML dump from archive.org
# https://archive.org/details/stackexchange
"""
Collect data from APIs — news, research, documentation.
"""
import requests
# Example: arXiv API for recent papers
def fetch_arxiv_papers(query: str, max_results: int = 100):
"""Fetch papers from arXiv API."""
import xml.etree.ElementTree as ET
url = f"http://export.arxiv.org/api/query?search_query=all:{query}&max_results={max_results}"
resp = requests.get(url)
root = ET.fromstring(resp.text)
ns = {'atom': 'http://www.w3.org/2005/Atom'}
papers = []
for entry in root.findall('atom:entry', ns):
title = entry.find('atom:title', ns).text.strip()
abstract = entry.find('atom:summary', ns).text.strip()
papers.append({
'text': f"Title: {title}\n\nAbstract: {abstract}",
'source': 'arxiv',
})
return papers
# Example: News API
def fetch_news(query: str, api_key: str):
"""Fetch news articles from NewsAPI."""
url = f"https://newsapi.org/v2/everything?q={query}&apiKey={api_key}&pageSize=100"
resp = requests.get(url)
articles = []
for item in resp.json().get('articles', []):
if item.get('content'):
articles.append({
'text': f"{item['title']}\n\n{item['content']}",
'source': item['url'],
})
return articles
"""
Merge data from all custom sources into a unified JSONL corpus.
"""
import json
def merge_sources(source_lists: dict[str, list], output_path: str):
"""Merge multiple data sources into a single JSONL file."""
total = 0
with open(output_path, 'w') as f:
for source_name, documents in source_lists.items():
for doc in documents:
doc['source_type'] = source_name
f.write(json.dumps(doc, ensure_ascii=False) + '\n')
total += 1
print(f"Merged {total} documents from {len(source_lists)} sources")
print(f"Saved to: {output_path}")
# Example usage:
merge_sources({
'github_code': all_code, # From 8.1
'web_articles': ml_docs, # From 8.2
'pdf_textbooks': pdf_docs, # From 8.3
'wikipedia': wiki_docs, # From 8.4
'stackoverflow': so_docs, # From 8.5
'arxiv_papers': arxiv_docs, # From 8.6
}, output_path='data/custom_corpus.jsonl')
Synthetic data is now essential for modern LLMs. DeepSeek-V3 used synthetic data from DeepSeek-R1 for reasoning. GPT-4 was trained partly on GPT-4-generated data.
"""
Self-Instruct: start with ~100 seed tasks, use an LLM to generate thousands more.
Original method from Stanford (2023). Cost-effective for SFT data.
"""
from openai import OpenAI
import json, random
client = OpenAI()
SEED_TASKS = [
{"instruction": "Write a Python function to sort a list of dictionaries by a key.",
"output": "def sort_dicts(lst, key):\n return sorted(lst, key=lambda x: x[key])"},
{"instruction": "Explain what a neural network is in simple terms.",
"output": "A neural network is a computer system inspired by the human brain..."},
{"instruction": "Solve: What is 15% of 240?",
"output": "15% of 240 = 0.15 × 240 = 36"},
# Add 100+ diverse seed tasks
]
def generate_instructions(seed_tasks: list, n_generate: int = 5000):
"""Generate new instructions using self-instruct."""
all_tasks = list(seed_tasks)
for i in range(n_generate):
# Sample 3 random examples as context
examples = random.sample(seed_tasks, min(3, len(seed_tasks)))
examples_text = "\n".join(
f"Instruction: {t['instruction']}\nOutput: {t['output']}"
for t in examples
)
prompt = f"""Here are some example tasks:
{examples_text}
Generate a new, different instruction and its output.
Be creative and diverse. Cover coding, math, writing, analysis, and reasoning.
New Instruction:"""
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheap and fast
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
temperature=0.9,
)
text = response.choices[0].message.content
# Parse instruction and output from response
if "Output:" in text:
parts = text.split("Output:", 1)
all_tasks.append({
"instruction": parts[0].strip(),
"output": parts[1].strip(),
})
if (i + 1) % 100 == 0:
print(f" Generated {i+1}/{n_generate} instructions")
return all_tasks
"""
Evol-Instruct (WizardLM): Take simple instructions and evolve them
through multiple rounds to increase complexity and diversity.
Evolution operators:
1. Add constraints ("...do this without using loops")
2. Deepen ("explain the underlying mathematics")
3. Concretize ("give a specific example with real data")
4. Increase reasoning ("solve step-by-step showing all work")
5. Broaden ("extend to handle edge cases")
"""
EVOLUTION_PROMPTS = {
"add_constraints": """Rewrite this instruction to add 2-3 specific constraints
that make it more challenging:
Original: {instruction}
Evolved:""",
"deepen": """Rewrite this instruction to require deeper knowledge and understanding:
Original: {instruction}
Evolved:""",
"concretize": """Rewrite this instruction to be more specific with concrete
examples, real data, or specific scenarios:
Original: {instruction}
Evolved:""",
"increase_reasoning": """Rewrite this instruction to require multi-step
reasoning, showing all intermediate steps:
Original: {instruction}
Evolved:""",
}
def evolve_instruction(instruction: str, n_rounds: int = 3) -> list[str]:
"""Evolve a simple instruction through multiple complexity rounds."""
evolved = [instruction]
for round_num in range(n_rounds):
operator = random.choice(list(EVOLUTION_PROMPTS.keys()))
prompt = EVOLUTION_PROMPTS[operator].format(instruction=evolved[-1])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
)
new_instruction = response.choices[0].message.content.strip()
evolved.append(new_instruction)
return evolved
"""
Magpie (2024): Generate instruction data WITHOUT any seed tasks.
Just prompt an instruction-tuned model with the chat template prefix.
The model naturally generates a user query, then you generate a response.
This is the cheapest, most scalable synthetic data method.
"""
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
def magpie_generate(model_name="meta-llama/Meta-Llama-3-8B-Instruct", n=1000):
"""Generate instruction-response pairs using Magpie method."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
# The "trick": just give the model the chat template prefix
# and it will generate a realistic user query
system_prefix = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"
pairs = []
for i in range(n):
# Step 1: Generate a user instruction
inputs = tokenizer(system_prefix, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=200, temperature=0.9,
do_sample=True, top_p=0.95)
instruction = tokenizer.decode(output[0], skip_special_tokens=True)
# Step 2: Generate the assistant response
full_prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=1000, temperature=0.7)
response = tokenizer.decode(output[0], skip_special_tokens=True)
pairs.append({"instruction": instruction, "response": response})
if (i + 1) % 100 == 0:
print(f" Magpie: {i+1}/{n} pairs generated")
return pairs
"""
Generate CoT (Chain-of-Thought) reasoning data from a strong model.
This is how DeepSeek-V3 got reasoning from DeepSeek-R1.
"""
def generate_cot_data(problems: list[str], model="gpt-4o"):
"""Generate step-by-step solutions using a strong model."""
cot_data = []
for problem in problems:
response = client.chat.completions.create(
model=model,
messages=[{
"role": "system",
"content": "Solve the following problem step by step. "
"Show ALL intermediate reasoning. "
"Use <think>...</think> tags for reasoning, "
"then give the final answer."
}, {
"role": "user",
"content": problem
}],
temperature=0.3,
)
solution = response.choices[0].message.content
cot_data.append({
"problem": problem,
"solution": solution,
"type": "chain_of_thought",
})
return cot_data
# Combine with rejection sampling: generate N solutions, keep only correct ones
def rejection_sample(problem, correct_answer, model="gpt-4o-mini", n=8):
"""Generate N solutions, keep only ones that reach the correct answer."""
correct_solutions = []
for _ in range(n):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": f"Solve: {problem}"}],
temperature=0.9, # High temp for diversity
)
solution = response.choices[0].message.content
if str(correct_answer) in solution:
correct_solutions.append(solution)
return correct_solutions # Multiple diverse correct solutions
| Tokenizer | Vocab Size | Used By | Notes |
|---|---|---|---|
| tiktoken (cl100k) | 100K | GPT-4, GPT-3.5 | Fast, byte-level BPE |
| SentencePiece | 32K-128K | LLaMA, Mistral | Good multilingual support |
| Qwen tokenizer | 151K | Qwen2.5, DeepSeek-V3 | Best for code + multilingual |
"""Train a BPE tokenizer optimized for YOUR data mix."""
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(
vocab_size=100_000,
min_frequency=100,
special_tokens=["<pad>", "<unk>", "<bos>", "<eos>"],
show_progress=True,
)
# Train on a representative sample of your data mix
tokenizer.train(files=[
"data/web_sample.txt", # 50%
"data/code_sample.txt", # 20%
"data/math_sample.txt", # 20%
"data/multilingual.txt", # 10%
], trainer=trainer)
tokenizer.save("my_tokenizer.json")
"""Convert filtered text corpus into binary training files."""
import numpy as np
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
vocab_size = tokenizer.vocab_size # 151,643
# Choose dtype based on vocab size
dtype = np.uint32 if vocab_size > 65535 else np.uint16
all_tokens = []
for doc in read_jsonl("filtered_corpus.jsonl"):
tokens = tokenizer.encode(doc['text'])
tokens.append(tokenizer.eos_token_id) # Document separator
all_tokens.extend(tokens)
# Save as memory-mapped binary
arr = np.array(all_tokens, dtype=dtype)
arr.tofile("data/train.bin")
print(f"Saved {len(arr):,} tokens ({arr.nbytes / 1e9:.1f} GB)")
Important
Always use uint32 if your vocab size > 65,535! Using uint16 for large vocabs (like Qwen's 151K) will silently corrupt your data by truncating token IDs. This is a common bug — we found and fixed it in superGPT.
The ratio of different data domains dramatically affects model capabilities. Here's the recipe:
"""
Data mixing strategy based on Llama 3 and DeepSeek-V3.
Key insights:
1. Start with ~50% general web text (foundation)
2. Code improves reasoning even on non-code tasks (17-20%)
3. Math is the highest-leverage domain for reasoning (15-25%)
4. Overtrain on high-quality data in later phases (annealing)
"""
# Phase 1: Main pre-training (first 80% of compute)
PRETRAIN_MIX = {
'web_text': 0.50, # FineWeb-Edu, RedPajama, C4
'code': 0.17, # The Stack, GitHub
'math': 0.15, # OpenWebMath, MathPile
'academic': 0.08, # arXiv, S2ORC, PubMed
'books': 0.05, # Gutenberg, BookCorpus
'multilingual': 0.05, # CC-100, OPUS
}
# Phase 2: Annealing (final 20% of compute)
# Upsample high-quality data for final polish
ANNEAL_MIX = {
'web_text': 0.30, # Only top-quality (FineWeb-Edu score ≥ 4)
'code': 0.25, # Upsample — code helps reasoning
'math': 0.25, # Heavy upsample on math
'academic': 0.10,
'books': 0.05,
'multilingual': 0.05,
}
"""
Shard data for multi-GPU/multi-node training.
Each shard should be ~256 MB - 1 GB.
"""
import numpy as np
import os
def shard_data(input_path, output_dir, n_shards=256):
"""Split a large .bin file into evenly-sized shards."""
os.makedirs(output_dir, exist_ok=True)
data = np.memmap(input_path, dtype=np.uint32, mode='r')
shard_size = len(data) // n_shards
for i in range(n_shards):
start = i * shard_size
end = start + shard_size if i < n_shards - 1 else len(data)
shard = np.array(data[start:end], dtype=np.uint32)
shard.tofile(os.path.join(output_dir, f"shard_{i:05d}.bin"))
print(f"Created {n_shards} shards in {output_dir}")
print(f"Shard size: ~{shard_size * 4 / 1e6:.0f} MB each")
# Example: 100B tokens → 400 GB → 400 shards of 1 GB each
shard_data("data/train.bin", "data/shards/", n_shards=400)
"""
Efficient data loading for billion-token training runs.
Uses memory-mapped files — never loads the full dataset into RAM.
"""
import numpy as np
import torch
class ShardedDataLoader:
"""Load data from multiple shards, cycling through them."""
def __init__(self, shard_dir, block_size, batch_size, dtype=np.uint32):
self.shards = sorted(
[os.path.join(shard_dir, f) for f in os.listdir(shard_dir)
if f.endswith('.bin')]
)
self.block_size = block_size
self.batch_size = batch_size
self.dtype = dtype
self.current_shard = 0
self._load_shard()
def _load_shard(self):
self.data = np.memmap(
self.shards[self.current_shard],
dtype=self.dtype, mode='r'
)
def get_batch(self):
ix = torch.randint(len(self.data) - self.block_size - 1, (self.batch_size,))
x = torch.stack([
torch.from_numpy(self.data[i:i+self.block_size].astype(np.int64))
for i in ix
])
y = torch.stack([
torch.from_numpy(self.data[i+1:i+1+self.block_size].astype(np.int64))
for i in ix
])
# Advance shard every N batches
self.current_shard = (self.current_shard + 1) % len(self.shards)
self._load_shard()
return x, y
Training on data in a strategic order — easy → hard — can improve convergence by 10-20%. This is called curriculum learning.
"""
Score documents by difficulty using multiple metrics.
Train on easy data first, complex data later.
"""
import re
import math
def score_difficulty(text: str) -> float:
"""Score text difficulty 0-1 (0 = easy, 1 = hard)."""
words = text.split()
n_words = len(words)
if n_words == 0:
return 0.5
# 1. Average word length (longer words = harder)
avg_word_len = sum(len(w) for w in words) / n_words
word_score = min(avg_word_len / 10, 1.0)
# 2. Sentence length (longer sentences = harder)
sentences = re.split(r'[.!?]+', text)
avg_sent_len = n_words / max(len(sentences), 1)
sent_score = min(avg_sent_len / 40, 1.0)
# 3. Vocabulary richness (more unique words = harder)
vocab_ratio = len(set(w.lower() for w in words)) / n_words
vocab_score = vocab_ratio
# 4. Technical content (LaTeX, code, formulas = harder)
tech_markers = sum(1 for m in ['$', '\\frac', 'def ', 'class ', 'import ',
'theorem', 'proof', 'algorithm']
if m in text)
tech_score = min(tech_markers / 5, 1.0)
# 5. Document length (longer = harder to learn from)
len_score = min(math.log(n_words + 1) / math.log(10000), 1.0)
# Weighted combination
return (0.2 * word_score + 0.2 * sent_score + 0.2 * vocab_score +
0.25 * tech_score + 0.15 * len_score)
"""
Implement curriculum learning with staged data scheduling.
"""
import numpy as np
def create_curriculum_shards(documents, n_stages=4, output_dir='data/curriculum/'):
"""Sort documents by difficulty and create staged shards."""
import os
# Score all documents
scored = [(score_difficulty(doc['text']), doc) for doc in documents]
scored.sort(key=lambda x: x[0]) # Easy first
# Split into stages
stage_size = len(scored) // n_stages
for stage in range(n_stages):
start = stage * stage_size
end = start + stage_size if stage < n_stages - 1 else len(scored)
stage_docs = [doc for _, doc in scored[start:end]]
stage_dir = os.path.join(output_dir, f'stage_{stage}')
os.makedirs(stage_dir, exist_ok=True)
# Tokenize and save this stage
# ... (use your tokenization pipeline)
avg_diff = np.mean([s for s, _ in scored[start:end]])
print(f"Stage {stage}: {len(stage_docs)} docs, avg difficulty: {avg_diff:.3f}")
# Training schedule:
# Stage 0 (iter 0-2500): Easy web text, simple sentences
# Stage 1 (iter 2500-5000): Medium articles, basic code
# Stage 2 (iter 5000-7500): Hard academic papers, complex code
# Stage 3 (iter 7500-10000): Hardest math proofs, research papers
"""
In the last 20% of training, switch to a curated high-quality mix.
This is what Llama 3 calls the 'annealing' phase.
Key changes during annealing:
1. Upsample highest-quality data (FineWeb-Edu score >= 4)
2. Upsample code and math (helps reasoning)
3. Reduce learning rate to near-zero
4. May increase context length (e.g., 4K → 128K)
"""
# Llama 3 annealing schedule:
# - Last 40M tokens on highest quality mix
# - Linear LR decay from 3.5e-5 → 0
# - Code weight: 17% → 25%
# - Math weight: 25% → 35%
# - Result: +2-3% on reasoning benchmarks vs no annealing
| Model Size | GPUs Needed | VRAM per GPU | Training Time | Estimated Cost |
|---|---|---|---|---|
| 125M | 1× A100/H100 | 40 GB | 1-3 days | $100-500 |
| 1B | 4× A100 | 40 GB each | 1-2 weeks | $2K-5K |
| 7B | 8× A100 | 80 GB each | 2-4 weeks | $20K-50K |
| 70B | 64× A100 | 80 GB each | 1-2 months | $300K-1M |
| 405B (Llama 3) | 16K H100 | 80 GB each | 54 days | $30M+ |
| 671B (DeepSeek-V3) | 2048 H800 | 80 GB each | 2 months | $5.5M |
"""
At scale, you need multiple forms of parallelism:
1. Data Parallelism (DP): Same model on each GPU, different data
- Simplest. Works up to ~8 GPUs.
- PyTorch DDP or FSDP
2. Tensor Parallelism (TP): Split layers across GPUs
- Splits attention heads and FFN across GPUs within a node
- Requires fast interconnect (NVLink)
3. Pipeline Parallelism (PP): Split layers across nodes
- Layer 0-15 on Node 1, Layer 16-31 on Node 2, etc.
- Works across slower interconnects
4. Expert Parallelism (EP): For MoE models
- Each GPU holds a subset of experts
- Tokens are all-to-all dispatched to the right GPU
"""
# For most users: FSDP is the easiest path to multi-GPU
# torchrun --nproc_per_node=8 -m supergpt.training.train \
# --preset medium --distributed --compile
# superGPT handles FSDP setup automatically:
from supergpt.training.train import main
# Just add --distributed flag
# Single GPU (up to ~1B params on 80GB VRAM)
python -m supergpt.training.train \
--preset small \
--data-dir data \
--max-iters 50000 \
--batch-size 32 \
--lr 3e-4 \
--compile \
--device cuda
# Multi-GPU with FSDP
torchrun --nproc_per_node=8 \
-m supergpt.training.train \
--preset medium \
--data-dir data \
--max-iters 100000 \
--distributed \
--compile
# With FP8 (2× memory savings on H100)
python -m supergpt.training.train \
--preset large \
--data-dir data \
--fp8 \
--compile
# With logging
python -m supergpt.training.train \
--preset small \
--wandb \
--tensorboard
| Parameter | Small (≤1B) | Medium (1-10B) | Large (≥10B) |
|---|---|---|---|
| Learning Rate | 3e-4 | 1.5e-4 | 6e-5 |
| Batch Size (tokens) | 512K | 2M | 4M+ |
| Warmup Steps | 2,000 | 2,000 | 2,000 |
| LR Schedule | Cosine | Cosine | Cosine |
| Weight Decay | 0.1 | 0.1 | 0.1 |
| Gradient Clipping | 1.0 | 1.0 | 1.0 |
| Context Length | 2K → 8K | 4K → 32K | 4K → 128K |
After pre-training, the model can predict text but can't follow instructions. Post-training turns it into an assistant.
Train on 500K-1.5M instruction-following examples. The model learns the system/user/assistant turn pattern.
# superGPT SFT with LoRA (memory-efficient)
python -m supergpt.training.finetune \
--checkpoint checkpoints/best.pt \
--data-dir data/sft/ \
--lora-rank 64 \
--max-iters 5000 \
--lr 2e-5
Align the model to human preferences. DPO (Direct Preference Optimization) is simpler than classic RLHF — no reward model needed.
# superGPT DPO alignment
python -m supergpt.training.align_dpo \
--checkpoint checkpoints/sft_best.pt \
--data-dir data/preferences/ \
--beta 0.1 \
--max-iters 2000
Transfer reasoning from a large teacher to a smaller student. DeepSeek-V3 distilled from DeepSeek-R1 for enhanced reasoning.
# superGPT distillation
python -m supergpt.training.distill \
--teacher checkpoints/large_model.pt \
--student-preset small \
--data-dir data/ \
--alpha 0.7 \
--temperature 2.0
Here's a complete walkthrough of building and training a model from scratch using superGPT.
# Option A: Use the built-in data pipeline (FineWeb-Edu, Wikipedia, etc.)
python -m supergpt.training.data_pipeline \
--dataset fineweb-edu \
--tokenizer Qwen/Qwen2.5-0.5B \
--max-tokens 100000000 \
--output-dir data/
# Option B: Use character-level data (for quick tests)
python -m supergpt.training.data_pipeline \
--dataset shakespeare \
--output-dir data/
# Option C: Process your own text files
# Put .txt files in a folder, then:
python -m supergpt.training.data_pipeline \
--input-dir /path/to/your/text/files/ \
--tokenizer Qwen/Qwen2.5-0.5B \
--output-dir data/
# Option D: Stream from HuggingFace
python -m supergpt.training.data_pipeline \
--dataset HuggingFaceFW/fineweb-edu \
--tokenizer Qwen/Qwen2.5-0.5B \
--max-tokens 500000000 \
--output-dir data/
# See all presets
python -m supergpt.training.train --help
# Available presets:
# micro - 2.1M params (testing, CPU ok)
# small - 10.6M params (1 GPU, fast iteration)
# medium - 124M params (GPT-2 scale, 1 GPU)
# large - 350M params (multi-GPU recommended)
# xl - 774M params (4+ GPUs)
# xxl - 1.5B params (8+ GPUs)
# Quick test (5 mins, CPU/single GPU)
python -m supergpt.training.train \
--preset micro \
--data-dir data/ \
--max-iters 1000 \
--device cuda
# Real training (1-2 hours on 1 GPU)
python -m supergpt.training.train \
--preset small \
--data-dir data/ \
--max-iters 10000 \
--batch-size 32 \
--lr 3e-4 \
--compile \
--device cuda
# Production training (multi-GPU)
torchrun --nproc_per_node=4 \
-m supergpt.training.train \
--preset medium \
--data-dir data/ \
--max-iters 50000 \
--distributed \
--compile
# Resume from checkpoint (automatic)
python -m supergpt.training.train \
--preset small \
--data-dir data/ \
--max-iters 20000 \
--compile --device cuda
# superGPT auto-detects checkpoints/latest.pt and resumes
# Generate from your trained model
python -m supergpt.inference.generate \
--checkpoint checkpoints/best.pt \
--prompt "The key principles of machine learning are" \
--max-tokens 200 \
--temperature 0.8
# Interactive mode
python -m supergpt.inference.generate \
--checkpoint checkpoints/best.pt \
--interactive
# LoRA fine-tuning (memory efficient, any GPU)
python -m supergpt.training.finetune \
--checkpoint checkpoints/best.pt \
--data-dir data/sft/ \
--lora-rank 64 \
--max-iters 5000 \
--lr 2e-5 \
--compile --device cuda
# Knowledge distillation: teacher → student
python -m supergpt.training.distill \
--teacher checkpoints/large_model.pt \
--student-preset small \
--data-dir data/ \
--alpha 0.7 \
--temperature 2.0 \
--max-iters 10000 \
--compile --device cuda
# Export to ONNX for inference
python -m supergpt.inference.export \
--checkpoint checkpoints/best.pt \
--format onnx \
--output model.onnx
# Serve as an API
python -m supergpt.inference.serve \
--checkpoint checkpoints/best.pt \
--port 8000
# → http://localhost:8000/generate?prompt=Hello&max_tokens=100
#!/bin/bash
# End-to-end: from raw text to trained model
# 1. Prepare your data directory
mkdir -p my_data/raw
# Put your .txt, .jsonl, or .md files in my_data/raw/
# 2. Tokenize
python -m supergpt.training.data_pipeline \
--input-dir my_data/raw/ \
--tokenizer Qwen/Qwen2.5-0.5B \
--output-dir my_data/tokenized/
# 3. Train
python -m supergpt.training.train \
--preset small \
--data-dir my_data/tokenized/ \
--max-iters 10000 \
--batch-size 32 \
--lr 3e-4 \
--compile \
--device cuda
# 4. Test generation
python -m supergpt.inference.generate \
--checkpoint checkpoints/best.pt \
--prompt "In this tutorial, we will learn" \
--max-tokens 200 \
--temperature 0.7
echo "Done! Model saved in checkpoints/"
| Dataset | Size | Description | Link |
|---|---|---|---|
| FineWeb-Edu | 1.3T tokens | Ultra-high-quality filtered web text | HuggingFace |
| FineWeb | 15T tokens | Full Common Crawl processed with quality filters | HuggingFace |
| RedPajama-V2 | 30T tokens | Full web corpus with 40+ quality annotations | Together AI |
| The Pile | 825 GB | Curated mix of 22 diverse sources | HuggingFace |
| Dolma | 3T tokens | Open corpus for OLMo (Allen AI) | HuggingFace |
| C4 | 750 GB | Cleaned Common Crawl (original T5 data) | HuggingFace |
| Dataset | Size | Description | Link |
|---|---|---|---|
| The Stack v2 | 6.4 TB | 358 programming languages, permissive licenses | HuggingFace |
| StarCoder Data | 783 GB | Curated code for StarCoder models | HuggingFace |
| Dataset | Size | Description | Link |
|---|---|---|---|
| OpenWebMath | 14.7B tokens | Math content filtered from Common Crawl | HuggingFace |
| MathPile | 9.5B tokens | Textbooks, arXiv, StackExchange | HuggingFace |
| GSM8K | 8.5K | Grade school math with solutions | HuggingFace |
| MATH | 12.5K | Competition math with step-by-step solutions | HuggingFace |
| NuminaMath-CoT | 860K | Chain-of-thought math solutions | HuggingFace |
| Dataset | Size | Description | Link |
|---|---|---|---|
| OpenAssistant | 161K msgs | Human-written multi-turn conversations | HuggingFace |
| UltraChat | 1.5M convos | Multi-turn synthetic conversations | HuggingFace |
| WizardLM | 196K | Evolved instructions for complex tasks | HuggingFace |
| Orca-Math | 200K | Math word problems for reasoning | HuggingFace |
| Tool | Purpose | Link |
|---|---|---|
| text-dedup | Deduplication at scale | GitHub |
| datasketch | MinHash / LSH implementation | GitHub |
| Trafilatura | Web text extraction | GitHub |
| fastText | Language detection + quality classification | GitHub |
| tokenizers | Fast BPE/WordPiece/Unigram training | GitHub |
| datatrove | HuggingFace's data processing toolkit | GitHub |
- Llama 3 Technical Report — Meta, 2024
- DeepSeek-V3 Technical Report — DeepSeek AI, 2024
- FineWeb: 15T Tokens of High-Quality Web Data — HuggingFace, 2024
- The Stack: 3TB of Source Code — BigCode Project, 2022
- Scaling Data-Constrained Language Models — Muennighoff et al., 2023
- OpenWebMath: Open Dataset of High-Quality Mathematical Web Text — Paster et al., 2023
- RedPajama-V2: An Open Dataset with 30T Tokens — Together AI, 2023