microgpt v10

Character-level GPT built from scratch in pure NumPy. No PyTorch, no frameworks — just matrix math and backpropagation.

Trained on a curated vocabulary of memetics, information warfare, cognitive science, persuasion, cryptography, and network science terms. Generates novel concept names in the same domain.

Quick Start

# Generate novel concepts (creative checkpoint)
python3 microgpt.py --load ckpt_creative.npz --generate_only --num_samples 50

# Generate accurate, domain-grounded terms (v9 full)
python3 microgpt.py --load ckpt_v9.npz --generate_only --num_samples 50

# Generate with novelty analysis
python3 microgpt.py --load ckpt_creative.npz --generate_only --num_samples 100 --novelty

# Train a new model (small, ~1 hour)
python3 microgpt.py --n_embd 64 --n_layer 6 --n_head 8 --block_size 48 \
  --num_steps 200000 --save ckpt.npz

# Train a large model (~6 hours with v5 optimizations)
python3 microgpt.py --n_embd 128 --n_layer 6 --n_head 8 --block_size 64 \
  --num_steps 200000 --lr 0.003 --save ckpt_large.npz

Architecture

Multi-head causal self-attention with RMSNorm
ReLU MLP (4x expansion)
Cosine LR schedule with warmup
AdamW optimizer
Dropout on attention and MLP

Features

Feature	Description
First-char debiasing	Penalizes overrepresented starting characters (`--debias 0.5`)
Quality scoring	Automated pass/fail: uppercase, length, triple chars, vowel ratio, consonant soup, truncation
Repetition penalty	N-gram tracking to avoid repeated trigrams (`--rep_penalty 1.5`)
Post-processing	Auto-capitalize first letter
Novelty analysis	`--novelty` flag compares output to training data via Levenshtein distance
Sample deduplication	Within-run dedup with retry (MAX_RETRIES=10)
Garble filter	Rejects nonsense words via edit distance against training vocabulary
Novel term collection	`--collect novel_terms.txt` appends novel outputs with dedup
Self-contained checkpoints	Saves vocab + model config in .npz for zero-config loading
float32 training	1.7x speedup on Apple Silicon vs float64

Checkpoints

Checkpoint	Params	Dataset	Steps	Quality	Novelty	Use for
`ckpt_creative.npz`	1.2M	5,869 terms (v7)	50K	96%	94.5%	Novel concept generation
`ckpt_v9.npz`	1.2M	6,473 terms (v9)	200K	97%	63%	Accurate, domain-grounded generation
`ckpt_v8.npz`	1.2M	6,447 terms (v8)	200K	97%	69%	Superseded by v9
`ckpt_v7.npz`	1.2M	5,869 terms (v7)	200K	88%	93.5%	Superseded by v8

Two-checkpoint strategy: ckpt_creative.npz (v7 50K) is the novelty king at 94.5%. ckpt_v9.npz is the quality king at 97%. Novelty dropped in v8/v9 because the expanded dataset covers more concept space — a measurement artifact, not less creativity.

Note: ckpt_200k.npz and ckpt_200k_large.npz are legacy checkpoints without saved vocab. Use --data input_backup.txt when loading them.

Dataset

input.txt — 6,508 terms across these clusters:

Memetics & information warfare (original core)
Cognitive biases & psychology
Game theory & decision science
Rhetoric & persuasion
Network science & behavioral economics
Cryptography & privacy
Propaganda & IO techniques
Disinformation & media manipulation
MITRE ATT&CK, ATLAS & DISARM frameworks (AI attack/defense)
OWASP LLM Top 10 (AI security)
CAPEC attack patterns
Dark patterns & deceptive design
AI safety & alignment (MIRI, ARC, DeepMind)
Cognitive security (COGSEC)
EU AI Act & NIST AI RMF (governance)
167 curated novel terms from 6 flywheel cycles

Analysis Tools

Script	Purpose
`novelty_check.py`	Standalone novelty analysis (edit distance vs training data)
`diversity_metrics.py`	Batch diversity: unique rate, entropy, type-token ratio
`audit_expanded.py`	Dataset quality audit: dupes, charset, length, near-dupes
`profile_instrumented.py`	Per-operation training profiler with FLOP estimation

Version History

v10 — (training) Flywheel cycle 6, 6,508 terms
v9 — Flywheel cycle 6: 35 novel terms from v9 creative; 97% quality, 63% novel at 50K
v8 — Flywheel cycle 5: 26 novel terms; 97% quality, 69% novel at 50K. Dataset expanded to 6,447 terms via ATT&CK, biases, fallacies, rhetoric, CAPEC, dark patterns
v7 — Flywheel cycle 4: 13 novel terms → 5,869 terms; 94.5% novelty at 96% quality (50K sweet spot)
v6 — Flywheel cycle 3: 25 novel terms → 5,856 terms; 87% novelty at 97% quality
v5 — Dataset scaled to 5,831 terms via external sources (ATLAS, DISARM, OWASP, COGSEC, EU AI Act, NIST); float32 training, 1.7x speedup; garble filter, sample dedup, --collect flag
v4 — Debiasing, quality scoring, repetition penalty, checkpoint vocab/config saving
v3 — Cosine LR, AdamW, batched training, dropout, top-k/top-p sampling

Memorization vs Novelty

v7 sweep (1.2M params, 5,869 terms) shows novelty staying above 88% across all checkpoints:

Steps	Quality	Novel%	AvgDist	Notes
10K	92%	99.5%	0.395	Near-random, very novel
20K	98%	96%	0.357	Highest quality
50K	96%	94.5%	0.308	Best balance — sweet spot
90K	90%	92%	0.314	Quality dip in mid-range
140K	96%	92.5%	0.326	Quality recovers
180K	94%	94.5%	0.327	Late novelty peak
200K	88%	93.5%	0.329	Quality drops at end

Novelty progression across versions: v4 55% → v5 84% → v6 87% → v7 94.5% → v8 69% → v9 63%.

v8+ novelty drop reflects broader training coverage (6,447+ terms vs 5,869), not reduced creativity. The v7 creative checkpoint remains the novelty benchmark.

Next Steps

Fuse QKV projections for additional 5-10% speedup
Explore word-level tokenizer for multi-word pattern capture
Scale dataset to 10-15K terms for further diversity
Continue data flywheel: collect → curate → retrain (6 cycles completed, 167 novel terms curated)
Find earlier sweet spot (20-30K) for v8+ dataset size to recover novelty

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
results		results
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
audit_expanded.py		audit_expanded.py
diversity_metrics.py		diversity_metrics.py
filter_candidates.py		filter_candidates.py
input.txt		input.txt
input_backup.txt		input_backup.txt
microgpt.py		microgpt.py
near_dupes_full.py		near_dupes_full.py
novel_terms.txt		novel_terms.txt
novelty_check.py		novelty_check.py
profile_instrumented.py		profile_instrumented.py
profile_training.py		profile_training.py
runner.py		runner.py
scrape_terms.py		scrape_terms.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microgpt v10

Quick Start

Architecture

Features

Checkpoints

Dataset

Analysis Tools

Version History

Memorization vs Novelty

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

microgpt v10

Quick Start

Architecture

Features

Checkpoints

Dataset

Analysis Tools

Version History

Memorization vs Novelty

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages