Skip to content

Mediaeater/microgpt

Repository files navigation

microgpt v10

Character-level GPT built from scratch in pure NumPy. No PyTorch, no frameworks — just matrix math and backpropagation.

Trained on a curated vocabulary of memetics, information warfare, cognitive science, persuasion, cryptography, and network science terms. Generates novel concept names in the same domain.

Quick Start

# Generate novel concepts (creative checkpoint)
python3 microgpt.py --load ckpt_creative.npz --generate_only --num_samples 50

# Generate accurate, domain-grounded terms (v9 full)
python3 microgpt.py --load ckpt_v9.npz --generate_only --num_samples 50

# Generate with novelty analysis
python3 microgpt.py --load ckpt_creative.npz --generate_only --num_samples 100 --novelty

# Train a new model (small, ~1 hour)
python3 microgpt.py --n_embd 64 --n_layer 6 --n_head 8 --block_size 48 \
  --num_steps 200000 --save ckpt.npz

# Train a large model (~6 hours with v5 optimizations)
python3 microgpt.py --n_embd 128 --n_layer 6 --n_head 8 --block_size 64 \
  --num_steps 200000 --lr 0.003 --save ckpt_large.npz

Architecture

  • Multi-head causal self-attention with RMSNorm
  • ReLU MLP (4x expansion)
  • Cosine LR schedule with warmup
  • AdamW optimizer
  • Dropout on attention and MLP

Features

Feature Description
First-char debiasing Penalizes overrepresented starting characters (--debias 0.5)
Quality scoring Automated pass/fail: uppercase, length, triple chars, vowel ratio, consonant soup, truncation
Repetition penalty N-gram tracking to avoid repeated trigrams (--rep_penalty 1.5)
Post-processing Auto-capitalize first letter
Novelty analysis --novelty flag compares output to training data via Levenshtein distance
Sample deduplication Within-run dedup with retry (MAX_RETRIES=10)
Garble filter Rejects nonsense words via edit distance against training vocabulary
Novel term collection --collect novel_terms.txt appends novel outputs with dedup
Self-contained checkpoints Saves vocab + model config in .npz for zero-config loading
float32 training 1.7x speedup on Apple Silicon vs float64

Checkpoints

Checkpoint Params Dataset Steps Quality Novelty Use for
ckpt_creative.npz 1.2M 5,869 terms (v7) 50K 96% 94.5% Novel concept generation
ckpt_v9.npz 1.2M 6,473 terms (v9) 200K 97% 63% Accurate, domain-grounded generation
ckpt_v8.npz 1.2M 6,447 terms (v8) 200K 97% 69% Superseded by v9
ckpt_v7.npz 1.2M 5,869 terms (v7) 200K 88% 93.5% Superseded by v8

Two-checkpoint strategy: ckpt_creative.npz (v7 50K) is the novelty king at 94.5%. ckpt_v9.npz is the quality king at 97%. Novelty dropped in v8/v9 because the expanded dataset covers more concept space — a measurement artifact, not less creativity.

Note: ckpt_200k.npz and ckpt_200k_large.npz are legacy checkpoints without saved vocab. Use --data input_backup.txt when loading them.

Dataset

input.txt — 6,508 terms across these clusters:

  • Memetics & information warfare (original core)
  • Cognitive biases & psychology
  • Game theory & decision science
  • Rhetoric & persuasion
  • Network science & behavioral economics
  • Cryptography & privacy
  • Propaganda & IO techniques
  • Disinformation & media manipulation
  • MITRE ATT&CK, ATLAS & DISARM frameworks (AI attack/defense)
  • OWASP LLM Top 10 (AI security)
  • CAPEC attack patterns
  • Dark patterns & deceptive design
  • AI safety & alignment (MIRI, ARC, DeepMind)
  • Cognitive security (COGSEC)
  • EU AI Act & NIST AI RMF (governance)
  • 167 curated novel terms from 6 flywheel cycles

Analysis Tools

Script Purpose
novelty_check.py Standalone novelty analysis (edit distance vs training data)
diversity_metrics.py Batch diversity: unique rate, entropy, type-token ratio
audit_expanded.py Dataset quality audit: dupes, charset, length, near-dupes
profile_instrumented.py Per-operation training profiler with FLOP estimation

Version History

  • v10 — (training) Flywheel cycle 6, 6,508 terms
  • v9 — Flywheel cycle 6: 35 novel terms from v9 creative; 97% quality, 63% novel at 50K
  • v8 — Flywheel cycle 5: 26 novel terms; 97% quality, 69% novel at 50K. Dataset expanded to 6,447 terms via ATT&CK, biases, fallacies, rhetoric, CAPEC, dark patterns
  • v7 — Flywheel cycle 4: 13 novel terms → 5,869 terms; 94.5% novelty at 96% quality (50K sweet spot)
  • v6 — Flywheel cycle 3: 25 novel terms → 5,856 terms; 87% novelty at 97% quality
  • v5 — Dataset scaled to 5,831 terms via external sources (ATLAS, DISARM, OWASP, COGSEC, EU AI Act, NIST); float32 training, 1.7x speedup; garble filter, sample dedup, --collect flag
  • v4 — Debiasing, quality scoring, repetition penalty, checkpoint vocab/config saving
  • v3 — Cosine LR, AdamW, batched training, dropout, top-k/top-p sampling

Memorization vs Novelty

v7 sweep (1.2M params, 5,869 terms) shows novelty staying above 88% across all checkpoints:

Steps Quality Novel% AvgDist Notes
10K 92% 99.5% 0.395 Near-random, very novel
20K 98% 96% 0.357 Highest quality
50K 96% 94.5% 0.308 Best balance — sweet spot
90K 90% 92% 0.314 Quality dip in mid-range
140K 96% 92.5% 0.326 Quality recovers
180K 94% 94.5% 0.327 Late novelty peak
200K 88% 93.5% 0.329 Quality drops at end

Novelty progression across versions: v4 55% → v5 84% → v6 87% → v7 94.5% → v8 69% → v9 63%.

v8+ novelty drop reflects broader training coverage (6,447+ terms vs 5,869), not reduced creativity. The v7 creative checkpoint remains the novelty benchmark.

Next Steps

  • Fuse QKV projections for additional 5-10% speedup
  • Explore word-level tokenizer for multi-word pattern capture
  • Scale dataset to 10-15K terms for further diversity
  • Continue data flywheel: collect → curate → retrain (6 cycles completed, 167 novel terms curated)
  • Find earlier sweet spot (20-30K) for v8+ dataset size to recover novelty

About

Trained on a curated vocabulary of memetics, information warfare, cognitive science, persuasion, cryptography, and network science terms. Generates novel concept names in the same domain.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages