Training companion for the OWASP Amass NLP plugin.
Produces nlp_model.bin — a character-level 4-gram language model that Amass uses to
generate probabilistic subdomain name candidates during reconnaissance.
Amass's traditional subdomain generation relies on a static wordlist and deterministic
mutation rules. amass-nlp adds a statistical component: a character-level Markov chain
trained on millions of real subdomain labels. Given a seed label such as api, the model
generates plausible continuations (api-v2, api-gateway, api-prod) ranked by learned
likelihood, rather than guessing from a fixed list.
Key properties:
- Pure Go — no CGO, no external C libraries, builds with
CGO_ENABLED=0 - Streaming trainer — processes hundred-million-line datasets in O(1) memory per line
- Small model — 2–3 MB gob-encoded binary; loads in milliseconds, scores in microseconds
- Online learning — the Amass plugin updates model weights in real time from DNS feedback and persists the learned delta across sessions
- Community deltas — operators may contribute anonymised learned deltas; these are aggregated and folded into every monthly release
amass-nlp/
├── pkg/ngram/
│ ├── ngram.go # NgramModel: Train, Generate, Score, Reinforce, Merge, Save, Load
│ └── ngram_test.go # Unit tests and perplexity benchmark
├── pkg/corpus/
│ ├── corpus.go # Shared corpus reader: gzip/stream + hostname extraction
│ └── corpus_test.go
├── pkg/vocab/
│ ├── vocab.go # LocaleVocab/Bundle types + gob I/O (wire-compatible with the plugin)
│ ├── build.go # Differential-frequency extractor: public-suffix split + log-odds
│ └── *_test.go
├── cmd/train/
│ └── main.go # Streaming trainer CLI
├── cmd/vocab/
│ └── main.go # Locale vocabulary bundle builder CLI
├── cmd/mkdelta/
│ └── main.go # Computes marginal counts between two model versions
├── cmd/aggregate/
│ └── main.go # Merges operator contribution deltas into one file
├── .github/workflows/
│ └── train.yml # Monthly CI: contributions + SecLists + Tranco + crt.sh → GH Release
├── prev_model.bin # Previous release model; delta baseline for the next CI run
└── go.mod
The model learns the probability of each character given the three preceding characters,
over the DNS label alphabet [a-z0-9-] plus a start and end token.
P(char | context) = (count(context, char) + 1) / (count(context) + vocabSize)
Laplace smoothing ensures every context-character pair has non-zero probability, so the model degrades gracefully on unseen patterns.
Why 4-grams: A context of three characters captures common subdomain patterns
(api-v, k8s-, -prod, us-east) without requiring a transformer or GPU. Inference
takes microseconds per candidate.
Storage: Sparse map (context → next_char → count). With real-world training data
the model is 2–3 MB — small enough to commit to the Amass repository and bundle in
every release binary.
The model is serialised with Go's encoding/gob:
type NgramModel struct {
Order int // n-gram order (4)
Counts map[string]map[byte]uint32 // context → next char → count
Total map[string]uint32 // context → total observations
Version int // codec version for forward compatibility
}A full model and a delta file use the same format. A delta simply has sparse marginal
counts (only the counts that increased since the last release). This means every .bin
file produced by this repo can be inspected, merged, or applied with the same tools.
Files are written atomically (.tmp + rename) to prevent corruption on SIGTERM.
When Amass starts, it builds an effective model by merging three layers:
effective = base_model (resources/nlp_model.bin, shipped with binary)
+ NLPLearnRateCI × nlp_ci_patch.bin (~/.config/amass/, replaced on sync)
+ NLPLearnRateUser × nlp_learned.bin (~/.config/amass/, never replaced)
| Layer | File | Updated by | Default rate |
|---|---|---|---|
| Base | resources/nlp_model.bin |
Amass release pipeline | — |
| CI patch | ~/.config/amass/nlp_ci_patch.bin |
amass --nlp-sync |
0.05 |
| User | ~/.config/amass/nlp_learned.bin |
Online learning during scans | 0.50 |
The user layer always dominates because its rate is ten times the CI rate. This means the model adapts to the names actually found in a user's target environment, without overwriting those learned patterns when a new monthly release arrives.
- Seed the generator with the trailing
Order-1characters of a known label - Sample the next character from the smoothed distribution
- Repeat until the end token or the 63-character RFC 1035 label limit is reached
- Generate 5× the requested number of candidates, score all, return the top N
| Signal | Action |
|---|---|
| DNS A / AAAA / CNAME resolved | Increment n-gram counts for that label (positive) |
| Structurally invalid label generated | Decrement counts (negative) |
| NXDOMAIN | No change — a valid name that is inactive today may become live |
The learned delta is stored separately in ~/.config/amass/nlp_learned.bin and merged
into the base model at startup with a configurable learn rate (default 0.5).
# Build all packages and CLI tools
go build ./...
# Build individual tools
go build -o train ./cmd/train
go build -o vocab ./cmd/vocab
go build -o mkdelta ./cmd/mkdelta
go build -o aggregate ./cmd/aggregateNo special build flags are required. The module is pure Go with no external dependencies.
Trains a new model from one or more hostname files.
Usage:
go run ./cmd/train \
--input <file> [--input <file> ...] \
--order 4 \ # n-gram order (default 4)
--output nlp_model.bin
Reads hostname files line by line (O(1) memory per line), extracts every DNS label from every hostname, and trains the model. Supports multiple input files in any combination of the formats listed below.
Builds the locale-conditioned subdomain vocabulary bundle (nlp_vocab.bin) consumed by the
Amass NLP plugin. Where the character model learns the shape of labels, the vocabulary
supplies the actual locale-characteristic words a domain in a given region tends to use
(e.g. German impressum/anmeldung, Japanese romaji saiyo).
Usage:
go run ./cmd/vocab \
--input ct_hostnames.txt [--input <file> ...] \
--min-count 10 \ # minimum in-locale occurrences for a token (default 5)
--top 2000 \ # max tokens per locale, by score (default 1000; 0 = all)
--alpha 0.5 \ # log-odds smoothing constant (default 0.5)
--output nlp_vocab.bin \
--per-country-dir ./out # optional: also write nlp_vocab_<cc>.bin per country
Each host is classified by its public suffix into a locale group (a ccTLD such as .de, with
language-sharing neighbours folded in — e.g. Austria's .at under German, Brazil's .com.br
under Portuguese) or the international gTLD baseline (.com/.net/.org/...). For every
locale, a label's weight is the log-odds of its in-locale frequency versus the baseline;
labels that are over-represented in the locale are kept. Universally common labels (www,
mail, api) cancel against the baseline and are excluded — those are already covered by
brute-force wordlists and the character model. Best fed Certificate Transparency corpora
(full FQDNs with real subdomains); bare-label wordlists like SecLists carry no locale signal.
The bundle is gob-encoded and wire-compatible with the plugin's loader.
Computes the marginal counts between two model versions. Used by the CI workflow to
produce the monthly nlp_delta_YYYYMMDD.bin release asset.
Usage:
go run ./cmd/mkdelta \
--prev prev_model.bin \
--new nlp_model.bin \
--output nlp_delta_YYYYMMDD.bin
For each n-gram context and next-character pair the delta contains
max(0, new_count − prev_count). Counts that stayed the same or decreased are omitted.
The output is a valid NgramModel gob file and can be merged with NgramModel.Merge.
Merges N operator contribution delta files into a single aggregated delta, scaling each contributor's counts by a fixed factor to prevent any one operator from dominating the community model.
Usage:
go run ./cmd/aggregate \
--input contrib1.bin [--input contrib2.bin ...] \
--scale 0.1 \ # per-contributor scale (default 0.1)
--output aggregated_contribs.bin
The --scale flag must be in (0, 1]. The aggregated output is passed to cmd/train
as an additional --input during the CI training run.
Amass users may opt in to sharing their learned deltas with the community. When
NLPShareLearned: true is set in the Amass configuration:
amass --nlp-syncscrubs the user'snlp_learned.bin— any context-character pair whose count is belowNLPShareMinCount(default 5) is zeroed before upload. This ensures that patterns seen only once or twice (potentially sensitive target-specific names) are never shared.- The scrubbed delta is uploaded to the
contributionsrelease tag in this repository. - On the next monthly CI run,
cmd/aggregatemerges all contributed deltas at scale 0.1 before training begins. The combined contribution is one additional--inputtocmd/train.
Privacy guarantees:
- Only n-gram counts are uploaded — never raw hostnames or FQDNs
- No operator identifier is attached to the file
- Sharing is opt-in; the default is
NLPShareLearned: false - The minimum-count threshold prevents rare (potentially sensitive) patterns from leaking
go run ./cmd/train \
--input /path/to/amass/resources/namelist.txt \
--input /path/to/amass/resources/alterations.txt \
--order 4 \
--output nlp_model.binProduces a ~40 KB bootstrap model in under a second.
The CI workflow uses four sources. To replicate locally:
mkdir -p data
# 1. SecLists 1M subdomain wordlist (~8.5 MB compressed)
curl -L "https://github.com/danielmiessler/SecLists/raw/master/Discovery/DNS/subdomains-top1million-full.7z" \
-o data/subdomains-top1million-full.7z
7z e data/subdomains-top1million-full.7z -odata/ -y
rm data/subdomains-top1million-full.7z
# 2. Tranco top-1M with subdomains
curl -L "https://tranco-list.eu/top-1m-incl-subdomains.csv.zip" -o data/tranco.zip
TRANCO_ENTRY=$(unzip -Z1 data/tranco.zip | head -n 1)
unzip -p data/tranco.zip "$TRANCO_ENTRY" > data/tranco.csv
rm data/tranco.zip
# 3. crt.sh Certificate Transparency log sample
for tld in com net org io dev app; do
curl -sSf "https://crt.sh/?q=%25.$tld&output=json&limit=10000" \
| python3 -c "
import json,sys
data=json.load(sys.stdin)
seen=set()
for e in data:
for n in e.get('name_value','').splitlines():
n=n.strip().lstrip('*.')
if n and n not in seen:
seen.add(n)
print(n)
" >> data/ct_hostnames.txt 2>/dev/null || true
done
# 4. Train — no pre-processing needed; all formats are detected automatically
go run ./cmd/train \
--input data/subdomains-top1million-full.txt \
--input data/tranco.csv \
--input data/ct_hostnames.txt \
--order 4 \
--output nlp_model.binThe trainer auto-detects the format of each --input file:
| Format | Example line | Notes |
|---|---|---|
| Plain label or FQDN | api.example.com |
One entry per line |
| Gzip-compressed | (detected by magic bytes) | Decompressed transparently |
SecLists COUNT LABEL |
5617 www |
Leading decimal count stripped automatically |
Tranco CSV RANK,FQDN |
1,google.com |
Leading rank column stripped automatically |
| Rapid7 FDNS JSON | {"name":"api.example.com","type":"A",...} |
Extracts name field |
| Zone file | api.example.com. 300 IN A 1.2.3.4 |
Extracts first field, strips trailing dot |
Every label from every hostname is extracted and ingested. Files with hundreds of millions of lines are processed streaming with constant memory.
After producing a new nlp_model.bin, generate the delta against the previous release:
go run ./cmd/mkdelta \
--prev prev_model.bin \
--new nlp_model.bin \
--output "nlp_delta_$(date +%Y%m%d).bin"
# Update the baseline for the next delta
cp nlp_model.bin prev_model.binThe delta file is published alongside the full model in the GitHub Release so that Amass
clients can sync incrementally (amass --nlp-sync) without downloading the full model.
# Run all tests with verbose output
go test -v ./...
# Run tests with coverage report
go test -v -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Run benchmarks
go test -bench=. -benchmem ./pkg/ngram/The test suite covers:
| Test | What it checks |
|---|---|
TestNew |
Model initialises with correct order and version |
TestTrain_* |
Counts populate; invalid/oversized labels are rejected |
TestScore_* |
Known labels outscore unknown; invalid labels return −∞ |
TestGenerate_* |
Output count, RFC 1123 validity, no duplicates, descending score order |
TestReinforce_* |
Positive/negative effect on score; underflow and overflow safety |
TestMerge_* |
Delta integration; nil/zero-scale no-ops |
TestSaveLoad_RoundTrip |
Gob encode → decode preserves all counts and scores exactly |
TestSave_IsAtomic |
No .tmp file left after successful write |
TestLoadFile_NotFound |
Returns error for missing path |
TestIsValidLabel |
RFC 1123 label validation edge cases |
TestPerplexity_HeldOut |
Held-out perplexity < 30 (random baseline = 38) |
The perplexity test trains on a small corpus and verifies the model achieves meaningful compression on a held-out set of real subdomain patterns, catching regressions in the core learning algorithm.
.github/workflows/train.yml runs monthly (1st of the month, 02:00 UTC) and on
workflow_dispatch or repository_dispatch from the Amass release pipeline.
Steps:
- Download operator contributions — fetches
*.bindelta files from thecontributionsrelease tag (gracefully skipped if the tag does not exist) - Aggregate contributions — runs
cmd/aggregateon all contributed deltas at scale 0.1; skipped if no contributions were received - Download SecLists —
subdomains-top1million-full.7zfrom danielmiessler/SecLists; hard-fails if the extracted file has fewer than 1,000,000 lines - Download Tranco —
top-1m-incl-subdomains.csv.zipfrom tranco-list.eu; hard-fails if fewer than 500,000 lines - Sample crt.sh — fetches up to 10,000 recent FQDNs per TLD across the international gTLD baseline (com/net/org/io) and the bundled locale ccTLDs (de/fr/jp/...)
- Train model — runs
cmd/trainon all available inputs; producesnlp_model.bin - Build vocabulary — runs
cmd/vocabon the FQDN corpora (CT logs + Tranco); producesnlp_vocab.bin, or is omitted if the sample yields no locale-specific tokens - Compute delta — runs
cmd/mkdeltaagainstprev_model.bin; producesnlp_delta_YYYYMMDD.bin - Update baseline — commits the new
nlp_model.binasprev_model.binwith[skip ci]for the next monthly delta - Create GitHub Release — publishes
nlp_model.bin, the delta file, and (when present)nlp_vocab.binas assets under the tagmodel-YYYYMMDD-<amass-version>
SHA256 checksums for all downloaded files are logged to the CI output as an audit trail. All GitHub Actions are pinned to specific commit SHAs to prevent supply-chain substitution.
The release tag format is model-YYYYMMDD-<amass-version>. The Amass release pipeline
downloads the latest nlp_model.bin asset before packaging its binaries.
amass-nlp (this repo) owasp-amass/amass
───────────────────────────────────── ──────────────────────────────────
cmd/train → nlp_model.bin → resources/nlp_model.bin (base layer)
cmd/vocab → nlp_vocab.bin → resources/nlp_vocab.bin (locale vocabulary)
cmd/mkdelta → nlp_delta_YYYYMMDD.bin → ~/.config/amass/nlp_ci_patch.bin (CI layer)
pkg/ngram/ngram.go → engine/plugins/nlp/model.go (shared codec)
pkg/vocab/vocab.go → engine/plugins/nlp/vocab.go (shared codec)
The Amass plugin contains a private copy of the model codec (same algorithm, unexported
types) to avoid an import cycle during the pre-publication phase. Once this module is
published to owasp-amass/amass-nlp, that private copy will be replaced with a direct
import of github.com/owasp-amass/amass-nlp/pkg/ngram.
-
SecLists by Daniel Miessler — the
subdomains-top1million-fullwordlist is the primary training corpus for this model. SecLists is the security tester's companion: a collection of multiple types of lists used during security assessments, including usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and more. The subdomain lists used here were zone-transferred from Cloudflare and represent the one million most-used subdomains. -
Tranco by Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczyński, and Wouter Joosen — a research-grade domain popularity list combining CrUX, Farsight passive DNS, Majestic, Cloudflare Radar, and Cisco Umbrella. Used here in its subdomains variant to provide real traffic-weighted FQDNs from the most-visited sites on the internet.
-
crt.sh by Sectigo — Certificate Transparency log search used to supplement training with a live sample of recently-seen FQDNs.
Apache 2.0 — see LICENSE.