Skip to content

Dude4Linux/amass-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

amass-nlp

Training companion for the OWASP Amass NLP plugin. Produces nlp_model.bin — a character-level 4-gram language model that Amass uses to generate probabilistic subdomain name candidates during reconnaissance.

Overview

Amass's traditional subdomain generation relies on a static wordlist and deterministic mutation rules. amass-nlp adds a statistical component: a character-level Markov chain trained on millions of real subdomain labels. Given a seed label such as api, the model generates plausible continuations (api-v2, api-gateway, api-prod) ranked by learned likelihood, rather than guessing from a fixed list.

Key properties:

  • Pure Go — no CGO, no external C libraries, builds with CGO_ENABLED=0
  • Streaming trainer — processes hundred-million-line datasets in O(1) memory per line
  • Small model — 2–3 MB gob-encoded binary; loads in milliseconds, scores in microseconds
  • Online learning — the Amass plugin updates model weights in real time from DNS feedback and persists the learned delta across sessions
  • Community deltas — operators may contribute anonymised learned deltas; these are aggregated and folded into every monthly release

Repository layout

amass-nlp/
├── pkg/ngram/
│   ├── ngram.go        # NgramModel: Train, Generate, Score, Reinforce, Merge, Save, Load
│   └── ngram_test.go   # Unit tests and perplexity benchmark
├── pkg/corpus/
│   ├── corpus.go       # Shared corpus reader: gzip/stream + hostname extraction
│   └── corpus_test.go
├── pkg/vocab/
│   ├── vocab.go        # LocaleVocab/Bundle types + gob I/O (wire-compatible with the plugin)
│   ├── build.go        # Differential-frequency extractor: public-suffix split + log-odds
│   └── *_test.go
├── cmd/train/
│   └── main.go         # Streaming trainer CLI
├── cmd/vocab/
│   └── main.go         # Locale vocabulary bundle builder CLI
├── cmd/mkdelta/
│   └── main.go         # Computes marginal counts between two model versions
├── cmd/aggregate/
│   └── main.go         # Merges operator contribution deltas into one file
├── .github/workflows/
│   └── train.yml       # Monthly CI: contributions + SecLists + Tranco + crt.sh → GH Release
├── prev_model.bin      # Previous release model; delta baseline for the next CI run
└── go.mod

Model design

Character 4-gram Markov chain

The model learns the probability of each character given the three preceding characters, over the DNS label alphabet [a-z0-9-] plus a start and end token.

P(char | context) = (count(context, char) + 1) / (count(context) + vocabSize)

Laplace smoothing ensures every context-character pair has non-zero probability, so the model degrades gracefully on unseen patterns.

Why 4-grams: A context of three characters captures common subdomain patterns (api-v, k8s-, -prod, us-east) without requiring a transformer or GPU. Inference takes microseconds per candidate.

Storage: Sparse map (context → next_char → count). With real-world training data the model is 2–3 MB — small enough to commit to the Amass repository and bundle in every release binary.

Model file format

The model is serialised with Go's encoding/gob:

type NgramModel struct {
    Order   int                        // n-gram order (4)
    Counts  map[string]map[byte]uint32 // context → next char → count
    Total   map[string]uint32          // context → total observations
    Version int                        // codec version for forward compatibility
}

A full model and a delta file use the same format. A delta simply has sparse marginal counts (only the counts that increased since the last release). This means every .bin file produced by this repo can be inspected, merged, or applied with the same tools.

Files are written atomically (.tmp + rename) to prevent corruption on SIGTERM.

Three-layer model architecture

When Amass starts, it builds an effective model by merging three layers:

effective = base_model                          (resources/nlp_model.bin, shipped with binary)
          + NLPLearnRateCI   × nlp_ci_patch.bin  (~/.config/amass/, replaced on sync)
          + NLPLearnRateUser × nlp_learned.bin   (~/.config/amass/, never replaced)
Layer File Updated by Default rate
Base resources/nlp_model.bin Amass release pipeline
CI patch ~/.config/amass/nlp_ci_patch.bin amass --nlp-sync 0.05
User ~/.config/amass/nlp_learned.bin Online learning during scans 0.50

The user layer always dominates because its rate is ten times the CI rate. This means the model adapts to the names actually found in a user's target environment, without overwriting those learned patterns when a new monthly release arrives.

Generation

  1. Seed the generator with the trailing Order-1 characters of a known label
  2. Sample the next character from the smoothed distribution
  3. Repeat until the end token or the 63-character RFC 1035 label limit is reached
  4. Generate 5× the requested number of candidates, score all, return the top N

Online learning (in the Amass plugin)

Signal Action
DNS A / AAAA / CNAME resolved Increment n-gram counts for that label (positive)
Structurally invalid label generated Decrement counts (negative)
NXDOMAIN No change — a valid name that is inactive today may become live

The learned delta is stored separately in ~/.config/amass/nlp_learned.bin and merged into the base model at startup with a configurable learn rate (default 0.5).

Building

# Build all packages and CLI tools
go build ./...

# Build individual tools
go build -o train   ./cmd/train
go build -o vocab   ./cmd/vocab
go build -o mkdelta ./cmd/mkdelta
go build -o aggregate ./cmd/aggregate

No special build flags are required. The module is pure Go with no external dependencies.

Tools

cmd/train — Streaming trainer

Trains a new model from one or more hostname files.

Usage:
  go run ./cmd/train \
    --input <file> [--input <file> ...] \
    --order  4            \  # n-gram order (default 4)
    --output nlp_model.bin

Reads hostname files line by line (O(1) memory per line), extracts every DNS label from every hostname, and trains the model. Supports multiple input files in any combination of the formats listed below.

cmd/vocab — Locale vocabulary builder

Builds the locale-conditioned subdomain vocabulary bundle (nlp_vocab.bin) consumed by the Amass NLP plugin. Where the character model learns the shape of labels, the vocabulary supplies the actual locale-characteristic words a domain in a given region tends to use (e.g. German impressum/anmeldung, Japanese romaji saiyo).

Usage:
  go run ./cmd/vocab \
    --input ct_hostnames.txt [--input <file> ...] \
    --min-count 10           \  # minimum in-locale occurrences for a token (default 5)
    --top       2000         \  # max tokens per locale, by score (default 1000; 0 = all)
    --alpha     0.5          \  # log-odds smoothing constant (default 0.5)
    --output    nlp_vocab.bin \
    --per-country-dir ./out      # optional: also write nlp_vocab_<cc>.bin per country

Each host is classified by its public suffix into a locale group (a ccTLD such as .de, with language-sharing neighbours folded in — e.g. Austria's .at under German, Brazil's .com.br under Portuguese) or the international gTLD baseline (.com/.net/.org/...). For every locale, a label's weight is the log-odds of its in-locale frequency versus the baseline; labels that are over-represented in the locale are kept. Universally common labels (www, mail, api) cancel against the baseline and are excluded — those are already covered by brute-force wordlists and the character model. Best fed Certificate Transparency corpora (full FQDNs with real subdomains); bare-label wordlists like SecLists carry no locale signal. The bundle is gob-encoded and wire-compatible with the plugin's loader.

cmd/mkdelta — Delta computation

Computes the marginal counts between two model versions. Used by the CI workflow to produce the monthly nlp_delta_YYYYMMDD.bin release asset.

Usage:
  go run ./cmd/mkdelta \
    --prev prev_model.bin \
    --new  nlp_model.bin  \
    --output nlp_delta_YYYYMMDD.bin

For each n-gram context and next-character pair the delta contains max(0, new_count − prev_count). Counts that stayed the same or decreased are omitted. The output is a valid NgramModel gob file and can be merged with NgramModel.Merge.

cmd/aggregate — Contribution aggregation

Merges N operator contribution delta files into a single aggregated delta, scaling each contributor's counts by a fixed factor to prevent any one operator from dominating the community model.

Usage:
  go run ./cmd/aggregate \
    --input contrib1.bin [--input contrib2.bin ...] \
    --scale  0.1              \  # per-contributor scale (default 0.1)
    --output aggregated_contribs.bin

The --scale flag must be in (0, 1]. The aggregated output is passed to cmd/train as an additional --input during the CI training run.

Operator contributions

Amass users may opt in to sharing their learned deltas with the community. When NLPShareLearned: true is set in the Amass configuration:

  1. amass --nlp-sync scrubs the user's nlp_learned.bin — any context-character pair whose count is below NLPShareMinCount (default 5) is zeroed before upload. This ensures that patterns seen only once or twice (potentially sensitive target-specific names) are never shared.
  2. The scrubbed delta is uploaded to the contributions release tag in this repository.
  3. On the next monthly CI run, cmd/aggregate merges all contributed deltas at scale 0.1 before training begins. The combined contribution is one additional --input to cmd/train.

Privacy guarantees:

  • Only n-gram counts are uploaded — never raw hostnames or FQDNs
  • No operator identifier is attached to the file
  • Sharing is opt-in; the default is NLPShareLearned: false
  • The minimum-count threshold prevents rare (potentially sensitive) patterns from leaking

Training

Quick smoke test (local wordlists)

go run ./cmd/train \
  --input /path/to/amass/resources/namelist.txt \
  --input /path/to/amass/resources/alterations.txt \
  --order 4 \
  --output nlp_model.bin

Produces a ~40 KB bootstrap model in under a second.

Full training run (recommended)

The CI workflow uses four sources. To replicate locally:

mkdir -p data

# 1. SecLists 1M subdomain wordlist (~8.5 MB compressed)
curl -L "https://github.com/danielmiessler/SecLists/raw/master/Discovery/DNS/subdomains-top1million-full.7z" \
  -o data/subdomains-top1million-full.7z
7z e data/subdomains-top1million-full.7z -odata/ -y
rm data/subdomains-top1million-full.7z

# 2. Tranco top-1M with subdomains
curl -L "https://tranco-list.eu/top-1m-incl-subdomains.csv.zip" -o data/tranco.zip
TRANCO_ENTRY=$(unzip -Z1 data/tranco.zip | head -n 1)
unzip -p data/tranco.zip "$TRANCO_ENTRY" > data/tranco.csv
rm data/tranco.zip

# 3. crt.sh Certificate Transparency log sample
for tld in com net org io dev app; do
  curl -sSf "https://crt.sh/?q=%25.$tld&output=json&limit=10000" \
    | python3 -c "
import json,sys
data=json.load(sys.stdin)
seen=set()
for e in data:
    for n in e.get('name_value','').splitlines():
        n=n.strip().lstrip('*.')
        if n and n not in seen:
            seen.add(n)
            print(n)
" >> data/ct_hostnames.txt 2>/dev/null || true
done

# 4. Train — no pre-processing needed; all formats are detected automatically
go run ./cmd/train \
  --input data/subdomains-top1million-full.txt \
  --input data/tranco.csv \
  --input data/ct_hostnames.txt \
  --order 4 \
  --output nlp_model.bin

Supported input formats

The trainer auto-detects the format of each --input file:

Format Example line Notes
Plain label or FQDN api.example.com One entry per line
Gzip-compressed (detected by magic bytes) Decompressed transparently
SecLists COUNT LABEL 5617 www Leading decimal count stripped automatically
Tranco CSV RANK,FQDN 1,google.com Leading rank column stripped automatically
Rapid7 FDNS JSON {"name":"api.example.com","type":"A",...} Extracts name field
Zone file api.example.com. 300 IN A 1.2.3.4 Extracts first field, strips trailing dot

Every label from every hostname is extracted and ingested. Files with hundreds of millions of lines are processed streaming with constant memory.

Computing a delta after training

After producing a new nlp_model.bin, generate the delta against the previous release:

go run ./cmd/mkdelta \
  --prev prev_model.bin \
  --new  nlp_model.bin  \
  --output "nlp_delta_$(date +%Y%m%d).bin"

# Update the baseline for the next delta
cp nlp_model.bin prev_model.bin

The delta file is published alongside the full model in the GitHub Release so that Amass clients can sync incrementally (amass --nlp-sync) without downloading the full model.

Testing

# Run all tests with verbose output
go test -v ./...

# Run tests with coverage report
go test -v -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run benchmarks
go test -bench=. -benchmem ./pkg/ngram/

The test suite covers:

Test What it checks
TestNew Model initialises with correct order and version
TestTrain_* Counts populate; invalid/oversized labels are rejected
TestScore_* Known labels outscore unknown; invalid labels return −∞
TestGenerate_* Output count, RFC 1123 validity, no duplicates, descending score order
TestReinforce_* Positive/negative effect on score; underflow and overflow safety
TestMerge_* Delta integration; nil/zero-scale no-ops
TestSaveLoad_RoundTrip Gob encode → decode preserves all counts and scores exactly
TestSave_IsAtomic No .tmp file left after successful write
TestLoadFile_NotFound Returns error for missing path
TestIsValidLabel RFC 1123 label validation edge cases
TestPerplexity_HeldOut Held-out perplexity < 30 (random baseline = 38)

The perplexity test trains on a small corpus and verifies the model achieves meaningful compression on a held-out set of real subdomain patterns, catching regressions in the core learning algorithm.

CI workflow

.github/workflows/train.yml runs monthly (1st of the month, 02:00 UTC) and on workflow_dispatch or repository_dispatch from the Amass release pipeline.

Steps:

  1. Download operator contributions — fetches *.bin delta files from the contributions release tag (gracefully skipped if the tag does not exist)
  2. Aggregate contributions — runs cmd/aggregate on all contributed deltas at scale 0.1; skipped if no contributions were received
  3. Download SecListssubdomains-top1million-full.7z from danielmiessler/SecLists; hard-fails if the extracted file has fewer than 1,000,000 lines
  4. Download Trancotop-1m-incl-subdomains.csv.zip from tranco-list.eu; hard-fails if fewer than 500,000 lines
  5. Sample crt.sh — fetches up to 10,000 recent FQDNs per TLD across the international gTLD baseline (com/net/org/io) and the bundled locale ccTLDs (de/fr/jp/...)
  6. Train model — runs cmd/train on all available inputs; produces nlp_model.bin
  7. Build vocabulary — runs cmd/vocab on the FQDN corpora (CT logs + Tranco); produces nlp_vocab.bin, or is omitted if the sample yields no locale-specific tokens
  8. Compute delta — runs cmd/mkdelta against prev_model.bin; produces nlp_delta_YYYYMMDD.bin
  9. Update baseline — commits the new nlp_model.bin as prev_model.bin with [skip ci] for the next monthly delta
  10. Create GitHub Release — publishes nlp_model.bin, the delta file, and (when present) nlp_vocab.bin as assets under the tag model-YYYYMMDD-<amass-version>

SHA256 checksums for all downloaded files are logged to the CI output as an audit trail. All GitHub Actions are pinned to specific commit SHAs to prevent supply-chain substitution.

The release tag format is model-YYYYMMDD-<amass-version>. The Amass release pipeline downloads the latest nlp_model.bin asset before packaging its binaries.

Relationship to the Amass plugin

amass-nlp (this repo)                    owasp-amass/amass
─────────────────────────────────────    ──────────────────────────────────
cmd/train  → nlp_model.bin           →   resources/nlp_model.bin  (base layer)
cmd/vocab  → nlp_vocab.bin           →   resources/nlp_vocab.bin  (locale vocabulary)
cmd/mkdelta → nlp_delta_YYYYMMDD.bin →   ~/.config/amass/nlp_ci_patch.bin (CI layer)
pkg/ngram/ngram.go                   →   engine/plugins/nlp/model.go (shared codec)
pkg/vocab/vocab.go                   →   engine/plugins/nlp/vocab.go (shared codec)

The Amass plugin contains a private copy of the model codec (same algorithm, unexported types) to avoid an import cycle during the pre-publication phase. Once this module is published to owasp-amass/amass-nlp, that private copy will be replaced with a direct import of github.com/owasp-amass/amass-nlp/pkg/ngram.

Acknowledgments

  • SecLists by Daniel Miessler — the subdomains-top1million-full wordlist is the primary training corpus for this model. SecLists is the security tester's companion: a collection of multiple types of lists used during security assessments, including usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and more. The subdomain lists used here were zone-transferred from Cloudflare and represent the one million most-used subdomains.

  • Tranco by Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczyński, and Wouter Joosen — a research-grade domain popularity list combining CrUX, Farsight passive DNS, Majestic, Cloudflare Radar, and Cisco Umbrella. Used here in its subdomains variant to provide real traffic-weighted FQDNs from the most-visited sites on the internet.

  • crt.sh by Sectigo — Certificate Transparency log search used to supplement training with a live sample of recently-seen FQDNs.

License

Apache 2.0 — see LICENSE.

About

Training companion for the OWASP Amass NLP plugin — character 4-gram model trainer, delta pipeline, and operator contribution aggregation

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages