feat: add pattern mining capabilities with Regulator integration by tarunKoyalwar · Pull Request #283 · projectdiscovery/alterx

tarunKoyalwar · 2025-11-10T07:08:43Z

Summary

This PR adds in-flight pattern mining capabilities to AlterX by integrating the Regulator pattern mining algorithm. This enables automatic discovery of subdomain patterns from observed data, complementing the existing template-based generation.

Key Features

Three operation modes:
- default: Original AlterX behavior with user-defined patterns
- discover: Pattern mining only (mined patterns)
- both: Combined user-defined and mined patterns with deduplication
Pattern mining algorithm:
- Three-phase clustering (edit distance, n-grams, prefix clustering)
- Quality control with thresholds and ratios
- DFA-based subdomain generation
CLI flags:
- -m, --mode: Operation mode (default/discover/both)
- -min-distance: Minimum Levenshtein distance for clustering (default: 2)
- -max-distance: Maximum Levenshtein distance for clustering (default: 10)
- -pattern-threshold: Minimum synthetic subdomains before ratio check (default: 500)
- -quality-ratio: Max ratio of synthetic/observed subdomains (default: 25)
- -save-rules: Save discovered patterns to JSON file

Changes in This PR

Integrated Regulator pattern mining algorithm into AlterX
Added DFA engine for pattern-based generation
Implemented three-phase clustering algorithm
Added deduplication between mined and user-defined patterns
Complete test coverage for core functionality
Added comprehensive documentation

Code Quality

✅ All linting checks passing (0 issues)
✅ All tests passing (68.4% coverage)
✅ Proper error handling following Go best practices
✅ No build artifacts committed

Testing

# Run tests
make test

# Run linter
make lint

# Build
make build

Example Usage

# Discover patterns from domains
echo -e "api.example.com\ndev.example.com\nprod.example.com" | alterx -m discover

# Combine mined and user patterns
echo -e "api.example.com\ndev.example.com" | alterx -m both

# Save discovered patterns
echo -e "api.example.com\ndev.example.com" | alterx -m discover -save-rules patterns.json

Summary by CodeRabbit

New Features
- Pattern mining with discover/both modes and CLI flags to configure mining, estimate and save rules.
- Deduplicated output handling for cleaner, unique subdomain lists.
Documentation
- New architecture and development guide added; README updated with build instructions and Makefile targets.
Chores
- Makefile added to streamline build/test/lint workflows.
- .gitignore updated.
Tests
- Added tests covering output deduplication.

- Add proper error handling for all Write() and Close() operations - Use defer with error handlers for cleanup operations - Remove unused extractTargetDomain() function - Add coverage.html to .gitignore to exclude build artifacts All changes follow Go best practices: - Deferred error handlers with logging - t.Fatalf() for test error handling - Named return pattern for defer close error propagation Linting: 0 issues (previously 12 issues) Tests: All passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-11-10T07:09:00Z

Walkthrough

The pull request adds pattern-mining functionality and DFA/NFA-based regex automation, implements output deduplication via a new DedupingWriter, integrates a three-mode CLI (default, discover, both) with pattern-mining flags, and introduces build/docs (Makefile, CLAUDE.md, README updates) and tests.

Changes

Cohort / File(s)	Summary
Build & Documentation `\.gitignore`, `Makefile`, `README.md`, `CLAUDE.md`	Updated gitignore entries (removed `cmd/alterx/alterx` from ignored list, added top-level `/cmd/alterx/alterx` and `/alterx`, added `coverage.html`); added Makefile with build/test/lint/fmt/deps/run/clean targets and variables; expanded README with Pattern Mining and Building-from-Source instructions; added CLAUDE.md architecture and developer guide.
Deduplication System `dedupe_writer.go`, `dedupe_writer_test.go`	New `DedupingWriter` type with async deduplication pipeline, buffered line handling, blacklist seeding, Write/Close/Count methods; tests cover dedup behavior, blacklist skipping, dashed-line skipping, multi-line writes, and empty-line skipping.
Pattern Mining Core `internal/patternmining/patternmining.go`, `internal/patternmining/regex.go`, `internal/patternmining/clustering.go`	New Miner implementing two-phase clustering (edit-distance + n‑gram/prefix), distance table memoization, regex generation from clusters (tokenization, per-position alternates, optional parts), numeric-range compression, rule validation/serialization, estimation and generation APIs, and helper utilities.
DFA/NFA Automation `internal/dank/dank.go`	New `DankEncoder` implementing regex preprocessing, Thompson-style NFA construction, epsilon closures, determinization, Brzozowski-style minimization (reverse+determinize cycles), dead-state completion, counting and generation of accepted strings, and introspection helpers.
CLI & Runner Integration `cmd/alterx/main.go`, `internal/runner/runner.go`, `examples/main.go`	`cmd/alterx/main.go`: added mode validation (default/discover/both), root-domain homogeneity validation (publicsuffix), pattern-mining flow for discover/both modes, unified dedup writer usage, output counting including discovered items, and helper writer functions. `internal/runner/runner.go`: added exported Options fields for pattern-mining flags (Mode, MinDistance, MaxDistance, PatternThreshold, QualityRatio, NgramsLimit, SaveRules) and flag parsing. `examples/main.go`: added error handling around ExecuteWithWriter.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as cmd/alterx/main
    participant Validator as RootDomainValidator
    participant Miner as PatternMiner
    participant Dedup as DedupingWriter
    participant Engine as AlterxEngine

    CLI->>Validator: validate mode & root domain (publicsuffix)
    alt mode == discover or both
        CLI->>Miner: Mine patterns from inputs
        Miner-->>CLI: rules & pattern metadata
        alt mode == discover
            CLI->>Dedup: write discovered subdomains
            Dedup-->>CLI: ack / unique count
            CLI->>CLI: exit (discover complete)
        else mode == both
            CLI->>Miner: save rules (JSON)
            Miner-->>CLI: saved
        end
    end

    alt mode == default or both
        CLI->>Engine: run alterations with combined patterns
        Engine->>Dedup: emit subdomains
        Dedup-->>Engine: deduped output
        Engine-->>CLI: execution complete
    end

    CLI->>Dedup: close writer
    Dedup-->>CLI: total unique count

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

internal/patternmining/* — clustering, distance caching, regex generation, numeric-range compression, and rule filtering logic.
internal/dank/dank.go — NFA/DFA construction, epsilon closures, determinization/reversal correctness, and generation/counting algorithms.
cmd/alterx/main.go — multi-mode control flow, proper combination of discovered and user-provided patterns, writer lifecycle, and root-domain homogeneity checks (verify publicsuffix usage and error conditions).
dedupe_writer.go — concurrency correctness, channel/worker lifecycle, buffering/line-splitting edge cases, and error handling when writing underlying output.

Poem

🐰
I munched on hosts and hopped through trees,
I mined patterns with regex breeze,
I thumped away duplicates with glee,
DFA and NFA now dance with me,
Hop—discover, default, both—y's a jubilee! 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main addition: pattern mining capabilities with Regulator integration, which is the primary feature across all changed files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-pm

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

dedupe_writer_test.go (1)
36-38: Drop the post-Close sleeps

Close() blocks until the worker finishes (wg.Wait()), so the 100 ms sleeps just slow the suite and add flake potential. Please remove them across the subtests.

Apply this diff:
-		// Give a moment for async processing to complete
-		time.Sleep(100 * time.Millisecond)
-
@@
-		time.Sleep(100 * time.Millisecond)
-
@@
-		time.Sleep(100 * time.Millisecond)
-
@@
-		time.Sleep(100 * time.Millisecond)
-
@@
-		time.Sleep(100 * time.Millisecond)
-
Also applies to: 70-72, 107-109, 132-134, 154-156

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a82e842 and 76f6ba1.

📒 Files selected for processing (13)

.gitignore (1 hunks)
CLAUDE.md (1 hunks)
Makefile (1 hunks)
README.md (2 hunks)
cmd/alterx/main.go (2 hunks)
dedupe_writer.go (1 hunks)
dedupe_writer_test.go (1 hunks)
examples/main.go (1 hunks)
internal/dank/dank.go (1 hunks)
internal/patternmining/clustering.go (1 hunks)
internal/patternmining/patternmining.go (1 hunks)
internal/patternmining/regex.go (1 hunks)
internal/runner/runner.go (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

internal/patternmining/clustering.go (1)

internal/patternmining/patternmining.go (1)

Miner (82-85)

dedupe_writer_test.go (1)

dedupe_writer.go (1)

NewDedupingWriter (27-46)

cmd/alterx/main.go (3)

internal/runner/runner.go (2)

ParseFlags (42-152)

Options (17-40)

dedupe_writer.go (1)

NewDedupingWriter (27-46)

internal/patternmining/patternmining.go (2)

NewMiner (88-93)

Options (27-44)

internal/patternmining/regex.go (1)

internal/patternmining/patternmining.go (1)

Miner (82-85)

internal/patternmining/patternmining.go (2)

internal/runner/runner.go (1)

Options (17-40)

internal/dank/dank.go (1)

NewDankEncoder (73-98)

🪛 LanguageTool

CLAUDE.md

[grammar] ~199-~199: Ensure spelling is correct
Context: ... - Maintain compatibility with original alterx API - Keep pattern mining as optional f...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~199-~199: Ensure spelling is correct
Context: ...n compatibility with original alterx API - Keep pattern mining as optional feature ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

CLAUDE.md

8-8: Images should have alternate text (alt text)

(MD045, no-alt-text)

9-9: Images should have alternate text (alt text)

(MD045, no-alt-text)

10-10: Images should have alternate text (alt text)

(MD045, no-alt-text)

11-11: Images should have alternate text (alt text)

(MD045, no-alt-text)

12-12: Images should have alternate text (alt text)

(MD045, no-alt-text)

13-13: Images should have alternate text (alt text)

(MD045, no-alt-text)

43-43: Link text should be descriptive

(MD059, descriptive-link-text)

176-176: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test Builds (macOS-latest, 1.21.x)
GitHub Check: Test Builds (windows-latest, 1.21.x)

🔇 Additional comments (2)

internal/patternmining/regex.go (2)

119-145: ...

119-145: ...

cmd/alterx/main.go

dedupe_writer.go

internal/patternmining/patternmining.go

Fixed multiple critical and major issues identified in code review: Critical fixes: - Fix getNValidateRootDomain to return computed root domain - Add proper error handling for publicsuffix.EffectiveTLDPlusOne - Strengthen domain validation with dot boundary check - Fix dedupe_writer buffer handling to process only complete lines - Replace bufio.Scanner with bytes.IndexByte for correct partial write handling Major fixes: - Include MaxDistance in clustering loop bounds (Phase 1 and Phase 3) - Fix nwords type assertion from int to int64 with proper casting Improvements: - Remove unnecessary time.Sleep calls from tests (Close() already blocks) - Clean up unused imports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76f6ba1 and 4e32f35.

📒 Files selected for processing (4)

cmd/alterx/main.go (2 hunks)
dedupe_writer.go (1 hunks)
dedupe_writer_test.go (1 hunks)
internal/patternmining/patternmining.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

dedupe_writer_test.go

🧰 Additional context used

🧬 Code graph analysis (2)

cmd/alterx/main.go (4)

internal/runner/runner.go (2)

ParseFlags (42-152)

Options (17-40)

dedupe_writer.go (1)

NewDedupingWriter (26-45)

internal/patternmining/patternmining.go (2)

NewMiner (88-93)

Options (27-44)

mutator.go (1)

Options (27-43)

internal/patternmining/patternmining.go (2)

internal/runner/runner.go (1)

Options (17-40)

internal/dank/dank.go (1)

NewDankEncoder (73-98)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test Builds (macOS-latest, 1.21.x)
GitHub Check: Test Builds (windows-latest, 1.21.x)

🔇 Additional comments (16)

dedupe_writer.go (3)

24-45: LGTM!

The constructor properly initializes the blacklist, buffered channel, and background processor. The channel buffer size of 100 provides reasonable async capacity.

83-108: LGTM!

The Write implementation correctly handles partial writes by processing only complete newline-terminated lines and buffering incomplete fragments. This addresses the previous review concern about bufio.Scanner mishandling partial writes.

110-137: LGTM!

The Close method correctly flushes any remaining buffered data (handling files without trailing newlines), signals completion via channel closure, and waits for the background processor. The Count method is properly synchronized.

cmd/alterx/main.go (5)

19-34: LGTM!

Mode validation is correct, and the deferred cleanup order is proper (LIFO execution ensures dedupWriter.Close() flushes before the underlying output is closed). Seeding the dedup writer with known domains effectively filters them from the output.

40-91: LGTM!

The pattern mining integration correctly validates the target domain (the previous issue where getNValidateRootDomain returned an empty string has been fixed), mines patterns, optionally saves rules, and handles both discover and both modes appropriately.

94-132: LGTM!

The default/both mode flow correctly combines user-defined patterns with optional mined patterns and uses the shared dedup writer to ensure output uniqueness across both modes. The final count accurately reflects deduplicated results.

134-155: LGTM!

The helper functions cleanly abstract output writer creation and cleanup. File permissions (0644) are appropriate for output files.

157-181: LGTM!

The root domain validation now correctly returns the computed root domain (fixing the previous empty-string bug), properly handles publicsuffix errors, and enforces strict boundary checking with "."+rootDomain to prevent false positives like evil-example.com matching example.com.

internal/patternmining/patternmining.go (8)

1-93: LGTM!

The package documentation clearly attributes the original Regulator algorithm. Type definitions are well-structured with appropriate JSON tags for serialization, and the constructor is straightforward.

119-148: LGTM!

Phase 1 edit-distance clustering now correctly includes MaxDistance in the iteration range (fixed from < to <=), ensuring the user-specified maximum distance is evaluated as intended. The clustering logic and pattern validation are sound.

150-246: LGTM!

Phase 2 n-gram prefix clustering correctly includes MaxDistance in both outer and nested loops (fixed from < to <=). The redundant prefix filtering (lines 189-195) appropriately prevents duplicate prefix patterns.

248-260: LGTM!

The result collection and sorting logic is correct. Returning both patterns and metadata enables downstream processing and rule persistence.

262-319: LGTM!

validateDomains properly filters malformed inputs and validates tokenization. buildDistanceTable computes necessary pairwise distances (O(n²) is unavoidable for this clustering approach). generateNgrams deterministically produces unigrams and bigrams with optional limiting.

321-402: LGTM!

SaveRules correctly captures close errors in the deferred function. EstimateCount efficiently uses the DFA's NumWords to count without generating strings. GenerateFromPatterns properly computes fixedSlice, handles negative cases, and removes double dots to prevent malformed output.

404-467: LGTM!

Pattern quality validation properly enforces both threshold and ratio constraints. Helper methods for deduplication, prefix filtering, and token extraction are straightforward and correct.

469-604: LGTM!

groupRulesByStep correctly asserts nwords as int64 before casting to int (fixing the previous type assertion bug). The Levenshtein implementation is standard, and escapeForDankEncoder properly handles regex special characters for the DFA engine.

cmd/alterx/main.go

dedupe_writer.go

tarunKoyalwar and others added 3 commits November 10, 2025 10:50

minor improvements

bd91b5a

complete implementation

0217a3a

tarunKoyalwar mentioned this pull request Nov 10, 2025

Addition: In-flight pattern mining #281

Closed

5 tasks

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

cmd/alterx/main.go Outdated Show resolved Hide resolved

dedupe_writer.go Outdated Show resolved Hide resolved

internal/patternmining/patternmining.go Outdated Show resolved Hide resolved

internal/patternmining/patternmining.go Show resolved Hide resolved

tarunKoyalwar linked an issue Nov 10, 2025 that may be closed by this pull request

Implement Pattern Mining TODOs #282

Closed

5 tasks

tarunKoyalwar self-assigned this Nov 10, 2025

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

cmd/alterx/main.go Show resolved Hide resolved

dedupe_writer.go Show resolved Hide resolved

tarunKoyalwar requested a review from ehsandeep November 10, 2025 08:02

ehsandeep approved these changes Nov 10, 2025

View reviewed changes

ehsandeep merged commit 03451bf into main Nov 10, 2025
9 checks passed

ehsandeep deleted the feat-pm branch November 10, 2025 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pattern mining capabilities with Regulator integration#283

feat: add pattern mining capabilities with Regulator integration#283
ehsandeep merged 4 commits intomainfrom
feat-pm

tarunKoyalwar commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tarunKoyalwar commented Nov 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Changes in This PR

Code Quality

Testing

Example Usage

Related

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tarunKoyalwar commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading