Conversation
- Add proper error handling for all Write() and Close() operations - Use defer with error handlers for cleanup operations - Remove unused extractTargetDomain() function - Add coverage.html to .gitignore to exclude build artifacts All changes follow Go best practices: - Deferred error handlers with logging - t.Fatalf() for test error handling - Named return pattern for defer close error propagation Linting: 0 issues (previously 12 issues) Tests: All passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
WalkthroughThe pull request adds pattern-mining functionality and DFA/NFA-based regex automation, implements output deduplication via a new DedupingWriter, integrates a three-mode CLI (default, discover, both) with pattern-mining flags, and introduces build/docs (Makefile, CLAUDE.md, README updates) and tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as cmd/alterx/main
participant Validator as RootDomainValidator
participant Miner as PatternMiner
participant Dedup as DedupingWriter
participant Engine as AlterxEngine
CLI->>Validator: validate mode & root domain (publicsuffix)
alt mode == discover or both
CLI->>Miner: Mine patterns from inputs
Miner-->>CLI: rules & pattern metadata
alt mode == discover
CLI->>Dedup: write discovered subdomains
Dedup-->>CLI: ack / unique count
CLI->>CLI: exit (discover complete)
else mode == both
CLI->>Miner: save rules (JSON)
Miner-->>CLI: saved
end
end
alt mode == default or both
CLI->>Engine: run alterations with combined patterns
Engine->>Dedup: emit subdomains
Dedup-->>Engine: deduped output
Engine-->>CLI: execution complete
end
CLI->>Dedup: close writer
Dedup-->>CLI: total unique count
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
dedupe_writer_test.go (1)
36-38: Drop the post-Close sleeps
Close()blocks until the worker finishes (wg.Wait()), so the 100 ms sleeps just slow the suite and add flake potential. Please remove them across the subtests.Apply this diff:
- // Give a moment for async processing to complete - time.Sleep(100 * time.Millisecond) - @@ - time.Sleep(100 * time.Millisecond) - @@ - time.Sleep(100 * time.Millisecond) - @@ - time.Sleep(100 * time.Millisecond) - @@ - time.Sleep(100 * time.Millisecond) -Also applies to: 70-72, 107-109, 132-134, 154-156
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignore(1 hunks)CLAUDE.md(1 hunks)Makefile(1 hunks)README.md(2 hunks)cmd/alterx/main.go(2 hunks)dedupe_writer.go(1 hunks)dedupe_writer_test.go(1 hunks)examples/main.go(1 hunks)internal/dank/dank.go(1 hunks)internal/patternmining/clustering.go(1 hunks)internal/patternmining/patternmining.go(1 hunks)internal/patternmining/regex.go(1 hunks)internal/runner/runner.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
internal/patternmining/clustering.go (1)
internal/patternmining/patternmining.go (1)
Miner(82-85)
dedupe_writer_test.go (1)
dedupe_writer.go (1)
NewDedupingWriter(27-46)
cmd/alterx/main.go (3)
internal/runner/runner.go (2)
ParseFlags(42-152)Options(17-40)dedupe_writer.go (1)
NewDedupingWriter(27-46)internal/patternmining/patternmining.go (2)
NewMiner(88-93)Options(27-44)
internal/patternmining/regex.go (1)
internal/patternmining/patternmining.go (1)
Miner(82-85)
internal/patternmining/patternmining.go (2)
internal/runner/runner.go (1)
Options(17-40)internal/dank/dank.go (1)
NewDankEncoder(73-98)
🪛 LanguageTool
CLAUDE.md
[grammar] ~199-~199: Ensure spelling is correct
Context: ... - Maintain compatibility with original alterx API - Keep pattern mining as optional f...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~199-~199: Ensure spelling is correct
Context: ...n compatibility with original alterx API - Keep pattern mining as optional feature ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
CLAUDE.md
8-8: Images should have alternate text (alt text)
(MD045, no-alt-text)
9-9: Images should have alternate text (alt text)
(MD045, no-alt-text)
10-10: Images should have alternate text (alt text)
(MD045, no-alt-text)
11-11: Images should have alternate text (alt text)
(MD045, no-alt-text)
12-12: Images should have alternate text (alt text)
(MD045, no-alt-text)
13-13: Images should have alternate text (alt text)
(MD045, no-alt-text)
43-43: Link text should be descriptive
(MD059, descriptive-link-text)
176-176: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test Builds (macOS-latest, 1.21.x)
- GitHub Check: Test Builds (windows-latest, 1.21.x)
🔇 Additional comments (2)
internal/patternmining/regex.go (2)
119-145: ...
119-145: ...
Fixed multiple critical and major issues identified in code review: Critical fixes: - Fix getNValidateRootDomain to return computed root domain - Add proper error handling for publicsuffix.EffectiveTLDPlusOne - Strengthen domain validation with dot boundary check - Fix dedupe_writer buffer handling to process only complete lines - Replace bufio.Scanner with bytes.IndexByte for correct partial write handling Major fixes: - Include MaxDistance in clustering loop bounds (Phase 1 and Phase 3) - Fix nwords type assertion from int to int64 with proper casting Improvements: - Remove unnecessary time.Sleep calls from tests (Close() already blocks) - Clean up unused imports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cmd/alterx/main.go(2 hunks)dedupe_writer.go(1 hunks)dedupe_writer_test.go(1 hunks)internal/patternmining/patternmining.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dedupe_writer_test.go
🧰 Additional context used
🧬 Code graph analysis (2)
cmd/alterx/main.go (4)
internal/runner/runner.go (2)
ParseFlags(42-152)Options(17-40)dedupe_writer.go (1)
NewDedupingWriter(26-45)internal/patternmining/patternmining.go (2)
NewMiner(88-93)Options(27-44)mutator.go (1)
Options(27-43)
internal/patternmining/patternmining.go (2)
internal/runner/runner.go (1)
Options(17-40)internal/dank/dank.go (1)
NewDankEncoder(73-98)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test Builds (macOS-latest, 1.21.x)
- GitHub Check: Test Builds (windows-latest, 1.21.x)
🔇 Additional comments (16)
dedupe_writer.go (3)
24-45: LGTM!The constructor properly initializes the blacklist, buffered channel, and background processor. The channel buffer size of 100 provides reasonable async capacity.
83-108: LGTM!The Write implementation correctly handles partial writes by processing only complete newline-terminated lines and buffering incomplete fragments. This addresses the previous review concern about
bufio.Scannermishandling partial writes.
110-137: LGTM!The Close method correctly flushes any remaining buffered data (handling files without trailing newlines), signals completion via channel closure, and waits for the background processor. The Count method is properly synchronized.
cmd/alterx/main.go (5)
19-34: LGTM!Mode validation is correct, and the deferred cleanup order is proper (LIFO execution ensures
dedupWriter.Close()flushes before the underlyingoutputis closed). Seeding the dedup writer with known domains effectively filters them from the output.
40-91: LGTM!The pattern mining integration correctly validates the target domain (the previous issue where
getNValidateRootDomainreturned an empty string has been fixed), mines patterns, optionally saves rules, and handles bothdiscoverandbothmodes appropriately.
94-132: LGTM!The default/both mode flow correctly combines user-defined patterns with optional mined patterns and uses the shared dedup writer to ensure output uniqueness across both modes. The final count accurately reflects deduplicated results.
134-155: LGTM!The helper functions cleanly abstract output writer creation and cleanup. File permissions (0644) are appropriate for output files.
157-181: LGTM!The root domain validation now correctly returns the computed root domain (fixing the previous empty-string bug), properly handles
publicsuffixerrors, and enforces strict boundary checking with"."+rootDomainto prevent false positives likeevil-example.commatchingexample.com.internal/patternmining/patternmining.go (8)
1-93: LGTM!The package documentation clearly attributes the original Regulator algorithm. Type definitions are well-structured with appropriate JSON tags for serialization, and the constructor is straightforward.
119-148: LGTM!Phase 1 edit-distance clustering now correctly includes
MaxDistancein the iteration range (fixed from<to<=), ensuring the user-specified maximum distance is evaluated as intended. The clustering logic and pattern validation are sound.
150-246: LGTM!Phase 2 n-gram prefix clustering correctly includes
MaxDistancein both outer and nested loops (fixed from<to<=). The redundant prefix filtering (lines 189-195) appropriately prevents duplicate prefix patterns.
248-260: LGTM!The result collection and sorting logic is correct. Returning both patterns and metadata enables downstream processing and rule persistence.
262-319: LGTM!
validateDomainsproperly filters malformed inputs and validates tokenization.buildDistanceTablecomputes necessary pairwise distances (O(n²) is unavoidable for this clustering approach).generateNgramsdeterministically produces unigrams and bigrams with optional limiting.
321-402: LGTM!
SaveRulescorrectly captures close errors in the deferred function.EstimateCountefficiently uses the DFA'sNumWordsto count without generating strings.GenerateFromPatternsproperly computesfixedSlice, handles negative cases, and removes double dots to prevent malformed output.
404-467: LGTM!Pattern quality validation properly enforces both threshold and ratio constraints. Helper methods for deduplication, prefix filtering, and token extraction are straightforward and correct.
469-604: LGTM!
groupRulesByStepcorrectly assertsnwordsasint64before casting toint(fixing the previous type assertion bug). The Levenshtein implementation is standard, andescapeForDankEncoderproperly handles regex special characters for the DFA engine.
Summary
This PR adds in-flight pattern mining capabilities to AlterX by integrating the Regulator pattern mining algorithm. This enables automatic discovery of subdomain patterns from observed data, complementing the existing template-based generation.
Key Features
Three operation modes:
default: Original AlterX behavior with user-defined patternsdiscover: Pattern mining only (mined patterns)both: Combined user-defined and mined patterns with deduplicationPattern mining algorithm:
CLI flags:
-m, --mode: Operation mode (default/discover/both)-min-distance: Minimum Levenshtein distance for clustering (default: 2)-max-distance: Maximum Levenshtein distance for clustering (default: 10)-pattern-threshold: Minimum synthetic subdomains before ratio check (default: 500)-quality-ratio: Max ratio of synthetic/observed subdomains (default: 25)-save-rules: Save discovered patterns to JSON fileChanges in This PR
Code Quality
Testing
Example Usage
Related
This PR supersedes PR #281 which encountered issues. This is a clean implementation with all validations passing.
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores
Tests