Skip to content

Addition: In-flight pattern mining#281

Closed
tarunKoyalwar wants to merge 5 commits intomainfrom
pattern-mining
Closed

Addition: In-flight pattern mining#281
tarunKoyalwar wants to merge 5 commits intomainfrom
pattern-mining

Conversation

@tarunKoyalwar
Copy link
Member

Summary

This PR adds an in-flight pattern mining implementation to AlterX, enabling automatic pattern discovery from subdomain datasets. The implementation uses hierarchical clustering algorithms to identify common patterns and generate DSL templates automatically.

Key Features

  • Hierarchical Ngram-Based Clustering: Multi-level clustering approach combining ngram prefix matching, token extraction, and edit distance clustering
  • Levenshtein Distance Clustering: Groups similar subdomains based on edit distance thresholds
  • Modular Architecture: Clean separation of concerns across multiple files:
    • clustering.go: Core clustering algorithms and orchestration
    • tokenization.go: Token extraction and parsing logic
    • pattern_generation.go: DSL pattern generation pipeline
    • pm.go: Main PatternMiner interface and execution flow

Implementation Approach

This PR focuses on high-level logic and architecture with placeholder functions marked with TODOs for future implementation. The structure is designed to be:

  • Well-documented with clear algorithm descriptions and examples
  • Easy to test and extend
  • Ready for incremental implementation

Attribution

This implementation is based on the regulator project by @cramppet. Regulator is a subdomain pattern mining tool that uses hierarchical clustering algorithms to automatically discover patterns in subdomain datasets. We've adapted and extended these concepts to provide automatic pattern generation capabilities within AlterX.

Special thanks to @cramppet for the excellent work on subdomain pattern analysis.

Testing

Test structure is in place in pm_test.go and utils_test.go for validation once the placeholder functions are implemented.

Next Steps

  • Implement tokenization logic
  • Implement pattern generation algorithm
  • Add comprehensive test cases
  • Performance optimization
  • Integration with existing AlterX features

🤖 Generated with Claude Code

This commit adds an in-flight pattern mining implementation to AlterX,
enabling automatic pattern discovery from subdomain datasets.

Key additions:
- Hierarchical ngram-based clustering algorithm
- Levenshtein distance clustering for subdomain grouping
- Modular architecture with clean separation of concerns:
  * clustering.go: Core clustering algorithms and orchestration
  * tokenization.go: Token extraction and parsing logic
  * pattern_generation.go: DSL pattern generation pipeline
  * pm.go: Main PatternMiner interface and execution flow

Implementation approach:
- High-level logic with placeholder functions for future implementation
- Well-documented algorithms with clear examples
- Test structure in place for validation

Attribution:
This implementation is based on the regulator project by @cramppet
(https://github.com/cramppet/regulator), adapted for AlterX's needs.

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pattern-mining

Comment @coderabbitai help to get the list of available commands and usage tips.

@ehsandeep
Copy link
Member

@coderabbitai create and link github issue for this!

@coderabbitai coderabbitai bot mentioned this pull request Nov 6, 2025
5 tasks
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get PR details and comments to understand what issue needs to be created
gh pr view 281 --repo projectdiscovery/alterx --json title,body,comments --jq '.'

Length of output: 4745


🏁 Script executed:

#!/bin/bash
# Create a GitHub issue for tracking the implementation of TODOs in PR #281
gh issue create \
  --repo projectdiscovery/alterx \
  --title "Implement Pattern Mining TODOs" \
  --body "## Overview

This issue tracks the implementation of placeholder functions and TODOs for the in-flight pattern mining feature introduced in PR #281.

## Background

PR #281 adds the architectural foundation for pattern mining in AlterX, with a modular design across multiple files (\`clustering.go\`, \`tokenization.go\`, \`pattern_generation.go\`, \`pm.go\`). The PR focuses on high-level logic with placeholder functions marked for future implementation.

## Tasks

- [ ] Implement tokenization logic
- [ ] Implement pattern generation algorithm
- [ ] Add comprehensive test cases
- [ ] Performance optimization
- [ ] Integration with existing AlterX features

## References

- PR: https://github.com/projectdiscovery/alterx/pull/281
- Comment: https://github.com/projectdiscovery/alterx/pull/281#issuecomment-3497288412
- Requested by: @ehsandeep

## Attribution

Based on the [regulator](https://github.com/cramppet/regulator) project by [@cramppet](https://github.com/cramppet)." \
  --assignee ehsandeep

Length of output: 53


@ehsandeep: I've created issue #282 to track the implementation of the pattern mining TODOs from this PR.

The issue includes:

  • All 5 tasks from the "Next Steps" section
  • References to this PR and your comment
  • Attribution to the regulator project
  • Assigned to you

You can view it here: #282

@ehsandeep ehsandeep linked an issue Nov 6, 2025 that may be closed by this pull request
5 tasks
tarunKoyalwar and others added 4 commits November 7, 2025 04:10
Features:
- Add PatternProvider interface for pattern generation strategies
- Implement ManualPatternProvider for user-specified patterns
- Implement MinedPatternProvider for automatic pattern discovery
- Add CLI flags for all mining options under "Pattern Mining" group
- Add comprehensive user documentation (PATTERN_MINING.md)

Fixes:
- Fix static pattern generation for identical inputs
- Fix unchecked error returns in defer statements (3 files)
- Remove unused functions (delimiter validation, pattern quality)
- Fix staticcheck warnings (code simplification)

The mutator now seamlessly switches between manual and discover modes
based on the -d flag, while maintaining complete backward compatibility.
All integration tests pass with exact Python implementation match.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Features:
- Add Mode field with three options: "default", "discover", "both"
- When -d is used without -mode, defaults to "discover" (mined only, no defaults)
- Add -m/--mode flag to explicitly choose pattern mode
- "both" mode combines mined patterns with defaults for maximum coverage

Pattern Generation Fix:
- Replace simplified prefix matching with full tokenization pipeline
- Use analyzeTokenAlignment, buildDSLPattern, and extractPayloads
- Properly handles multi-variable patterns like api{{p0}}{{p1}}
- Correctly includes delimiters in payloads (e.g., ["-prod", "-staging"])

Tests:
- All integration tests pass (manual and discover modes)
- TestPatternDifferences PASS (Go matches Python exactly)
- TestGeneratePattern PASS (all 7 test cases)
- TestCrossValidation PASS (10/10 cases)
- 43/45 tests passing (2 minor failures in intermediate stages)

Binary Output: Verified working correctly with proper domain generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@tarunKoyalwar
Copy link
Member Author

This PR has been superseded by PR #283 which includes:

  • Complete pattern mining implementation
  • All linting errors resolved (0 issues)
  • All tests passing
  • Proper error handling following Go best practices
  • Clean commit history

PR #283 can be safely merged. This PR can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Pattern Mining TODOs

2 participants