Skip to content

DankEncoder: Add context support and max results limit to prevent hanging on complex patterns #285

@tarunKoyalwar

Description

@tarunKoyalwar

Problem

When running pattern mining in discover mode on certain inputs with many domains (~900+), the subdomain generation step can hang indefinitely. The hang occurs in GenerateAtFixedLength() when processing patterns that cause exponential recursion in the DFS traversal.

Example Input

Input file with 898 subdomains under .tstaging.tools domain causes hanging around pattern 400-500 during generation phase. Sample domains that contribute to problematic patterns:

mobile-prod-genymotion-64.ue1.mobile.tstaging.tools
grafana.ue1.s11.tstaging.tools
kibana-logging.eck.ue1.cloudhub.tstaging.tools
gateway.ue1.stg1.tstagingsub-97-35-127.gateway.ue1.stg1.tstaging.tools

The issue is that certain discovered patterns, when fed to DankEncoder's GenerateAtFixedLength(), trigger deep recursion that takes an impractical amount of time to complete.

Root Cause

The internal/dank/dank.go library's GenerateAtFixedLength() function:

  1. Uses pure recursive DFS without any exit conditions
  2. Has no way to be interrupted or cancelled
  3. Has no limit on number of results generated

Requested Features

1. Context Support for Early Cancellation

Add context parameter to allow graceful cancellation:

func (d *DankEncoder) GenerateAtFixedLengthWithContext(ctx context.Context, fixedLen int) ([]string, error)

This would allow the caller to set timeouts and cancel expensive operations.

2. Max Results Limit

Add parameter to limit maximum results and exit early:

func (d *DankEncoder) GenerateAtFixedLengthWithLimit(fixedLen int, maxResults int) []string

This would prevent runaway recursion by stopping once a threshold is reached.

Benefits

  • Prevents hanging on complex patterns
  • Allows reasonable timeouts for pattern generation
  • Makes discover mode viable for larger input sets
  • Maintains backwards compatibility (keep existing functions, add new variants)

Workaround

Currently using a conservative NumWords() estimate check, but it's not accurate enough to prevent all problematic patterns from being processed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions