Architecture Documentation

Overview

The zig-regex library implements a regular expression engine using Thompson's NFA constructionalgorithm combined withthread-based simulation for pattern matching. This architecture provides linear time complexity O(n*m) for matching, where n is the input length and m is the pattern size.

Design Philosophy

Zero Dependencies - Uses only Zig's standard library
Memory Safety - Full allocator control, no hidden allocations
Correctness First - Focus on correct implementation before optimization
Clean Separation - Each phase (parsing, compilation, execution) is isolated
Testability - Every component is independently testable

Component Architecture

┌─────────────────┐
│  User Input     │
│  Pattern String │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│     LEXER       │  ← Tokenizes pattern into tokens
│  (parser.zig)   │    (literals, operators, etc.)
└────────┬────────┘
         │ Tokens
         ▼
┌─────────────────┐
│     PARSER      │  ← Builds Abstract Syntax Tree
│  (parser.zig)   │    using recursive descent
└────────┬────────┘
         │ AST
         ▼
┌─────────────────┐
│   COMPILER      │  ← Thompson NFA Construction
│ (compiler.zig)  │    Converts AST → NFA
└────────┬────────┘
         │ NFA
         ▼
┌─────────────────┐
│   VIRTUAL       │  ← Thread-based simulation
│   MACHINE       │    Matches input against NFA
│   (vm.zig)      │
└────────┬────────┘
         │ Match Results
         ▼
┌─────────────────┐
│  Public API     │  ← High-level matching functions
│  (regex.zig)    │    find(), replace(), split()
└─────────────────┘

Module Breakdown

1. common.zig (174 lines)

Purpose: Shared types and utilities

Key Components:

CharRange - Represents a character range (e.g., 'a'-'z')
CharClass - Set of character ranges with negation support
CharClasses - Predefined classes (\d, \w, \s, etc.)
CompileFlags - Future support for flags (i, m, s, x, u)
Span - Source position for error reporting

Design Notes:

Character classes use ranges for memory efficiency
Negation handled at match time, not during construction

2. ast.zig (267 lines)

Purpose: Abstract Syntax Tree representation

Node Types:

pub const NodeType = enum {
    literal,      // Single character 'a'
    any,          // Wildcard '.'
    concat,       // Implicit concatenation 'ab'
    alternation,  // Choice 'a|b'
    star,         // Zero or more 'a*'
    plus,         // One or more 'a+'
    optional,     // Zero or one 'a?'
    repeat,       // Bounded 'a{m,n}'
    char_class,   // Character set '[a-z]'
    group,        // Capture group '(a)'
    anchor,       // Position assertion '^', '$', '\b'
    empty,        // Epsilon transition
};

Key Features:

Tagged union for type-safe node data
Recursive structure for nested expressions
Memory management via allocator
Span tracking for error messages

Design Notes:

Each node owns its children (tree ownership)
destroy() recursively frees the entire tree
Factory methods ensure proper initialization

3. parser.zig (395 lines)

Purpose: Lexical analysis and parsing

Architecture:

Lexer (Tokenizer)
    │
    ├─ peek()     - Look at current char
    ├─ advance()  - Move to next char
    ├─ makeToken()- Create token
    └─ parseEscape() - Handle \ sequences

Parser (Recursive Descent)
    │
    ├─ parseAlternation()  (lowest precedence)
    ├─ parseConcat()
    ├─ parseRepeat()
    ├─ parsePrimary()      (highest precedence)
    └─ parseCharClass()

Operator Precedence (high to low):

Literals, groups, character classes
Quantifiers (*, +, ?, {m,n})
Concatenation (implicit)
Alternation (|)

Design Notes:

Single-pass parsing
Error recovery not implemented (fails fast)
Lookahead limited to one token
Character classes parsed separately

4. compiler.zig (455 lines)

Purpose: NFA construction via Thompson's algorithm

NFA Components:

pub const State = struct {
    id: StateId,
    transitions: ArrayList(Transition),
    is_accepting: bool,
    capture_start: ?usize,
    capture_end: ?usize,
};

pub const Transition = struct {
    transition_type: TransitionType,
    to: StateId,
    data: TransitionData,
};

Thompson Construction Rules:

Pattern	NFA Fragment
Literal 'a'	`s0 --a--> s1`
Concatenation 'ab'	`s0 --a--> s1 --ε--> s2 --b--> s3`
Alternation 'a\|b'	`s0 --ε--> (s1 --a--> s2)<br>    └--ε--> (s3 --b--> s4)`
Star 'a*'	`s0 --ε--> s1 --a--> s2 --ε--> s1<br>    └--ε--> accept`

Design Notes:

Each AST node compiles to a Fragment (start, accept states)
Epsilon transitions used for control flow
Greedy matching achieved through simulation order
Capture groups marked on states, not transitions

5. vm.zig (340 lines)

Purpose: NFA simulation engine

Thread-Based Simulation:

Thread = {
    state: StateId,
    capture_starts: []?usize,
    capture_ends: []?usize,
}

Algorithm:

1. Start with one thread at start state
2. Compute epsilon closure
3. For each input character:

   a. For each thread, try all transitions
   b. Create new threads for matches
   c. Track longest match (greedy)

4. Return longest accepting match

Key Features:

Greedy Matching: Continues matching to find longest match
Epsilon Closure: Precomputed before consuming input
Capture Tracking: Each thread maintains capture positions
No Backtracking: All paths explored simultaneously

Time Complexity:

O(n _ m) where n = input length, m = number of states
Worst case: O(n _ 2^p) where p = number of alternations

Space Complexity:

O(m * c) where m = states, c = capture groups
Thread list grows with state complexity

Design Notes:

Threads represent possible execution paths
Visited states tracked to avoid infinite loops
Anchors checked during epsilon closure
Match preference: longer match > earlier match

6. regex.zig (385 lines)

Purpose: Public API and convenience functions

API Surface:

// Compilation
Regex.compile(allocator, pattern) -> Regex
regex.deinit()

// Matching
regex.isMatch(input) -> bool
regex.find(input) -> ?Match
regex.findAll(allocator, input) -> []Match

// Transformation
regex.replace(allocator, input, replacement) -> []u8
regex.replaceAll(allocator, input, replacement) -> []u8
regex.split(allocator, input) -> [][]const u8

Design Notes:

Immutable compiled patterns (can reuse)
All allocations explicit and user-controlled
Match results contain slices (no copies)
Iterator pattern not yet implemented

7. errors.zig (67 lines)

Purpose: Error types and reporting

Error Types:

pub const RegexError = error{
    // Parse errors
    InvalidPattern,
    UnexpectedCharacter,
    UnexpectedEndOfPattern,
    InvalidEscapeSequence,
    InvalidCharacterClass,
    InvalidQuantifier,
    UnmatchedParenthesis,
    UnmatchedBracket,
    EmptyPattern,

    // Runtime errors
    CompilationFailed,
    TooManyStates,
    OutOfMemory,
};

Error Context:

Position tracking in pattern
Human-readable messages
Pointer to error location

Algorithms

Thompson's NFA Construction

Why Thompson?

Simple to implement
Guaranteed linear time matching
No catastrophic backtracking
Naturally supports alternation

How it works:

Each regex operator has a standard NFA fragment pattern
Fragments compose recursively
Epsilon transitions connect fragments
Result is a single NFA with one start and one accept state

Thread-Based Simulation

Why Threads?

Simulates NFA without building DFA (space efficient)
Handles large state spaces
Natural support for captures

Greedy Matching:

Continue simulation even after finding a match
Keep longest match found
Ensures a* matches "aaa" not ""

Epsilon Closure:

Precompute all states reachable via ε-transitions
Check anchors during closure
Avoids redundant state exploration

Memory Management

Ownership Model:

Regex
  └─ owns NFA
       └─ owns States
            └─ own Transitions

AST (temporary)
  └─ owns Nodes (recursively)

Matches
  └─ borrows input (slices)
  └─ owns capture array

Allocation Points:

Pattern string duplication
AST node creation (freed after compilation)
NFA states and transitions
VM threads during matching (freed each match)
Match results and captures

Guidelines:

User provides allocator for all operations
No hidden global allocations
Clear ownership semantics
defer patterns for cleanup

Performance Characteristics

Time Complexity

Operation	Complexity	Notes
Compile	O(p)	p = pattern length
isMatch	O(n _ m)	n = input, m = states
find	O(n _ m _ k)	k = positions to try
findAll	O(n _ m)	Amortized
replace	O(n + r)	r = replacement length
split	O(n _ m)	Plus allocation

Space Complexity

Component	Complexity	Notes
AST	O(p)	Temporary
NFA	O(p)	Persistent
VM Threads	O(m _ c)	c = captures
Matches	O(r)	r = result count

Optimization Opportunities

Not Yet Implemented:

DFA construction for static patterns
State minimization
Literal prefix extraction
Boyer-Moore-style searching
JIT compilation
SIMD for character classes

Testing Strategy

Test Categories:

Unit Tests - Each module tested independently
Integration Tests - End-to-end pattern matching
Edge Cases - Empty strings, boundary conditions
Compliance Tests - Standard regex behaviors
Property Tests - Invariants (future)

Current Coverage:

35+ tests across all modules
All basic features tested
Edge cases covered
No fuzzing yet

Future Enhancements

Phase 6: Quality & Testing

Comprehensive fuzzing
Property-based tests
Performance regression tests
Memory leak detection

Phase 7: Performance

DFA construction option
Literal prefix optimization
State minimization
Benchmark suite

Phase 8: Extended Features

Named capture groups
Backreferences
Lookahead/lookbehind
Conditional patterns
Unicode properties

Phase 9: API Improvements

Iterator interface
Streaming matches
Partial matching
Match options/flags

References

Algorithms:

Thompson, Ken (1968). "Regular Expression Search Algorithm"
Cox, Russ (2007). "Regular Expression Matching Can Be Simple And Fast"

Implementations:

RE2 (Google) - Inspiration for architecture
Rust regex crate - API design reference
PCRE - Feature completeness reference

Contributing

When modifying the architecture:

Maintain phase separation
Keep allocations explicit
Preserve linear time guarantees
Add tests for new features
Update this document

Last Updated: 2025-01-26 Zig Version: 0.15.1 Lines of Code: ~2,300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Documentation

Overview

Design Philosophy

Component Architecture

Module Breakdown

1. common.zig (174 lines)

2. ast.zig (267 lines)

3. parser.zig (395 lines)

4. compiler.zig (455 lines)

5. vm.zig (340 lines)

6. regex.zig (385 lines)

7. errors.zig (67 lines)

Algorithms

Thompson's NFA Construction

Thread-Based Simulation

Memory Management

Performance Characteristics

Time Complexity

Space Complexity

Optimization Opportunities

Testing Strategy

Future Enhancements

Phase 6: Quality & Testing

Phase 7: Performance

Phase 8: Extended Features

Phase 9: API Improvements

References

Contributing

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture Documentation

Overview

Design Philosophy

Component Architecture

Module Breakdown

1. common.zig (174 lines)

2. ast.zig (267 lines)

3. parser.zig (395 lines)

4. compiler.zig (455 lines)

5. vm.zig (340 lines)

6. regex.zig (385 lines)

7. errors.zig (67 lines)

Algorithms

Thompson's NFA Construction

Thread-Based Simulation

Memory Management

Performance Characteristics

Time Complexity

Space Complexity

Optimization Opportunities

Testing Strategy

Future Enhancements

Phase 6: Quality & Testing

Phase 7: Performance

Phase 8: Extended Features

Phase 9: API Improvements

References

Contributing