The zig-regex library implements a regular expression engine using Thompson's NFA constructionalgorithm combined withthread-based simulation for pattern matching. This architecture provides linear time complexity O(n*m) for matching, where n is the input length and m is the pattern size.
- Zero Dependencies - Uses only Zig's standard library
- Memory Safety - Full allocator control, no hidden allocations
- Correctness First - Focus on correct implementation before optimization
- Clean Separation - Each phase (parsing, compilation, execution) is isolated
- Testability - Every component is independently testable
┌─────────────────┐
│ User Input │
│ Pattern String │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LEXER │ ← Tokenizes pattern into tokens
│ (parser.zig) │ (literals, operators, etc.)
└────────┬────────┘
│ Tokens
▼
┌─────────────────┐
│ PARSER │ ← Builds Abstract Syntax Tree
│ (parser.zig) │ using recursive descent
└────────┬────────┘
│ AST
▼
┌─────────────────┐
│ COMPILER │ ← Thompson NFA Construction
│ (compiler.zig) │ Converts AST → NFA
└────────┬────────┘
│ NFA
▼
┌─────────────────┐
│ VIRTUAL │ ← Thread-based simulation
│ MACHINE │ Matches input against NFA
│ (vm.zig) │
└────────┬────────┘
│ Match Results
▼
┌─────────────────┐
│ Public API │ ← High-level matching functions
│ (regex.zig) │ find(), replace(), split()
└─────────────────┘
Purpose: Shared types and utilities
Key Components:
CharRange- Represents a character range (e.g., 'a'-'z')CharClass- Set of character ranges with negation supportCharClasses- Predefined classes (\d, \w, \s, etc.)CompileFlags- Future support for flags (i, m, s, x, u)Span- Source position for error reporting
Design Notes:
- Character classes use ranges for memory efficiency
- Negation handled at match time, not during construction
Purpose: Abstract Syntax Tree representation
Node Types:
pub const NodeType = enum {
literal, // Single character 'a'
any, // Wildcard '.'
concat, // Implicit concatenation 'ab'
alternation, // Choice 'a|b'
star, // Zero or more 'a*'
plus, // One or more 'a+'
optional, // Zero or one 'a?'
repeat, // Bounded 'a{m,n}'
char_class, // Character set '[a-z]'
group, // Capture group '(a)'
anchor, // Position assertion '^', '$', '\b'
empty, // Epsilon transition
};Key Features:
- Tagged union for type-safe node data
- Recursive structure for nested expressions
- Memory management via allocator
- Span tracking for error messages
Design Notes:
- Each node owns its children (tree ownership)
destroy()recursively frees the entire tree- Factory methods ensure proper initialization
Purpose: Lexical analysis and parsing
Architecture:
Lexer (Tokenizer)
│
├─ peek() - Look at current char
├─ advance() - Move to next char
├─ makeToken()- Create token
└─ parseEscape() - Handle \ sequences
Parser (Recursive Descent)
│
├─ parseAlternation() (lowest precedence)
├─ parseConcat()
├─ parseRepeat()
├─ parsePrimary() (highest precedence)
└─ parseCharClass()
Operator Precedence (high to low):
- Literals, groups, character classes
- Quantifiers (*, +, ?, {m,n})
- Concatenation (implicit)
- Alternation (|)
Design Notes:
- Single-pass parsing
- Error recovery not implemented (fails fast)
- Lookahead limited to one token
- Character classes parsed separately
Purpose: NFA construction via Thompson's algorithm
NFA Components:
pub const State = struct {
id: StateId,
transitions: ArrayList(Transition),
is_accepting: bool,
capture_start: ?usize,
capture_end: ?usize,
};
pub const Transition = struct {
transition_type: TransitionType,
to: StateId,
data: TransitionData,
};Thompson Construction Rules:
| Pattern | NFA Fragment |
|---|---|
| Literal 'a' | s0 --a--> s1 |
| Concatenation 'ab' | s0 --a--> s1 --ε--> s2 --b--> s3 |
| Alternation 'a|b' | s0 --ε--> (s1 --a--> s2)<br> └--ε--> (s3 --b--> s4) |
| Star 'a*' | s0 --ε--> s1 --a--> s2 --ε--> s1<br> └--ε--> accept |
Design Notes:
- Each AST node compiles to a Fragment (start, accept states)
- Epsilon transitions used for control flow
- Greedy matching achieved through simulation order
- Capture groups marked on states, not transitions
Purpose: NFA simulation engine
Thread-Based Simulation:
Thread = {
state: StateId,
capture_starts: []?usize,
capture_ends: []?usize,
}
Algorithm:
1. Start with one thread at start state
2. Compute epsilon closure
3. For each input character:
a. For each thread, try all transitions
b. Create new threads for matches
c. Track longest match (greedy)
4. Return longest accepting match
Key Features:
- Greedy Matching: Continues matching to find longest match
- Epsilon Closure: Precomputed before consuming input
- Capture Tracking: Each thread maintains capture positions
- No Backtracking: All paths explored simultaneously
Time Complexity:
- O(n _ m) where n = input length, m = number of states
- Worst case: O(n _ 2^p) where p = number of alternations
Space Complexity:
- O(m * c) where m = states, c = capture groups
- Thread list grows with state complexity
Design Notes:
- Threads represent possible execution paths
- Visited states tracked to avoid infinite loops
- Anchors checked during epsilon closure
- Match preference: longer match > earlier match
Purpose: Public API and convenience functions
API Surface:
// Compilation
Regex.compile(allocator, pattern) -> Regex
regex.deinit()
// Matching
regex.isMatch(input) -> bool
regex.find(input) -> ?Match
regex.findAll(allocator, input) -> []Match
// Transformation
regex.replace(allocator, input, replacement) -> []u8
regex.replaceAll(allocator, input, replacement) -> []u8
regex.split(allocator, input) -> [][]const u8Design Notes:
- Immutable compiled patterns (can reuse)
- All allocations explicit and user-controlled
- Match results contain slices (no copies)
- Iterator pattern not yet implemented
Purpose: Error types and reporting
Error Types:
pub const RegexError = error{
// Parse errors
InvalidPattern,
UnexpectedCharacter,
UnexpectedEndOfPattern,
InvalidEscapeSequence,
InvalidCharacterClass,
InvalidQuantifier,
UnmatchedParenthesis,
UnmatchedBracket,
EmptyPattern,
// Runtime errors
CompilationFailed,
TooManyStates,
OutOfMemory,
};Error Context:
- Position tracking in pattern
- Human-readable messages
- Pointer to error location
Why Thompson?
- Simple to implement
- Guaranteed linear time matching
- No catastrophic backtracking
- Naturally supports alternation
How it works:
- Each regex operator has a standard NFA fragment pattern
- Fragments compose recursively
- Epsilon transitions connect fragments
- Result is a single NFA with one start and one accept state
Why Threads?
- Simulates NFA without building DFA (space efficient)
- Handles large state spaces
- Natural support for captures
Greedy Matching:
- Continue simulation even after finding a match
- Keep longest match found
- Ensures
a*matches "aaa" not ""
Epsilon Closure:
- Precompute all states reachable via ε-transitions
- Check anchors during closure
- Avoids redundant state exploration
Ownership Model:
Regex
└─ owns NFA
└─ owns States
└─ own Transitions
AST (temporary)
└─ owns Nodes (recursively)
Matches
└─ borrows input (slices)
└─ owns capture array
Allocation Points:
- Pattern string duplication
- AST node creation (freed after compilation)
- NFA states and transitions
- VM threads during matching (freed each match)
- Match results and captures
Guidelines:
- User provides allocator for all operations
- No hidden global allocations
- Clear ownership semantics
- defer patterns for cleanup
| Operation | Complexity | Notes |
|---|---|---|
| Compile | O(p) | p = pattern length |
| isMatch | O(n _ m) | n = input, m = states |
| find | O(n _ m _ k) | k = positions to try |
| findAll | O(n _ m) | Amortized |
| replace | O(n + r) | r = replacement length |
| split | O(n _ m) | Plus allocation |
| Component | Complexity | Notes |
|---|---|---|
| AST | O(p) | Temporary |
| NFA | O(p) | Persistent |
| VM Threads | O(m _ c) | c = captures |
| Matches | O(r) | r = result count |
Not Yet Implemented:
- DFA construction for static patterns
- State minimization
- Literal prefix extraction
- Boyer-Moore-style searching
- JIT compilation
- SIMD for character classes
Test Categories:
- Unit Tests - Each module tested independently
- Integration Tests - End-to-end pattern matching
- Edge Cases - Empty strings, boundary conditions
- Compliance Tests - Standard regex behaviors
- Property Tests - Invariants (future)
Current Coverage:
- 35+ tests across all modules
- All basic features tested
- Edge cases covered
- No fuzzing yet
- Comprehensive fuzzing
- Property-based tests
- Performance regression tests
- Memory leak detection
- DFA construction option
- Literal prefix optimization
- State minimization
- Benchmark suite
- Named capture groups
- Backreferences
- Lookahead/lookbehind
- Conditional patterns
- Unicode properties
- Iterator interface
- Streaming matches
- Partial matching
- Match options/flags
Algorithms:
- Thompson, Ken (1968). "Regular Expression Search Algorithm"
- Cox, Russ (2007). "Regular Expression Matching Can Be Simple And Fast"
Implementations:
- RE2 (Google) - Inspiration for architecture
- Rust regex crate - API design reference
- PCRE - Feature completeness reference
When modifying the architecture:
- Maintain phase separation
- Keep allocations explicit
- Preserve linear time guarantees
- Add tests for new features
- Update this document
Last Updated: 2025-01-26 Zig Version: 0.15.1 Lines of Code: ~2,300