Red Architecture

Rust implementation of GNU sed - Stream Editor

Overview
Core Architecture
Module Structure
Key Components
Data Flow
Bytes and Encoding
Regex Engine
GNU sed Compatibility
Development Guide

Overview

Red is a drop-in replacement for GNU sed, written in Rust.

Design Principles

Two-Phase Processing: Compile scripts once, execute many times
Unified Context: Single configuration object passed through all modules
Type Safety: Leverage Rust's type system for correctness
GNU Compatibility: Match GNU sed behavior

Core Architecture

Comparison with GNU sed

Aspect	GNU sed	Red	Notes
Command representation	1 struct with union	2 enums (parser + runtime)	Standard AST→IR pattern
Processing	Byte-first	String-first (UTF-8)	Raw bytes tracked separately
Regex engine	POSIX regex + DFA	Custom 3-level matcher	Literal/DFA/NFA
Memory	Obstack allocator	Standard Rust allocations
Jumps	Array indices	Pre-resolved indices	Resolved at compile time

High-Level Flow

┌─────────────┐
│   CLI Args  │
└──────┬──────┘
       │
       ▼
┌─────────────┐     ┌──────────────┐
│  Context    │────▶│  Validation  │
└──────┬──────┘     └──────────────┘
       │
       ▼
┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Parser    │────▶│    Lexer     │────▶│  AST Nodes   │
└──────┬──────┘     └──────────────┘     └──────────────┘
       │                                 (parser::Command)
       │
       ▼
┌────────────────────┐
│ convert_to_runtime │ (AST → RuntimeCommand)
└──────┬─────────────┘
       │
       ▼
┌─────────────┐
│  Commands   │ (RuntimeCommand with compiled regexes)
└──────┬──────┘
       │
       ▼
┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Engine    │────▶│  Evaluator   │────▶│    Output    │
└─────────────┘     └──────────────┘     └──────────────┘

Phase 1: Compilation

Input: Script text + Context Output: List of RuntimeCommand Purpose: Validate and optimize scripts before execution

pub fn parse_scripts_to_commands(
    scripts: &[(String, ScriptSource)],
    ctx: &Context,
) -> Result<Vec<RuntimeCommand>>

Phase 2: Execution

Input: Commands + Input lines Output: Transformed text Purpose: Apply commands to input data

pub fn execute_over_lines(
    commands: &[RuntimeCommand],
    // ... parameters
) -> Result<()>

Module Structure

Module Dependency Graph

main.rs
    │
    ▼
cli.rs ──────────────────────────────────────┐
    │                                        │
    ▼                                        ▼
lib.rs ◄──────────────────────────────── errors.rs
    │
    ├──────────────┬──────────────┬──────────────┐
    ▼              ▼              ▼              ▼
parser/        engine/        fileio/        util/
├── mod.rs     ├── mod.rs     ├── mod.rs     ├── regex.rs
├── lexer.rs   ├── exec.rs    ├── lines.rs   └── version.rs
└── ast.rs     ├── addr.rs    ├── inplace.rs
               ├── types.rs   └── encoding.rs
               └── pattern_space.rs
                    │
                    ▼
               regex/
               ├── mod.rs
               ├── parser.rs
               ├── dfa.rs
               ├── nfa.rs
               ├── backtrack.rs
               └── literal.rs

1. CLI Layer (`src/cli.rs`, `src/main.rs`)

Responsibility: Argument parsing and validation

Uses lexopt for natural order preservation
Separate print_usage() and help_text() functions
Exit codes match GNU sed (4 for usage errors)

Key Functions:

pub fn parse_args() -> Result<CliArgs>

2. Context Layer (`src/context.rs`)

Responsibility: Unified configuration management

pub struct Context {
    // Regex mode
    pub extended_regex: bool,

    // POSIX compliance
    pub posix: PosixMode,
    pub strict_posix: bool,

    // Execution modes
    pub quiet: bool,
    pub null_data: bool,
    pub unbuffered: bool,
    pub separate_files: bool,

    // In-place editing
    pub in_place_suffix: Option<String>,
    pub follow_symlinks: bool,

    // Formatting
    pub line_length: usize,

    // Security
    pub sandbox: bool,

    // Scripts (for #n quiet mode detection)
    pub scripts_with_sources: Vec<(String, ScriptSource)>,
}

POSIX Modes:

pub enum PosixMode {
    Extended,  // GNU extensions enabled (default)
    Correct,   // POSIXLY_CORRECT env var
    Basic,     // --posix flag (strict)
}

3. File I/O Layer (`src/fileio/`)

Responsibility: File reading, encoding detection, and backup utilities

encoding.rs - Encoding detection from BOM and locale
lines.rs - File/stdin reading with line splitting
inplace.rs - Backup suffix expansion for in-place editing
mod.rs - Module exports and output flushing helper

Key Functions:

pub fn read_all_lines(files, null_data, follow_symlinks) -> Result<(lines, bytes, filenames, encoding, ends_with_sep)>
pub fn split_file_content(content, null_data) -> (lines, bytes, encoding, ends_with_sep)
pub fn expand_backup_suffix(file_path, suffix) -> String
pub fn flush_output(out, unbuffered)

4. Parser Layer (`src/parser/`)

Responsibility: Script compilation and validation

Lexer (`src/parser/lexer.rs`)

Tokenizes sed scripts using a state machine approach.

Key optimizations:

Single-pass tokenization
Bracket depth tracking
Escape sequence handling
Raw bytes preservation for non-UTF-8

AST (`src/parser/ast.rs`)

Defines parser command structure:

pub enum Command {
    Substitute { flags, pattern, replacement, ... },
    Delete { range, negated },
    Print { range, negated, ... },
    Group { commands },      // Flattened at compile time
    Comment(String),         // Discarded at compile time
    Version { version },     // Checked at compile time
    // ... more command types
}

Parser (`src/parser.rs`)

Converts tokens to AST nodes with validation:

pub fn parse(tokens: &[Token], ctx: &Context) -> Result<Vec<Command>>

5. Engine Layer (`src/engine/`)

Responsibility: Command execution and state management

Types (`src/engine/types.rs`)

Runtime command representation:

pub enum Command {
    Substitution {
        pattern: SedRegex,               // Compiled regex
        replacement: ReplacementTemplate, // Parsed replacement
        global: bool,
        print: bool,
        literal_pattern: Option<String>,  // Fast-path optimization
        literal_replacement: Option<String>,
        // ... more fields
    },
    // ... more runtime commands
}

ExecutionContext (`src/engine/exec.rs`)

Maintains execution state:

pub struct ExecutionContext {
    pub pattern_space: PatternSpace,
    pub hold_space: String,
    pub hold_space_raw: Vec<u8>,
    pub line_number: usize,
    pub is_last_line: bool,
    // ... state fields
}

AddressEvaluator (`src/engine/addr.rs`)

Evaluates address ranges with state tracking:

pub fn evaluate(&mut self, range: Option<&AddressRange>, ctx: &ExecutionContext) -> bool

Features:

Line number ranges (5,10)
Regex matches (/pattern/)
Step addressing (1~2)
Dollar ($) for last line
Range state machine (INACTIVE → ACTIVE → CLOSED)

6. Error Handling (`src/errors.rs`)

Unified error types:

pub enum SedError {
    Parse { message, source },
    Io { operation, path, error },
    Rename { source_path, dest_path, error },
    Runtime { message },
    Usage { message },
}

GNU sed compatible error messages and exit codes.

Key Components

Two-Stage Command Representation

Red uses a deliberate two-stage command representation, following the classic AST-to-IR pattern from compilers:

Parser Command (`parser/ast.rs`)

pub enum Command {
    Substitution {
        pattern: String,           // Raw regex pattern
        replacement: String,       // Raw replacement string
        flags: SubstitutionFlags,  // Struct with all flags
        // ...
    },
    Group { commands: Vec<Command> },  // Flattened at compile time
    Version { version: String },       // Compile-time check only
    Comment(String),                   // Discarded
    // ... more variants
}

Purpose: Preserves source structure, enables error reporting with context

Runtime Command (`engine/types.rs`)

pub enum Command {
    Substitution {
        pattern: SedRegex,               // Compiled regex
        replacement: ReplacementTemplate, // Parsed replacement
        global: bool,                    // Expanded from flags
        print: bool,
        literal_pattern: Option<String>, // Fast-path optimization
        literal_replacement: Option<String>,
        // ... fields with optimizations
    },
    // ... more variants (no Group/Version/Comment)
}

Purpose: Optimized for execution, pre-compiled patterns, literal optimizations

Why Two Enums?

Separation of concerns: Parser produces semantic representation, engine uses optimized representation
Compile-time processing: Group commands are flattened, Version is checked, Comments discarded
Optimization fields: Runtime adds literal_pattern, literal_replacement, use_last that don't exist in source
Type differences: String → SedRegex, String → ReplacementTemplate, SubstitutionFlags → individual booleans
Common compiler pattern: Similar to AST → IR transformation in production compilers

PatternSpace: Dual Representation

The PatternSpace struct (src/engine/pattern_space.rs) maintains synchronized representations:

pub struct PatternSpace {
    raw: Vec<u8>,              // Raw bytes (may contain invalid UTF-8)
    active_start: usize,       // Active pointer for D command optimization
    text_cache: Option<String>, // Cached UTF-8 representation
}

Key methods:

raw() - Returns active bytes slice
text() - Returns UTF-8 text (from cache)
set_raw() - Updates from bytes, rebuilds text cache with lossy conversion
delete_first_line() - O(1) using active pointer (GNU sed technique)

Data Flow

1. Script Compilation

  Script Text
       │
       ▼
┌─────────────┐
│   Lexer     │ Tokenize with raw bytes tracking
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Parser    │ Build AST (parser::Command)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Validator  │ Check POSIX rules, labels
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Converter  │ compile_to_runtime_commands()
└──────┬──────┘
       │
       ▼
RuntimeCommand[] (with compiled regexes)

2. Command Execution

Input Lines (with raw bytes)
         │
         ▼
┌──────────────────┐
│ ExecutionContext │◄─── Commands
│ - pattern_space  │
│ - hold_space     │
│ - line_number    │
└────────┬─────────┘
         │
         ▼
┌──────────────────────┐
│ apply_commands_with_ │
│ context()            │
└───────┬──────────────┘
        │
        ▼
 CommandResult (with raw bytes)
        │
        ▼
    ┌────────┐
    │ Output │ (preserves raw bytes)
    └────────┘

3. In-Place Editing

  Original File
       │
       ▼
┌──────────────┐
│ Read Content │ (as raw bytes)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Process    │ (preserve raw bytes through pipeline)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Temp File   │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│    Rename    │ (atomic)
└──────────────┘

Bytes and Encoding

Design: UTF-8 First with Raw Bytes Tracking

Red uses Rust's String type (valid UTF-8) as the primary text representation for regex matching, but maintains raw bytes throughout the pipeline for output.

Raw Bytes Preservation

Input (bytes) → PatternSpace.raw → Commands → Output (bytes)
                     ↓
               text_cache (lossy)
                     ↓
              Regex matching on text

Implementation:

Hold space raw bytes tracking (hold_space_raw: Vec<u8>)
CommandResult::Continue(String, Option<Vec<u8>>) - carries raw bytes through engine
CommandResult::Print(String, Option<Vec<u8>>) - outputs raw bytes when available
LiteralBytes(Vec<u8>) token in replacement templates

Component Status

Component	Status	Implementation
Script reading	Bytes	`Vec<u8>` + lossy UTF-8 conversion
Lexer	Bytes-aware	char-to-byte mapping for raw bytes
Parser	Bytes-aware	`replacement_raw_bytes` field
Replacement template	Bytes	`LiteralBytes(Vec<u8>)` token
Escape sequences	Raw output	`\xNN`, `\oNNN`, `\dNNN`, `\cX`
Pattern matching	UTF-8	Regex operates on text cache
Data I/O	Bytes	Raw bytes preserved through hold/pattern space
Substitution pipeline	Bytes	Raw bytes tracked across multiple substitutions

Regex Engine

Red implements a custom regex engine with three different matchers, each optimized for different pattern types.

Three-Level Optimization Strategy

pub enum Matcher {
    Literal(LiteralMatcher),   // Fastest: direct string/byte search
    Dfa(DfaMatcher),           // Fast: Deterministic Finite Automaton
    Nfa(NfaMatcher),           // Full-featured: Nondeterministic Finite Automaton
}

Matcher Selection (compile time)

  Pattern Analysis
         │
         ▼
┌─────────────────────┐
│ Is literal?         │──Yes──▶ LiteralMatcher (fastest)
│ (no metacharacters) │
└────────┬────────────┘
         │ No
         ▼
┌─────────────────┐
│ Has backrefs?   │──Yes──▶ NfaMatcher (required)
│ (\1, \2, etc)   │
└────────┬────────┘
         │ No
         ▼
┌─────────────────┐
│ MBCS locale?    │──Yes──▶ NfaMatcher (for non-ASCII patterns)
│ (MB_CUR_MAX > 1)│
└────────┬────────┘
         │ No
         ▼
      DfaMatcher (fast, no backrefs)

Literal Matcher (`src/regex/literal.rs`)

When used: Patterns without any regex metacharacters (e.g., hello, foo_bar)

Implementation:

pub struct LiteralMatcher {
    pattern: String,
    ignore_case: bool,
}

impl LiteralMatcher {
    // String-based matching (UTF-8)
    pub fn find(&self, text: &str) -> Option<usize>

    // Byte-based matching (MBCS-safe for ASCII patterns)
    pub fn find_bytes(&self, bytes: &[u8]) -> Option<usize>
    pub fn find_bytes_from(&self, bytes: &[u8], start: usize) -> Option<usize>
}

MBCS behavior: ASCII-only literal patterns can safely use byte-level matching even in MBCS locales, because ASCII bytes (0x00-0x7F) never appear as part of multibyte characters.

DFA Matcher (`src/regex/dfa.rs`)

When used: Simple patterns without backreferences in single-byte locales

How DFA works:

Compiles pattern into a state machine at parse time
Each state has exactly one transition per input character
No backtracking - processes input in a single pass
Memory: O(states × alphabet_size)

Limitations:

Cannot support backreferences - This is a mathematical impossibility, not an implementation limitation. Backreferences (\1, \2) require the regex engine to "remember" what was matched and compare against it later. DFAs have no memory of the path taken to reach a state.
Disabled in MBCS locales - DFA operates on fixed-width input units. In MBCS, character widths vary (1-4 bytes), requiring boundary detection that DFA architecture doesn't support.

NFA Matcher (`src/regex/nfa.rs`, `src/regex/backtrack.rs`)

When used:

Patterns with backreferences (\1, \2, etc.)
Patterns with capturing groups that need extraction
All patterns in MBCS locales (for non-ASCII patterns)

How NFA differs from DFA:

                    DFA                          NFA
             ┌─────────────────┐          ┌─────────────────┐
Input 'a' →  │ State 1 ──────▶ │ State 2  │ State 1 ──┬───▶ │ State 2
             │    (one path)   │          │           └───▶ │ State 3
             └─────────────────┘          │    (multiple)   │
                                          └─────────────────┘

NFA can be in multiple states simultaneously
Explores all possible paths (via backtracking)
Can "remember" captured groups for backreferences
More flexible but slower

MBCS support in NFA:

// NFA uses MbText for character boundary tracking
pub struct MbText<'a> {
    raw: &'a [u8],
    char_boundaries: Vec<usize>,  // Where each character starts
    char_validity: Vec<bool>,     // Is each sequence valid?
}

// Transitions respect character boundaries
match transition {
    Transition::Any => {
        // In MBCS mode: advance by full character, not byte
        let char_len = mbtext.char_at(pos).len();
        next_pos = pos + char_len;
    }
}

Why Both DFA and NFA?

Aspect	DFA	NFA
Speed	O(n) - very fast	O(n×m) - slower
Backreferences	Impossible	Full support
Memory usage	Higher (state table)	Lower
MBCS support	Disabled	Full support
Use case	Simple patterns	Complex patterns

Cannot remove either:

Removing DFA: Performance regression for simple patterns
Removing NFA: Backreferences would be impossible (s/$foo$/\1\1/ fails)

MBCS Locale Behavior

When MB_CUR_MAX > 1 (Shift-JIS, EUC-JP, etc.):

Pattern Type              Matcher Used       Why
─────────────────────────────────────────────────────
ASCII literal "hello"     LiteralMatcher     Safe: ASCII bytes don't overlap MBCS
Non-ASCII literal "日本"  NfaMatcher         Needs MBCS boundary handling
Any with backrefs         NfaMatcher         DFA can't do backrefs
Simple /[a-z]+/           NfaMatcher         DFA disabled in MBCS

Example of why DFA fails in MBCS:

Shift-JIS "表" = bytes [0x95, 0x5C]
ASCII "\" = byte [0x5C]

Pattern: /\\/  (match backslash)

DFA (byte-level): Would incorrectly match inside "表"
                  because 0x5C appears as second byte

NFA (MBCS-aware): Correctly skips "表" because MbText
                  knows 0x5C is part of a 2-byte character

GNU sed Compatibility

Compatibility Features

CLI Compatibility

Natural order preservation of -e and -f
Separate usage (stderr) vs help (stdout) messages
Exit code 4 for usage errors
Write error detection (/dev/full)

Command Compatibility

All standard sed commands (s, d, p, a, i, c, etc.)
Address ranges (1,10, /start/,/end/, $)
Step addressing (1~2)
Hold space operations (h, H, g, G, x)
Branching (b, t, T, :label)
Byte-level operations (l command with octal escapes)

Multibyte Character Set (MBCS) Support

Full MBCS support for non-UTF-8 locales (Shift-JIS, EUC-JP, etc.)

Feature	Implementation
Locale detection	`mbcs::initialize()` via `setlocale()`, `MB_CUR_MAX`
Character boundaries	`MbText` struct with `mbrlen()`
Regex dot matching	`Transition::Any` respects MB boundaries
Character classes	`mbrtowc()` conversion for CharSet matching
Capture groups	`raw_bytes` field preserves MBCS bytes
Backreferences	Byte-level comparison in MB mode
Invalid/incomplete sequences	`char_validity` tracking, skip in dot match
Raw bytes output	`render_replacement_to_bytes()` sink

Key MBCS Data Structures:

pub struct MbText<'a> {
    raw: &'a [u8],                    // Raw input bytes
    char_boundaries: Vec<usize>,      // Byte offset for each char start
    char_validity: Vec<bool>,         // Valid vs invalid/incomplete
}

pub struct MbChar<'a> {
    bytes: &'a [u8],                  // One logical character's bytes
}

SELinux Support

When built with the selinux feature (Linux only):

Preserves SELinux contexts during in-place editing
Without --follow-symlinks: Uses symlink's context for new file
With --follow-symlinks: Uses target file's context

Development Guide

Adding New Commands

Add token in src/parser/lexer.rs
Add AST node in src/parser/ast.rs
Add parsing logic in src/parser.rs
Add runtime command in src/engine/types.rs
Add conversion in src/lib.rs:convert_new_command_to_old()
Add execution logic in src/engine/exec.rs
Add tests in tests/

Adding New Options

Add field to CliArgs in src/cli.rs
Add parsing in parse_args()
Add to Context in src/context.rs
Update RunConfig in src/lib.rs
Add validation if needed

Testing

# Unit tests
cargo test

# GNU sed compatibility tests
./scripts/run_gnused_tests.sh

# Specific test
./scripts/run_gnused_tests.sh --test-pattern "help.sh"

# Fast mode (debug build, compact output)
./scripts/run_gnused_tests.sh --fast

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History