Skip to content

Latest commit

 

History

History
777 lines (614 loc) · 23.5 KB

File metadata and controls

777 lines (614 loc) · 23.5 KB

Red Architecture

Rust implementation of GNU sed - Stream Editor


Table of Contents

  1. Overview
  2. Core Architecture
  3. Module Structure
  4. Key Components
  5. Data Flow
  6. Bytes and Encoding
  7. Regex Engine
  8. GNU sed Compatibility
  9. Development Guide

Overview

Red is a drop-in replacement for GNU sed, written in Rust.

Design Principles

  1. Two-Phase Processing: Compile scripts once, execute many times
  2. Unified Context: Single configuration object passed through all modules
  3. Type Safety: Leverage Rust's type system for correctness
  4. GNU Compatibility: Match GNU sed behavior

Core Architecture

Comparison with GNU sed

Aspect GNU sed Red Notes
Command representation 1 struct with union 2 enums (parser + runtime) Standard AST→IR pattern
Processing Byte-first String-first (UTF-8) Raw bytes tracked separately
Regex engine POSIX regex + DFA Custom 3-level matcher Literal/DFA/NFA
Memory Obstack allocator Standard Rust allocations
Jumps Array indices Pre-resolved indices Resolved at compile time

High-Level Flow

┌─────────────┐
│   CLI Args  │
└──────┬──────┘
       │
       ▼
┌─────────────┐     ┌──────────────┐
│  Context    │────▶│  Validation  │
└──────┬──────┘     └──────────────┘
       │
       ▼
┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Parser    │────▶│    Lexer     │────▶│  AST Nodes   │
└──────┬──────┘     └──────────────┘     └──────────────┘
       │                                 (parser::Command)
       │
       ▼
┌────────────────────┐
│ convert_to_runtime │ (AST → RuntimeCommand)
└──────┬─────────────┘
       │
       ▼
┌─────────────┐
│  Commands   │ (RuntimeCommand with compiled regexes)
└──────┬──────┘
       │
       ▼
┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Engine    │────▶│  Evaluator   │────▶│    Output    │
└─────────────┘     └──────────────┘     └──────────────┘

Phase 1: Compilation

Input: Script text + Context Output: List of RuntimeCommand Purpose: Validate and optimize scripts before execution

pub fn parse_scripts_to_commands(
    scripts: &[(String, ScriptSource)],
    ctx: &Context,
) -> Result<Vec<RuntimeCommand>>

Phase 2: Execution

Input: Commands + Input lines Output: Transformed text Purpose: Apply commands to input data

pub fn execute_over_lines(
    commands: &[RuntimeCommand],
    // ... parameters
) -> Result<()>

Module Structure

Module Dependency Graph

main.rs
    │
    ▼
cli.rs ──────────────────────────────────────┐
    │                                        │
    ▼                                        ▼
lib.rs ◄──────────────────────────────── errors.rs
    │
    ├──────────────┬──────────────┬──────────────┐
    ▼              ▼              ▼              ▼
parser/        engine/        fileio/        util/
├── mod.rs     ├── mod.rs     ├── mod.rs     ├── regex.rs
├── lexer.rs   ├── exec.rs    ├── lines.rs   └── version.rs
└── ast.rs     ├── addr.rs    ├── inplace.rs
               ├── types.rs   └── encoding.rs
               └── pattern_space.rs
                    │
                    ▼
               regex/
               ├── mod.rs
               ├── parser.rs
               ├── dfa.rs
               ├── nfa.rs
               ├── backtrack.rs
               └── literal.rs

1. CLI Layer (src/cli.rs, src/main.rs)

Responsibility: Argument parsing and validation

  • Uses lexopt for natural order preservation
  • Separate print_usage() and help_text() functions
  • Exit codes match GNU sed (4 for usage errors)

Key Functions:

pub fn parse_args() -> Result<CliArgs>

2. Context Layer (src/context.rs)

Responsibility: Unified configuration management

pub struct Context {
    // Regex mode
    pub extended_regex: bool,

    // POSIX compliance
    pub posix: PosixMode,
    pub strict_posix: bool,

    // Execution modes
    pub quiet: bool,
    pub null_data: bool,
    pub unbuffered: bool,
    pub separate_files: bool,

    // In-place editing
    pub in_place_suffix: Option<String>,
    pub follow_symlinks: bool,

    // Formatting
    pub line_length: usize,

    // Security
    pub sandbox: bool,

    // Scripts (for #n quiet mode detection)
    pub scripts_with_sources: Vec<(String, ScriptSource)>,
}

POSIX Modes:

pub enum PosixMode {
    Extended,  // GNU extensions enabled (default)
    Correct,   // POSIXLY_CORRECT env var
    Basic,     // --posix flag (strict)
}

3. File I/O Layer (src/fileio/)

Responsibility: File reading, encoding detection, and backup utilities

  • encoding.rs - Encoding detection from BOM and locale
  • lines.rs - File/stdin reading with line splitting
  • inplace.rs - Backup suffix expansion for in-place editing
  • mod.rs - Module exports and output flushing helper

Key Functions:

pub fn read_all_lines(files, null_data, follow_symlinks) -> Result<(lines, bytes, filenames, encoding, ends_with_sep)>
pub fn split_file_content(content, null_data) -> (lines, bytes, encoding, ends_with_sep)
pub fn expand_backup_suffix(file_path, suffix) -> String
pub fn flush_output(out, unbuffered)

4. Parser Layer (src/parser/)

Responsibility: Script compilation and validation

Lexer (src/parser/lexer.rs)

Tokenizes sed scripts using a state machine approach.

Key optimizations:

  • Single-pass tokenization
  • Bracket depth tracking
  • Escape sequence handling
  • Raw bytes preservation for non-UTF-8

AST (src/parser/ast.rs)

Defines parser command structure:

pub enum Command {
    Substitute { flags, pattern, replacement, ... },
    Delete { range, negated },
    Print { range, negated, ... },
    Group { commands },      // Flattened at compile time
    Comment(String),         // Discarded at compile time
    Version { version },     // Checked at compile time
    // ... more command types
}

Parser (src/parser.rs)

Converts tokens to AST nodes with validation:

pub fn parse(tokens: &[Token], ctx: &Context) -> Result<Vec<Command>>

5. Engine Layer (src/engine/)

Responsibility: Command execution and state management

Types (src/engine/types.rs)

Runtime command representation:

pub enum Command {
    Substitution {
        pattern: SedRegex,               // Compiled regex
        replacement: ReplacementTemplate, // Parsed replacement
        global: bool,
        print: bool,
        literal_pattern: Option<String>,  // Fast-path optimization
        literal_replacement: Option<String>,
        // ... more fields
    },
    // ... more runtime commands
}

ExecutionContext (src/engine/exec.rs)

Maintains execution state:

pub struct ExecutionContext {
    pub pattern_space: PatternSpace,
    pub hold_space: String,
    pub hold_space_raw: Vec<u8>,
    pub line_number: usize,
    pub is_last_line: bool,
    // ... state fields
}

AddressEvaluator (src/engine/addr.rs)

Evaluates address ranges with state tracking:

pub fn evaluate(&mut self, range: Option<&AddressRange>, ctx: &ExecutionContext) -> bool

Features:

  • Line number ranges (5,10)
  • Regex matches (/pattern/)
  • Step addressing (1~2)
  • Dollar ($) for last line
  • Range state machine (INACTIVE → ACTIVE → CLOSED)

6. Error Handling (src/errors.rs)

Unified error types:

pub enum SedError {
    Parse { message, source },
    Io { operation, path, error },
    Rename { source_path, dest_path, error },
    Runtime { message },
    Usage { message },
}

GNU sed compatible error messages and exit codes.


Key Components

Two-Stage Command Representation

Red uses a deliberate two-stage command representation, following the classic AST-to-IR pattern from compilers:

Parser Command (parser/ast.rs)

pub enum Command {
    Substitution {
        pattern: String,           // Raw regex pattern
        replacement: String,       // Raw replacement string
        flags: SubstitutionFlags,  // Struct with all flags
        // ...
    },
    Group { commands: Vec<Command> },  // Flattened at compile time
    Version { version: String },       // Compile-time check only
    Comment(String),                   // Discarded
    // ... more variants
}

Purpose: Preserves source structure, enables error reporting with context

Runtime Command (engine/types.rs)

pub enum Command {
    Substitution {
        pattern: SedRegex,               // Compiled regex
        replacement: ReplacementTemplate, // Parsed replacement
        global: bool,                    // Expanded from flags
        print: bool,
        literal_pattern: Option<String>, // Fast-path optimization
        literal_replacement: Option<String>,
        // ... fields with optimizations
    },
    // ... more variants (no Group/Version/Comment)
}

Purpose: Optimized for execution, pre-compiled patterns, literal optimizations

Why Two Enums?

  1. Separation of concerns: Parser produces semantic representation, engine uses optimized representation
  2. Compile-time processing: Group commands are flattened, Version is checked, Comments discarded
  3. Optimization fields: Runtime adds literal_pattern, literal_replacement, use_last that don't exist in source
  4. Type differences: String → SedRegex, String → ReplacementTemplate, SubstitutionFlags → individual booleans
  5. Common compiler pattern: Similar to AST → IR transformation in production compilers

PatternSpace: Dual Representation

The PatternSpace struct (src/engine/pattern_space.rs) maintains synchronized representations:

pub struct PatternSpace {
    raw: Vec<u8>,              // Raw bytes (may contain invalid UTF-8)
    active_start: usize,       // Active pointer for D command optimization
    text_cache: Option<String>, // Cached UTF-8 representation
}

Key methods:

  • raw() - Returns active bytes slice
  • text() - Returns UTF-8 text (from cache)
  • set_raw() - Updates from bytes, rebuilds text cache with lossy conversion
  • delete_first_line() - O(1) using active pointer (GNU sed technique)

Data Flow

1. Script Compilation

  Script Text
       │
       ▼
┌─────────────┐
│   Lexer     │ Tokenize with raw bytes tracking
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Parser    │ Build AST (parser::Command)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Validator  │ Check POSIX rules, labels
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Converter  │ compile_to_runtime_commands()
└──────┬──────┘
       │
       ▼
RuntimeCommand[] (with compiled regexes)

2. Command Execution

Input Lines (with raw bytes)
         │
         ▼
┌──────────────────┐
│ ExecutionContext │◄─── Commands
│ - pattern_space  │
│ - hold_space     │
│ - line_number    │
└────────┬─────────┘
         │
         ▼
┌──────────────────────┐
│ apply_commands_with_ │
│ context()            │
└───────┬──────────────┘
        │
        ▼
 CommandResult (with raw bytes)
        │
        ▼
    ┌────────┐
    │ Output │ (preserves raw bytes)
    └────────┘

3. In-Place Editing

  Original File
       │
       ▼
┌──────────────┐
│ Read Content │ (as raw bytes)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Process    │ (preserve raw bytes through pipeline)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Temp File   │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│    Rename    │ (atomic)
└──────────────┘

Bytes and Encoding

Design: UTF-8 First with Raw Bytes Tracking

Red uses Rust's String type (valid UTF-8) as the primary text representation for regex matching, but maintains raw bytes throughout the pipeline for output.

Raw Bytes Preservation

Input (bytes) → PatternSpace.raw → Commands → Output (bytes)
                     ↓
               text_cache (lossy)
                     ↓
              Regex matching on text

Implementation:

  • Hold space raw bytes tracking (hold_space_raw: Vec<u8>)
  • CommandResult::Continue(String, Option<Vec<u8>>) - carries raw bytes through engine
  • CommandResult::Print(String, Option<Vec<u8>>) - outputs raw bytes when available
  • LiteralBytes(Vec<u8>) token in replacement templates

Component Status

Component Status Implementation
Script reading Bytes Vec<u8> + lossy UTF-8 conversion
Lexer Bytes-aware char-to-byte mapping for raw bytes
Parser Bytes-aware replacement_raw_bytes field
Replacement template Bytes LiteralBytes(Vec<u8>) token
Escape sequences Raw output \xNN, \oNNN, \dNNN, \cX
Pattern matching UTF-8 Regex operates on text cache
Data I/O Bytes Raw bytes preserved through hold/pattern space
Substitution pipeline Bytes Raw bytes tracked across multiple substitutions

Regex Engine

Red implements a custom regex engine with three different matchers, each optimized for different pattern types.

Three-Level Optimization Strategy

pub enum Matcher {
    Literal(LiteralMatcher),   // Fastest: direct string/byte search
    Dfa(DfaMatcher),           // Fast: Deterministic Finite Automaton
    Nfa(NfaMatcher),           // Full-featured: Nondeterministic Finite Automaton
}

Matcher Selection (compile time)

  Pattern Analysis
         │
         ▼
┌─────────────────────┐
│ Is literal?         │──Yes──▶ LiteralMatcher (fastest)
│ (no metacharacters) │
└────────┬────────────┘
         │ No
         ▼
┌─────────────────┐
│ Has backrefs?   │──Yes──▶ NfaMatcher (required)
│ (\1, \2, etc)   │
└────────┬────────┘
         │ No
         ▼
┌─────────────────┐
│ MBCS locale?    │──Yes──▶ NfaMatcher (for non-ASCII patterns)
│ (MB_CUR_MAX > 1)│
└────────┬────────┘
         │ No
         ▼
      DfaMatcher (fast, no backrefs)

Literal Matcher (src/regex/literal.rs)

When used: Patterns without any regex metacharacters (e.g., hello, foo_bar)

Implementation:

pub struct LiteralMatcher {
    pattern: String,
    ignore_case: bool,
}

impl LiteralMatcher {
    // String-based matching (UTF-8)
    pub fn find(&self, text: &str) -> Option<usize>

    // Byte-based matching (MBCS-safe for ASCII patterns)
    pub fn find_bytes(&self, bytes: &[u8]) -> Option<usize>
    pub fn find_bytes_from(&self, bytes: &[u8], start: usize) -> Option<usize>
}

MBCS behavior: ASCII-only literal patterns can safely use byte-level matching even in MBCS locales, because ASCII bytes (0x00-0x7F) never appear as part of multibyte characters.

DFA Matcher (src/regex/dfa.rs)

When used: Simple patterns without backreferences in single-byte locales

How DFA works:

  • Compiles pattern into a state machine at parse time
  • Each state has exactly one transition per input character
  • No backtracking - processes input in a single pass
  • Memory: O(states × alphabet_size)

Limitations:

  • Cannot support backreferences - This is a mathematical impossibility, not an implementation limitation. Backreferences (\1, \2) require the regex engine to "remember" what was matched and compare against it later. DFAs have no memory of the path taken to reach a state.
  • Disabled in MBCS locales - DFA operates on fixed-width input units. In MBCS, character widths vary (1-4 bytes), requiring boundary detection that DFA architecture doesn't support.

NFA Matcher (src/regex/nfa.rs, src/regex/backtrack.rs)

When used:

  • Patterns with backreferences (\1, \2, etc.)
  • Patterns with capturing groups that need extraction
  • All patterns in MBCS locales (for non-ASCII patterns)

How NFA differs from DFA:

                    DFA                          NFA
             ┌─────────────────┐          ┌─────────────────┐
Input 'a' →  │ State 1 ──────▶ │ State 2  │ State 1 ──┬───▶ │ State 2
             │    (one path)   │          │           └───▶ │ State 3
             └─────────────────┘          │    (multiple)   │
                                          └─────────────────┘
  • NFA can be in multiple states simultaneously
  • Explores all possible paths (via backtracking)
  • Can "remember" captured groups for backreferences
  • More flexible but slower

MBCS support in NFA:

// NFA uses MbText for character boundary tracking
pub struct MbText<'a> {
    raw: &'a [u8],
    char_boundaries: Vec<usize>,  // Where each character starts
    char_validity: Vec<bool>,     // Is each sequence valid?
}

// Transitions respect character boundaries
match transition {
    Transition::Any => {
        // In MBCS mode: advance by full character, not byte
        let char_len = mbtext.char_at(pos).len();
        next_pos = pos + char_len;
    }
}

Why Both DFA and NFA?

Aspect DFA NFA
Speed O(n) - very fast O(n×m) - slower
Backreferences Impossible Full support
Memory usage Higher (state table) Lower
MBCS support Disabled Full support
Use case Simple patterns Complex patterns

Cannot remove either:

  • Removing DFA: Performance regression for simple patterns
  • Removing NFA: Backreferences would be impossible (s/\(foo\)/\1\1/ fails)

MBCS Locale Behavior

When MB_CUR_MAX > 1 (Shift-JIS, EUC-JP, etc.):

Pattern Type              Matcher Used       Why
─────────────────────────────────────────────────────
ASCII literal "hello"     LiteralMatcher     Safe: ASCII bytes don't overlap MBCS
Non-ASCII literal "日本"  NfaMatcher         Needs MBCS boundary handling
Any with backrefs         NfaMatcher         DFA can't do backrefs
Simple /[a-z]+/           NfaMatcher         DFA disabled in MBCS

Example of why DFA fails in MBCS:

Shift-JIS "表" = bytes [0x95, 0x5C]
ASCII "\" = byte [0x5C]

Pattern: /\\/  (match backslash)

DFA (byte-level): Would incorrectly match inside "表"
                  because 0x5C appears as second byte

NFA (MBCS-aware): Correctly skips "表" because MbText
                  knows 0x5C is part of a 2-byte character

GNU sed Compatibility

Compatibility Features

CLI Compatibility

  • Natural order preservation of -e and -f
  • Separate usage (stderr) vs help (stdout) messages
  • Exit code 4 for usage errors
  • Write error detection (/dev/full)

Command Compatibility

  • All standard sed commands (s, d, p, a, i, c, etc.)
  • Address ranges (1,10, /start/,/end/, $)
  • Step addressing (1~2)
  • Hold space operations (h, H, g, G, x)
  • Branching (b, t, T, :label)
  • Byte-level operations (l command with octal escapes)

Multibyte Character Set (MBCS) Support

Full MBCS support for non-UTF-8 locales (Shift-JIS, EUC-JP, etc.)

Feature Implementation
Locale detection mbcs::initialize() via setlocale(), MB_CUR_MAX
Character boundaries MbText struct with mbrlen()
Regex dot matching Transition::Any respects MB boundaries
Character classes mbrtowc() conversion for CharSet matching
Capture groups raw_bytes field preserves MBCS bytes
Backreferences Byte-level comparison in MB mode
Invalid/incomplete sequences char_validity tracking, skip in dot match
Raw bytes output render_replacement_to_bytes() sink

Key MBCS Data Structures:

pub struct MbText<'a> {
    raw: &'a [u8],                    // Raw input bytes
    char_boundaries: Vec<usize>,      // Byte offset for each char start
    char_validity: Vec<bool>,         // Valid vs invalid/incomplete
}

pub struct MbChar<'a> {
    bytes: &'a [u8],                  // One logical character's bytes
}

SELinux Support

When built with the selinux feature (Linux only):

  • Preserves SELinux contexts during in-place editing
  • Without --follow-symlinks: Uses symlink's context for new file
  • With --follow-symlinks: Uses target file's context

Development Guide

Adding New Commands

  1. Add token in src/parser/lexer.rs
  2. Add AST node in src/parser/ast.rs
  3. Add parsing logic in src/parser.rs
  4. Add runtime command in src/engine/types.rs
  5. Add conversion in src/lib.rs:convert_new_command_to_old()
  6. Add execution logic in src/engine/exec.rs
  7. Add tests in tests/

Adding New Options

  1. Add field to CliArgs in src/cli.rs
  2. Add parsing in parse_args()
  3. Add to Context in src/context.rs
  4. Update RunConfig in src/lib.rs
  5. Add validation if needed

Testing

# Unit tests
cargo test

# GNU sed compatibility tests
./scripts/run_gnused_tests.sh

# Specific test
./scripts/run_gnused_tests.sh --test-pattern "help.sh"

# Fast mode (debug build, compact output)
./scripts/run_gnused_tests.sh --fast

References