Rust implementation of GNU sed - Stream Editor
- Overview
- Core Architecture
- Module Structure
- Key Components
- Data Flow
- Bytes and Encoding
- Regex Engine
- GNU sed Compatibility
- Development Guide
Red is a drop-in replacement for GNU sed, written in Rust.
- Two-Phase Processing: Compile scripts once, execute many times
- Unified Context: Single configuration object passed through all modules
- Type Safety: Leverage Rust's type system for correctness
- GNU Compatibility: Match GNU sed behavior
| Aspect | GNU sed | Red | Notes |
|---|---|---|---|
| Command representation | 1 struct with union | 2 enums (parser + runtime) | Standard AST→IR pattern |
| Processing | Byte-first | String-first (UTF-8) | Raw bytes tracked separately |
| Regex engine | POSIX regex + DFA | Custom 3-level matcher | Literal/DFA/NFA |
| Memory | Obstack allocator | Standard Rust allocations | |
| Jumps | Array indices | Pre-resolved indices | Resolved at compile time |
┌─────────────┐
│ CLI Args │
└──────┬──────┘
│
▼
┌─────────────┐ ┌──────────────┐
│ Context │────▶│ Validation │
└──────┬──────┘ └──────────────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Parser │────▶│ Lexer │────▶│ AST Nodes │
└──────┬──────┘ └──────────────┘ └──────────────┘
│ (parser::Command)
│
▼
┌────────────────────┐
│ convert_to_runtime │ (AST → RuntimeCommand)
└──────┬─────────────┘
│
▼
┌─────────────┐
│ Commands │ (RuntimeCommand with compiled regexes)
└──────┬──────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Engine │────▶│ Evaluator │────▶│ Output │
└─────────────┘ └──────────────┘ └──────────────┘
Input: Script text + Context
Output: List of RuntimeCommand
Purpose: Validate and optimize scripts before execution
pub fn parse_scripts_to_commands(
scripts: &[(String, ScriptSource)],
ctx: &Context,
) -> Result<Vec<RuntimeCommand>>Input: Commands + Input lines Output: Transformed text Purpose: Apply commands to input data
pub fn execute_over_lines(
commands: &[RuntimeCommand],
// ... parameters
) -> Result<()>main.rs
│
▼
cli.rs ──────────────────────────────────────┐
│ │
▼ ▼
lib.rs ◄──────────────────────────────── errors.rs
│
├──────────────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼
parser/ engine/ fileio/ util/
├── mod.rs ├── mod.rs ├── mod.rs ├── regex.rs
├── lexer.rs ├── exec.rs ├── lines.rs └── version.rs
└── ast.rs ├── addr.rs ├── inplace.rs
├── types.rs └── encoding.rs
└── pattern_space.rs
│
▼
regex/
├── mod.rs
├── parser.rs
├── dfa.rs
├── nfa.rs
├── backtrack.rs
└── literal.rs
Responsibility: Argument parsing and validation
- Uses
lexoptfor natural order preservation - Separate
print_usage()andhelp_text()functions - Exit codes match GNU sed (4 for usage errors)
Key Functions:
pub fn parse_args() -> Result<CliArgs>Responsibility: Unified configuration management
pub struct Context {
// Regex mode
pub extended_regex: bool,
// POSIX compliance
pub posix: PosixMode,
pub strict_posix: bool,
// Execution modes
pub quiet: bool,
pub null_data: bool,
pub unbuffered: bool,
pub separate_files: bool,
// In-place editing
pub in_place_suffix: Option<String>,
pub follow_symlinks: bool,
// Formatting
pub line_length: usize,
// Security
pub sandbox: bool,
// Scripts (for #n quiet mode detection)
pub scripts_with_sources: Vec<(String, ScriptSource)>,
}POSIX Modes:
pub enum PosixMode {
Extended, // GNU extensions enabled (default)
Correct, // POSIXLY_CORRECT env var
Basic, // --posix flag (strict)
}Responsibility: File reading, encoding detection, and backup utilities
encoding.rs- Encoding detection from BOM and localelines.rs- File/stdin reading with line splittinginplace.rs- Backup suffix expansion for in-place editingmod.rs- Module exports and output flushing helper
Key Functions:
pub fn read_all_lines(files, null_data, follow_symlinks) -> Result<(lines, bytes, filenames, encoding, ends_with_sep)>
pub fn split_file_content(content, null_data) -> (lines, bytes, encoding, ends_with_sep)
pub fn expand_backup_suffix(file_path, suffix) -> String
pub fn flush_output(out, unbuffered)Responsibility: Script compilation and validation
Tokenizes sed scripts using a state machine approach.
Key optimizations:
- Single-pass tokenization
- Bracket depth tracking
- Escape sequence handling
- Raw bytes preservation for non-UTF-8
Defines parser command structure:
pub enum Command {
Substitute { flags, pattern, replacement, ... },
Delete { range, negated },
Print { range, negated, ... },
Group { commands }, // Flattened at compile time
Comment(String), // Discarded at compile time
Version { version }, // Checked at compile time
// ... more command types
}Converts tokens to AST nodes with validation:
pub fn parse(tokens: &[Token], ctx: &Context) -> Result<Vec<Command>>Responsibility: Command execution and state management
Runtime command representation:
pub enum Command {
Substitution {
pattern: SedRegex, // Compiled regex
replacement: ReplacementTemplate, // Parsed replacement
global: bool,
print: bool,
literal_pattern: Option<String>, // Fast-path optimization
literal_replacement: Option<String>,
// ... more fields
},
// ... more runtime commands
}Maintains execution state:
pub struct ExecutionContext {
pub pattern_space: PatternSpace,
pub hold_space: String,
pub hold_space_raw: Vec<u8>,
pub line_number: usize,
pub is_last_line: bool,
// ... state fields
}Evaluates address ranges with state tracking:
pub fn evaluate(&mut self, range: Option<&AddressRange>, ctx: &ExecutionContext) -> boolFeatures:
- Line number ranges (5,10)
- Regex matches (/pattern/)
- Step addressing (1~2)
- Dollar ($) for last line
- Range state machine (INACTIVE → ACTIVE → CLOSED)
Unified error types:
pub enum SedError {
Parse { message, source },
Io { operation, path, error },
Rename { source_path, dest_path, error },
Runtime { message },
Usage { message },
}GNU sed compatible error messages and exit codes.
Red uses a deliberate two-stage command representation, following the classic AST-to-IR pattern from compilers:
pub enum Command {
Substitution {
pattern: String, // Raw regex pattern
replacement: String, // Raw replacement string
flags: SubstitutionFlags, // Struct with all flags
// ...
},
Group { commands: Vec<Command> }, // Flattened at compile time
Version { version: String }, // Compile-time check only
Comment(String), // Discarded
// ... more variants
}Purpose: Preserves source structure, enables error reporting with context
pub enum Command {
Substitution {
pattern: SedRegex, // Compiled regex
replacement: ReplacementTemplate, // Parsed replacement
global: bool, // Expanded from flags
print: bool,
literal_pattern: Option<String>, // Fast-path optimization
literal_replacement: Option<String>,
// ... fields with optimizations
},
// ... more variants (no Group/Version/Comment)
}Purpose: Optimized for execution, pre-compiled patterns, literal optimizations
- Separation of concerns: Parser produces semantic representation, engine uses optimized representation
- Compile-time processing: Group commands are flattened, Version is checked, Comments discarded
- Optimization fields: Runtime adds
literal_pattern,literal_replacement,use_lastthat don't exist in source - Type differences: String → SedRegex, String → ReplacementTemplate, SubstitutionFlags → individual booleans
- Common compiler pattern: Similar to AST → IR transformation in production compilers
The PatternSpace struct (src/engine/pattern_space.rs) maintains synchronized representations:
pub struct PatternSpace {
raw: Vec<u8>, // Raw bytes (may contain invalid UTF-8)
active_start: usize, // Active pointer for D command optimization
text_cache: Option<String>, // Cached UTF-8 representation
}Key methods:
raw()- Returns active bytes slicetext()- Returns UTF-8 text (from cache)set_raw()- Updates from bytes, rebuilds text cache with lossy conversiondelete_first_line()- O(1) using active pointer (GNU sed technique)
Script Text
│
▼
┌─────────────┐
│ Lexer │ Tokenize with raw bytes tracking
└──────┬──────┘
│
▼
┌─────────────┐
│ Parser │ Build AST (parser::Command)
└──────┬──────┘
│
▼
┌─────────────┐
│ Validator │ Check POSIX rules, labels
└──────┬──────┘
│
▼
┌─────────────┐
│ Converter │ compile_to_runtime_commands()
└──────┬──────┘
│
▼
RuntimeCommand[] (with compiled regexes)
Input Lines (with raw bytes)
│
▼
┌──────────────────┐
│ ExecutionContext │◄─── Commands
│ - pattern_space │
│ - hold_space │
│ - line_number │
└────────┬─────────┘
│
▼
┌──────────────────────┐
│ apply_commands_with_ │
│ context() │
└───────┬──────────────┘
│
▼
CommandResult (with raw bytes)
│
▼
┌────────┐
│ Output │ (preserves raw bytes)
└────────┘
Original File
│
▼
┌──────────────┐
│ Read Content │ (as raw bytes)
└──────┬───────┘
│
▼
┌──────────────┐
│ Process │ (preserve raw bytes through pipeline)
└──────┬───────┘
│
▼
┌──────────────┐
│ Temp File │
└──────┬───────┘
│
▼
┌──────────────┐
│ Rename │ (atomic)
└──────────────┘
Red uses Rust's String type (valid UTF-8) as the primary text representation for regex matching, but maintains raw bytes throughout the pipeline for output.
Input (bytes) → PatternSpace.raw → Commands → Output (bytes)
↓
text_cache (lossy)
↓
Regex matching on text
Implementation:
- Hold space raw bytes tracking (
hold_space_raw: Vec<u8>) CommandResult::Continue(String, Option<Vec<u8>>)- carries raw bytes through engineCommandResult::Print(String, Option<Vec<u8>>)- outputs raw bytes when availableLiteralBytes(Vec<u8>)token in replacement templates
| Component | Status | Implementation |
|---|---|---|
| Script reading | Bytes | Vec<u8> + lossy UTF-8 conversion |
| Lexer | Bytes-aware | char-to-byte mapping for raw bytes |
| Parser | Bytes-aware | replacement_raw_bytes field |
| Replacement template | Bytes | LiteralBytes(Vec<u8>) token |
| Escape sequences | Raw output | \xNN, \oNNN, \dNNN, \cX |
| Pattern matching | UTF-8 | Regex operates on text cache |
| Data I/O | Bytes | Raw bytes preserved through hold/pattern space |
| Substitution pipeline | Bytes | Raw bytes tracked across multiple substitutions |
Red implements a custom regex engine with three different matchers, each optimized for different pattern types.
pub enum Matcher {
Literal(LiteralMatcher), // Fastest: direct string/byte search
Dfa(DfaMatcher), // Fast: Deterministic Finite Automaton
Nfa(NfaMatcher), // Full-featured: Nondeterministic Finite Automaton
} Pattern Analysis
│
▼
┌─────────────────────┐
│ Is literal? │──Yes──▶ LiteralMatcher (fastest)
│ (no metacharacters) │
└────────┬────────────┘
│ No
▼
┌─────────────────┐
│ Has backrefs? │──Yes──▶ NfaMatcher (required)
│ (\1, \2, etc) │
└────────┬────────┘
│ No
▼
┌─────────────────┐
│ MBCS locale? │──Yes──▶ NfaMatcher (for non-ASCII patterns)
│ (MB_CUR_MAX > 1)│
└────────┬────────┘
│ No
▼
DfaMatcher (fast, no backrefs)
When used: Patterns without any regex metacharacters (e.g., hello, foo_bar)
Implementation:
pub struct LiteralMatcher {
pattern: String,
ignore_case: bool,
}
impl LiteralMatcher {
// String-based matching (UTF-8)
pub fn find(&self, text: &str) -> Option<usize>
// Byte-based matching (MBCS-safe for ASCII patterns)
pub fn find_bytes(&self, bytes: &[u8]) -> Option<usize>
pub fn find_bytes_from(&self, bytes: &[u8], start: usize) -> Option<usize>
}MBCS behavior: ASCII-only literal patterns can safely use byte-level matching even in MBCS locales, because ASCII bytes (0x00-0x7F) never appear as part of multibyte characters.
When used: Simple patterns without backreferences in single-byte locales
How DFA works:
- Compiles pattern into a state machine at parse time
- Each state has exactly one transition per input character
- No backtracking - processes input in a single pass
- Memory: O(states × alphabet_size)
Limitations:
- Cannot support backreferences - This is a mathematical impossibility, not an implementation limitation. Backreferences (
\1,\2) require the regex engine to "remember" what was matched and compare against it later. DFAs have no memory of the path taken to reach a state. - Disabled in MBCS locales - DFA operates on fixed-width input units. In MBCS, character widths vary (1-4 bytes), requiring boundary detection that DFA architecture doesn't support.
When used:
- Patterns with backreferences (
\1,\2, etc.) - Patterns with capturing groups that need extraction
- All patterns in MBCS locales (for non-ASCII patterns)
How NFA differs from DFA:
DFA NFA
┌─────────────────┐ ┌─────────────────┐
Input 'a' → │ State 1 ──────▶ │ State 2 │ State 1 ──┬───▶ │ State 2
│ (one path) │ │ └───▶ │ State 3
└─────────────────┘ │ (multiple) │
└─────────────────┘
- NFA can be in multiple states simultaneously
- Explores all possible paths (via backtracking)
- Can "remember" captured groups for backreferences
- More flexible but slower
MBCS support in NFA:
// NFA uses MbText for character boundary tracking
pub struct MbText<'a> {
raw: &'a [u8],
char_boundaries: Vec<usize>, // Where each character starts
char_validity: Vec<bool>, // Is each sequence valid?
}
// Transitions respect character boundaries
match transition {
Transition::Any => {
// In MBCS mode: advance by full character, not byte
let char_len = mbtext.char_at(pos).len();
next_pos = pos + char_len;
}
}| Aspect | DFA | NFA |
|---|---|---|
| Speed | O(n) - very fast | O(n×m) - slower |
| Backreferences | Impossible | Full support |
| Memory usage | Higher (state table) | Lower |
| MBCS support | Disabled | Full support |
| Use case | Simple patterns | Complex patterns |
Cannot remove either:
- Removing DFA: Performance regression for simple patterns
- Removing NFA: Backreferences would be impossible (
s/\(foo\)/\1\1/fails)
When MB_CUR_MAX > 1 (Shift-JIS, EUC-JP, etc.):
Pattern Type Matcher Used Why
─────────────────────────────────────────────────────
ASCII literal "hello" LiteralMatcher Safe: ASCII bytes don't overlap MBCS
Non-ASCII literal "日本" NfaMatcher Needs MBCS boundary handling
Any with backrefs NfaMatcher DFA can't do backrefs
Simple /[a-z]+/ NfaMatcher DFA disabled in MBCS
Example of why DFA fails in MBCS:
Shift-JIS "表" = bytes [0x95, 0x5C]
ASCII "\" = byte [0x5C]
Pattern: /\\/ (match backslash)
DFA (byte-level): Would incorrectly match inside "表"
because 0x5C appears as second byte
NFA (MBCS-aware): Correctly skips "表" because MbText
knows 0x5C is part of a 2-byte character
- Natural order preservation of
-eand-f - Separate usage (stderr) vs help (stdout) messages
- Exit code 4 for usage errors
- Write error detection (
/dev/full)
- All standard sed commands (s, d, p, a, i, c, etc.)
- Address ranges (1,10, /start/,/end/, $)
- Step addressing (1~2)
- Hold space operations (h, H, g, G, x)
- Branching (b, t, T, :label)
- Byte-level operations (l command with octal escapes)
Full MBCS support for non-UTF-8 locales (Shift-JIS, EUC-JP, etc.)
| Feature | Implementation |
|---|---|
| Locale detection | mbcs::initialize() via setlocale(), MB_CUR_MAX |
| Character boundaries | MbText struct with mbrlen() |
| Regex dot matching | Transition::Any respects MB boundaries |
| Character classes | mbrtowc() conversion for CharSet matching |
| Capture groups | raw_bytes field preserves MBCS bytes |
| Backreferences | Byte-level comparison in MB mode |
| Invalid/incomplete sequences | char_validity tracking, skip in dot match |
| Raw bytes output | render_replacement_to_bytes() sink |
Key MBCS Data Structures:
pub struct MbText<'a> {
raw: &'a [u8], // Raw input bytes
char_boundaries: Vec<usize>, // Byte offset for each char start
char_validity: Vec<bool>, // Valid vs invalid/incomplete
}
pub struct MbChar<'a> {
bytes: &'a [u8], // One logical character's bytes
}When built with the selinux feature (Linux only):
- Preserves SELinux contexts during in-place editing
- Without
--follow-symlinks: Uses symlink's context for new file - With
--follow-symlinks: Uses target file's context
- Add token in
src/parser/lexer.rs - Add AST node in
src/parser/ast.rs - Add parsing logic in
src/parser.rs - Add runtime command in
src/engine/types.rs - Add conversion in
src/lib.rs:convert_new_command_to_old() - Add execution logic in
src/engine/exec.rs - Add tests in
tests/
- Add field to
CliArgsinsrc/cli.rs - Add parsing in
parse_args() - Add to
Contextinsrc/context.rs - Update
RunConfiginsrc/lib.rs - Add validation if needed
# Unit tests
cargo test
# GNU sed compatibility tests
./scripts/run_gnused_tests.sh
# Specific test
./scripts/run_gnused_tests.sh --test-pattern "help.sh"
# Fast mode (debug build, compact output)
./scripts/run_gnused_tests.sh --fast