Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
9e52ff4
CHRN-27 - Update reader.rs (#1)
fr3akX Oct 16, 2025
233447e
add ralphex entries to .gitignore
fr3akX Feb 27, 2026
c2e2c9d
feat: fix cfg_attr inline hints in reader.rs
fr3akX Feb 27, 2026
ccd4f07
feat: optimize varint32 with fast-path batch bounds check
fr3akX Feb 27, 2026
bd8a4bf
feat: optimize varint64 with fast-path batch bounds check
fr3akX Feb 27, 2026
65063f5
feat: add bounds check in read_len and capacity hint in read_packed
fr3akX Feb 27, 2026
97baaf9
feat: expand benchmarks for fixed32, fixed64, string, and packed reads
fr3akX Feb 27, 2026
7e11452
feat: verify acceptance criteria for reader optimizations
fr3akX Feb 27, 2026
fe21c56
fix: address code review findings
fr3akX Feb 27, 2026
9e29d8d
fix: address code review findings
fr3akX Feb 27, 2026
07040fc
move completed plan: 2026-02-27-optimize-reader-hot-paths.md
fr3akX Feb 27, 2026
ec45864
feat: replace generic read_fixed with direct from_le_bytes implementa…
fr3akX Feb 27, 2026
bc09774
feat: add #[cold] annotations to varint slow paths
fr3akX Feb 27, 2026
d619949
feat: add size_hint and ExactSizeIterator to PackedFixed iterators
fr3akX Feb 27, 2026
67094db
feat: benchmark and validate reader optimizations
fr3akX Feb 27, 2026
e59d8ec
feat: move completed plan to docs/plans/completed
fr3akX Feb 27, 2026
8fbae99
fix: address code review findings
fr3akX Feb 27, 2026
205fa03
feat: add branchless varint32 decode for ARM64
fr3akX Feb 27, 2026
9766b79
feat: add branchless varint64 decode for ARM64
fr3akX Feb 27, 2026
f75e4fa
feat: add NEON batch varint32 decode for packed fields
fr3akX Feb 27, 2026
06e4c0b
feat: add ARM64-targeted benchmarks for branchless varint and NEON ba…
fr3akX Feb 27, 2026
1b705ac
feat: verify acceptance criteria for ARM64 NEON read path optimizations
fr3akX Feb 27, 2026
8c2d4d4
feat: update documentation with ARM64/NEON optimization patterns
fr3akX Feb 27, 2026
1d80dce
fix: address code review findings
fr3akX Feb 27, 2026
4b6093d
perf
fr3akX Mar 3, 2026
1f072a2
expose fields
fr3akX Mar 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,6 @@ quick-protobuf/tests/rust_protobuf/v2/issue_170_c.rs
quick-protobuf/tests/rust_protobuf/v2/issue_170_common.rs
quick-protobuf/tests/rust_protobuf/v2/issue_170_d.rs
quick-protobuf/tests/rust_protobuf/v2/issue_170_e.rs

# ralphex progress logs
.ralphex/progress/
75 changes: 75 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# CLAUDE.md — Developer Knowledge Base

## Project Structure

- `quick-protobuf/src/reader.rs` — core protobuf reader; hot path optimizations live here
- `quick-protobuf/src/writer.rs` — core protobuf writer
- `quick-protobuf/benches/benches.rs` — criterion benchmarks
- `pb-rs/` — code generator (separate crate)

## Build and Test Commands

- Library tests: `cargo test --manifest-path quick-protobuf/Cargo.toml --lib`
- All tests: `cargo test --manifest-path quick-protobuf/Cargo.toml`
- Benchmarks: `cargo bench --manifest-path quick-protobuf/Cargo.toml`
- Linter: `cargo clippy --manifest-path quick-protobuf/Cargo.toml`

## Architecture: BytesReader Performance Patterns

### Fast-path / slow-path split
Hot reader methods (`read_varint32`, `read_varint64`) check whether enough bytes are
available for the maximum possible read, then use direct slice indexing without per-byte
bounds checks. The slow-path fallback is a separate `#[cold]` function. Follow this
pattern for any future varint or variable-length optimizations.

### Fixed-width reads
`read_fixed64`, `read_fixed32`, `read_sfixed64`, `read_sfixed32`, `read_float`,
`read_double` all use `self.end` (the current message-end cursor) as the primary bounds
limit, then `.get()` for a defensive secondary check against `bytes.len()`. Use
`from_le_bytes` directly; do not add `byteorder` back to `reader.rs`.

### Sub-reader boundary (`self.end`)
`self.end` tracks the logical end of the current message context (set by `read_len_varint`
for length-delimited fields). All read methods must check `self.end`, not just
`bytes.len()`, to avoid escaping sub-message boundaries. `read_u8` enforces this.

### Dependencies
- `byteorder` is used only in `writer.rs`. Do not add it to `reader.rs`.

### PackedFixed iterators
Any iterator type over `PackedFixed` must implement both `size_hint()` returning
`(remaining, Some(remaining))` and `ExactSizeIterator`. Test both `Owned` and `Borrowed`
variants.

### ARM64/NEON optimizations
ARM64-specific optimizations are gated on `cfg(target_arch = "aarch64")` with the existing
scalar code preserved under `cfg(not(target_arch = "aarch64"))`. NEON is mandatory on
ARMv8-A, so no feature flag is needed — only the target arch gate.

Three optimization layers exist:

1. **Branchless single varint decode** (`decode_varint32_branchless`, `decode_varint64_branchless`):
Free functions that load 8 bytes as a u64, use bit manipulation (MSB mask, isolate-first-
terminator, shift-and-mask payload extraction) to decode without branches. Used in
`read_varint32` and `read_varint64` fast paths on aarch64.

2. **NEON batch varint32 decode** (`batch_decode_varint32_neon`):
Unsafe function using `core::arch::aarch64` NEON intrinsics (`vld1q_u8`, `vshrq_n_u8`,
`vmulq_u8`, `vpaddlq_*`) to extract continuation-bit masks from 16-byte chunks, then
decodes individual varints branchlessly. Called from `read_packed_int32`.

3. **`read_packed_int32` method**: Public method on `BytesReader` that uses the NEON batch
path on aarch64 with scalar fallback for tail bytes and on non-aarch64 architectures.

When adding new ARM64 optimizations, follow these conventions:
- Gate with `cfg(target_arch = "aarch64")`, keep scalar fallback under `cfg(not(...))`
- Use `#[cfg(any(target_arch = "aarch64", test))]` for helper functions that need unit
testing on all architectures
- Use `core::arch::aarch64` intrinsics (stable since Rust 1.59), not inline assembly
- Unsafe NEON functions must document their safety preconditions

## Inline Attribute Convention

Use `#[cfg_attr(feature = "std", inline(always))]` for hot single-expression methods and
`#[cfg_attr(feature = "std", inline)]` for larger methods. The bare `#[cfg_attr(std, ...)]`
form is a no-op (never set by the compiler) — always use `feature = "std"`.
12 changes: 12 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@
- test: Adding missing tests
- chore: Changes to the build process or auxiliary tools/libraries/documentation

## quick-protobuf 0.8.2
- fix: correct `#[cfg_attr(std, inline)]` to `#[cfg_attr(feature = "std", inline)]` in reader.rs and writer.rs (inlining was previously a no-op)
- fix: `read_len` now validates declared field length does not exceed remaining buffer, returning `Error::UnexpectedEndOfBuffer`
- fix: `read_u8` now respects sub-reader `end` boundary, preventing reads past length-delimited message boundaries
- perf: varint32 and varint64 decoding use a fast-path batch bounds check, reducing per-byte overhead in the common case
- perf: fixed-width reads (`read_fixed32`, `read_fixed64`, `read_sfixed32`, `read_sfixed64`, `read_float`, `read_double`) rewritten with direct `from_le_bytes` and tighter `self.end` bounds check
- perf: `read_packed` pre-allocates result `Vec` with element-count capacity hint
- feat: `PackedFixedIntoIter` and `PackedFixedRefIter` implement `ExactSizeIterator` and `size_hint()`
- feat: add `read_packed_int32` method on `BytesReader` for optimized packed int32 field reading
- perf: ARM64 (aarch64) branchless varint32 and varint64 decoding using bit manipulation
- perf: ARM64 (aarch64) NEON-accelerated batch varint32 decode for packed int32 fields via `read_packed_int32`

## pb-rs 0.10.0
- fix: fix nested items and package name resolution
- fix: parser now parses comments successfully
Expand Down
139 changes: 139 additions & 0 deletions docs/plans/completed/2026-02-27-arm64-neon-read-path-optimizations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# ARM64 NEON Read Path Optimizations

## Overview

Add ARM64 NEON-accelerated read path optimizations to BytesReader, targeting AWS Graviton
processors. Three optimization areas identified through read path review: branchless single
varint decode, NEON batch varint decode for packed fields, and NEON-accelerated PackedFixed
bulk conversion. All optimizations are gated on `cfg(target_arch = "aarch64")` with automatic
scalar fallback on other architectures.

## Read Path Review Findings

### High value targets

1. **Branchless single varint decode** (read_varint32, read_varint64)
- Current code: branch-per-byte, up to 5 branches for varint32, 10 for varint64
- Optimization: load 8 bytes at once as u64, use bit manipulation + ARM64 CLZ to find
varint length, extract value branchlessly
- Covers: next_tag, read_int32, read_uint32, read_sint32, read_bool, read_enum, and all
length-prefix reads (read_string, read_bytes, read_message, read_packed)
- Expected impact: 20-40% improvement on multi-byte varints, reduces branch mispredictions

2. **NEON batch varint decode for packed repeated fields** (read_packed with varint types)
- Current code: tight scalar loop calling read_varint32 per element
- Optimization: masked-VByte approach - load 16 bytes via NEON, determine all varint
boundaries from continuation bits in parallel, extract 4+ values simultaneously
- Expected impact: 2-4x for packed int32/uint32/sint32 fields

### Medium value targets

3. **PackedFixed bulk conversion** (make_vec_from_unaligned_buf)
- Current code: ptr::copy (already good), but iteration uses per-element read_unaligned
- ARM64 handles unaligned loads efficiently, so improvement is marginal
- Skip: not worth the complexity for ARM64 where unaligned loads are fast

### Low value targets (skip)

4. **UTF-8 validation in read_string** - Rust stdlib already uses SIMD internally
5. **Field skipping in read_unknown** - low volume, not worth optimizing

## Context

- Files involved:
- `quick-protobuf/src/reader.rs` (main target)
- `quick-protobuf/benches/benches.rs` (benchmark additions)
- `quick-protobuf/Cargo.toml` (no changes expected - no new deps)
- Related patterns: fast-path/slow-path split, `#[cfg_attr(feature = "std", inline)]` convention
- Dependencies: none - uses `core::arch::aarch64` intrinsics (stable since Rust 1.59, MSRV is 1.62.1)
- Target gating: `cfg(target_arch = "aarch64")` - NEON is mandatory on ARMv8-A, no feature flag needed

## Development Approach

- **Testing approach**: Regular (code first, then tests)
- Complete each task fully before moving to the next
- ARM64-specific code is gated with `cfg(target_arch)`, scalar fallback is the existing code
- All existing tests must continue to pass on all architectures
- **CRITICAL: every task MUST include new/updated tests**
- **CRITICAL: all tests must pass before starting next task**

## Implementation Steps

### Task 1: Branchless varint32 decode

**Files:**
- Modify: `quick-protobuf/src/reader.rs`

Replace the branch-per-byte varint32 fast path with a branchless approach on aarch64:

- [x] Add `#[cfg(target_arch = "aarch64")]` branchless varint32 fast path in `read_varint32`
- Load 8 bytes as u64 (little-endian) from `bytes[self.start..]`
- Extract continuation-bit mask: `let cont = val & 0x8080808080808080`
- Find first zero MSB to determine length: use `(!cont).trailing_zeros() / 8 + 1` (maps to CLZ on ARM64)
- Apply per-byte masks to strip continuation bits and combine 7-bit groups
- Clamp to 5 bytes max for varint32
- [x] Keep existing scalar fast path under `#[cfg(not(target_arch = "aarch64"))]`
- [x] Write tests: varint32 decode produces identical results for all varint sizes (1-5 bytes)
- [x] Run project test suite - must pass before task 2

### Task 2: Branchless varint64 decode

**Files:**
- Modify: `quick-protobuf/src/reader.rs`

Apply the same branchless approach to varint64:

- [x] Add `#[cfg(target_arch = "aarch64")]` branchless varint64 fast path in `read_varint64`
- Same u64-load approach for the first 8 bytes
- For varints > 8 bytes (9-10), load a second u64 or fall back to scalar for the tail
- Combine parts using shifts matching the existing r0/r1/r2 decomposition
- [x] Keep existing scalar fast path under `#[cfg(not(target_arch = "aarch64"))]`
- [x] Write tests: varint64 decode produces identical results for all varint sizes (1-10 bytes)
- [x] Run project test suite - must pass before task 3

### Task 3: NEON batch varint32 decode for packed fields

**Files:**
- Modify: `quick-protobuf/src/reader.rs`

Add a NEON-accelerated path for `read_packed` when decoding varint-encoded types:

- [x] Add `#[cfg(target_arch = "aarch64")]` module with NEON batch varint32 decode function
- Uses `core::arch::aarch64` intrinsics: `vld1q_u8`, `vcltq_u8`, `vandq_u8`, `vgetq_lane_u8`
- Load 16 bytes via `vld1q_u8`
- Extract continuation bits via vector comparison against 0x80
- Convert continuation-bit bitmask to varint boundary positions
- Decode up to 4+ varints per 16-byte load using lookup tables or shift chains
- Write decoded u32 values directly into output Vec
- Handle tail (remaining bytes < 16) with scalar fallback
- [x] Add `read_packed_varint32_neon` method to BytesReader, called from `read_packed` when
the read closure matches varint32 patterns, or add a dedicated `read_packed_int32` method
- [x] Write tests: batch decode matches scalar decode for varied varint sizes and counts
- [x] Write tests: edge cases - empty packed field, single element, exactly 16 bytes, 17 bytes
- [x] Run project test suite - must pass before task 4

### Task 4: ARM64-targeted benchmarks

**Files:**
- Modify: `quick-protobuf/benches/benches.rs`

Add benchmarks that highlight the SIMD improvements:

- [x] Add `read_packed_int32` benchmark (packed repeated int32 with mixed varint sizes)
- [x] Add `read_packed_int32_small_values` benchmark (1-byte varints, best case for batch decode)
- [x] Add `read_packed_int32_large_values` benchmark (4-5 byte varints)
- [x] Add `read_varint32_multibyte` benchmark (varints that are 2-4 bytes, to measure branchless benefit)
- [x] Verify benchmarks compile and run
- [x] Run project test suite - must pass before task 5

### Task 5: Verify acceptance criteria

- [x] Run full test suite: `cargo test --manifest-path quick-protobuf/Cargo.toml`
- [x] Run linter: `cargo clippy --manifest-path quick-protobuf/Cargo.toml`
- [x] Verify all aarch64-gated code compiles (cross-check with `--target aarch64-unknown-linux-gnu` if available)
- [x] Verify non-aarch64 builds are unaffected (existing scalar paths unchanged)

### Task 6: Update documentation

- [x] Update CLAUDE.md with ARM64/NEON optimization patterns and conventions
- [x] Move this plan to `docs/plans/completed/`
89 changes: 89 additions & 0 deletions docs/plans/completed/2026-02-27-optimize-reader-fixed-reads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Optimize reader fixed-width reads and code layout

## Overview

Second round of reader optimizations following the completed varint fast-path work. Focuses on three areas: (1) eliminating the generic `read_fixed` closure pattern in favor of direct `from_le_bytes` calls, (2) improving code layout with `#[cold]` annotations on slow paths, and (3) adding `size_hint` to PackedFixed iterators for better collection performance.

## Context

- Files involved: `quick-protobuf/src/reader.rs`, `quick-protobuf/benches/benches.rs`
- Related patterns: varint fast-path pattern from previous optimization round
- Dependencies: byteorder stays (still used by writer.rs); reader stops importing it
- Previous work: `docs/plans/completed/2026-02-27-optimize-reader-hot-paths.md` (inline fixes, varint fast paths, read_len bounds check, read_packed capacity hint)

## Findings from inspection

### 1. Generic `read_fixed<M, F>` closure pattern (lines 438-446)

The private `read_fixed` method takes a closure parameter `F: Fn(&[u8]) -> M` which dispatches to byteorder's `LE::read_u64`, `LE::read_u32`, etc. While the compiler monomorphizes and inlines this, the indirection chain is: caller -> read_fixed (generic) -> bytes.get().ok_or() -> closure -> LE::read_u64 -> u64::from_le_bytes. Replacing with direct `from_le_bytes` calls in each method eliminates the generic + closure layer, uses the tighter `self.end` bound instead of `bytes.len()` used by `get()`, and removes the byteorder import from reader.rs.

### 2. Missing `#[cold]` on slow paths (lines 179, 313)

`read_varint32_slow` and `read_varint64_slow` are fallback paths hit only near buffer boundaries. Without `#[cold]`, the compiler treats them as equally likely as the fast path, potentially placing slow-path code in hot cache lines and degrading instruction cache utilization.

### 3. PackedFixed iterators missing `size_hint` (lines 970, 1014)

Both `PackedFixedIntoIter` and `PackedFixedRefIter` implement `Iterator` but don't override `size_hint()`, defaulting to `(0, None)`. This means `collect::<Vec<_>>()` and similar operations cannot pre-allocate, causing repeated reallocations.

## Development Approach

- **Testing approach**: Regular (code first, then tests)
- Complete each task fully before moving to the next
- **CRITICAL: every task MUST include new/updated tests**
- **CRITICAL: all tests must pass before starting next task**

## Implementation Steps

### Task 1: Replace generic read_fixed with direct from_le_bytes implementations

**Files:**
- Modify: `quick-protobuf/src/reader.rs`

Replace the generic `read_fixed<M, F: Fn(&[u8]) -> M>` method and its callers with direct implementations that use Rust's native `from_le_bytes` methods. Each method does its own bounds check against `self.end` (tighter than `bytes.len()` used by `get()`) and calls `from_le_bytes` directly.

- [x] Remove the private `read_fixed<M, F>` generic method
- [x] Implement `read_fixed64` directly: bounds check `self.start + 8 <= self.end`, then `u64::from_le_bytes(bytes[self.start..self.start+8].try_into().unwrap())`
- [x] Implement `read_fixed32` directly with same pattern (4 bytes, u32)
- [x] Implement `read_sfixed64` directly (8 bytes, i64)
- [x] Implement `read_sfixed32` directly (4 bytes, i32)
- [x] Implement `read_float` directly (4 bytes, f32 via `f32::from_le_bytes`)
- [x] Implement `read_double` directly (8 bytes, f64 via `f64::from_le_bytes`)
- [x] Remove `use byteorder::ByteOrder` and `use byteorder::LittleEndian as LE` imports from reader.rs
- [x] Run lib tests (`cargo test --manifest-path quick-protobuf/Cargo.toml --lib`)

### Task 2: Add #[cold] to slow paths

**Files:**
- Modify: `quick-protobuf/src/reader.rs`

- [x] Add `#[cold]` attribute to `read_varint32_slow`
- [x] Add `#[cold]` attribute to `read_varint64_slow`
- [x] Run lib tests

### Task 3: Add size_hint to PackedFixed iterators

**Files:**
- Modify: `quick-protobuf/src/reader.rs`

- [x] Implement `size_hint()` on `PackedFixedIntoIter`: return `(remaining, Some(remaining))` where `remaining = self.packed_fixed.len() - self.index`
- [x] Implement `size_hint()` on `PackedFixedRefIter`: same pattern
- [x] Add `ExactSizeIterator` impl for both iterators (since the count is always known)
- [x] Add test: verify `size_hint` returns correct values during iteration
- [x] Add test: verify `collect::<Vec<_>>()` produces correct results (exercises size_hint for pre-allocation)
- [x] Run lib tests

### Task 4: Benchmark and validate optimizations

**Files:**
- Modify: `quick-protobuf/benches/benches.rs`

- [x] Run existing benchmarks to establish baseline before changes (if not already done)
- [x] Run benchmarks after all changes to measure impact
- [x] Verify no regressions in any benchmark
- [x] Run full test suite (`cargo test --manifest-path quick-protobuf/Cargo.toml --lib`)
- [x] Run clippy (`cargo clippy --manifest-path quick-protobuf/Cargo.toml`)

### Task 5: Update documentation

- [x] Update CLAUDE.md if internal patterns changed
- [x] Move this plan to `docs/plans/completed/`
Loading