tafia · fr3akX · Oct 16, 2025 · Feb 27, 2026 · Feb 27, 2026 · Feb 27, 2026
diff --git a/.gitignore b/.gitignore
@@ -57,3 +57,6 @@ quick-protobuf/tests/rust_protobuf/v2/issue_170_c.rs
 quick-protobuf/tests/rust_protobuf/v2/issue_170_common.rs
 quick-protobuf/tests/rust_protobuf/v2/issue_170_d.rs
 quick-protobuf/tests/rust_protobuf/v2/issue_170_e.rs
+
+# ralphex progress logs
+.ralphex/progress/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,75 @@
+# CLAUDE.md — Developer Knowledge Base
+
+## Project Structure
+
+- `quick-protobuf/src/reader.rs` — core protobuf reader; hot path optimizations live here
+- `quick-protobuf/src/writer.rs` — core protobuf writer
+- `quick-protobuf/benches/benches.rs` — criterion benchmarks
+- `pb-rs/` — code generator (separate crate)
+
+## Build and Test Commands
+
+- Library tests: `cargo test --manifest-path quick-protobuf/Cargo.toml --lib`
+- All tests: `cargo test --manifest-path quick-protobuf/Cargo.toml`
+- Benchmarks: `cargo bench --manifest-path quick-protobuf/Cargo.toml`
+- Linter: `cargo clippy --manifest-path quick-protobuf/Cargo.toml`
+
+## Architecture: BytesReader Performance Patterns
+
+### Fast-path / slow-path split
+Hot reader methods (`read_varint32`, `read_varint64`) check whether enough bytes are
+available for the maximum possible read, then use direct slice indexing without per-byte
+bounds checks. The slow-path fallback is a separate `#[cold]` function. Follow this
+pattern for any future varint or variable-length optimizations.
+
+### Fixed-width reads
+`read_fixed64`, `read_fixed32`, `read_sfixed64`, `read_sfixed32`, `read_float`,
+`read_double` all use `self.end` (the current message-end cursor) as the primary bounds
+limit, then `.get()` for a defensive secondary check against `bytes.len()`. Use
+`from_le_bytes` directly; do not add `byteorder` back to `reader.rs`.
+
+### Sub-reader boundary (`self.end`)
+`self.end` tracks the logical end of the current message context (set by `read_len_varint`
+for length-delimited fields). All read methods must check `self.end`, not just
+`bytes.len()`, to avoid escaping sub-message boundaries. `read_u8` enforces this.
+
+### Dependencies
+- `byteorder` is used only in `writer.rs`. Do not add it to `reader.rs`.
+
+### PackedFixed iterators
+Any iterator type over `PackedFixed` must implement both `size_hint()` returning
+`(remaining, Some(remaining))` and `ExactSizeIterator`. Test both `Owned` and `Borrowed`
+variants.
+
+### ARM64/NEON optimizations
+ARM64-specific optimizations are gated on `cfg(target_arch = "aarch64")` with the existing
+scalar code preserved under `cfg(not(target_arch = "aarch64"))`. NEON is mandatory on
+ARMv8-A, so no feature flag is needed — only the target arch gate.
+
+Three optimization layers exist:
+
+1. **Branchless single varint decode** (`decode_varint32_branchless`, `decode_varint64_branchless`):
+   Free functions that load 8 bytes as a u64, use bit manipulation (MSB mask, isolate-first-
+   terminator, shift-and-mask payload extraction) to decode without branches. Used in
+   `read_varint32` and `read_varint64` fast paths on aarch64.
+
+2. **NEON batch varint32 decode** (`batch_decode_varint32_neon`):
+   Unsafe function using `core::arch::aarch64` NEON intrinsics (`vld1q_u8`, `vshrq_n_u8`,
+   `vmulq_u8`, `vpaddlq_*`) to extract continuation-bit masks from 16-byte chunks, then
+   decodes individual varints branchlessly. Called from `read_packed_int32`.
+
+3. **`read_packed_int32` method**: Public method on `BytesReader` that uses the NEON batch
+   path on aarch64 with scalar fallback for tail bytes and on non-aarch64 architectures.
+
+When adding new ARM64 optimizations, follow these conventions:
+- Gate with `cfg(target_arch = "aarch64")`, keep scalar fallback under `cfg(not(...))`
+- Use `#[cfg(any(target_arch = "aarch64", test))]` for helper functions that need unit
+  testing on all architectures
+- Use `core::arch::aarch64` intrinsics (stable since Rust 1.59), not inline assembly
+- Unsafe NEON functions must document their safety preconditions
+
+## Inline Attribute Convention
+
+Use `#[cfg_attr(feature = "std", inline(always))]` for hot single-expression methods and
+`#[cfg_attr(feature = "std", inline)]` for larger methods. The bare `#[cfg_attr(std, ...)]`
+form is a no-op (never set by the compiler) — always use `feature = "std"`.
diff --git a/Changelog.md b/Changelog.md
@@ -8,6 +8,18 @@
   - test: Adding missing tests
   - chore: Changes to the build process or auxiliary tools/libraries/documentation
 
+## quick-protobuf 0.8.2
+- fix: correct `#[cfg_attr(std, inline)]` to `#[cfg_attr(feature = "std", inline)]` in reader.rs and writer.rs (inlining was previously a no-op)
+- fix: `read_len` now validates declared field length does not exceed remaining buffer, returning `Error::UnexpectedEndOfBuffer`
+- fix: `read_u8` now respects sub-reader `end` boundary, preventing reads past length-delimited message boundaries
+- perf: varint32 and varint64 decoding use a fast-path batch bounds check, reducing per-byte overhead in the common case
+- perf: fixed-width reads (`read_fixed32`, `read_fixed64`, `read_sfixed32`, `read_sfixed64`, `read_float`, `read_double`) rewritten with direct `from_le_bytes` and tighter `self.end` bounds check
+- perf: `read_packed` pre-allocates result `Vec` with element-count capacity hint
+- feat: `PackedFixedIntoIter` and `PackedFixedRefIter` implement `ExactSizeIterator` and `size_hint()`
+- feat: add `read_packed_int32` method on `BytesReader` for optimized packed int32 field reading
+- perf: ARM64 (aarch64) branchless varint32 and varint64 decoding using bit manipulation
+- perf: ARM64 (aarch64) NEON-accelerated batch varint32 decode for packed int32 fields via `read_packed_int32`
+
 ## pb-rs 0.10.0
 - fix: fix nested items and package name resolution
 - fix: parser now parses comments successfully

diff --git a/docs/plans/completed/2026-02-27-arm64-neon-read-path-optimizations.md b/docs/plans/completed/2026-02-27-arm64-neon-read-path-optimizations.md
@@ -0,0 +1,139 @@
+# ARM64 NEON Read Path Optimizations
+
+## Overview
+
+Add ARM64 NEON-accelerated read path optimizations to BytesReader, targeting AWS Graviton
+processors. Three optimization areas identified through read path review: branchless single
+varint decode, NEON batch varint decode for packed fields, and NEON-accelerated PackedFixed
+bulk conversion. All optimizations are gated on `cfg(target_arch = "aarch64")` with automatic
+scalar fallback on other architectures.
+
+## Read Path Review Findings
+
+### High value targets
+
+1. **Branchless single varint decode** (read_varint32, read_varint64)
+   - Current code: branch-per-byte, up to 5 branches for varint32, 10 for varint64
+   - Optimization: load 8 bytes at once as u64, use bit manipulation + ARM64 CLZ to find
+     varint length, extract value branchlessly
+   - Covers: next_tag, read_int32, read_uint32, read_sint32, read_bool, read_enum, and all
+     length-prefix reads (read_string, read_bytes, read_message, read_packed)
+   - Expected impact: 20-40% improvement on multi-byte varints, reduces branch mispredictions
+
+2. **NEON batch varint decode for packed repeated fields** (read_packed with varint types)
+   - Current code: tight scalar loop calling read_varint32 per element
+   - Optimization: masked-VByte approach - load 16 bytes via NEON, determine all varint
+     boundaries from continuation bits in parallel, extract 4+ values simultaneously
+   - Expected impact: 2-4x for packed int32/uint32/sint32 fields
+
+### Medium value targets
+
+3. **PackedFixed bulk conversion** (make_vec_from_unaligned_buf)
+   - Current code: ptr::copy (already good), but iteration uses per-element read_unaligned
+   - ARM64 handles unaligned loads efficiently, so improvement is marginal
+   - Skip: not worth the complexity for ARM64 where unaligned loads are fast
+
+### Low value targets (skip)
+
+4. **UTF-8 validation in read_string** - Rust stdlib already uses SIMD internally
+5. **Field skipping in read_unknown** - low volume, not worth optimizing
+
+## Context
+
+- Files involved:
+  - `quick-protobuf/src/reader.rs` (main target)
+  - `quick-protobuf/benches/benches.rs` (benchmark additions)
+  - `quick-protobuf/Cargo.toml` (no changes expected - no new deps)
+- Related patterns: fast-path/slow-path split, `#[cfg_attr(feature = "std", inline)]` convention
+- Dependencies: none - uses `core::arch::aarch64` intrinsics (stable since Rust 1.59, MSRV is 1.62.1)
+- Target gating: `cfg(target_arch = "aarch64")` - NEON is mandatory on ARMv8-A, no feature flag needed
+
+## Development Approach
+
+- **Testing approach**: Regular (code first, then tests)
+- Complete each task fully before moving to the next
+- ARM64-specific code is gated with `cfg(target_arch)`, scalar fallback is the existing code
+- All existing tests must continue to pass on all architectures
+- **CRITICAL: every task MUST include new/updated tests**
+- **CRITICAL: all tests must pass before starting next task**
+
+## Implementation Steps
+
+### Task 1: Branchless varint32 decode
+
+**Files:**
+- Modify: `quick-protobuf/src/reader.rs`
+
+Replace the branch-per-byte varint32 fast path with a branchless approach on aarch64:
+
+- [x] Add `#[cfg(target_arch = "aarch64")]` branchless varint32 fast path in `read_varint32`
+  - Load 8 bytes as u64 (little-endian) from `bytes[self.start..]`
+  - Extract continuation-bit mask: `let cont = val & 0x8080808080808080`
+  - Find first zero MSB to determine length: use `(!cont).trailing_zeros() / 8 + 1` (maps to CLZ on ARM64)
+  - Apply per-byte masks to strip continuation bits and combine 7-bit groups
+  - Clamp to 5 bytes max for varint32
+- [x] Keep existing scalar fast path under `#[cfg(not(target_arch = "aarch64"))]`
+- [x] Write tests: varint32 decode produces identical results for all varint sizes (1-5 bytes)
+- [x] Run project test suite - must pass before task 2
+
+### Task 2: Branchless varint64 decode
+
+**Files:**
+- Modify: `quick-protobuf/src/reader.rs`
+
+Apply the same branchless approach to varint64:
+
+- [x] Add `#[cfg(target_arch = "aarch64")]` branchless varint64 fast path in `read_varint64`
+  - Same u64-load approach for the first 8 bytes
+  - For varints > 8 bytes (9-10), load a second u64 or fall back to scalar for the tail
+  - Combine parts using shifts matching the existing r0/r1/r2 decomposition
+- [x] Keep existing scalar fast path under `#[cfg(not(target_arch = "aarch64"))]`
+- [x] Write tests: varint64 decode produces identical results for all varint sizes (1-10 bytes)
+- [x] Run project test suite - must pass before task 3
+
+### Task 3: NEON batch varint32 decode for packed fields
+
+**Files:**
+- Modify: `quick-protobuf/src/reader.rs`
+
+Add a NEON-accelerated path for `read_packed` when decoding varint-encoded types:
+
+- [x] Add `#[cfg(target_arch = "aarch64")]` module with NEON batch varint32 decode function
+  - Uses `core::arch::aarch64` intrinsics: `vld1q_u8`, `vcltq_u8`, `vandq_u8`, `vgetq_lane_u8`
+  - Load 16 bytes via `vld1q_u8`
+  - Extract continuation bits via vector comparison against 0x80
+  - Convert continuation-bit bitmask to varint boundary positions
+  - Decode up to 4+ varints per 16-byte load using lookup tables or shift chains
+  - Write decoded u32 values directly into output Vec
+  - Handle tail (remaining bytes < 16) with scalar fallback
+- [x] Add `read_packed_varint32_neon` method to BytesReader, called from `read_packed` when
+    the read closure matches varint32 patterns, or add a dedicated `read_packed_int32` method
+- [x] Write tests: batch decode matches scalar decode for varied varint sizes and counts
+- [x] Write tests: edge cases - empty packed field, single element, exactly 16 bytes, 17 bytes
+- [x] Run project test suite - must pass before task 4
+
+### Task 4: ARM64-targeted benchmarks
+
+**Files:**
+- Modify: `quick-protobuf/benches/benches.rs`
+
+Add benchmarks that highlight the SIMD improvements:
+
+- [x] Add `read_packed_int32` benchmark (packed repeated int32 with mixed varint sizes)
+- [x] Add `read_packed_int32_small_values` benchmark (1-byte varints, best case for batch decode)
+- [x] Add `read_packed_int32_large_values` benchmark (4-5 byte varints)
+- [x] Add `read_varint32_multibyte` benchmark (varints that are 2-4 bytes, to measure branchless benefit)
+- [x] Verify benchmarks compile and run
+- [x] Run project test suite - must pass before task 5
+
+### Task 5: Verify acceptance criteria
+
+- [x] Run full test suite: `cargo test --manifest-path quick-protobuf/Cargo.toml`
+- [x] Run linter: `cargo clippy --manifest-path quick-protobuf/Cargo.toml`
+- [x] Verify all aarch64-gated code compiles (cross-check with `--target aarch64-unknown-linux-gnu` if available)
+- [x] Verify non-aarch64 builds are unaffected (existing scalar paths unchanged)
+
+### Task 6: Update documentation
+
+- [x] Update CLAUDE.md with ARM64/NEON optimization patterns and conventions
+- [x] Move this plan to `docs/plans/completed/`
diff --git a/docs/plans/completed/2026-02-27-optimize-reader-fixed-reads.md b/docs/plans/completed/2026-02-27-optimize-reader-fixed-reads.md
@@ -0,0 +1,89 @@
+# Optimize reader fixed-width reads and code layout
+
+## Overview
+
+Second round of reader optimizations following the completed varint fast-path work. Focuses on three areas: (1) eliminating the generic `read_fixed` closure pattern in favor of direct `from_le_bytes` calls, (2) improving code layout with `#[cold]` annotations on slow paths, and (3) adding `size_hint` to PackedFixed iterators for better collection performance.
+
+## Context
+
+- Files involved: `quick-protobuf/src/reader.rs`, `quick-protobuf/benches/benches.rs`
+- Related patterns: varint fast-path pattern from previous optimization round
+- Dependencies: byteorder stays (still used by writer.rs); reader stops importing it
+- Previous work: `docs/plans/completed/2026-02-27-optimize-reader-hot-paths.md` (inline fixes, varint fast paths, read_len bounds check, read_packed capacity hint)
+
+## Findings from inspection
+
+### 1. Generic `read_fixed<M, F>` closure pattern (lines 438-446)
+
+The private `read_fixed` method takes a closure parameter `F: Fn(&[u8]) -> M` which dispatches to byteorder's `LE::read_u64`, `LE::read_u32`, etc. While the compiler monomorphizes and inlines this, the indirection chain is: caller -> read_fixed (generic) -> bytes.get().ok_or() -> closure -> LE::read_u64 -> u64::from_le_bytes. Replacing with direct `from_le_bytes` calls in each method eliminates the generic + closure layer, uses the tighter `self.end` bound instead of `bytes.len()` used by `get()`, and removes the byteorder import from reader.rs.
+
+### 2. Missing `#[cold]` on slow paths (lines 179, 313)
+
+`read_varint32_slow` and `read_varint64_slow` are fallback paths hit only near buffer boundaries. Without `#[cold]`, the compiler treats them as equally likely as the fast path, potentially placing slow-path code in hot cache lines and degrading instruction cache utilization.
+
+### 3. PackedFixed iterators missing `size_hint` (lines 970, 1014)
+
+Both `PackedFixedIntoIter` and `PackedFixedRefIter` implement `Iterator` but don't override `size_hint()`, defaulting to `(0, None)`. This means `collect::<Vec<_>>()` and similar operations cannot pre-allocate, causing repeated reallocations.
+
+## Development Approach
+
+- **Testing approach**: Regular (code first, then tests)
+- Complete each task fully before moving to the next
+- **CRITICAL: every task MUST include new/updated tests**
+- **CRITICAL: all tests must pass before starting next task**
+
+## Implementation Steps
+
+### Task 1: Replace generic read_fixed with direct from_le_bytes implementations
+
+**Files:**
+- Modify: `quick-protobuf/src/reader.rs`
+
+Replace the generic `read_fixed<M, F: Fn(&[u8]) -> M>` method and its callers with direct implementations that use Rust's native `from_le_bytes` methods. Each method does its own bounds check against `self.end` (tighter than `bytes.len()` used by `get()`) and calls `from_le_bytes` directly.
+
+- [x] Remove the private `read_fixed<M, F>` generic method
+- [x] Implement `read_fixed64` directly: bounds check `self.start + 8 <= self.end`, then `u64::from_le_bytes(bytes[self.start..self.start+8].try_into().unwrap())`
+- [x] Implement `read_fixed32` directly with same pattern (4 bytes, u32)
+- [x] Implement `read_sfixed64` directly (8 bytes, i64)
+- [x] Implement `read_sfixed32` directly (4 bytes, i32)
+- [x] Implement `read_float` directly (4 bytes, f32 via `f32::from_le_bytes`)
+- [x] Implement `read_double` directly (8 bytes, f64 via `f64::from_le_bytes`)
+- [x] Remove `use byteorder::ByteOrder` and `use byteorder::LittleEndian as LE` imports from reader.rs
+- [x] Run lib tests (`cargo test --manifest-path quick-protobuf/Cargo.toml --lib`)
+
+### Task 2: Add #[cold] to slow paths
+
+**Files:**
+- Modify: `quick-protobuf/src/reader.rs`
+
+- [x] Add `#[cold]` attribute to `read_varint32_slow`
+- [x] Add `#[cold]` attribute to `read_varint64_slow`
+- [x] Run lib tests
+
+### Task 3: Add size_hint to PackedFixed iterators
+
+**Files:**
+- Modify: `quick-protobuf/src/reader.rs`
+
+- [x] Implement `size_hint()` on `PackedFixedIntoIter`: return `(remaining, Some(remaining))` where `remaining = self.packed_fixed.len() - self.index`
+- [x] Implement `size_hint()` on `PackedFixedRefIter`: same pattern
+- [x] Add `ExactSizeIterator` impl for both iterators (since the count is always known)
+- [x] Add test: verify `size_hint` returns correct values during iteration
+- [x] Add test: verify `collect::<Vec<_>>()` produces correct results (exercises size_hint for pre-allocation)
+- [x] Run lib tests
+
+### Task 4: Benchmark and validate optimizations
+
+**Files:**
+- Modify: `quick-protobuf/benches/benches.rs`
+
+- [x] Run existing benchmarks to establish baseline before changes (if not already done)
+- [x] Run benchmarks after all changes to measure impact
+- [x] Verify no regressions in any benchmark
+- [x] Run full test suite (`cargo test --manifest-path quick-protobuf/Cargo.toml --lib`)
+- [x] Run clippy (`cargo clippy --manifest-path quick-protobuf/Cargo.toml`)
+
+### Task 5: Update documentation
+
+- [x] Update CLAUDE.md if internal patterns changed
+- [x] Move this plan to `docs/plans/completed/`