Skip to content

perf: parallelize attestation signature verification#274

Merged
pablodeymo merged 3 commits intomainfrom
parallel-sig-verification
Apr 10, 2026
Merged

perf: parallelize attestation signature verification#274
pablodeymo merged 3 commits intomainfrom
parallel-sig-verification

Conversation

@MegaRedHand
Copy link
Copy Markdown
Collaborator

Summary

  • Parallelize attestation signature verification using rayon's par_iter
  • Signature verification was sequential (~40-50ms per attestation), causing blocks with many attestations to take 1.4-1.9s to process
  • Split into two phases: prepare inputs sequentially (cheap), then verify in parallel
  • Observed in devnet: with 36 attestations, block processing took ~1.4-1.9s; expect ~4x speedup on 4 cores

Test plan

  • make fmt passes
  • make lint passes
  • make test passes (all 101 tests, including signature spec tests)
  • Deploy to devnet and verify block processing times drop

@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Overall: Good optimization moving signature verification to parallel execution. The approach of separating cheap input preparation from expensive verification is correct. However, there are performance and observability regressions to address.

Critical Issues

crates/blockchain/src/store.rs

1. Performance regression: Unnecessary Vec clone in hot loop (Line ~1201)

verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot)

Each attestation clones the entire public_keys Vec (48 bytes × N validators) inside the parallel worker. With many attestations, this significantly increases memory traffic.

Fix: Change verify_aggregated_signature signature to accept &[PublicKey] instead of Vec<PublicKey>, or wrap the Vec in Arc<[PublicKey]> during input preparation to share ownership without cloning.

2. Loss of timing metrics (Lines ~1196-1207)

The original code wrapped verification in:

let _timing = metrics::time_pq_sig_aggregated_signatures_verification();

This metric is entirely removed. For consensus monitoring, you need visibility into aggregate signature verification latency.

Fix: Either:

  • Add the timing metric back around the entire parallel block (measures total time, not per-signature)
  • Use rayon::spawn with manual timing if per-signature timing is needed
  • Document the intentional removal in PR description

Code Quality

3. Inconsistent error handling pattern

The sequential preparation phase uses collect::<Result<_, _>>() (good), but the parallel phase uses try_for_each with manual match on the result. The match is unnecessary since verify_aggregated_signature already returns Result<(), E>.

Suggestion (Line ~1198-1208):

.par_iter()
.try_for_each(|(proof_data, public_keys, message, slot)| {
    verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot)
        .map(|_| metrics::inc_pq_sig_aggregated_signatures_valid())
        .map_err(|e| {
            metrics::inc_pq_sig_aggregated_signatures_invalid();
            StoreError::AggregateVerificationFailed(e)
        })
})

4. Module import placement (Line ~1197)

use rayon::prelude::*;

Placing this inside the function body is unconventional. Move to top-level imports for consistency.

Security & Correctness

5. Verification completeness

Good: Using try_for_each ensures all attestations are verified (not just first N), and short-circuits on first failure, maintaining consensus safety.

6. Thread safety

Good: The verification_inputs Vec is immutable during parallel iteration, and verify_aggregated_signature appears pure (no mutable shared state). Safe for Rayon.

Documentation

7. RELEASE.md & README.md updates

Good: Clear explanation of the new unstable vs latest tagging strategy. The table in README accurately reflects the release process.

Nitpicks

Line ~1184: The comment // cheap: bit checks + pubkey lookups is helpful, but note that validator_indices + pubkey lookups are actually O(N) in the number of validators per attestation. For large committees, this could be non-trivial, but still likely cheaper than crypto verification.

Line ~1203: *slot dereference suggests slot is &u32. Verify that verify_aggregated_signature doesn't actually need u32 by value (if it just reads it, &u32 would suffice).

Summary

  1. Fix the public_keys.clone() overhead - this is the blocker
  2. Restore or document the timing metric removal
  3. Move use rayon::prelude::*; to imports
  4. Consider simplifying the error handling in the parallel block

The parallelization strategy is sound and maintains consensus correctness.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here's my review of PR #274:


PR Review: perf: parallelize attestation signature verification

Summary

The two-phase approach (sequential input preparation → parallel verification) is correct and well-motivated. The Once-based verifier initialization is thread-safe, and prometheus counters use atomics so concurrent inc_* calls are safe. The logic is sound.


Issues

par_iter() forces an unnecessary Vec<ValidatorPublicKey> clone per attestation

store.rs:1217

// Current: par_iter() borrows, forcing .clone() on each parallel call
verification_inputs
    .par_iter()
    .try_for_each(|(proof_data, public_keys, message, slot)| {
        let result =
            verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot);

verify_aggregated_signature takes public_keys: Vec<ValidatorPublicKey> by value, and verification_inputs is not used after this point. Switching to into_par_iter() moves each element instead of borrowing, eliminating the clone:

verification_inputs
    .into_par_iter()
    .try_for_each(|(proof_data, public_keys, message, slot)| {
        let result =
            verify_aggregated_signature(proof_data, public_keys, &message, slot);

Each XMSS pubkey is 52 bytes; with a full attestation committee this is a non-trivial allocation per attestation that is entirely avoidable.


Timing metric dropped without replacement

store.rs (removed context)

The RAII guard let _timing = metrics::time_pq_sig_aggregated_signatures_verification() was removed and no parallel equivalent added. The overall aggregated_elapsed log field covers wall-clock time, but the lean_pq_sig_aggregated_signatures_verification histogram loses all data. Wrapping each closure body with a fresh timing guard still works correctly in parallel (each guard measures one verification call on its thread):

.try_for_each(|(proof_data, public_keys, message, slot)| {
    let _timing = metrics::time_pq_sig_aggregated_signatures_verification();
    let result = verify_aggregated_signature(...);

Given that the stated motivation is observing a 4× speedup, keeping this metric would let you verify it in production without relying solely on log output.


use rayon::prelude::* inside function body

store.rs:1212

This import belongs at the top of the file with the other use declarations, not embedded mid-function. It's not a correctness issue but is inconsistent with the rest of the codebase.


Non-deterministic error attribution on failure

Minor: the sequential version always returned the error from the first failing attestation (by index). try_for_each in rayon returns the error from whichever thread completes first. For consensus correctness this doesn't matter, but it may affect debugging (the error log won't consistently point to the same offending attestation index). Acceptable tradeoff, worth a comment.


Docs changes

README.md and RELEASE.md updates are clear and accurate. The tag convention table is a useful addition.


Verdict

Approve with the into_par_iter() fix applied — that's the one concrete regression introduced by the refactor. The timing metric drop is secondary but worth restoring for observability.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Greptile Summary

This PR parallelizes attestation signature verification in verify_signatures using rayon's par_iter, splitting the work into a cheap sequential preparation phase (bit checks, pubkey lookups) and a parallel verification phase. The implementation is correct and the expected ~4× speedup on 4 cores should address the observed 1.4–1.9 s block processing times with 36 attestations. Documentation changes update container tagging conventions.

Confidence Score: 5/5

Safe to merge — parallelization is correct, error propagation is sound, and all findings are P2 style/observability suggestions.

The two-phase approach (sequential prep, parallel verify) is correctly implemented. Prometheus counters are atomic so calling them from rayon threads is safe. try_for_each correctly short-circuits on the first error. The only findings are a dropped timing histogram (observability regression, not a correctness issue) and an in-function import placement. Both are P2.

crates/blockchain/src/store.rs — dropped per-attestation timing metric worth restoring.

Important Files Changed

Filename Overview
crates/blockchain/src/store.rs Parallel attestation signature verification added via rayon; per-attestation timing metric (time_pq_sig_aggregated_signatures_verification) silently dropped from this path.
crates/blockchain/Cargo.toml Added rayon workspace dependency for parallel iteration.
Cargo.toml Registered rayon = "1.10" in workspace dependencies.
README.md Added Container Releases section and a new roadmap item for formal STF verification.
RELEASE.md Updated tag convention: unstable replaces latest for main-branch builds; latest now tracks the current devnet.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[verify_signatures called] --> B{attestation count\n== signature count?}
    B -- No --> ERR1[Err: AttestationSignatureMismatch]
    B -- Yes --> C[Sequential preparation phase\nbits check + pubkey lookup]
    C --> D{Any input error?}
    D -- Yes --> ERR2[Err: ParticipantsMismatch /\nInvalidValidatorIndex /\nPubkeyDecodingFailed]
    D -- No --> E[verification_inputs Vec collected]
    E --> F[rayon par_iter]
    F --> G1[Thread 1: verify_aggregated_signature]
    F --> G2[Thread 2: verify_aggregated_signature]
    F --> G3[Thread N: verify_aggregated_signature]
    G1 & G2 & G3 --> H{All OK?}
    H -- Any Err --> ERR3[Err: AggregateVerificationFailed]
    H -- All Ok --> I[inc valid metrics]
    I --> J[verify proposer signature sequentially]
    J --> K[Log timing info]
    K --> L[Ok]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 1199-1214

Comment:
**Per-attestation timing metric dropped**

The `_timing = metrics::time_pq_sig_aggregated_signatures_verification()` guard was removed and nothing replaces it inside the parallel closure. The histogram (`LEAN_PQ_SIG_AGGREGATED_SIGNATURES_VERIFICATION_TIME_SECONDS`) is still registered and still used in the gossip path (line 458), so it will now only reflect gossip-path verifications, silently under-counting block-path verifications. Since `TimingGuard` is a plain RAII struct with no shared state, it's safe to create per-thread inside `par_iter`. Consider adding it back:

```suggestion
    verification_inputs
        .par_iter()
        .try_for_each(|(proof_data, public_keys, message, slot)| {
            let _timing = metrics::time_pq_sig_aggregated_signatures_verification();
            let result =
                verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot);
            match result {
                Ok(()) => {
                    metrics::inc_pq_sig_aggregated_signatures_valid();
                    Ok(())
                }
                Err(e) => {
                    metrics::inc_pq_sig_aggregated_signatures_invalid();
                    Err(StoreError::AggregateVerificationFailed(e))
                }
            }
        })?;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 1198

Comment:
**Import inside function body**

`use rayon::prelude::*` is placed mid-function. The codebase convention is to keep imports at the top of the file or at least at the top of the function; placing it inline between a comment and the expression it enables makes it easy to miss. Consider moving it to the top of the file alongside the other `use` statements.

```suggestion
    use rayon::prelude::*;
    // Run expensive signature verification in parallel
    verification_inputs
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "perf: parallelize attestation signature ..." | Re-trigger Greptile

Signature verification dominated block processing time (~40-50ms per
attestation, sequential). With 36 attestations this meant ~1.4-1.9s per
block, causing the target to fall hundreds of slots behind the head.

Split into two phases:
1. Prepare inputs sequentially (cheap: bit checks, pubkey lookups)
2. Verify signatures in parallel via rayon's par_iter

On a 4-core machine with 36 attestations, expect ~4x speedup
(1.4s -> ~350ms).
@MegaRedHand MegaRedHand force-pushed the parallel-sig-verification branch from 1a71cbf to e45113b Compare April 10, 2026 18:34
Comment on lines +1199 to +1214
verification_inputs
.par_iter()
.try_for_each(|(proof_data, public_keys, message, slot)| {
let result =
verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot);
match result {
Ok(()) => {
metrics::inc_pq_sig_aggregated_signatures_valid();
Ok(())
}
Err(e) => {
metrics::inc_pq_sig_aggregated_signatures_invalid();
Err(StoreError::AggregateVerificationFailed(e))
}
validators[vid as usize]
.get_pubkey()
.map_err(|_| StoreError::PubkeyDecodingFailed(vid))
})
.collect::<Result<_, _>>()?;

let verification_result = {
let _timing = metrics::time_pq_sig_aggregated_signatures_verification();
verify_aggregated_signature(&aggregated_proof.proof_data, public_keys, &message, slot)
};
match verification_result {
Ok(()) => metrics::inc_pq_sig_aggregated_signatures_valid(),
Err(e) => {
metrics::inc_pq_sig_aggregated_signatures_invalid();
return Err(StoreError::AggregateVerificationFailed(e));
}
}
}
})?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Per-attestation timing metric dropped

The _timing = metrics::time_pq_sig_aggregated_signatures_verification() guard was removed and nothing replaces it inside the parallel closure. The histogram (LEAN_PQ_SIG_AGGREGATED_SIGNATURES_VERIFICATION_TIME_SECONDS) is still registered and still used in the gossip path (line 458), so it will now only reflect gossip-path verifications, silently under-counting block-path verifications. Since TimingGuard is a plain RAII struct with no shared state, it's safe to create per-thread inside par_iter. Consider adding it back:

Suggested change
verification_inputs
.par_iter()
.try_for_each(|(proof_data, public_keys, message, slot)| {
let result =
verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot);
match result {
Ok(()) => {
metrics::inc_pq_sig_aggregated_signatures_valid();
Ok(())
}
Err(e) => {
metrics::inc_pq_sig_aggregated_signatures_invalid();
Err(StoreError::AggregateVerificationFailed(e))
}
validators[vid as usize]
.get_pubkey()
.map_err(|_| StoreError::PubkeyDecodingFailed(vid))
})
.collect::<Result<_, _>>()?;
let verification_result = {
let _timing = metrics::time_pq_sig_aggregated_signatures_verification();
verify_aggregated_signature(&aggregated_proof.proof_data, public_keys, &message, slot)
};
match verification_result {
Ok(()) => metrics::inc_pq_sig_aggregated_signatures_valid(),
Err(e) => {
metrics::inc_pq_sig_aggregated_signatures_invalid();
return Err(StoreError::AggregateVerificationFailed(e));
}
}
}
})?;
verification_inputs
.par_iter()
.try_for_each(|(proof_data, public_keys, message, slot)| {
let _timing = metrics::time_pq_sig_aggregated_signatures_verification();
let result =
verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot);
match result {
Ok(()) => {
metrics::inc_pq_sig_aggregated_signatures_valid();
Ok(())
}
Err(e) => {
metrics::inc_pq_sig_aggregated_signatures_invalid();
Err(StoreError::AggregateVerificationFailed(e))
}
}
})?;
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 1199-1214

Comment:
**Per-attestation timing metric dropped**

The `_timing = metrics::time_pq_sig_aggregated_signatures_verification()` guard was removed and nothing replaces it inside the parallel closure. The histogram (`LEAN_PQ_SIG_AGGREGATED_SIGNATURES_VERIFICATION_TIME_SECONDS`) is still registered and still used in the gossip path (line 458), so it will now only reflect gossip-path verifications, silently under-counting block-path verifications. Since `TimingGuard` is a plain RAII struct with no shared state, it's safe to create per-thread inside `par_iter`. Consider adding it back:

```suggestion
    verification_inputs
        .par_iter()
        .try_for_each(|(proof_data, public_keys, message, slot)| {
            let _timing = metrics::time_pq_sig_aggregated_signatures_verification();
            let result =
                verify_aggregated_signature(proof_data, public_keys.clone(), message, *slot);
            match result {
                Ok(()) => {
                    metrics::inc_pq_sig_aggregated_signatures_valid();
                    Ok(())
                }
                Err(e) => {
                    metrics::inc_pq_sig_aggregated_signatures_invalid();
                    Err(StoreError::AggregateVerificationFailed(e))
                }
            }
        })?;
```

How can I resolve this? If you propose a fix, please make it concise.

.collect::<Result<_, StoreError>>()?;

// Run expensive signature verification in parallel
use rayon::prelude::*;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Import inside function body

use rayon::prelude::* is placed mid-function. The codebase convention is to keep imports at the top of the file or at least at the top of the function; placing it inline between a comment and the expression it enables makes it easy to miss. Consider moving it to the top of the file alongside the other use statements.

Suggested change
use rayon::prelude::*;
use rayon::prelude::*;
// Run expensive signature verification in parallel
verification_inputs
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 1198

Comment:
**Import inside function body**

`use rayon::prelude::*` is placed mid-function. The codebase convention is to keep imports at the top of the file or at least at the top of the function; placing it inline between a comment and the expression it enables makes it easy to miss. Consider moving it to the top of the file alongside the other `use` statements.

```suggestion
    use rayon::prelude::*;
    // Run expensive signature verification in parallel
    verification_inputs
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. High: the new verification_inputs staging turns aggregate verification from streaming to fully buffered, and then clones every pubkey set again inside the Rayon closure. In the old code, each attestation’s Vec<ValidatorPublicKey> was built, verified, and dropped immediately; now one vector is retained for every attestation, and public_keys.clone() duplicates it once more before calling verify_aggregated_signature. Because both the attestation list and signature list are capped at 4096, and each bitlist can itself cover up to 4096 validators, this changes memory from O(participants in one attestation) to O(total participants across the whole block), which is a new block-import DoS surface. See store.rs, store.rs, block.rs, block.rs, attestation.rs. I’d keep this streaming or use into_par_iter() over owned work items in bounded batches so the pubkey vectors are moved, not cloned and retained en masse.

  2. Medium: this makes peer-triggered block validation consume the global Rayon pool from the hot import path, so one large block can now fan verifier work across all cores and contend with fork choice, networking, and proposer duties. Previously the damage from an expensive block was naturally limited to one validation thread; after this change it becomes whole-process CPU amplification. That is a meaningful operational/security regression for a consensus client unless parallelism is explicitly bounded or gated behind a threshold. See store.rs, store.rs.

  3. Low: block import no longer records time_pq_sig_aggregated_signatures_verification() per aggregate proof, while the gossip attestation path still does. That silently changes the meaning of the histogram and will make verification latency dashboards/alerts inconsistent after merge. See store.rs, store.rs, metrics.rs.

No fork-choice, justification/finalization, STF, or SSZ logic changed in this PR; the risk is isolated to XMSS aggregate verification behavior.

I couldn’t run cargo test here because the sandbox blocks rustup from creating temp files under its default home.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Switch from par_iter() + public_keys.clone() to into_par_iter() which
moves ownership of each tuple into the parallel closure, eliminating
the per-attestation Vec<ValidatorPublicKey> clone.
@MegaRedHand MegaRedHand added the performance Performance improvements or possible performance improvements label Apr 10, 2026
@pablodeymo pablodeymo merged commit e98859d into main Apr 10, 2026
3 checks passed
@pablodeymo pablodeymo deleted the parallel-sig-verification branch April 10, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance improvements or possible performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants