Batch proving with discriminant reuse (stacked on #325) by hoffmang9 · Pull Request #359 · Chia-Network/chiavdf

hoffmang9 · 2026-05-13T03:38:40Z

Summary

add the Trick 2 batch-proving path from the bbr fork (f3c73bf, 750df66) so jobs sharing (challenge, size_bits, x0) can reuse one squaring trajectory
include the follow-up RSS stabilization commit (445cb0d) for the batch finalization path
stack this branch on top of PR Add streaming one-wesolowski compaction APIs #325 so implementation work can start before Add streaming one-wesolowski compaction APIs #325 merges; once Add streaming one-wesolowski compaction APIs #325 merges, this PR will be updated/rebased to present only PR2 delta

Notes

this is intentionally opened as a draft stacked PR before Add streaming one-wesolowski compaction APIs #325 is merged
current diff includes Add streaming one-wesolowski compaction APIs #325 commits because of the stacked base

Test plan

wait for Add streaming one-wesolowski compaction APIs #325 merge, then rebase/update this branch onto main
run full CI matrix after rebase
run fast-wrapper batch API validation in downstream integration

Made with Cursor

Note

High Risk
Adds a large new C/C++ proving path that changes how Wesolowski proofs are generated and introduces new multi-threaded batch finalization and fast-path counter slot allocation, which could affect correctness and stability under concurrency.

Overview
Adds a new fast_wrapper C ABI for compaction-oriented one-Wesolowski proving that can stream bucket updates during squaring (requires known y_ref), optionally using an incremental GetBlock mapping to avoid expensive per-block exponentiation, plus a batch API that reuses a single squaring trajectory across many jobs sharing (challenge, x, discriminant_size_bits) and offloads per-job finalization to a small worker pool.

Updates the core squaring loop integration to support batch lifecycle hooks (OnBatchStart/OnBatchReplay), introduces quiet_mode to suppress stdout in library use, and assigns per-thread fast-counter slots via vdf_fast_pairindex() to reduce collisions when running multiple VDFs in-process.

Build/CI updates ensure cmake is present on macOS runners and extend Makefile.vdf-client with optional PIC builds and a new libchiavdf_fastc.a target; adds docs describing the new bluebox compaction optimizations.

^{Reviewed by Cursor Bugbot for commit 8e76d33. Bugbot is set up for automated code reviews on this repo. Configure here.}

Introduce a fast C wrapper with streaming proof generation, incremental GetBlock optimization, and memory-budgeted (k,l) tuning, plus the minimal runtime/build infrastructure needed to embed chiavdf in multi-worker clients. Co-authored-by: Cursor <cursoragent@cursor.com>

Guard the fast pairindex slot selection behind the existing x86/asm feature checks and return slot 0 on non-x86 targets, where threading counters are not compiled. Co-authored-by: Cursor <cursoragent@cursor.com>

Install cmake via Homebrew and export its bin path in the C libraries and wheel workflows so self-hosted macOS jobs don't fail when cmake is missing from PATH. Co-authored-by: Cursor <cursoragent@cursor.com>

…ocation. Track and roll back per-batch checkpoints when replaying a failed fast batch, and switch pairindex slot allocation to unsigned atomics to avoid negative modulo indexing after counter wraparound. Co-authored-by: Cursor <cursoragent@cursor.com>

Document that batch bounds use completed-iteration base values while OnIteration is normalized to 1-based indices to avoid ambiguity in replay tracking. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose missing batch C bindings and debug visibility so downstream Rust tests can validate tuner behavior end-to-end. Co-authored-by: Cursor <cursoragent@cursor.com>

Default CHIA_VDF_FAST_COUNTER_SLOTS to 100 in threading.h so upstream builds keep lower BSS usage while allowing embedded deployments to override via compiler defines. Co-authored-by: Cursor <cursoragent@cursor.com>

Use one program-wide atomic slot allocator for `vdf_fast_pairindex()` so concurrent VDF computations started from different translation units cannot collide on shared fast counter slots. Co-authored-by: Cursor <cursoragent@cursor.com>

Reject k>=64 before any 64-bit left-shift and reuse validated bucket spans for allocation, indexing, and finalization loops so invalid parameter tuning cannot trigger undefined behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

Add compile-time guards that reject zero fast-counter slot configurations before modulo indexing, and export Homebrew's cmake path in macOS workflows so cmake is available within the same step on Intel runners. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop the root-level development patch file that diverged from the live implementation, and adjust the streaming tuner cost model so bucket-update work scales with checkpoint count and `l` instead of only `k`. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace per-iteration modulo checks with next-checkpoint tracking in the streaming callback, and integrate the scheduling update with batch replay boundaries so rollback/replay semantics remain correct in the current upstreamed implementation. Co-authored-by: Cursor <cursoragent@cursor.com>

Lease fast counter slots with per-slot in-use tracking so long-lived processes can recycle released slots safely, and restore the one-weso proof diagnostic behind quiet_mode to keep client logging behavior consistent. Co-authored-by: Cursor <cursoragent@cursor.com>

(cherry picked from commit f3c73bf)

(cherry picked from commit 750df66)

(cherry picked from commit 445cb0d)

hoffmang9 · 2026-05-13T03:59:32Z

@Ealrann Assuming these two PRs are merged - is there anything else you need to be able to abandon your chiavdf fork and just use a release of this repo?

Ensure batch proving joins pending finalizer work and frees allocated output arrays on exceptions so stack-referenced state cannot outlive the call frame. Also remove an unused internal batch-free helper that duplicated the public C API. Co-authored-by: Cursor <cursoragent@cursor.com>

Handle replay notifications in streaming Wesolowski callbacks by rejecting replayed batches instead of reusing irreversibly accumulated bucket state. This prevents silent incorrect proofs when the fast squaring path replays a corrupted batch. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep discriminant and stop-flag state alive for catch-path finalizer joins so callback references never dangle during exception unwinding. This preserves safe cleanup when batch proving fails mid-flight. Co-authored-by: Cursor <cursoragent@cursor.com>

Add a reducer-aware streaming proof finalization method used by batch worker threads and remove stale unused bucket replay members. This keeps batch finalization functional while cleaning dead scaffolding flagged by review. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 88602ba. Configure here.}

Drop the unused batch progress trampoline scaffolding in fast_wrapper to reduce maintenance noise and keep the batch callback flow aligned with current direct progress handling. Co-authored-by: Cursor <cursoragent@cursor.com>

hoffmang9 and others added 16 commits February 23, 2026 23:34

Fix non-x86 build break in vdf_fast_pairindex.

7be0752

Guard the fast pairindex slot selection behind the existing x86/asm feature checks and return slot 0 on non-x86 targets, where threading counters are not compiled. Co-authored-by: Cursor <cursoragent@cursor.com>

Ensure cmake is present on macOS CI runners.

3755be2

Install cmake via Homebrew and export its bin path in the C libraries and wheel workflows so self-hosted macOS jobs don't fail when cmake is missing from PATH. Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify batch iteration indexing in streaming callback.

fd000ab

Document that batch bounds use completed-iteration base values while OnIteration is normalized to 1-based indices to avoid ambiguity in replay tracking. Co-authored-by: Cursor <cursoragent@cursor.com>

Add streaming tuner diagnostics and batch fast-wrapper APIs.

3f82dc2

Expose missing batch C bindings and debug visibility so downstream Rust tests can validate tuner behavior end-to-end. Co-authored-by: Cursor <cursoragent@cursor.com>

Make fast-thread counter slots build-configurable.

95f8ff1

Default CHIA_VDF_FAST_COUNTER_SLOTS to 100 in threading.h so upstream builds keep lower BSS usage while allowing embedded deployments to override via compiler defines. Co-authored-by: Cursor <cursoragent@cursor.com>

trick 2

1e19548

(cherry picked from commit f3c73bf)

Event queue

81b7b0d

(cherry picked from commit 750df66)

Fix unbounded RSS growth

ea0a0ab

(cherry picked from commit 445cb0d)

hoffmang9 marked this pull request as ready for review May 13, 2026 03:59