Skip to content

optimization: Batch DB and git operations in post-run pipeline#38

Merged
TrevorBasinger merged 7 commits intomainfrom
cg/backend-speedups
Mar 17, 2026
Merged

optimization: Batch DB and git operations in post-run pipeline#38
TrevorBasinger merged 7 commits intomainfrom
cg/backend-speedups

Conversation

@christophergeyer
Copy link
Member

Summary

Reduces per-file post-run overhead from ~10 ms/file to ~2-3 ms/file by eliminating N+1 patterns in artifact registration and file classification.

  • Batch artifact registration: Single SELECT ... IN + bulk add_all() + one flush() replaces per-file register/link calls
  • Batch edge-existence checks: existing_input_paths()/existing_output_paths() replace per-file has_input_path()/has_output_path() loops
  • Batch hash lookups: get_hashes_batch() for get_inputs()/get_outputs()
  • Cache git ls-files: Single subprocess call replaces per-file git ls-files --error-unmatch
  • ROAR_TIMING instrumentation: Set ROAR_TIMING=1 to get a JSON timing breakdown (tracer, provenance, record) on stderr — useful for benchmarking

Reduces backend time per file from ~8ms/file to ~2.5ms/file.

Test plan

  • Unit tests pass (test_job_recording, test_job_recording_dedup, test_file_filter, test_file_classifier_perf, test_byte_range_registration)
  • Regression-based benchmarks (bm_trace_1) confirm per-file costs with tracer/post-run breakdown
  • Verify roar show @N displays correct inputs/outputs (exercises get_hashes_batch path)

🤖 Generated with https://claude.com/claude-code

chrisgeyertreqs and others added 7 commits March 16, 2026 02:29
Reduces per-file post-processing overhead from ~8.5ms to ~2.8ms by:

- Batch git ls-files: single subprocess call instead of one per file
  in classify_all (files.py)
- Batch artifact registration: bulk hash lookup and bulk insert instead
  of per-file ORM queries (artifact.py, job_recording.py)
- Batch job edge creation: add_inputs_batch/add_outputs_batch with
  single flush instead of per-file insert+flush (job.py)
- Batch hash retrieval: get_hashes_batch eliminates N+1 queries in
  get_inputs/get_outputs (artifact.py, job.py)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-file has_input_path/has_output_path calls with single
IN-clause queries, fix redundant setdefault in get_hashes_batch, and
document register_batch's reduced signature vs register().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ROAR_TIMING=1 is set, prints a JSON timing summary to stderr
with tracer, post-run, provenance, and record phase durations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TrevorBasinger TrevorBasinger merged commit 0205d0f into main Mar 17, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants