Skip to content

perf: C++ hot-path optimizations (LUT encoding, buffer reuse, misc)#83

Merged
aviezerl merged 6 commits intomasterfrom
perf/pwm-lut-encoding
Apr 3, 2026
Merged

perf: C++ hot-path optimizations (LUT encoding, buffer reuse, misc)#83
aviezerl merged 6 commits intomasterfrom
perf/pwm-lut-encoding

Conversation

@aviezerl
Copy link
Copy Markdown
Collaborator

@aviezerl aviezerl commented Apr 3, 2026

Summary

  • Replace switch-based DNA base encoding with compile-time lookup tables in DnaPSSM and GseqString — eliminates branch mispredictions in per-base hot loops
  • Replace string-based mode comparisons with enum dispatch in gseq.pwm inner loop
  • Reuse DP buffer across calls in PWMEditDistanceScorer::compute_with_indels
  • Reuse bin_vals buffer in GenomeTrackFixedBin instead of reallocating per chromosome
  • GIntervals::range(chromid): use chrom map for O(k) instead of scanning all intervals O(n)
  • GenomeTrackSmooth: replace modulo with conditional increment in per-sample path
  • GenomeTrackArrays: bulk-read array values in single call instead of 2×N separate reads

Test plan

  • All 17,801 tests pass (alutil::tst(parallel=TRUE))
  • Clean compilation with no warnings

aviezerl added 6 commits April 3, 2026 09:44
Replace all switch-statement character encoding in PWM scoring hot loops
with static 256-entry lookup tables (BASE_ENCODE, COMPLEMENT_ENCODE,
NEUTRAL_CHAR). This eliminates indirect branches that cause ~25% branch
prediction accuracy on random DNA sequences.

Key changes:
- Add DnaLookupTables struct with three static lookup tables in DnaPSSM.h
- Replace all encode()/get_log_prob(char) calls with direct table lookups
- Fix double-switch in calc_like_rc (complement switch + encode switch)
- Fix latent bug: 'case h' instead of 'case g' in integrate_energy RC path

Benchmarks (50K sequences x 400bp, bidirectional):
- gseq.pwm max mode: 15.5s -> 0.76s (20x speedup)
- gseq.pwm lse mode: 16.6s -> 1.88s (8.8x speedup)
- vtrack forward:     35.5s -> 2.48s (14.3x speedup)
- vtrack bidirect:    42.3s -> 2.68s (15.8x speedup)

All 1220 PWM tests pass. All 17800 tests pass.
Convert PWM scoring mode dispatch from per-position string comparisons
(mode == "lse", "max", etc.) to a pre-parsed enum (PwmMode). The string
is parsed once at function entry, then integer comparison is used in
the inner loop. Applied to both gap and non-gap code paths.
Replace per-window heap allocation of the 3D DP table with a reusable
member buffer (m_dp_buffer). The buffer grows to the maximum needed
size and is reused across subsequent windows, eliminating thousands of
malloc/free cycles per interval.
Replace per-call vector<float> allocations for bin reads with a
reusable member scratch buffer (m_scratch_bin_vals). Eliminates
heap allocations in the read_interval hot path for FixedBin tracks.
- GIntervals::range(chromid): use m_chrom2itr for O(k) per-chrom
  iteration instead of scanning all intervals O(n)
- GenomeTrackSmooth: replace modulo with conditional increment in
  per-sample hot path (avoids integer division)
- GenomeTrackArrays: bulk-read ArrayVal array in one call instead of
  2*N separate BufferedFile::read() calls per element
@aviezerl aviezerl merged commit 149d2f8 into master Apr 3, 2026
4 of 5 checks passed
@aviezerl aviezerl deleted the perf/pwm-lut-encoding branch April 3, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant