Skip to content

🐎 accelerate findcirc#8

Merged
dmiller15 merged 5 commits intomasterfrom
dm-speedup-findcirc
Mar 26, 2026
Merged

🐎 accelerate findcirc#8
dmiller15 merged 5 commits intomasterfrom
dm-speedup-findcirc

Conversation

@dmiller15
Copy link
Copy Markdown
Contributor

@dmiller15 dmiller15 commented Feb 5, 2026

Description

Speed improvement for find_circ. Original code can be found here: https://github.com/daaaaande/circs_snake/blob/master/scripts/pipelines/find_circ.py.

It's a 10 year old piece of software that is extremely slow. I decided to change a couple of things:

  1. I examined the profile of a test run an observed that it's doing a lot of expensive numpy calls that can be better handled more cheaply through bytewise comparisons. Additionally, I added an early exit to the function so that it will stop as soon as it exceeds the mismatch limit.
  2. find_breakpoints needs to be called on each pair of reads. The original implementation does this one pair at a time and that's really quite rate limiting for the pipe. The best way to unlock things here was to add multiprocessing. The multiprocessing is done in batches of 500 across as many threads as possible.

For this review, a lot of the code is copied from the original. For subworkflows/local/circs/bin/find_circ_mp.py everything can be taken as-is except for:

  • mismatches_bytes_early
  • find_breakpoints
  • worker_init
  • worker_process
  • worker_process_inner
  • main script lines 480-590

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Process time drops from 5.5 to 1.5 hours!

Test Configuration:

  • Environment:
  • Test files:

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings
  • I have committed any related changes to the PR

@dmiller15 dmiller15 marked this pull request as ready for review February 9, 2026 18:15
@dmiller15 dmiller15 self-assigned this Feb 9, 2026
@dmiller15 dmiller15 added feature New functionality refactor Something needs to be done better labels Feb 9, 2026
Copy link
Copy Markdown
Member

@migbro migbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty nice overall. I mostly just comments. I assume you've compared to the original implementation to ensure functional equivalence?

Comment thread subworkflows/local/circs/bin/find_circ_fast_main.py Outdated
Comment thread subworkflows/local/circs/bin/find_circ_mp.py
return diff
return diff

def find_breakpoints(A, B, read, genome, chrom, asize, margin, maxdist):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I didn't write this. I just swapped out the mismatch function.

spliced_mv = memoryview(spliced)
internal_mv = memoryview(internal_b)

# local copies to avoid global lookups in hot loop
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...is this for speed, or do global lookups have like "changed during iteration" problems?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for speed. It's one of the many suggestions CHOPGPT came up with. The improvement was minimal compared to dropping numpy and doing early exit. Left it in anyway.

Base automatically changed from dm-circ-init to master February 12, 2026 16:23
@dmiller15 dmiller15 requested a review from migbro February 19, 2026 16:03
Copy link
Copy Markdown
Member

@migbro migbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed, thanks!

@dmiller15 dmiller15 merged commit 6c08cf8 into master Mar 26, 2026
@dmiller15 dmiller15 deleted the dm-speedup-findcirc branch March 26, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality refactor Something needs to be done better

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants