Skip to content

Split dedup into dedup_seq and dedup_pos#16

Merged
nrminor merged 1 commit intoschema-v2.5.0from
dedup-decomposition
Feb 17, 2026
Merged

Split dedup into dedup_seq and dedup_pos#16
nrminor merged 1 commit intoschema-v2.5.0from
dedup-decomposition

Conversation

@nrminor
Copy link
Copy Markdown
Member

@nrminor nrminor commented Feb 13, 2026

Summary

The pipeline currently has a single dedup parameter that controls two distinct deduplication strategies that operate at different stages:

  1. Sequence-based (clumpify): removes exact/near-exact duplicate reads based on sequence content, runs during read preprocessing
  2. Positional (samtools markdup): removes PCR/optical duplicates based on mapping position, runs after alignment in minimap2

This PR splits them into dedup_seq and dedup_pos so users can enable one without the other. The original dedup parameter is preserved as an umbrella that enables both, maintaining backward compatibility.

Resolution chain

Each flag follows the same precedence pattern used by the other preprocessing flags:

specific param ?: umbrella ?: master switch

Concretely:

  • dedup_seq ?: dedup ?: preprocess — gates clumpify in PREPROCESS_READS
  • dedup_pos ?: dedup ?: preprocess — gates samtools markdup in MAP_READS_TO_CONTIGS

So --dedup still turns both on, --preprocess still turns everything on, and the new flags provide fine-grained control.

Changes

Nextflow layer:

  • nextflow.config — added dedup_seq and dedup_pos params (both default null)
  • workflows/preprocess_reads.nf — resolution chain now checks dedup_seq first
  • modules/minimap2.nf — resolution chain now checks dedup_pos first

Python layer (CLI, model, schema):

  • lib/py_nvd/models.py — added dedup_seq and dedup_pos fields to NvdParams
  • schemas/nvd-params.v2.5.0.schema.json — added schema entries
  • lib/py_nvd/cli/commands/run.py — added --dedup-seq/--no-dedup-seq and --dedup-pos/--no-dedup-pos
  • lib/py_nvd/cli/commands/preset.py — same
  • lib/py_nvd/params.py — added to template generation list

Generated:

  • lib/py_nvd/_fingerprint.json — regenerated (nextflow.config changed)

@wkgardner
Copy link
Copy Markdown
Collaborator

The precedence order makes sense and won't mess with the backwards compatibility of others. I am glad that we now have a way to more finely tune the dedup level we are using.

It looks like using both dedup_seq and dedup_pos at the same time works the same as the original dedup and both ternary statements follow the correct resolution logic. :? Groovy baby!

@nrminor
Copy link
Copy Markdown
Member Author

nrminor commented Feb 13, 2026

Yes! Gotta love Groovy's elvis operator (:?)!

@nrminor nrminor merged commit bfdfb82 into schema-v2.5.0 Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants