Skip to content

feat: parameterize Markov order k in gsynth (k=1..8)#82

Merged
aviezerl merged 7 commits intomasterfrom
feat/gsynth-variable-k
Apr 1, 2026
Merged

feat: parameterize Markov order k in gsynth (k=1..8)#82
aviezerl merged 7 commits intomasterfrom
feat/gsynth-variable-k

Conversation

@aviezerl
Copy link
Copy Markdown
Collaborator

Summary

  • Add k parameter to gsynth.train() (default 5L, range 1-8) to configure the Markov order for synthetic genome generation
  • Refactor C++ StratifiedMarkovModel from fixed NUM_5MERS=1024 arrays to dynamic vectors sized by 4^k
  • Backward compatible: old models (k=5) load without changes; .gsm format uses version 2 for k!=5

Changes

C++ (4 files): StratifiedMarkovModel.h/cpp accepts runtime k with dynamic storage. GenomeSynthTrain.cpp and GenomeSynthSample.cpp accept k from R. Binary format v2 stores k in header; v1 loads as k=5.

R (1 file): gsynth.train(k=5L) with strict validation (rejects non-integer, out-of-range). Propagated through gsynth.save/load/sample. Uses double arithmetic to avoid integer overflow for large models.

Tests (1 new file): test-gsynth-k.R — 7 test blocks (54 expectations) covering k=1/3/5/7, validation, save/load round-trip, backward compat, stratified models, .gsm version assertion.

Test plan

  • All 12,034 existing gsynth tests pass (backward compat)
  • 54 new k-parameter tests pass
  • Clean compilation with zero warnings
  • k=5 explicit produces identical results to default
  • Save/load round-trip preserves k and produces identical samples
  • Non-integer k values (e.g. 3.5) rejected
  • .gsm version bumped to 2 for k!=5 models

aviezerl and others added 7 commits March 30, 2026 23:42
…t 5)

The gsynth system previously hardcoded a Markov order of 5 throughout.
This adds a `k` parameter to `gsynth.train()` supporting orders 1-8,
enabling users to trade off model complexity vs. context sensitivity.

C++: StratifiedMarkovModel uses runtime k with dynamic vector storage
instead of fixed NUM_5MERS=1024 arrays. Binary format v2 stores k in
header; v1 files load as k=5 for backward compatibility.

R: gsynth.train(k=5L) with validation, stored in model object. Propagated
through save/load/sample. Old models without k default to 5.

Tests: 7 new test blocks (52 expectations) covering k=1/3/5/7, validation,
save/load round-trip, backward compat, and stratified models.
total_bins * num_kmers * 4L overflows R's 32-bit integer for k=8 models
with >= 8192 bins. Use double arithmetic for expected_n instead.
k=3.5 was silently converted to 3 via as.integer(). Now validate that
k == as.integer(k) before coercion, erroring on non-integer numerics.
v1 implied fixed 1024x4 context layout. Models with k != 5 now emit
version: 2 so external .gsm readers can distinguish variable-order files.
k=5 models continue to emit version: 1 for backward compatibility.
@aviezerl aviezerl merged commit b858da4 into master Apr 1, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant