feat: parameterize Markov order k in gsynth (k=1..8) by aviezerl · Pull Request #82 · tanaylab/misha

aviezerl · 2026-03-30T21:05:42Z

Summary

Add k parameter to gsynth.train() (default 5L, range 1-8) to configure the Markov order for synthetic genome generation
Refactor C++ StratifiedMarkovModel from fixed NUM_5MERS=1024 arrays to dynamic vectors sized by 4^k
Backward compatible: old models (k=5) load without changes; .gsm format uses version 2 for k!=5

Changes

C++ (4 files): StratifiedMarkovModel.h/cpp accepts runtime k with dynamic storage. GenomeSynthTrain.cpp and GenomeSynthSample.cpp accept k from R. Binary format v2 stores k in header; v1 loads as k=5.

R (1 file): gsynth.train(k=5L) with strict validation (rejects non-integer, out-of-range). Propagated through gsynth.save/load/sample. Uses double arithmetic to avoid integer overflow for large models.

Tests (1 new file): test-gsynth-k.R — 7 test blocks (54 expectations) covering k=1/3/5/7, validation, save/load round-trip, backward compat, stratified models, .gsm version assertion.

Test plan

All 12,034 existing gsynth tests pass (backward compat)
54 new k-parameter tests pass
Clean compilation with zero warnings
k=5 explicit produces identical results to default
Save/load round-trip preserves k and produces identical samples
Non-integer k values (e.g. 3.5) rejected
.gsm version bumped to 2 for k!=5 models

…t 5) The gsynth system previously hardcoded a Markov order of 5 throughout. This adds a `k` parameter to `gsynth.train()` supporting orders 1-8, enabling users to trade off model complexity vs. context sensitivity. C++: StratifiedMarkovModel uses runtime k with dynamic vector storage instead of fixed NUM_5MERS=1024 arrays. Binary format v2 stores k in header; v1 files load as k=5 for backward compatibility. R: gsynth.train(k=5L) with validation, stored in model object. Propagated through save/load/sample. Old models without k default to 5. Tests: 7 new test blocks (52 expectations) covering k=1/3/5/7, validation, save/load round-trip, backward compat, and stratified models.

total_bins * num_kmers * 4L overflows R's 32-bit integer for k=8 models with >= 8192 bins. Use double arithmetic for expected_n instead.

k=3.5 was silently converted to 3 via as.integer(). Now validate that k == as.integer(k) before coercion, erroring on non-integer numerics.

v1 implied fixed 1024x4 context layout. Models with k != 5 now emit version: 2 so external .gsm readers can distinguish variable-order files. k=5 models continue to emit version: 1 for backward compatibility.

aviezerl and others added 7 commits March 30, 2026 23:42

fix: avoid integer overflow when loading large high-order .gsm models

1888792

total_bins * num_kmers * 4L overflows R's 32-bit integer for k=8 models with >= 8192 bins. Use double arithmetic for expected_n instead.

fix: reject non-integer k values instead of silently truncating

7f4cd5e

k=3.5 was silently converted to 3 via as.integer(). Now validate that k == as.integer(k) before coercion, erroring on non-integer numerics.

fix: bump .gsm metadata version to 2 for k != 5 models

86224f9

v1 implied fixed 1024x4 context layout. Models with k != 5 now emit version: 2 so external .gsm readers can distinguish variable-order files. k=5 models continue to emit version: 1 for backward compatibility.

Style code (GHA)

0e52f4b

chore: bump version to 5.6.9

2793ff5

feat: raise max Markov order from k=8 to k=10

34b475a

aviezerl merged commit b858da4 into master Apr 1, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parameterize Markov order k in gsynth (k=1..8)#82

feat: parameterize Markov order k in gsynth (k=1..8)#82
aviezerl merged 7 commits intomasterfrom
feat/gsynth-variable-k

aviezerl commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aviezerl commented Mar 30, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant