feat: parameterize Markov order k in gsynth (k=1..8)#82
Merged
Conversation
…t 5) The gsynth system previously hardcoded a Markov order of 5 throughout. This adds a `k` parameter to `gsynth.train()` supporting orders 1-8, enabling users to trade off model complexity vs. context sensitivity. C++: StratifiedMarkovModel uses runtime k with dynamic vector storage instead of fixed NUM_5MERS=1024 arrays. Binary format v2 stores k in header; v1 files load as k=5 for backward compatibility. R: gsynth.train(k=5L) with validation, stored in model object. Propagated through save/load/sample. Old models without k default to 5. Tests: 7 new test blocks (52 expectations) covering k=1/3/5/7, validation, save/load round-trip, backward compat, and stratified models.
total_bins * num_kmers * 4L overflows R's 32-bit integer for k=8 models with >= 8192 bins. Use double arithmetic for expected_n instead.
k=3.5 was silently converted to 3 via as.integer(). Now validate that k == as.integer(k) before coercion, erroring on non-integer numerics.
v1 implied fixed 1024x4 context layout. Models with k != 5 now emit version: 2 so external .gsm readers can distinguish variable-order files. k=5 models continue to emit version: 1 for backward compatibility.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
kparameter togsynth.train()(default5L, range 1-8) to configure the Markov order for synthetic genome generationStratifiedMarkovModelfrom fixedNUM_5MERS=1024arrays to dynamic vectors sized by4^k.gsmformat uses version 2 for k!=5Changes
C++ (4 files):
StratifiedMarkovModel.h/cppaccepts runtime k with dynamic storage.GenomeSynthTrain.cppandGenomeSynthSample.cppaccept k from R. Binary format v2 stores k in header; v1 loads as k=5.R (1 file):
gsynth.train(k=5L)with strict validation (rejects non-integer, out-of-range). Propagated throughgsynth.save/load/sample. Uses double arithmetic to avoid integer overflow for large models.Tests (1 new file):
test-gsynth-k.R— 7 test blocks (54 expectations) covering k=1/3/5/7, validation, save/load round-trip, backward compat, stratified models, .gsm version assertion.Test plan