Conversation
… formats All 15 distributions now follow a single consistent standard: - operator<< delegates to toString() - operator>> parses exactly the text that toString() / operator<< writes, including reading and discarding derived-value lines (Mean, Variance, etc.) so the stream is positioned correctly for the next token after the call Previously broken (pre-existing bugs): - 9 distributions had operator<< writing a different format from toString() (e.g. Gaussian wrote 'Normal Distribution: Mean = V' while toString() writes 'Gaussian Distribution: mu (mean) = V') - Binomial/NegativeBinomial operator>> parsed a parenthesised 'Dist(r,p)' format that no operator<< ever produced - StudentT operator>> used a 65-line key=value parser (CC 22) for a format no longer written by operator<< - Poisson operator>> expected 4 tokens before the value but toString() writes 6 tokens (lambda (rate parameter) =) - Discrete operator>> tried to read numSymbols as an integer first token, failing immediately on the 'Discrete' keyword written by operator<< hmm.cpp parse_* fixes (called after type keyword consumed): - Fixed token counts for parse_exponential, parse_gamma, parse_log_normal, parse_pareto, parse_poisson, parse_uniform, parse_beta - All parse_* functions now skip trailing derived-value lines (Mean, Variance, Median, Mode) so operator>>(Hmm) advances cleanly to the next state header - Added parse_binomial and parse_rayleigh to kStreamParsers map No format changes to toString() itself; the verbose format is preserved. Co-Authored-By: Oz <oz-agent@warp.dev>
Stream IO (test_hmm_stream_io.cpp): - AllDistributionsStreamRoundTrip: 14-state HMM covering every distribution except NegativeBinomial; verifies operator<</>> round-trip with 1e-5 tolerance (toString() uses 6 decimal places, not max_digits10) - NegativeBinomialStreamLimitation: confirms two-word type dispatch limitation throws std::runtime_error as documented JSON sanitization (test_hmm_json.cpp): - StatesCapExceededThrows / NonFiniteStatesThrows / InputSizeLimitThrows - PiArrayLongerThanNThrows / TransMatrixLongerThanNThrows - DiscreteNCapExceededThrows / DiscretePropsArrayLongerThanNThrows - StatesAtCapAccepted: confirms valid inputs within limits are accepted XML round-trips (test_xml_file_io.cpp): - Removed all GTEST_SKIP() from round-trip tests; they now pass with the fixed stream parsers - XMLRoundTripConsistency verifies DiscreteDistribution symbol probabilities (1e-5 tolerance) - HMMStreamOperators checks distribution type and symbol count Co-Authored-By: Oz <oz-agent@warp.dev>
…ON I/O
segmental_kmeans_trainer.cpp:
- Add file-level algorithm overview (outline of pi/trans/emis/optimize loop,
convergence criterion, DiscreteDistribution restriction)
- Add function-level comments for Clusters::Clusters (initial partitioning
strategy), remove (SIZE_MAX sentinel), learnPi, learnTrans, learnEmis
(1e-10 probability floor rationale), optimizeCluster (convergence indicator),
flattenObservationLists (why contiguous order matters for initialisation)
hmm_validator.cpp:
- Accept both .json (load_json) and any other extension (XMLFileReader)
- Auto-detect format from file extension; report 'legacy format' for XML
- Update usage message and file description accordingly
- Add #include <filesystem> for std::filesystem::path
basic_hmm_example.cpp:
- Replace raw std::ofstream/ifstream stream round-trip ('Testing File
Read/Write') with save_json/load_json demonstration
- Update file header comment: 'XML file round-trip' -> 'JSON file round-trip'
- Remove stale #include <fstream>
libhmm.h:
- Add #include 'libhmm/io/hmm_json.h' to the I/O section
- Mark XML headers as legacy/deprecated in umbrella header comments
- JSON is now the recommended format for new code
Co-Authored-By: Oz <oz-agent@warp.dev>
Adds a samples/ directory with two reference HMMs, each in both formats:
two_state_gaussian.{json,xml} -- 2-state continuous HMM with Gaussian
emissions. Useful for validating basic load/inference pipelines.
casino.{json,xml} -- 2-state discrete HMM (dishonest casino,
Durbin et al. 1998). Fair die (uniform) vs loaded die (face 6 biased
to 37.5%). Useful for Viterbi decoding and discrete distribution tests.
All four files validated with hmm_validator: Load, Validate, ForwardBackward,
and Viterbi all pass. JSON files use exact IEEE 754 fractions (max_digits10);
XML files match the token format produced by XMLFileWriter.
samples/README.md documents parameters, format notes, and usage examples.
Co-Authored-By: Oz <oz-agent@warp.dev>
std::from_chars for floating-point is not available in AppleClang's libc++ (macOS CI failure). Replace with std::strtod, which: - provides the same consumed-position semantics via the end-pointer - is available on all C++17 platforms (no feature-test gap) - uses errno for range errors Also add std::locale::classic() imbue to write_double() to guarantee the '.' decimal separator regardless of the process global locale, matching the expectation of strtod under the default C locale. Removes <charconv> and <system_error> includes; adds <cerrno> and <cstdlib>. Co-Authored-By: Oz <oz-agent@warp.dev>
OldCrow
added a commit
that referenced
this pull request
May 5, 2026
CMakeLists.txt: VERSION 3.3.0 -> 3.4.0 include/libhmm/libhmm.h: @Version 3.4.0 CHANGELOG.md: add [3.4.0] entry covering all three refactor phases: - Phase 1 (PR #4): code quality - weighted stats helper, duplicate preamble elimination, BaumWelch CC 32->13, common.h split - Phase 2 (PR #5): common.h fan-in 24 reduction - Phase 3 (PR #6): JSON serializer, IO alignment bug fix, operator>> decomposition, sanitization, sample files README.md: - Version badge: 3.1.2 -> 3.4.0 - Tests badge: 36/36 -> 38/38 - Add I/O section (JSON save_json/load_json, legacy XML) - Add samples/ to project structure - Add io/ to include/libhmm/ directory listing - Update tools/ and tests/ descriptions Co-Authored-By: Oz <oz-agent@warp.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes Tiers 4 and 5 of the code quality refactoring roadmap, plus a comprehensive fix of pre-existing IO correctness bugs discovered during implementation.
Tier 5 — Custom JSON serializer/deserializer
No external dependencies. Full round-trip at
max_digits10precision.json_utils(include/libhmm/io/json_utils.h+src/io/json_utils.cpp): schema-awareReadertokenizer + write helpers (write_double,write_array,write_matrix,write_distribution).read_double_array(N)andread_double_matrix(N,N)accept optional size caps.to_json() const override+static from_json(json::Reader&)factory method.hmm_json(include/libhmm/io/hmm_json.h+src/io/hmm_json.cpp):to_json(Hmm),from_json(string_view),save_json(Hmm, path),load_json(path)free functions. Anonymous-namespace factory map gives CC 2 dispatch regardless of distribution count.serialization.h,distribution_io_utils.h(zero callers).isValidXMLFilerenamedcanParseAsHmm, format documented, deprecation notices added.libhmm.hupdated:hmm_json.hadded as the recommended IO entry point; XML headers marked legacy.Input sanitization
kMaxHmmStateskMaxJsonInputByteskMaxDiscreteSymbolsread_double_array(N)read_double_matrix(N,N)isfiniteon states valuestatic_cast<size_t>(inf)Tier 4 —
operator>>(istream, Hmm)decompositionoperator>>was 258 lines, CC 18 (lizard warning). Decomposed by extracting each distribution parser lambda into a namedstaticfunction in the anonymous namespace:operator>>; zero lizard warnings inhmm.cpp.std::function<>map with plainStreamParserFnfunction-pointer map (eliminates heap-allocated closures).parse_binomialandparse_rayleigh(previously absent from dispatch table).Pre-existing IO correctness bugs (discovered during implementation)
All 15 distributions had inconsistent
operator<<,operator>>, andtoString()implementations — many parsers were reading formats that nooperator<<ever wrote. Fixed systematically:operator<<rewrote a different format fromtoString()(e.g. Gaussian wrote"Normal Distribution: Mean = V"whiletoString()writes"Gaussian Distribution: μ (mean) = V"). All now delegate totoString().operator>>: rewritten to token-scan thetoString()format, including skipping derived-value lines (Mean, Variance, etc.) so the HMM stream parser advances correctly to the next state.parse_*functions inhmm.cpp: wrong token counts corrected (e.g.parse_poissonexpectedλ =buttoString()writesλ (rate parameter) =).Discrete::operator>>: replaced legacy bare-integer format with currenttoString()format.StudentT::operator>>: CC 22 over-engineered key=value parser replaced with 3-line token scan (CC 2).Tests added
tests/io/test_hmm_json.cpp(new):TypeFieldPrefix: all 15 distributionto_json()type fieldsAllDistributionsRoundTrip: 15-state HMM, exact (bit-identical) parameter recoveryFileSaveLoadRoundTrip:save_json/load_jsonfile round-triptests/io/test_hmm_stream_io.cpp(extended):AllDistributionsStreamRoundTrip: 14-state HMM viaoperator<</operator>>(all distributions except NegativeBinomial)NegativeBinomialStreamLimitation: documented limitation confirmedtests/io/test_xml_file_io.cpp(updated):GTEST_SKIP()— round-trip tests now pass with correct parsersXMLRoundTripConsistencyOther
segmental_kmeans_trainer.cpp: implementation-level documentation added (algorithm overview, per-function comments explaining partitioning strategy, probability floor rationale, convergence indicator, etc.)hmm_validator: accepts.json(JSON) and any other extension (legacy XML); format auto-detected from file extension.basic_hmm_example: section 4 replaced rawofstream/ifstreamround-trip withsave_json/load_jsondemonstration.samples/: new directory withtwo_state_gaussian.{json,xml}andcasino.{json,xml}— validated throughhmm_validator, tracked in the repo.Metrics
hmm.cpp)operator>>(Hmm)CCStudentT::operator>>CCPre-existing lizard warnings retained as acknowledged:
BetaDistribution::getProbability(CC 23),BetaDistribution::fit(CC 16),NegativeBinomialDistribution::fit(CC 19),ParetoDistribution::fit(CC 17) — all algorithmic complexity in numerical methods, not structural.