Skip to content

refactor: Phase 3 — JSON serializer, IO alignment, sanitization, operator>> decomposition#6

Merged
OldCrow merged 5 commits intomainfrom
refactor/code-quality-phase3
May 5, 2026
Merged

refactor: Phase 3 — JSON serializer, IO alignment, sanitization, operator>> decomposition#6
OldCrow merged 5 commits intomainfrom
refactor/code-quality-phase3

Conversation

@OldCrow
Copy link
Copy Markdown
Owner

@OldCrow OldCrow commented May 5, 2026

Summary

Completes Tiers 4 and 5 of the code quality refactoring roadmap, plus a comprehensive fix of pre-existing IO correctness bugs discovered during implementation.


Tier 5 — Custom JSON serializer/deserializer

No external dependencies. Full round-trip at max_digits10 precision.

  • json_utils (include/libhmm/io/json_utils.h + src/io/json_utils.cpp): schema-aware Reader tokenizer + write helpers (write_double, write_array, write_matrix, write_distribution). read_double_array(N) and read_double_matrix(N,N) accept optional size caps.
  • All 15 distributions: to_json() const override + static from_json(json::Reader&) factory method.
  • hmm_json (include/libhmm/io/hmm_json.h + src/io/hmm_json.cpp): to_json(Hmm), from_json(string_view), save_json(Hmm, path), load_json(path) free functions. Anonymous-namespace factory map gives CC 2 dispatch regardless of distribution count.
  • Dead code deleted: serialization.h, distribution_io_utils.h (zero callers).
  • XML classes deprecated: dead boost TODO comment blocks removed, isValidXMLFile renamed canParseAsHmm, format documented, deprecation notices added.
  • libhmm.h updated: hmm_json.h added as the recommended IO entry point; XML headers marked legacy.

Input sanitization

Guard Limit Rationale
kMaxHmmStates 4096 N×N matrix at N=4096 ≈ 128 MB
kMaxJsonInputBytes 10 MB Well above any realistic HMM file
kMaxDiscreteSymbols 65536 512 KB per distribution
read_double_array(N) N elements Prevents heap growth before dimension check
read_double_matrix(N,N) N rows × N cols Same
isfinite on states value Guards UB static_cast<size_t>(inf)

Tier 4 — operator>>(istream, Hmm) decomposition

operator>> was 258 lines, CC 18 (lizard warning). Decomposed by extracting each distribution parser lambda into a named static function in the anonymous namespace:

  • CC: 18 → 7 on operator>>; zero lizard warnings in hmm.cpp.
  • Replaced std::function<> map with plain StreamParserFn function-pointer map (eliminates heap-allocated closures).
  • Added parse_binomial and parse_rayleigh (previously absent from dispatch table).
  • NegativeBinomial two-word type limitation documented and confirmed by test.

Pre-existing IO correctness bugs (discovered during implementation)

All 15 distributions had inconsistent operator<<, operator>>, and toString() implementations — many parsers were reading formats that no operator<< ever wrote. Fixed systematically:

  • 9 distributions: operator<< rewrote a different format from toString() (e.g. Gaussian wrote "Normal Distribution: Mean = V" while toString() writes "Gaussian Distribution: μ (mean) = V"). All now delegate to toString().
  • All 15 standalone operator>>: rewritten to token-scan the toString() format, including skipping derived-value lines (Mean, Variance, etc.) so the HMM stream parser advances correctly to the next state.
  • 6 parse_* functions in hmm.cpp: wrong token counts corrected (e.g. parse_poisson expected λ = but toString() writes λ (rate parameter) =).
  • Discrete::operator>>: replaced legacy bare-integer format with current toString() format.
  • StudentT::operator>>: CC 22 over-engineered key=value parser replaced with 3-line token scan (CC 2).
  • All GTEST_SKIP() removed from IO tests — they were hiding these bugs.

Tests added

tests/io/test_hmm_json.cpp (new):

  • TypeFieldPrefix: all 15 distribution to_json() type fields
  • AllDistributionsRoundTrip: 15-state HMM, exact (bit-identical) parameter recovery
  • FileSaveLoadRoundTrip: save_json / load_json file round-trip
  • Error cases: unknown type, malformed JSON, zero states
  • 8 sanitization boundary tests: states cap, non-finite states, 11 MB input, overlong pi/trans/probs arrays, Discrete n cap

tests/io/test_hmm_stream_io.cpp (extended):

  • AllDistributionsStreamRoundTrip: 14-state HMM via operator<</operator>> (all distributions except NegativeBinomial)
  • NegativeBinomialStreamLimitation: documented limitation confirmed

tests/io/test_xml_file_io.cpp (updated):

  • Removed all GTEST_SKIP() — round-trip tests now pass with correct parsers
  • Added distribution parameter verification in XMLRoundTripConsistency

Other

  • segmental_kmeans_trainer.cpp: implementation-level documentation added (algorithm overview, per-function comments explaining partitioning strategy, probability floor rationale, convergence indicator, etc.)
  • hmm_validator: accepts .json (JSON) and any other extension (legacy XML); format auto-detected from file extension.
  • basic_hmm_example: section 4 replaced raw ofstream/ifstream round-trip with save_json/load_json demonstration.
  • samples/: new directory with two_state_gaussian.{json,xml} and casino.{json,xml} — validated through hmm_validator, tracked in the repo.

Metrics

Metric Before After
Lizard warnings (hmm.cpp) 1 (CC 18) 0
operator>>(Hmm) CC 18 7
StudentT::operator>> CC 22 2
Tests passing 37/37 38/38
GTEST_SKIP instances 4 0

Pre-existing lizard warnings retained as acknowledged: BetaDistribution::getProbability (CC 23), BetaDistribution::fit (CC 16), NegativeBinomialDistribution::fit (CC 19), ParetoDistribution::fit (CC 17) — all algorithmic complexity in numerical methods, not structural.


Conversation: https://app.warp.dev/conversation/ca60ff01-fa5e-45cf-be1a-5dd234ddd87c
Plan: https://app.warp.dev/drive/notebook/WeK7cgQicMiiFNphwUmzXT

OldCrow and others added 5 commits May 4, 2026 22:41
… formats

All 15 distributions now follow a single consistent standard:
- operator<< delegates to toString()
- operator>> parses exactly the text that toString() / operator<< writes,
  including reading and discarding derived-value lines (Mean, Variance, etc.)
  so the stream is positioned correctly for the next token after the call

Previously broken (pre-existing bugs):
- 9 distributions had operator<< writing a different format from toString()
  (e.g. Gaussian wrote 'Normal Distribution: Mean = V' while toString()
  writes 'Gaussian Distribution: mu (mean) = V')
- Binomial/NegativeBinomial operator>> parsed a parenthesised 'Dist(r,p)'
  format that no operator<< ever produced
- StudentT operator>> used a 65-line key=value parser (CC 22) for a format
  no longer written by operator<<
- Poisson operator>> expected 4 tokens before the value but toString() writes
  6 tokens (lambda (rate parameter) =)
- Discrete operator>> tried to read numSymbols as an integer first token,
  failing immediately on the 'Discrete' keyword written by operator<<

hmm.cpp parse_* fixes (called after type keyword consumed):
- Fixed token counts for parse_exponential, parse_gamma, parse_log_normal,
  parse_pareto, parse_poisson, parse_uniform, parse_beta
- All parse_* functions now skip trailing derived-value lines (Mean, Variance,
  Median, Mode) so operator>>(Hmm) advances cleanly to the next state header
- Added parse_binomial and parse_rayleigh to kStreamParsers map

No format changes to toString() itself; the verbose format is preserved.

Co-Authored-By: Oz <oz-agent@warp.dev>
Stream IO (test_hmm_stream_io.cpp):
- AllDistributionsStreamRoundTrip: 14-state HMM covering every distribution
  except NegativeBinomial; verifies operator<</>> round-trip with 1e-5
  tolerance (toString() uses 6 decimal places, not max_digits10)
- NegativeBinomialStreamLimitation: confirms two-word type dispatch
  limitation throws std::runtime_error as documented

JSON sanitization (test_hmm_json.cpp):
- StatesCapExceededThrows / NonFiniteStatesThrows / InputSizeLimitThrows
- PiArrayLongerThanNThrows / TransMatrixLongerThanNThrows
- DiscreteNCapExceededThrows / DiscretePropsArrayLongerThanNThrows
- StatesAtCapAccepted: confirms valid inputs within limits are accepted

XML round-trips (test_xml_file_io.cpp):
- Removed all GTEST_SKIP() from round-trip tests; they now pass with
  the fixed stream parsers
- XMLRoundTripConsistency verifies DiscreteDistribution symbol
  probabilities (1e-5 tolerance)
- HMMStreamOperators checks distribution type and symbol count

Co-Authored-By: Oz <oz-agent@warp.dev>
…ON I/O

segmental_kmeans_trainer.cpp:
- Add file-level algorithm overview (outline of pi/trans/emis/optimize loop,
  convergence criterion, DiscreteDistribution restriction)
- Add function-level comments for Clusters::Clusters (initial partitioning
  strategy), remove (SIZE_MAX sentinel), learnPi, learnTrans, learnEmis
  (1e-10 probability floor rationale), optimizeCluster (convergence indicator),
  flattenObservationLists (why contiguous order matters for initialisation)

hmm_validator.cpp:
- Accept both .json (load_json) and any other extension (XMLFileReader)
- Auto-detect format from file extension; report 'legacy format' for XML
- Update usage message and file description accordingly
- Add #include <filesystem> for std::filesystem::path

basic_hmm_example.cpp:
- Replace raw std::ofstream/ifstream stream round-trip ('Testing File
  Read/Write') with save_json/load_json demonstration
- Update file header comment: 'XML file round-trip' -> 'JSON file round-trip'
- Remove stale #include <fstream>

libhmm.h:
- Add #include 'libhmm/io/hmm_json.h' to the I/O section
- Mark XML headers as legacy/deprecated in umbrella header comments
- JSON is now the recommended format for new code

Co-Authored-By: Oz <oz-agent@warp.dev>
Adds a samples/ directory with two reference HMMs, each in both formats:

  two_state_gaussian.{json,xml}  -- 2-state continuous HMM with Gaussian
    emissions. Useful for validating basic load/inference pipelines.

  casino.{json,xml}  -- 2-state discrete HMM (dishonest casino,
    Durbin et al. 1998). Fair die (uniform) vs loaded die (face 6 biased
    to 37.5%). Useful for Viterbi decoding and discrete distribution tests.

All four files validated with hmm_validator: Load, Validate, ForwardBackward,
and Viterbi all pass. JSON files use exact IEEE 754 fractions (max_digits10);
XML files match the token format produced by XMLFileWriter.

samples/README.md documents parameters, format notes, and usage examples.

Co-Authored-By: Oz <oz-agent@warp.dev>
std::from_chars for floating-point is not available in AppleClang's
libc++ (macOS CI failure). Replace with std::strtod, which:
- provides the same consumed-position semantics via the end-pointer
- is available on all C++17 platforms (no feature-test gap)
- uses errno for range errors

Also add std::locale::classic() imbue to write_double() to guarantee
the '.' decimal separator regardless of the process global locale,
matching the expectation of strtod under the default C locale.

Removes <charconv> and <system_error> includes; adds <cerrno> and <cstdlib>.

Co-Authored-By: Oz <oz-agent@warp.dev>
@OldCrow OldCrow merged commit 38c96bb into main May 5, 2026
7 checks passed
OldCrow added a commit that referenced this pull request May 5, 2026
CMakeLists.txt: VERSION 3.3.0 -> 3.4.0
include/libhmm/libhmm.h: @Version 3.4.0

CHANGELOG.md: add [3.4.0] entry covering all three refactor phases:
- Phase 1 (PR #4): code quality - weighted stats helper, duplicate
  preamble elimination, BaumWelch CC 32->13, common.h split
- Phase 2 (PR #5): common.h fan-in 24 reduction
- Phase 3 (PR #6): JSON serializer, IO alignment bug fix, operator>>
  decomposition, sanitization, sample files

README.md:
- Version badge: 3.1.2 -> 3.4.0
- Tests badge: 36/36 -> 38/38
- Add I/O section (JSON save_json/load_json, legacy XML)
- Add samples/ to project structure
- Add io/ to include/libhmm/ directory listing
- Update tools/ and tests/ descriptions

Co-Authored-By: Oz <oz-agent@warp.dev>
@OldCrow OldCrow deleted the refactor/code-quality-phase3 branch May 5, 2026 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant