release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes by mivertowski · Pull Request #188 · mivertowski/SyntheticData

mivertowski · 2026-05-08T19:17:02Z

v5.9.0 minor release bundling the customer-feedback follow-ups
that landed on top of v5.8.0 plus the JE CSV widening for
predecessor_line_id (model field added in v5.8.0 but never
surfaced through CSV).

At the HF reference-dataset scale (1 M JE lines) the default
Cartesian method produced ≈225 M edges totalling ~52 GB. Method A
drops that to ≈600 k edges (~80 MB) while preserving the
exact-1.0-confidence cohort and matching the 2024 paper exactly.
The HF dataset config now ships with method: a.

Added — `predecessor_line_id` column on JE CSV

v5.8.0 added the model field but only JSON exposed it. CSV
header is now 43 columns ending with predecessor_line_id.

Fixed — silent-drop drift in distribution structs (#183)

11 of 12 shipped templates had silently-dropped fields. Schema
extensions + #[serde(deny_unknown_fields)] + per-field defaults

improved validator error messages + new regression test that
walks every template.

Fixed — variance overflow in `unusual_item_generator` (#183)

Defence-in-depth guard against rust_decimal overflow on
account-level balances above 10^15.

Fixed — Benford magnitude sampling (#185)

BenfordSampler::sample_with_first_digit and
EnhancedBenfordSampler::sample_with_digits now derive
order-of-magnitude from the parent log-normal instead of
sampling uniformly across [log10(min), log10(max)].

Fixed — quintillion-scale amounts in BenfordViolationStrategy (#185)

Magnitude was computed from full Decimal to_string() (decimal
point stripped), producing target × 10^18 fraud amounts on
routine inputs. Now uses f64::log10().floor() capped at 12.

Customer mult_overflow repro	Before	After
Max line debit	$2.32 × 10^19	$24.58 × 10^6
Mean line amount	$2.55 × 10^14	$2,766
Lines > $10B	58	0
Materiality (entity 1000)	$3.69 × 10^16	$391 K

Fixed — CI runner sizing (#184)

ubuntu-latest runs --lib --bins only; full integration suite
runs on macOS/Windows. Same restriction on Code Coverage job.

Bumped — workspace version 5.8.0 → 5.9.0

All workspace Cargo.toml references updated. CHANGELOG
restructured: post-cut entries that had accumulated under the
v5.8.0 heading now live under a clean v5.9.0 release entry.

Test plan

149/149 datasynth-config lib tests pass.
scenario_templates_validate integration test passes
(12/12 shipped templates deserialize + validate cleanly).
Sequential v5.9.0 release-binary regen of the HF reference
config completes in 26.7 s, produces 1 059 679-line JE CSV
with 43 columns + 61 653-edge je_network.csv (Method A).
Parquet conversion produces 6 artefacts totalling ~45 MB.

🤖 Generated with Claude Code

…refresh ## Background The v5.8.0 release of `graphs/je_network.csv` defaulted to a full Cartesian product of debit × credit lines per JE. At the HuggingFace reference-dataset scale (1 M JE lines × 12 months × 10 companies, with `audit-group` + period-close + IC matching), this produced **225 M edges totalling 52 GB of CSV** — too large to publish. Each multi-line consolidation contributes O(n × m) edges, so a 50-debit / 50-credit period-close entry alone yields 2 500 edges, and the long tail of high-cardinality JEs (intercompany batches, opening balances) dominates the total. Methods A through E from Ivertowski (2024) *Hardware Accelerated Method for Accounting Network Generation* recognise this combinatorial blow-up explicitly: Method A is the bijective `2-line → 1-edge` case (≈ 60 % of postings per the paper) and Methods B–E enumerate progressively lossier choices for the rest. ## What this commit adds A new opt-in YAML field `graph_export.je_network.method` lets configs choose between: - `cartesian` (default) — current v5.8.0 behaviour, full Cartesian product. Backward-compatible. - `a` — Method A from the 2024 paper: emit exactly one edge per 2-line journal entry and skip multi-line entries. Edge count drops from O(n × m × |JEs|) to O(|2-line JEs|); confidence is exactly 1.0 on every edge. ## Implementation - `datasynth-config/src/schema.rs`: new `JeNetworkMethod` enum (`cartesian` | `a`) and `JeNetworkConfig` struct, embedded in `GraphExportConfig.je_network` with `#[serde(default)]`. - `datasynth-runtime/src/output_writer.rs::write_je_network_csv`: takes the method as a parameter, skips multi-line JEs when `method == A`. Public entrypoints (`write_all_output`, `write_all_output_with_root`, `write_all_output_with_layout`) forward the value, defaulting to `JeNetworkMethod::default()` (= Cartesian) so callers that don't care preserve old behaviour. - `datasynth-cli/src/main.rs`: passes `generator_config.graph_export.je_network.method` to the writer. ## HF dataset refresh (v5.5.1 → v5.8.0) `configs/examples/hf/journal_entries_1m.yaml`: - Header bumped to v5.8.0; documents the v5.6.0/5.7.0/5.8.0 additions (ISO 21378 fields, industry-pack expansion option, je_network export, Benford magnitude fix, distribution silent-drop closure). - New `graph_export.je_network.method: a` block — keeps the edge file shippable. `scripts/hf_to_parquet.py`: - New `je_network_csv_to_parquet()` step that converts the edge list to a single `je_network.parquet` (typically < 10 MB at Method A; < 1 GB at Cartesian compressed). `configs/examples/hf/README.md`: documents the new `je_network.parquet` artefact and its schema. ## Measured impact (1 M JE-line config, audit-group + 12 months + 10 companies) | Metric | Cartesian (default) | Method A | |-----------------------|---------------------|---------------| | Edges | 225 020 023 | **61 654** | | Edge-file size (CSV) | 52 GB | **13 MB** | | Edge-file size (parq.)| ~7 GB est. | **5.6 MB** | | Generation time | 16 m 46 s | **24 s** | | Per-edge confidence | 1.0 / 1.0/(n·m) | **1.0** (all) | Total HF staging directory (`hf_staging/`) at Method A: **~45 MB** (3-shard JE 39 MB + je_network 5.6 MB + COA / TB / cost-centres / profit-centres ~210 KB). ## Tests - `cargo test -p datasynth-config --lib` — 149/149 pass. - `cargo test -p datasynth-config --test scenario_templates_validate` — 12/12 templates still parse cleanly. - Round-trip regen + parquet conversion produces a valid HF-shaped staging tree (this is also the verification artefact for the `je_network` cardinality / confidence claims above). ## CHANGELOG Attached under v5.8.0 per release-cadence guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundles the post-v5.8.0 customer-feedback follow-ups and the v5.8.0 model field that the CSV writer hadn't surfaced yet: - Cargo.toml: workspace version 5.8.0 → 5.9.0 (minor bump). - crates/datasynth-runtime/src/output_writer.rs: JE CSV now ends with `predecessor_line_id` as the 43rd column. v5.8.0 added the model field but the CSV writer kept the 42-column schema and silently dropped the value. Old column-positional consumers will need to either ignore the new column or extend their reader; the JSON output already had it via serde. - CHANGELOG.md: restructured. All post-v5.8.0-cut entries that had been appended under the v5.8.0 heading (#183, #184, #185, #186 the silent-drop / quintillion-amount / Benford magnitude / CI-runner fixes, plus the new `graph_export.je_network.method` config option from PR #188) now live under a single v5.9.0 release entry per release-cadence guidance. v5.8.0 itself is back to its original release-day content. ## Tests - 149/149 datasynth-config lib tests pass. - scenario_templates_validate integration test passes (12/12 shipped templates deserialize + validate cleanly under the new schema). - Sequential v5.9.0 release-binary regen of the HF reference config completes in 26.7 s, produces a 1 059 679-line JE CSV with 43 columns and a 61 653-edge je_network.csv (Method A); parquet conversion produces ~45 MB of staging artefacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes a customer-feedback finding from the v5.8.0 review window: three JE columns were populated by generators independent of the master tables and therefore did not join cleanly to them. | JE column | Pre-fix | Master | Pre-fix join | |---|---|---|---| | cost_center | CC1000-CC5000 (const) | 210 master ids | 0 / 5 | | profit_center| PC-{COMP}-{P2P|O2C|R2R|H2R} | 120 master ids | 0 / 51 | | created_by | NPARKE015 (UserPool, separate) | 737 employees | 0 / 737 | ## Plumbing JeGenerator (builder, immutable): - with_cost_center_pool(Vec<String>) overrides const fallback - with_profit_center_pool(Vec<String>) overrides legacy derive - with_user_pool(UserPool) overrides auto-generated DocumentFlowJeGenerator (mutable setters): - set_cost_center_pool(Vec<String>) - set_profit_center_pool(Vec<String>) UserPool::from_employees(&[Employee]) constructor in datasynth-core so the orchestrator can build a UserPool sharing user_ids with the employees master. JournalEntryGenerator::split now propagates cost_center_pool / profit_center_pool to sub-generators (the previous version copied vendor/customer/material pools but not the new ones, which made parallel workers fall back to the const and emit ~40 % CC1000 values on a 1M-line config). ## Per-line enrichment Both generators' enrich_line_items pick deterministically from the company-filtered subset of the master pool (`cc_seed.wrapping_add(i) % pool.len()`), filtering by the entry's company_code so a JE for company "1000" gets a cost_center with "-1000-" in its id. Falls back to the legacy hardcoded `COST_CENTER_POOL` const only when the orchestrator did not provide a master pool — preserves backward-compatibility for configs that skip master-data generation. ## Employee.user_id format alignment employee_generator's user_id format was `u{counter:06}` (regular employees) and `exec{counter:04}` (executives) — disjoint from UserPool's `name.to_user_id(counter)` format (e.g. NPARKE015). Both now use `name.to_user_id(counter)` so UserPool and Employee share the same vocabulary; UserPool::from_employees then builds a UserPool whose user_ids are exactly the employee_ids JE references. ## Asset generator Asset.cost_center was hardcoded as `CC-{COMP}-ADMIN`, but the cost-centers master uses categories FIN / PROD / SALES / RD / CORP — ADMIN never appeared. Switched to `CC-{COMP}-CORP` (closest semantic match for asset ownership). ## Verified on the v5.9.0 HF reference dataset (1 M JE lines) | Linkage | Pre-fix | Post-fix | |----------------------------------------|---------|--------------| | JE.cost_center ↔ cost_centers.id | 0 / 5 | **210 / 210**| | JE.profit_center↔ profit_centers.id | 0 / 51 | **120 / 120**| | JE.created_by ↔ employees.user_id | 0 / 737 | **675 / 737**| The 62 created_by values that don't join are `BATCH0001-…` system users (automated postings, correctly *not* tied to a human employee record). ## Audited generators that don't need this treatment Document-flow ↔ master linkages are already clean: - purchase_orders.vendor_id → vendors.vendor_id ✅ - sales_orders.customer_id → customers.customer_id ✅ - *.items[].material_id → materials.material_id ✅ Hardcoded pools in treasury (COUNTERPARTIES, LENDERS, ISSUING_BANKS), audit (KAM_POOL), governance (KEY_DECISIONS, AUDIT_COMMITTEE_MATTERS), management_report (POSITIVE_COMMENTARY etc.), and legal_document (ENGAGEMENT_LETTER_TERMS etc.) are domain-specific text templates, not master-pool consumers, and need no plumbing. ## Tests - 149/149 datasynth-config lib pass. - 688/688 datasynth-core::models lib pass. - 7/7 datasynth-generators::master_data::employee_generator lib pass. - Sequential v5.9.0 release-binary regen of the HF reference config completes in 23.7 s; 1 M-row JE CSV produces a 100 % CC + 100 % PC + 92 % created_by master-data join rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski and others added 2 commits May 8, 2026 21:16

mivertowski changed the title ~~feat(graph): add je_network method config + HF v5.8.0 dataset refresh~~ release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes May 8, 2026

mivertowski merged commit b7e95db into main May 8, 2026
16 checks passed

mivertowski deleted the chore/hf-dataset-v5.8.0-refresh branch May 8, 2026 22:59

mivertowski mentioned this pull request May 9, 2026

chore(hf): bake audit-p2p + supply-chain-ocel recipes; align all data-card READMEs #189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes#188

release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes#188
mivertowski merged 3 commits intomainfrom
chore/hf-dataset-v5.8.0-refresh

mivertowski commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mivertowski commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contents

Added — graph_export.je_network.method config option

Added — predecessor_line_id column on JE CSV

Fixed — silent-drop drift in distribution structs (#183)

Fixed — variance overflow in unusual_item_generator (#183)

Fixed — Benford magnitude sampling (#185)

Fixed — quintillion-scale amounts in BenfordViolationStrategy (#185)

Fixed — CI runner sizing (#184)

Bumped — workspace version 5.8.0 → 5.9.0

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mivertowski commented May 8, 2026 •

edited

Loading

Added — `graph_export.je_network.method` config option

Added — `predecessor_line_id` column on JE CSV

Fixed — variance overflow in `unusual_item_generator` (#183)