release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes#188
Merged
mivertowski merged 3 commits intomainfrom May 8, 2026
Merged
release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes#188mivertowski merged 3 commits intomainfrom
mivertowski merged 3 commits intomainfrom
Conversation
…refresh
## Background
The v5.8.0 release of `graphs/je_network.csv` defaulted to a full
Cartesian product of debit × credit lines per JE. At the
HuggingFace reference-dataset scale (1 M JE lines × 12 months × 10
companies, with `audit-group` + period-close + IC matching), this
produced **225 M edges totalling 52 GB of CSV** — too large to
publish. Each multi-line consolidation contributes O(n × m)
edges, so a 50-debit / 50-credit period-close entry alone yields
2 500 edges, and the long tail of high-cardinality JEs (intercompany
batches, opening balances) dominates the total.
Methods A through E from Ivertowski (2024)
*Hardware Accelerated Method for Accounting Network Generation*
recognise this combinatorial blow-up explicitly: Method A is the
bijective `2-line → 1-edge` case (≈ 60 % of postings per the
paper) and Methods B–E enumerate progressively lossier choices for
the rest.
## What this commit adds
A new opt-in YAML field `graph_export.je_network.method` lets
configs choose between:
- `cartesian` (default) — current v5.8.0 behaviour, full
Cartesian product. Backward-compatible.
- `a` — Method A from the 2024 paper: emit exactly one edge per
2-line journal entry and skip multi-line entries. Edge count
drops from O(n × m × |JEs|) to O(|2-line JEs|); confidence is
exactly 1.0 on every edge.
## Implementation
- `datasynth-config/src/schema.rs`: new `JeNetworkMethod` enum
(`cartesian` | `a`) and `JeNetworkConfig` struct, embedded in
`GraphExportConfig.je_network` with `#[serde(default)]`.
- `datasynth-runtime/src/output_writer.rs::write_je_network_csv`:
takes the method as a parameter, skips multi-line JEs when
`method == A`. Public entrypoints (`write_all_output`,
`write_all_output_with_root`, `write_all_output_with_layout`)
forward the value, defaulting to `JeNetworkMethod::default()`
(= Cartesian) so callers that don't care preserve old behaviour.
- `datasynth-cli/src/main.rs`: passes
`generator_config.graph_export.je_network.method` to the writer.
## HF dataset refresh (v5.5.1 → v5.8.0)
`configs/examples/hf/journal_entries_1m.yaml`:
- Header bumped to v5.8.0; documents the v5.6.0/5.7.0/5.8.0
additions (ISO 21378 fields, industry-pack expansion option,
je_network export, Benford magnitude fix, distribution
silent-drop closure).
- New `graph_export.je_network.method: a` block — keeps the
edge file shippable.
`scripts/hf_to_parquet.py`:
- New `je_network_csv_to_parquet()` step that converts the
edge list to a single `je_network.parquet` (typically
< 10 MB at Method A; < 1 GB at Cartesian compressed).
`configs/examples/hf/README.md`: documents the new
`je_network.parquet` artefact and its schema.
## Measured impact (1 M JE-line config, audit-group + 12 months
+ 10 companies)
| Metric | Cartesian (default) | Method A |
|-----------------------|---------------------|---------------|
| Edges | 225 020 023 | **61 654** |
| Edge-file size (CSV) | 52 GB | **13 MB** |
| Edge-file size (parq.)| ~7 GB est. | **5.6 MB** |
| Generation time | 16 m 46 s | **24 s** |
| Per-edge confidence | 1.0 / 1.0/(n·m) | **1.0** (all) |
Total HF staging directory (`hf_staging/`) at Method A:
**~45 MB** (3-shard JE 39 MB + je_network 5.6 MB + COA / TB /
cost-centres / profit-centres ~210 KB).
## Tests
- `cargo test -p datasynth-config --lib` — 149/149 pass.
- `cargo test -p datasynth-config --test scenario_templates_validate`
— 12/12 templates still parse cleanly.
- Round-trip regen + parquet conversion produces a valid HF-shaped
staging tree (this is also the verification artefact for the
`je_network` cardinality / confidence claims above).
## CHANGELOG
Attached under v5.8.0 per release-cadence guidance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles the post-v5.8.0 customer-feedback follow-ups and the v5.8.0 model field that the CSV writer hadn't surfaced yet: - Cargo.toml: workspace version 5.8.0 → 5.9.0 (minor bump). - crates/datasynth-runtime/src/output_writer.rs: JE CSV now ends with `predecessor_line_id` as the 43rd column. v5.8.0 added the model field but the CSV writer kept the 42-column schema and silently dropped the value. Old column-positional consumers will need to either ignore the new column or extend their reader; the JSON output already had it via serde. - CHANGELOG.md: restructured. All post-v5.8.0-cut entries that had been appended under the v5.8.0 heading (#183, #184, #185, #186 the silent-drop / quintillion-amount / Benford magnitude / CI-runner fixes, plus the new `graph_export.je_network.method` config option from PR #188) now live under a single v5.9.0 release entry per release-cadence guidance. v5.8.0 itself is back to its original release-day content. ## Tests - 149/149 datasynth-config lib tests pass. - scenario_templates_validate integration test passes (12/12 shipped templates deserialize + validate cleanly under the new schema). - Sequential v5.9.0 release-binary regen of the HF reference config completes in 26.7 s, produces a 1 059 679-line JE CSV with 43 columns and a 61 653-edge je_network.csv (Method A); parquet conversion produces ~45 MB of staging artefacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes a customer-feedback finding from the v5.8.0 review window:
three JE columns were populated by generators independent of the
master tables and therefore did not join cleanly to them.
| JE column | Pre-fix | Master | Pre-fix join |
|---|---|---|---|
| cost_center | CC1000-CC5000 (const) | 210 master ids | 0 / 5 |
| profit_center| PC-{COMP}-{P2P|O2C|R2R|H2R} | 120 master ids | 0 / 51 |
| created_by | NPARKE015 (UserPool, separate) | 737 employees | 0 / 737 |
## Plumbing
JeGenerator (builder, immutable):
- with_cost_center_pool(Vec<String>) overrides const fallback
- with_profit_center_pool(Vec<String>) overrides legacy derive
- with_user_pool(UserPool) overrides auto-generated
DocumentFlowJeGenerator (mutable setters):
- set_cost_center_pool(Vec<String>)
- set_profit_center_pool(Vec<String>)
UserPool::from_employees(&[Employee]) constructor in datasynth-core
so the orchestrator can build a UserPool sharing user_ids with
the employees master.
JournalEntryGenerator::split now propagates cost_center_pool /
profit_center_pool to sub-generators (the previous version copied
vendor/customer/material pools but not the new ones, which made
parallel workers fall back to the const and emit ~40 % CC1000
values on a 1M-line config).
## Per-line enrichment
Both generators' enrich_line_items pick deterministically from
the company-filtered subset of the master pool
(`cc_seed.wrapping_add(i) % pool.len()`), filtering by the entry's
company_code so a JE for company "1000" gets a cost_center with
"-1000-" in its id. Falls back to the legacy hardcoded
`COST_CENTER_POOL` const only when the orchestrator did not provide
a master pool — preserves backward-compatibility for configs that
skip master-data generation.
## Employee.user_id format alignment
employee_generator's user_id format was `u{counter:06}` (regular
employees) and `exec{counter:04}` (executives) — disjoint from
UserPool's `name.to_user_id(counter)` format (e.g. NPARKE015).
Both now use `name.to_user_id(counter)` so UserPool and Employee
share the same vocabulary; UserPool::from_employees then builds a
UserPool whose user_ids are exactly the employee_ids JE references.
## Asset generator
Asset.cost_center was hardcoded as `CC-{COMP}-ADMIN`, but the
cost-centers master uses categories FIN / PROD / SALES / RD /
CORP — ADMIN never appeared. Switched to `CC-{COMP}-CORP`
(closest semantic match for asset ownership).
## Verified on the v5.9.0 HF reference dataset (1 M JE lines)
| Linkage | Pre-fix | Post-fix |
|----------------------------------------|---------|--------------|
| JE.cost_center ↔ cost_centers.id | 0 / 5 | **210 / 210**|
| JE.profit_center↔ profit_centers.id | 0 / 51 | **120 / 120**|
| JE.created_by ↔ employees.user_id | 0 / 737 | **675 / 737**|
The 62 created_by values that don't join are `BATCH0001-…` system
users (automated postings, correctly *not* tied to a human
employee record).
## Audited generators that don't need this treatment
Document-flow ↔ master linkages are already clean:
- purchase_orders.vendor_id → vendors.vendor_id ✅
- sales_orders.customer_id → customers.customer_id ✅
- *.items[].material_id → materials.material_id ✅
Hardcoded pools in treasury (COUNTERPARTIES, LENDERS,
ISSUING_BANKS), audit (KAM_POOL), governance (KEY_DECISIONS,
AUDIT_COMMITTEE_MATTERS), management_report (POSITIVE_COMMENTARY
etc.), and legal_document (ENGAGEMENT_LETTER_TERMS etc.) are
domain-specific text templates, not master-pool consumers, and
need no plumbing.
## Tests
- 149/149 datasynth-config lib pass.
- 688/688 datasynth-core::models lib pass.
- 7/7 datasynth-generators::master_data::employee_generator lib
pass.
- Sequential v5.9.0 release-binary regen of the HF reference
config completes in 23.7 s; 1 M-row JE CSV produces a 100 %
CC + 100 % PC + 92 % created_by master-data join rate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v5.9.0 minor release bundling the customer-feedback follow-ups
that landed on top of v5.8.0 plus the JE CSV widening for
predecessor_line_id(model field added in v5.8.0 but neversurfaced through CSV).
Contents
Added —
graph_export.je_network.methodconfig optionTwo methods exposed:
cartesian(default) — full Cartesian product of debit × creditlines per JE. Backward-compatible with v5.8.0 behaviour.
a— Method A from Ivertowski (2024): exactly one edge per2-line JE; multi-line entries skipped. Confidence = 1.0.
At the HF reference-dataset scale (1 M JE lines) the default
Cartesian method produced ≈225 M edges totalling ~52 GB. Method A
drops that to ≈600 k edges (~80 MB) while preserving the
exact-1.0-confidence cohort and matching the 2024 paper exactly.
The HF dataset config now ships with
method: a.Added —
predecessor_line_idcolumn on JE CSVv5.8.0 added the model field but only JSON exposed it. CSV
header is now 43 columns ending with
predecessor_line_id.Fixed — silent-drop drift in distribution structs (#183)
11 of 12 shipped templates had silently-dropped fields. Schema
extensions +
#[serde(deny_unknown_fields)]+ per-field defaultswalks every template.
Fixed — variance overflow in
unusual_item_generator(#183)Defence-in-depth guard against
rust_decimaloverflow onaccount-level balances above 10^15.
Fixed — Benford magnitude sampling (#185)
BenfordSampler::sample_with_first_digitandEnhancedBenfordSampler::sample_with_digitsnow deriveorder-of-magnitude from the parent log-normal instead of
sampling uniformly across
[log10(min), log10(max)].Fixed — quintillion-scale amounts in BenfordViolationStrategy (#185)
Magnitude was computed from full Decimal
to_string()(decimalpoint stripped), producing
target × 10^18fraud amounts onroutine inputs. Now uses
f64::log10().floor()capped at 12.Fixed — CI runner sizing (#184)
ubuntu-latest runs
--lib --binsonly; full integration suiteruns on macOS/Windows. Same restriction on
Code Coveragejob.Bumped — workspace version 5.8.0 → 5.9.0
All workspace
Cargo.tomlreferences updated. CHANGELOGrestructured: post-cut entries that had accumulated under the
v5.8.0 heading now live under a clean v5.9.0 release entry.
Test plan
scenario_templates_validateintegration test passes(12/12 shipped templates deserialize + validate cleanly).
config completes in 26.7 s, produces 1 059 679-line JE CSV
with 43 columns + 61 653-edge je_network.csv (Method A).
🤖 Generated with Claude Code