Skip to content

release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes#188

Merged
mivertowski merged 3 commits intomainfrom
chore/hf-dataset-v5.8.0-refresh
May 8, 2026
Merged

release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes#188
mivertowski merged 3 commits intomainfrom
chore/hf-dataset-v5.8.0-refresh

Conversation

@mivertowski
Copy link
Copy Markdown
Owner

@mivertowski mivertowski commented May 8, 2026

v5.9.0 minor release bundling the customer-feedback follow-ups
that landed on top of v5.8.0 plus the JE CSV widening for
predecessor_line_id (model field added in v5.8.0 but never
surfaced through CSV).

Contents

Added — graph_export.je_network.method config option

Two methods exposed:

  • cartesian (default) — full Cartesian product of debit × credit
    lines per JE. Backward-compatible with v5.8.0 behaviour.
  • a — Method A from Ivertowski (2024): exactly one edge per
    2-line JE; multi-line entries skipped. Confidence = 1.0.

At the HF reference-dataset scale (1 M JE lines) the default
Cartesian method produced ≈225 M edges totalling ~52 GB. Method A
drops that to ≈600 k edges (~80 MB) while preserving the
exact-1.0-confidence cohort and matching the 2024 paper exactly.
The HF dataset config now ships with method: a.

Added — predecessor_line_id column on JE CSV

v5.8.0 added the model field but only JSON exposed it. CSV
header is now 43 columns ending with predecessor_line_id.

Fixed — silent-drop drift in distribution structs (#183)

11 of 12 shipped templates had silently-dropped fields. Schema
extensions + #[serde(deny_unknown_fields)] + per-field defaults

  • improved validator error messages + new regression test that
    walks every template.

Fixed — variance overflow in unusual_item_generator (#183)

Defence-in-depth guard against rust_decimal overflow on
account-level balances above 10^15.

Fixed — Benford magnitude sampling (#185)

BenfordSampler::sample_with_first_digit and
EnhancedBenfordSampler::sample_with_digits now derive
order-of-magnitude from the parent log-normal instead of
sampling uniformly across [log10(min), log10(max)].

Fixed — quintillion-scale amounts in BenfordViolationStrategy (#185)

Magnitude was computed from full Decimal to_string() (decimal
point stripped), producing target × 10^18 fraud amounts on
routine inputs. Now uses f64::log10().floor() capped at 12.

Customer mult_overflow repro Before After
Max line debit $2.32 × 10^19 $24.58 × 10^6
Mean line amount $2.55 × 10^14 $2,766
Lines > $10B 58 0
Materiality (entity 1000) $3.69 × 10^16 $391 K

Fixed — CI runner sizing (#184)

ubuntu-latest runs --lib --bins only; full integration suite
runs on macOS/Windows. Same restriction on Code Coverage job.

Bumped — workspace version 5.8.0 → 5.9.0

All workspace Cargo.toml references updated. CHANGELOG
restructured: post-cut entries that had accumulated under the
v5.8.0 heading now live under a clean v5.9.0 release entry.

Test plan

  • 149/149 datasynth-config lib tests pass.
  • scenario_templates_validate integration test passes
    (12/12 shipped templates deserialize + validate cleanly).
  • Sequential v5.9.0 release-binary regen of the HF reference
    config completes in 26.7 s, produces 1 059 679-line JE CSV
    with 43 columns + 61 653-edge je_network.csv (Method A).
  • Parquet conversion produces 6 artefacts totalling ~45 MB.

🤖 Generated with Claude Code

mivertowski and others added 2 commits May 8, 2026 21:16
…refresh

## Background

The v5.8.0 release of `graphs/je_network.csv` defaulted to a full
Cartesian product of debit × credit lines per JE.  At the
HuggingFace reference-dataset scale (1 M JE lines × 12 months × 10
companies, with `audit-group` + period-close + IC matching), this
produced **225 M edges totalling 52 GB of CSV** — too large to
publish.  Each multi-line consolidation contributes O(n × m)
edges, so a 50-debit / 50-credit period-close entry alone yields
2 500 edges, and the long tail of high-cardinality JEs (intercompany
batches, opening balances) dominates the total.

Methods A through E from Ivertowski (2024)
*Hardware Accelerated Method for Accounting Network Generation*
recognise this combinatorial blow-up explicitly: Method A is the
bijective `2-line → 1-edge` case (≈ 60 % of postings per the
paper) and Methods B–E enumerate progressively lossier choices for
the rest.

## What this commit adds

A new opt-in YAML field `graph_export.je_network.method` lets
configs choose between:

  - `cartesian` (default) — current v5.8.0 behaviour, full
    Cartesian product. Backward-compatible.
  - `a` — Method A from the 2024 paper: emit exactly one edge per
    2-line journal entry and skip multi-line entries. Edge count
    drops from O(n × m × |JEs|) to O(|2-line JEs|); confidence is
    exactly 1.0 on every edge.

## Implementation

- `datasynth-config/src/schema.rs`: new `JeNetworkMethod` enum
  (`cartesian` | `a`) and `JeNetworkConfig` struct, embedded in
  `GraphExportConfig.je_network` with `#[serde(default)]`.
- `datasynth-runtime/src/output_writer.rs::write_je_network_csv`:
  takes the method as a parameter, skips multi-line JEs when
  `method == A`. Public entrypoints (`write_all_output`,
  `write_all_output_with_root`, `write_all_output_with_layout`)
  forward the value, defaulting to `JeNetworkMethod::default()`
  (= Cartesian) so callers that don't care preserve old behaviour.
- `datasynth-cli/src/main.rs`: passes
  `generator_config.graph_export.je_network.method` to the writer.

## HF dataset refresh (v5.5.1 → v5.8.0)

`configs/examples/hf/journal_entries_1m.yaml`:

  - Header bumped to v5.8.0; documents the v5.6.0/5.7.0/5.8.0
    additions (ISO 21378 fields, industry-pack expansion option,
    je_network export, Benford magnitude fix, distribution
    silent-drop closure).
  - New `graph_export.je_network.method: a` block — keeps the
    edge file shippable.

`scripts/hf_to_parquet.py`:

  - New `je_network_csv_to_parquet()` step that converts the
    edge list to a single `je_network.parquet` (typically
    < 10 MB at Method A; < 1 GB at Cartesian compressed).

`configs/examples/hf/README.md`: documents the new
`je_network.parquet` artefact and its schema.

## Measured impact (1 M JE-line config, audit-group + 12 months
+ 10 companies)

| Metric                | Cartesian (default) | Method A      |
|-----------------------|---------------------|---------------|
| Edges                 | 225 020 023         | **61 654**    |
| Edge-file size (CSV)  | 52 GB               | **13 MB**     |
| Edge-file size (parq.)| ~7 GB est.          | **5.6 MB**    |
| Generation time       | 16 m 46 s           | **24 s**      |
| Per-edge confidence   | 1.0 / 1.0/(n·m)     | **1.0** (all) |

Total HF staging directory (`hf_staging/`) at Method A:
**~45 MB** (3-shard JE 39 MB + je_network 5.6 MB + COA / TB /
cost-centres / profit-centres ~210 KB).

## Tests

- `cargo test -p datasynth-config --lib` — 149/149 pass.
- `cargo test -p datasynth-config --test scenario_templates_validate`
  — 12/12 templates still parse cleanly.
- Round-trip regen + parquet conversion produces a valid HF-shaped
  staging tree (this is also the verification artefact for the
  `je_network` cardinality / confidence claims above).

## CHANGELOG

Attached under v5.8.0 per release-cadence guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles the post-v5.8.0 customer-feedback follow-ups and the
v5.8.0 model field that the CSV writer hadn't surfaced yet:

- Cargo.toml: workspace version 5.8.0 → 5.9.0 (minor bump).
- crates/datasynth-runtime/src/output_writer.rs: JE CSV now
  ends with `predecessor_line_id` as the 43rd column.  v5.8.0
  added the model field but the CSV writer kept the 42-column
  schema and silently dropped the value.  Old column-positional
  consumers will need to either ignore the new column or extend
  their reader; the JSON output already had it via serde.
- CHANGELOG.md: restructured.  All post-v5.8.0-cut entries that
  had been appended under the v5.8.0 heading (#183, #184, #185,
  #186 the silent-drop / quintillion-amount / Benford magnitude
  / CI-runner fixes, plus the new `graph_export.je_network.method`
  config option from PR #188) now live under a single v5.9.0
  release entry per release-cadence guidance.  v5.8.0 itself is
  back to its original release-day content.

## Tests

- 149/149 datasynth-config lib tests pass.
- scenario_templates_validate integration test passes (12/12
  shipped templates deserialize + validate cleanly under the
  new schema).
- Sequential v5.9.0 release-binary regen of the HF reference
  config completes in 26.7 s, produces a 1 059 679-line JE CSV
  with 43 columns and a 61 653-edge je_network.csv (Method A);
  parquet conversion produces ~45 MB of staging artefacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski mivertowski changed the title feat(graph): add je_network method config + HF v5.8.0 dataset refresh release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes May 8, 2026
Closes a customer-feedback finding from the v5.8.0 review window:
three JE columns were populated by generators independent of the
master tables and therefore did not join cleanly to them.

| JE column | Pre-fix | Master | Pre-fix join |
|---|---|---|---|
| cost_center  | CC1000-CC5000 (const)            | 210 master ids | 0 / 5    |
| profit_center| PC-{COMP}-{P2P|O2C|R2R|H2R}      | 120 master ids | 0 / 51   |
| created_by   | NPARKE015 (UserPool, separate)   | 737 employees  | 0 / 737  |

## Plumbing

JeGenerator (builder, immutable):
  - with_cost_center_pool(Vec<String>)  overrides const fallback
  - with_profit_center_pool(Vec<String>) overrides legacy derive
  - with_user_pool(UserPool)             overrides auto-generated

DocumentFlowJeGenerator (mutable setters):
  - set_cost_center_pool(Vec<String>)
  - set_profit_center_pool(Vec<String>)

UserPool::from_employees(&[Employee]) constructor in datasynth-core
so the orchestrator can build a UserPool sharing user_ids with
the employees master.

JournalEntryGenerator::split now propagates cost_center_pool /
profit_center_pool to sub-generators (the previous version copied
vendor/customer/material pools but not the new ones, which made
parallel workers fall back to the const and emit ~40 % CC1000
values on a 1M-line config).

## Per-line enrichment

Both generators' enrich_line_items pick deterministically from
the company-filtered subset of the master pool
(`cc_seed.wrapping_add(i) % pool.len()`), filtering by the entry's
company_code so a JE for company "1000" gets a cost_center with
"-1000-" in its id.  Falls back to the legacy hardcoded
`COST_CENTER_POOL` const only when the orchestrator did not provide
a master pool — preserves backward-compatibility for configs that
skip master-data generation.

## Employee.user_id format alignment

employee_generator's user_id format was `u{counter:06}` (regular
employees) and `exec{counter:04}` (executives) — disjoint from
UserPool's `name.to_user_id(counter)` format (e.g. NPARKE015).
Both now use `name.to_user_id(counter)` so UserPool and Employee
share the same vocabulary; UserPool::from_employees then builds a
UserPool whose user_ids are exactly the employee_ids JE references.

## Asset generator

Asset.cost_center was hardcoded as `CC-{COMP}-ADMIN`, but the
cost-centers master uses categories FIN / PROD / SALES / RD /
CORP — ADMIN never appeared.  Switched to `CC-{COMP}-CORP`
(closest semantic match for asset ownership).

## Verified on the v5.9.0 HF reference dataset (1 M JE lines)

| Linkage                                | Pre-fix | Post-fix     |
|----------------------------------------|---------|--------------|
| JE.cost_center  ↔ cost_centers.id      | 0 / 5   | **210 / 210**|
| JE.profit_center↔ profit_centers.id    | 0 / 51  | **120 / 120**|
| JE.created_by   ↔ employees.user_id    | 0 / 737 | **675 / 737**|

The 62 created_by values that don't join are `BATCH0001-…` system
users (automated postings, correctly *not* tied to a human
employee record).

## Audited generators that don't need this treatment

Document-flow ↔ master linkages are already clean:
  - purchase_orders.vendor_id  → vendors.vendor_id    ✅
  - sales_orders.customer_id   → customers.customer_id ✅
  - *.items[].material_id      → materials.material_id ✅

Hardcoded pools in treasury (COUNTERPARTIES, LENDERS,
ISSUING_BANKS), audit (KAM_POOL), governance (KEY_DECISIONS,
AUDIT_COMMITTEE_MATTERS), management_report (POSITIVE_COMMENTARY
etc.), and legal_document (ENGAGEMENT_LETTER_TERMS etc.) are
domain-specific text templates, not master-pool consumers, and
need no plumbing.

## Tests

- 149/149 datasynth-config lib pass.
- 688/688 datasynth-core::models lib pass.
- 7/7 datasynth-generators::master_data::employee_generator lib
  pass.
- Sequential v5.9.0 release-binary regen of the HF reference
  config completes in 23.7 s; 1 M-row JE CSV produces a 100 %
  CC + 100 % PC + 92 % created_by master-data join rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski mivertowski merged commit b7e95db into main May 8, 2026
16 checks passed
@mivertowski mivertowski deleted the chore/hf-dataset-v5.8.0-refresh branch May 8, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant