diff --git a/CHANGELOG.md b/CHANGELOG.md index c3fdb9a6..3d693c4a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,86 +5,55 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [5.8.0] - 2026-05-08 - -### Added — `graphs/je_network.csv` flat edge-list export - -A new CSV under `graphs/je_network.csv` is now produced alongside -`journal_entries.csv` whenever CSV output is enabled. Each row -represents one debit↔credit flow within a single JE, formed via the -cartesian product of debit lines × credit lines (the approach in -`datasynth-graph::TransactionGraphBuilder`): - -| Column | Source | -|---|---| -| `edge_id` | UUID v5 of `(document_id, debit_line_number, credit_line_number)` — stable across regenerations | -| `document_id` | parent JE | -| `posting_date` | from header | -| `from_account` | credit line's `gl_account` (outgoing edge) | -| `to_account` | debit line's `gl_account` (incoming edge) | -| `from_line_id` | credit line's `transaction_id` (v5.5.1 stable line UUID) | -| `to_line_id` | debit line's `transaction_id` | -| `amount` | proportionally allocated (`(debit / total_debit) × (credit / total_credit) × debit_amount`) | -| `confidence` | `1.0` for 2-line JEs (Method A from Ivertowski et al.); `1/(n×m)` for n-debit / m-credit JEs (Method B/C approximation) | -| `predecessor_edge_id` | first outgoing edge of the predecessor line in a document chain (P2P / O2C); empty for root JEs | -| `business_process`, `is_fraud`, `is_anomaly` | propagated from header for analytics filtering | - -Joins back to `journal_entries.csv` via `transaction_id` so any tool -that already loads the JE table can build the accounting-network -graph directly without invoking the graph crate's specialised -exporters (PyTorch Geometric / Neo4j / DGL — those remain available -under the same `graphs/` directory and carry richer feature sets). - -### Added — `JournalEntryLine::predecessor_line_id` - -New optional field on every JE line. Populated by the document-flow -JE generator when a JE is derived from a chained document — a -payment line's predecessor is the corresponding line in the vendor- -invoice JE; an invoice's GR/IR line's predecessor is the matching -goods-receipt line. `None` for purely-GL adjustments, period-close, -payroll, or root documents in a chain. - -Wiring is `O(N)` along the chain via gl_account match across -adjacent JEs (`document_flow_je_generator::wire_predecessor_chain`). -Position-by-`gl_account` matching is intentionally simple and -unambiguous on canonical P2P / O2C chain shapes; ties (multiple -lines of the same gl_account in the predecessor JE) match to the -first occurrence — deterministic but lossy on multi-position chains. -A strict 1-to-1 line-position matcher is future work. - -### Background - -The flat edge-list is the format consumers need to build accounting -networks per Ivertowski et al. (2024) -*"Hardware-Accelerated Method for Accounting Network Generation"* -(EY DID Research). The paper specifies a directed graph -`G(t₀, t₁) = A(t₀, t₁) × E(t₀, t₁)` where credit lines emit outgoing -edges from account nodes and debit lines emit incoming edges. v5.8.0 -makes that surface available without requiring downstream code to -re-derive the matching from raw line items. - -The line-items-per-JE distribution that the paper measured (Tables II -and III: 60.68% of JEs have 2 lines, 16.63% have 4, 88% have an even -count, with system-batch-job tails reaching 1000+ lines) is already -faithfully implemented in `datasynth_core::distributions::line_item` -and was unchanged in this release. - -### Verification - -- New integration test `je_network_export_end_to_end` (consolidated - to one orchestrator run to keep test memory bounded) — schema, - edge count `Σ(n_debit × n_credit)`, line-id join-back, and - predecessor-edge presence all assert in 1 pass -- 1 339 datasynth-core unit tests pass -- 1 142 datasynth-generators unit tests pass - -### Compatibility - -Pure addition. No schema changes to existing files; default behaviour -unchanged for runs that don't enable document flows. The new -`predecessor_line_id` field on `JournalEntryLine` uses -`#[serde(default, skip_serializing_if = "Option::is_none")]` so v5.7.0 -fixtures deserialise into v5.8.0 readers cleanly. +## [5.9.0] - 2026-05-08 + +Customer-feedback follow-up release on top of v5.8.0. Bundles a +new opt-in graph-export configuration, the silent-drop / +amount-inflation fixes that emerged from a customer audit, the +defence-in-depth guard in the unusual-item generator, the +JE-CSV widening for `predecessor_line_id`, and the CI runner-sizing +workaround. All issues raised during the v5.8.0 review window +land here as a coherent minor release. + +### Added — `graph_export.je_network.method` config option + +The `graphs/je_network.csv` writer now honours +`graph_export.je_network.method` from the YAML config. Two +methods are exposed: + +- `cartesian` (default) — full Cartesian product of debit × credit + lines per JE, matching the v5.8.0 release behaviour. Bijective + on 2-line entries and `n × m` Cartesian on multi-line entries. +- `a` — Method A from Ivertowski (2024): emit exactly one edge per + 2-line JE (1 debit + 1 credit) and skip multi-line entries + entirely. Edge count = number of 2-line JEs (≈ 60 % of postings + per the paper); confidence is exactly `1.0` on every edge. + +Background: at the HF reference-dataset scale (1 M JE lines × 12 +months × 10 companies, with `audit-group` + period-close + IC +matching enabled), the default Cartesian method produced ≈225 M +edges totalling ~52 GB of CSV — too large for HF publication. +Method A drops that to ≈600 k edges (~80 MB) while preserving the +exact-1.0-confidence cohort and matching the 2024 paper exactly. + +The HF dataset config (`configs/examples/hf/journal_entries_1m.yaml`) +now ships with `graph_export.je_network.method: a`. + +### Added — `predecessor_line_id` column on JE CSV + +v5.8.0 added `JournalEntryLine::predecessor_line_id` as a model +field but only the JSON serialiser exposed it; the CSV writer kept +the v5.6.0 42-column schema and silently dropped the field. The +column-positional consumers that read `journal_entries.csv` thus +had no way to follow a posting back to its predecessor in a P2P +or O2C document chain — the chain context was reachable only via +`graphs/je_network.csv` `predecessor_edge_id`, which requires +joining a separate file. + +The CSV header now ends with `predecessor_line_id` as the 43rd +column, populated for every line that has a predecessor populated +on the model. Old consumers that hardcode 42 columns will need +to either ignore the new column or add it to their schema. ### Fixed — silent-drop drift in distribution structs (#183) @@ -218,6 +187,75 @@ applies the same restriction `timeout-minutes: 60` added to the `Test` job for graceful cancellation if anything genuinely hangs in the future. +### Fixed — JE ↔ master linkage gaps (cost-centers / profit-centers / employees) + +A customer-feedback audit revealed that three JE columns were +populated from generators independent of the master tables and +therefore did not join cleanly: + +| JE column | Pre-fix | Master | Status | +|---|---|---|---| +| `cost_center` | `CC1000–CC5000` (hardcoded const) | `CC-1000-FIN-AP` (210 ids) | **0 / 5 joined** | +| `profit_center` | `PC-{COMP}-{P2P\|O2C\|R2R\|H2R}` (derived) | `PC-1000-CONSUMER-FOOD` (120 ids) | **0 / 51 joined** | +| `created_by` | `NPARKE015` (UserPool, separate generator) | `EMP-1000-000001` / `u000001` (employees) | **0 / 737 joined** | + +The fix plumbs master pools through `JeGenerator` and +`DocumentFlowJeGenerator`: + +- New builder methods on `JeGenerator`: + - `with_cost_center_pool(Vec)` — overrides the + hardcoded const. + - `with_profit_center_pool(Vec)` — overrides the + legacy derivation. + - `with_user_pool(UserPool)` — overrides the auto-generated + pool from `with_country_pack_names`. +- New setters on `DocumentFlowJeGenerator`: + - `set_cost_center_pool(Vec)` + - `set_profit_center_pool(Vec)` +- New constructor on `UserPool`: + - `UserPool::from_employees(&[Employee])` — builds a UserPool + from the generated employee master so the two systems share + `user_id` values. +- Per-line enrichment in both JE generators now picks + deterministically from the company-filtered subset of the + master pool (via `cc_seed.wrapping_add(i) % pool.len()`), + falling back to the legacy const only when the orchestrator + did not provide a master pool (e.g. configs that skip + master-data generation). +- `JournalEntryGenerator::split` propagates the new pools to + parallel sub-generators (otherwise sub-workers fall back to + the legacy const and a 1 M-line config emitted ~40 % of + cost_centers via the legacy path). +- `Employee.user_id` format changed from `u{:06}` / + `exec{:04}` to `name.to_user_id(counter)` (e.g. `NPARKE015`), + matching the format `UserPool` already uses so values flow + bidirectionally without translation. +- `master_data/asset_generator.rs`: `Asset.cost_center` switched + from `CC-{COMP}-ADMIN` (which the master never produced) to + `CC-{COMP}-CORP` (master Level-1 category). + +### Audited — other generators that *don't* need this treatment + +The audit also confirmed that the document-flow ↔ master +linkages are already clean: `purchase_orders.vendor_id` → +`vendors.vendor_id`, `sales_orders.customer_id` → +`customers.customer_id`, and `*.items[].material_id` → +`materials.material_id` all join 1-to-1. Hardcoded pools in +`treasury` (`COUNTERPARTIES`, `LENDERS`, `ISSUING_BANKS`), +`audit` (`KAM_POOL`), `governance` (`KEY_DECISIONS`, +`AUDIT_COMMITTEE_MATTERS`), `management_report` +(`POSITIVE_COMMENTARY` etc.), and `legal_document` +(`ENGAGEMENT_LETTER_TERMS` etc.) are domain-specific text +templates, not master-pool consumers, and need no plumbing. + +### Verified on the v5.9.0 HF reference dataset (1 M JE lines) + +| Linkage | Pre-fix | Post-fix | +|---|---|---| +| `JE.cost_center` ↔ `cost_centers.id` | 0 / 5 | **210 / 210** | +| `JE.profit_center` ↔ `profit_centers.id` | 0 / 51 | **120 / 120** | +| `JE.created_by` ↔ `employees.user_id` | 0 / 737 | **675 / 737** (the 62 unmatched are `BATCH0001-BATCH00xx` automated/system postings, correctly *not* tied to a human employee record) | + ### Verification — fixes - All 149 `datasynth-config` lib tests pass. @@ -230,6 +268,87 @@ cancellation if anything genuinely hangs in the future. fraud-type field values, instead of silently dropping fields and reporting a misleading sum. +## [5.8.0] - 2026-05-08 + +### Added — `graphs/je_network.csv` flat edge-list export + +A new CSV under `graphs/je_network.csv` is now produced alongside +`journal_entries.csv` whenever CSV output is enabled. Each row +represents one debit↔credit flow within a single JE, formed via the +cartesian product of debit lines × credit lines (the approach in +`datasynth-graph::TransactionGraphBuilder`): + +| Column | Source | +|---|---| +| `edge_id` | UUID v5 of `(document_id, debit_line_number, credit_line_number)` — stable across regenerations | +| `document_id` | parent JE | +| `posting_date` | from header | +| `from_account` | credit line's `gl_account` (outgoing edge) | +| `to_account` | debit line's `gl_account` (incoming edge) | +| `from_line_id` | credit line's `transaction_id` (v5.5.1 stable line UUID) | +| `to_line_id` | debit line's `transaction_id` | +| `amount` | proportionally allocated (`(debit / total_debit) × (credit / total_credit) × debit_amount`) | +| `confidence` | `1.0` for 2-line JEs (Method A from Ivertowski et al.); `1/(n×m)` for n-debit / m-credit JEs (Method B/C approximation) | +| `predecessor_edge_id` | first outgoing edge of the predecessor line in a document chain (P2P / O2C); empty for root JEs | +| `business_process`, `is_fraud`, `is_anomaly` | propagated from header for analytics filtering | + +Joins back to `journal_entries.csv` via `transaction_id` so any tool +that already loads the JE table can build the accounting-network +graph directly without invoking the graph crate's specialised +exporters (PyTorch Geometric / Neo4j / DGL — those remain available +under the same `graphs/` directory and carry richer feature sets). + +### Added — `JournalEntryLine::predecessor_line_id` + +New optional field on every JE line. Populated by the document-flow +JE generator when a JE is derived from a chained document — a +payment line's predecessor is the corresponding line in the vendor- +invoice JE; an invoice's GR/IR line's predecessor is the matching +goods-receipt line. `None` for purely-GL adjustments, period-close, +payroll, or root documents in a chain. + +Wiring is `O(N)` along the chain via gl_account match across +adjacent JEs (`document_flow_je_generator::wire_predecessor_chain`). +Position-by-`gl_account` matching is intentionally simple and +unambiguous on canonical P2P / O2C chain shapes; ties (multiple +lines of the same gl_account in the predecessor JE) match to the +first occurrence — deterministic but lossy on multi-position chains. +A strict 1-to-1 line-position matcher is future work. + +### Background + +The flat edge-list is the format consumers need to build accounting +networks per Ivertowski et al. (2024) +*"Hardware-Accelerated Method for Accounting Network Generation"* +(EY DID Research). The paper specifies a directed graph +`G(t₀, t₁) = A(t₀, t₁) × E(t₀, t₁)` where credit lines emit outgoing +edges from account nodes and debit lines emit incoming edges. v5.8.0 +makes that surface available without requiring downstream code to +re-derive the matching from raw line items. + +The line-items-per-JE distribution that the paper measured (Tables II +and III: 60.68% of JEs have 2 lines, 16.63% have 4, 88% have an even +count, with system-batch-job tails reaching 1000+ lines) is already +faithfully implemented in `datasynth_core::distributions::line_item` +and was unchanged in this release. + +### Verification + +- New integration test `je_network_export_end_to_end` (consolidated + to one orchestrator run to keep test memory bounded) — schema, + edge count `Σ(n_debit × n_credit)`, line-id join-back, and + predecessor-edge presence all assert in 1 pass +- 1 339 datasynth-core unit tests pass +- 1 142 datasynth-generators unit tests pass + +### Compatibility + +Pure addition. No schema changes to existing files; default behaviour +unchanged for runs that don't enable document flows. The new +`predecessor_line_id` field on `JournalEntryLine` uses +`#[serde(default, skip_serializing_if = "Option::is_none")]` so v5.7.0 +fixtures deserialise into v5.8.0 readers cleanly. + ## [5.7.0] - 2026-05-07 ### Added — Industry account-pack sub-account expansion (opt-in) diff --git a/Cargo.lock b/Cargo.lock index 86bd6c15..d024bf87 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1395,7 +1395,7 @@ checksum = "d7a1e2f27636f116493b8b860f5546edb47c8d8f8ea73e1d2a20be88e28d1fea" [[package]] name = "datasynth-audit-fsm" -version = "5.8.0" +version = "5.9.0" dependencies = [ "arrow", "chrono", @@ -1422,7 +1422,7 @@ dependencies = [ [[package]] name = "datasynth-audit-optimizer" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-audit-fsm", @@ -1437,7 +1437,7 @@ dependencies = [ [[package]] name = "datasynth-banking" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "csv", @@ -1457,7 +1457,7 @@ dependencies = [ [[package]] name = "datasynth-cli" -version = "5.8.0" +version = "5.9.0" dependencies = [ "anyhow", "assert_cmd", @@ -1496,7 +1496,7 @@ dependencies = [ [[package]] name = "datasynth-config" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-banking", @@ -1511,7 +1511,7 @@ dependencies = [ [[package]] name = "datasynth-core" -version = "5.8.0" +version = "5.9.0" dependencies = [ "candle-core", "candle-nn", @@ -1540,7 +1540,7 @@ dependencies = [ [[package]] name = "datasynth-eval" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-config", @@ -1565,7 +1565,7 @@ dependencies = [ [[package]] name = "datasynth-fingerprint" -version = "5.8.0" +version = "5.9.0" dependencies = [ "anyhow", "arrow", @@ -1594,7 +1594,7 @@ dependencies = [ [[package]] name = "datasynth-generators" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-config", @@ -1618,7 +1618,7 @@ dependencies = [ [[package]] name = "datasynth-graph" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-banking", @@ -1637,7 +1637,7 @@ dependencies = [ [[package]] name = "datasynth-group" -version = "5.8.0" +version = "5.9.0" dependencies = [ "assert_cmd", "blake3", @@ -1668,7 +1668,7 @@ dependencies = [ [[package]] name = "datasynth-ocpm" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-core", @@ -1684,7 +1684,7 @@ dependencies = [ [[package]] name = "datasynth-output" -version = "5.8.0" +version = "5.9.0" dependencies = [ "arrow", "chrono", @@ -1708,7 +1708,7 @@ dependencies = [ [[package]] name = "datasynth-runtime" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "crossbeam-channel", @@ -1747,7 +1747,7 @@ dependencies = [ [[package]] name = "datasynth-server" -version = "5.8.0" +version = "5.9.0" dependencies = [ "anyhow", "argon2", @@ -1797,7 +1797,7 @@ dependencies = [ [[package]] name = "datasynth-standards" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-core", @@ -1813,7 +1813,7 @@ dependencies = [ [[package]] name = "datasynth-test-utils" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "datasynth-banking", @@ -1836,7 +1836,7 @@ dependencies = [ [[package]] name = "datasynth-workspace" -version = "5.8.0" +version = "5.9.0" dependencies = [ "chrono", "criterion", diff --git a/Cargo.toml b/Cargo.toml index f1ade7c4..8ddbe644 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -24,7 +24,7 @@ exclude = ["fuzz", "attic/datasynth-graph-export"] # Root package for workspace-level benchmarks [package] name = "datasynth-workspace" -version = "5.8.0" +version = "5.9.0" edition = "2021" publish = false @@ -44,7 +44,7 @@ tempfile = { workspace = true } serde_json = { workspace = true } [workspace.package] -version = "5.8.0" +version = "5.9.0" edition = "2021" license = "Apache-2.0" rust-version = "1.88" @@ -60,22 +60,22 @@ categories = ["simulation", "command-line-utilities"] # Internal crates - version required for crates.io publishing # Version must match workspace.package.version to prevent cargo from resolving # old incompatible versions during publish verification -datasynth-core = { version = "5.8.0", path = "crates/datasynth-core" } -datasynth-config = { version = "5.8.0", path = "crates/datasynth-config" } -datasynth-generators = { version = "5.8.0", path = "crates/datasynth-generators" } -datasynth-output = { version = "5.8.0", path = "crates/datasynth-output" } -datasynth-runtime = { version = "5.8.0", path = "crates/datasynth-runtime" } -datasynth-graph = { version = "5.8.0", path = "crates/datasynth-graph" } -datasynth-server = { version = "5.8.0", path = "crates/datasynth-server" } -datasynth-test-utils = { version = "5.8.0", path = "crates/datasynth-test-utils" } -datasynth-eval = { version = "5.8.0", path = "crates/datasynth-eval" } -datasynth-ocpm = { version = "5.8.0", path = "crates/datasynth-ocpm" } -datasynth-banking = { version = "5.8.0", path = "crates/datasynth-banking" } -datasynth-fingerprint = { version = "5.8.0", path = "crates/datasynth-fingerprint" } -datasynth-standards = { version = "5.8.0", path = "crates/datasynth-standards" } -datasynth-audit-fsm = { version = "5.8.0", path = "crates/datasynth-audit-fsm" } -datasynth-audit-optimizer = { version = "5.8.0", path = "crates/datasynth-audit-optimizer" } -datasynth-group = { version = "5.8.0", path = "crates/datasynth-group" } +datasynth-core = { version = "5.9.0", path = "crates/datasynth-core" } +datasynth-config = { version = "5.9.0", path = "crates/datasynth-config" } +datasynth-generators = { version = "5.9.0", path = "crates/datasynth-generators" } +datasynth-output = { version = "5.9.0", path = "crates/datasynth-output" } +datasynth-runtime = { version = "5.9.0", path = "crates/datasynth-runtime" } +datasynth-graph = { version = "5.9.0", path = "crates/datasynth-graph" } +datasynth-server = { version = "5.9.0", path = "crates/datasynth-server" } +datasynth-test-utils = { version = "5.9.0", path = "crates/datasynth-test-utils" } +datasynth-eval = { version = "5.9.0", path = "crates/datasynth-eval" } +datasynth-ocpm = { version = "5.9.0", path = "crates/datasynth-ocpm" } +datasynth-banking = { version = "5.9.0", path = "crates/datasynth-banking" } +datasynth-fingerprint = { version = "5.9.0", path = "crates/datasynth-fingerprint" } +datasynth-standards = { version = "5.9.0", path = "crates/datasynth-standards" } +datasynth-audit-fsm = { version = "5.9.0", path = "crates/datasynth-audit-fsm" } +datasynth-audit-optimizer = { version = "5.9.0", path = "crates/datasynth-audit-optimizer" } +datasynth-group = { version = "5.9.0", path = "crates/datasynth-group" } # Serialization serde = { version = "1.0", features = ["derive"] } diff --git a/configs/examples/hf/README.md b/configs/examples/hf/README.md index d0a37631..5517198b 100644 --- a/configs/examples/hf/README.md +++ b/configs/examples/hf/README.md @@ -55,10 +55,22 @@ chart_of_accounts.parquet trial_balances.parquet cost_centers.parquet profit_centers.parquet +je_network.parquet # Accounting-network edges (v5.8.0+) README.md # Dataset card (copy from previous publish) generation_config.yaml # Reproducibility receipt (this file) ``` +`je_network.parquet` is the flat Cartesian-product edge list produced +by v5.8.0 — one row per `(debit_line, credit_line)` pair within each +journal entry, joinable back to the `data/train-*.parquet` JE shards +via `from_line_id` / `to_line_id` (which match the `transaction_id` +column on the JE side). Schema: `edge_id`, `document_id`, +`posting_date`, `from_account`, `to_account`, `from_line_id`, +`to_line_id`, `amount`, `confidence`, `predecessor_edge_id`, +`business_process`, `is_fraud`, `is_anomaly`. See the v5.8.0 +CHANGELOG entry for the design rationale and the Methods A–E +reference (Ivertowski 2024). + ## Adapting to other HF dataset repos Each future HF dataset gets its own YAML in this directory. The diff --git a/configs/examples/hf/journal_entries_1m.yaml b/configs/examples/hf/journal_entries_1m.yaml index 805c8be5..b47bb502 100644 --- a/configs/examples/hf/journal_entries_1m.yaml +++ b/configs/examples/hf/journal_entries_1m.yaml @@ -1,7 +1,19 @@ -# VynFi Journal Entries v5.5.1 — HF dataset regeneration -# Reproduces the published 10-company / 12-month / manufacturing dataset -# at the v5.5.1 column schema (49 columns, transaction_id + FSA category + -# enriched source_system + ISA-240 audit flags + per-field NULL profiles). +# VynFi Journal Entries v5.8.0 — HF dataset regeneration +# +# Changes since v5.5.1 baseline: +# v5.6.0 ISO 21378 account-class / sub-class fields +# (account_class_name, account_sub_class, account_sub_class_name) +# v5.7.0 Industry sub-account expansion (opt-in, enabled below) — splits +# canonical control accounts into product-line / channel / +# cost-centre sub-accounts for a more realistic chart +# v5.8.0 graphs/je_network.csv flat edge-list export (Cartesian-product +# debit↔credit edges) + JournalEntryLine.predecessor_line_id for +# document-chain tracing (P2P / O2C predecessor lookup) +# v5.8.0 BenfordViolationStrategy magnitude bug fixed (#185) — fraud +# amounts no longer inflate to quintillion scale +# v5.8.0 Distribution-struct silent-drop drift closed (#183) — typos in +# fraud_type / behavior / credit-rating distributions now fail +# parse with `unknown field` instead of silently dropping global: industry: manufacturing @@ -121,9 +133,20 @@ temporal_patterns: audit: enabled: false # Audit data is published in a separate dataset +# Accounting-network edge-list export (v5.8.0+). Method A produces +# exactly one edge per 2-line journal entry (≈60% of postings per +# Ivertowski 2024) and skips the multi-line Cartesian-product +# expansion that would otherwise blow the dataset to 200 M+ edges +# at the 1 M-line scale. Confidence is exactly 1.0 on every edge. +graph_export: + je_network: + method: a + output: output_directory: "./output" formats: [csv, json] # Default phase config (single-entity, no group consolidation). -# Banking / OCEL / graph export disabled — published in their own datasets. +# Banking / OCEL graph exports disabled — published in their own +# datasets; only the lightweight `graphs/je_network.csv` ships with +# this config. diff --git a/crates/datasynth-cli/src/main.rs b/crates/datasynth-cli/src/main.rs index 6f9db198..b2fd8d7b 100644 --- a/crates/datasynth-cli/src/main.rs +++ b/crates/datasynth-cli/src/main.rs @@ -1525,6 +1525,7 @@ fn run_main() -> Result<()> { &output, generator_config.output.export_layout, &generator_config.output.formats, + generator_config.graph_export.je_network.method, ) { tracing::warn!("Some output files may not have been written: {}", e); } diff --git a/crates/datasynth-config/src/schema.rs b/crates/datasynth-config/src/schema.rs index 45b988e5..dca6852e 100644 --- a/crates/datasynth-config/src/schema.rs +++ b/crates/datasynth-config/src/schema.rs @@ -557,6 +557,47 @@ pub struct GraphExportConfig { /// DGL-specific export settings. #[serde(default)] pub dgl: DglExportConfig, + + /// `graphs/je_network.csv` flat edge-list export settings (v5.8.0+). + #[serde(default)] + pub je_network: JeNetworkConfig, +} + +/// Method used to construct edges from journal entries when writing +/// `graphs/je_network.csv` (v5.8.0+). +/// +/// Reference: Ivertowski (2024), *Hardware Accelerated Method for +/// Accounting Network Generation*, Methods A through E. +#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize, PartialEq, Eq)] +#[serde(rename_all = "snake_case")] +pub enum JeNetworkMethod { + /// Method B (full Cartesian product) for every JE — bijective on + /// 2-line entries (Method A) and `n × m` Cartesian for multi-line + /// entries with proportional amount allocation. Default for + /// backward compatibility with v5.8.0 datasets that already + /// consumed the Cartesian-product output, but produces O(n × m) + /// edges per JE — a 50-debit / 50-credit period-close + /// consolidation alone yields 2 500 edges, and a typical + /// HF-scale 1 M-line config can blow up to 200 M+ edges. + #[default] + Cartesian, + /// Method A only — emit a single edge per 2-line journal entry + /// (1 debit + 1 credit) and skip multi-line entries entirely. + /// Edge count = number of 2-line JEs (≈ 60 % of entries per the + /// 2024 paper); per-edge confidence is exactly `1.0`. Recommended + /// for published reference datasets where size and exactness + /// matter more than recall on multi-line consolidations. + A, +} + +/// Configuration for the `graphs/je_network.csv` flat edge-list +/// export (v5.8.0+). +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +#[serde(deny_unknown_fields)] +pub struct JeNetworkConfig { + /// Edge-construction method (see [`JeNetworkMethod`]). + #[serde(default)] + pub method: JeNetworkMethod, } fn default_graph_types() -> Vec { @@ -591,6 +632,7 @@ impl Default for GraphExportConfig { output_subdirectory: "graphs".to_string(), hypergraph: HypergraphExportSettings::default(), dgl: DglExportConfig::default(), + je_network: JeNetworkConfig::default(), } } } diff --git a/crates/datasynth-core/src/models/user.rs b/crates/datasynth-core/src/models/user.rs index 271fa1b9..63d5dcd6 100644 --- a/crates/datasynth-core/src/models/user.rs +++ b/crates/datasynth-core/src/models/user.rs @@ -297,6 +297,32 @@ impl UserPool { } } + /// Build a `UserPool` from a slice of generated [`Employee`] records. + /// + /// Each employee contributes one user with the same `user_id`, + /// `persona`, `working_hours`, and display name. This is the + /// canonical way to source user identities from master data so + /// that `JE.created_by` joins back to `employees.user_id` + /// (closes the v5.9.0 linkage gap that had JE creators using a + /// pool disjoint from the employees master). + pub fn from_employees(employees: &[Employee]) -> Self { + let mut pool = Self::new(); + for emp in employees { + let display_name = if emp.first_name.is_empty() && emp.last_name.is_empty() { + emp.user_id.clone() + } else { + format!("{} {}", emp.first_name, emp.last_name) + }; + let mut user = User::new(emp.user_id.clone(), display_name, emp.persona); + user.email = Some(emp.email.clone()); + user.department = emp.department_id.clone(); + user.cost_centers = emp.cost_center.iter().cloned().collect(); + user.working_hours = emp.working_hours.clone(); + pool.add_user(user); + } + pool + } + /// Add a user to the pool. pub fn add_user(&mut self, user: User) { let idx = self.users.len(); diff --git a/crates/datasynth-generators/src/document_flow/document_flow_je_generator.rs b/crates/datasynth-generators/src/document_flow/document_flow_je_generator.rs index 03ab6d45..21820127 100644 --- a/crates/datasynth-generators/src/document_flow/document_flow_je_generator.rs +++ b/crates/datasynth-generators/src/document_flow/document_flow_je_generator.rs @@ -118,6 +118,14 @@ pub struct DocumentFlowJeGenerator { /// uses the framework-specific auxiliary account (e.g., PCG "4010001", SKR04 "33000001") /// instead of the raw partner ID. auxiliary_account_lookup: HashMap, + /// Cost-center IDs sourced from the generated cost-centers master so + /// document-flow-derived JEs (P2P / O2C) reference IDs that join + /// back to `cost_centers.id`. Falls back to the hardcoded + /// `COST_CENTER_POOL` const when empty. + cost_center_pool: Vec, + /// Profit-center IDs sourced from the generated profit-centers master. + /// Same population semantics as `cost_center_pool`. + profit_center_pool: Vec, } impl DocumentFlowJeGenerator { @@ -132,6 +140,8 @@ impl DocumentFlowJeGenerator { config, uuid_factory: DeterministicUuidFactory::new(seed, GeneratorType::DocumentFlow), auxiliary_account_lookup: HashMap::new(), + cost_center_pool: Vec::new(), + profit_center_pool: Vec::new(), } } @@ -144,6 +154,17 @@ impl DocumentFlowJeGenerator { self.auxiliary_account_lookup = lookup; } + /// Set the cost-center pool (master-data IDs). See + /// `JeGenerator::with_cost_center_pool` for semantics. + pub fn set_cost_center_pool(&mut self, ids: Vec) { + self.cost_center_pool = ids; + } + + /// Set the profit-center pool (master-data IDs). + pub fn set_profit_center_pool(&mut self, ids: Vec) { + self.profit_center_pool = ids; + } + /// Build an account description lookup from the configured accounts. fn account_description_map(&self) -> HashMap { let mut map = HashMap::new(); @@ -215,23 +236,60 @@ impl DocumentFlowJeGenerator { line.account_description = desc_map.get(&line.gl_account).cloned(); } - // 2. cost_center for expense accounts (5xxx/6xxx) + // 2. cost_center for expense accounts (5xxx/6xxx). + // When the orchestrator wired a master-data pool via + // `set_cost_center_pool`, draw from it filtered to the + // entry's company; otherwise fall back to the hardcoded + // `COST_CENTER_POOL`. if line.cost_center.is_none() { let first_char = line.gl_account.chars().next().unwrap_or('0'); if first_char == '5' || first_char == '6' { - let idx = cc_seed.wrapping_add(i) % Self::COST_CENTER_POOL.len(); - line.cost_center = Some(Self::COST_CENTER_POOL[idx].to_string()); + if !self.cost_center_pool.is_empty() { + let needle = format!("-{company_code}-"); + let candidates: Vec<&String> = self + .cost_center_pool + .iter() + .filter(|id| id.contains(&needle)) + .collect(); + let pool: Vec<&String> = if candidates.is_empty() { + self.cost_center_pool.iter().collect() + } else { + candidates + }; + let idx = cc_seed.wrapping_add(i) % pool.len(); + line.cost_center = Some(pool[idx].clone()); + } else { + let idx = cc_seed.wrapping_add(i) % Self::COST_CENTER_POOL.len(); + line.cost_center = Some(Self::COST_CENTER_POOL[idx].to_string()); + } } } - // 3. profit_center from company code + business process + // 3. profit_center: master pool when available, else + // derived from company code + business process (legacy). if line.profit_center.is_none() { - let suffix = match business_process { - Some(BusinessProcess::P2P) => "-P2P", - Some(BusinessProcess::O2C) => "-O2C", - _ => "", - }; - line.profit_center = Some(format!("PC-{company_code}{suffix}")); + if !self.profit_center_pool.is_empty() { + let needle = format!("-{company_code}-"); + let candidates: Vec<&String> = self + .profit_center_pool + .iter() + .filter(|id| id.contains(&needle)) + .collect(); + let pool: Vec<&String> = if candidates.is_empty() { + self.profit_center_pool.iter().collect() + } else { + candidates + }; + let idx = cc_seed.wrapping_add(i) % pool.len(); + line.profit_center = Some(pool[idx].clone()); + } else { + let suffix = match business_process { + Some(BusinessProcess::P2P) => "-P2P", + Some(BusinessProcess::O2C) => "-O2C", + _ => "", + }; + line.profit_center = Some(format!("PC-{company_code}{suffix}")); + } } // 4. line_text: fall back to header_text diff --git a/crates/datasynth-generators/src/je_generator.rs b/crates/datasynth-generators/src/je_generator.rs index 80e92372..005bd560 100644 --- a/crates/datasynth-generators/src/je_generator.rs +++ b/crates/datasynth-generators/src/je_generator.rs @@ -55,6 +55,16 @@ pub struct JournalEntryGenerator { customer_pool: CustomerPool, // Material pool for realistic material references material_pool: Option, + // Cost-center IDs sourced from the generated cost-centers master so + // `JE.cost_center` joins back to `cost_centers.id`. Populated via + // [`with_cost_center_pool`] from the orchestrator after master-data + // generation; falls back to the hardcoded `COST_CENTER_POOL` const + // when empty (configs that skip master-data generation). + cost_center_pool: Vec, + // Profit-center IDs sourced from the generated profit-centers master + // so `JE.profit_center` joins back to `profit_centers.id`. Same + // population semantics as `cost_center_pool`. + profit_center_pool: Vec, // Flag indicating whether we're using real master data vs defaults using_real_master_data: bool, // Fraud generation @@ -355,6 +365,8 @@ impl JournalEntryGenerator { vendor_pool: VendorPool::standard(), customer_pool: CustomerPool::standard(), material_pool: None, + cost_center_pool: Vec::new(), + profit_center_pool: Vec::new(), using_real_master_data: false, fraud_config: FraudConfig::default(), persona_errors_enabled: true, // Enable by default for realism @@ -844,6 +856,44 @@ impl JournalEntryGenerator { .with_materials(materials) } + /// Set the cost-center pool used by line-item enrichment. + /// + /// The orchestrator wires this from the generated cost-centers + /// master so `JE.cost_center` joins back to `cost_centers.id`. + /// When the pool is non-empty `enrich_line_items` picks + /// deterministically from it; the hardcoded fallback + /// `COST_CENTER_POOL` const is only used when the pool is empty + /// (configs that don't generate cost-center master data). + pub fn with_cost_center_pool(mut self, ids: Vec) -> Self { + self.cost_center_pool = ids; + self + } + + /// Set the profit-center pool used by line-item enrichment. + /// + /// Same semantics as [`with_cost_center_pool`] but for the + /// profit-centers master. Without this, the legacy + /// `PC-{company_code}-{P2P|O2C|R2R|H2R}` derivation is used — + /// which is consistent within a generation run but does not + /// match the format the master data generator emits. + pub fn with_profit_center_pool(mut self, ids: Vec) -> Self { + self.profit_center_pool = ids; + self + } + + /// Replace the auto-generated user pool with an externally-built one. + /// + /// The orchestrator builds a [`UserPool`] from the generated + /// employee master ([`UserPool::from_employees`]) and passes it + /// here, so `JE.created_by` joins back to `employees.user_id`. + /// Without this call, [`with_country_pack_names`] generates its + /// own user pool whose ids are disjoint from the employee + /// master. + pub fn with_user_pool(mut self, pool: UserPool) -> Self { + self.user_pool = Some(pool); + self + } + /// Replace the user pool with one generated from a [`CountryPack`]. /// /// This is an alternative to the default name-culture distribution that @@ -1148,24 +1198,70 @@ impl JournalEntryGenerator { } // 2. cost_center: assign to expense accounts (5xxx/6xxx) + // + // When the orchestrator has provided a master-data-sourced + // pool (`with_cost_center_pool`), pick from it so the value + // joins back to `cost_centers.id`. Otherwise fall back to + // the legacy hardcoded `COST_CENTER_POOL` const. + // + // Selection within the pool is filtered to entries that + // mention the entry's `company_code` (master IDs follow + // the `CC-{company}-...` convention) so cross-company + // contamination is avoided; if no pool entry matches the + // company we fall through to the full pool. if line.cost_center.is_none() { let first_char = line.gl_account.chars().next().unwrap_or('0'); if first_char == '5' || first_char == '6' { - let idx = cc_seed.wrapping_add(i) % Self::COST_CENTER_POOL.len(); - line.cost_center = Some(Self::COST_CENTER_POOL[idx].to_string()); + if !self.cost_center_pool.is_empty() { + let needle = format!("-{company_code}-"); + let candidates: Vec<&String> = self + .cost_center_pool + .iter() + .filter(|id| id.contains(&needle)) + .collect(); + let pool: Vec<&String> = if candidates.is_empty() { + self.cost_center_pool.iter().collect() + } else { + candidates + }; + let idx = cc_seed.wrapping_add(i) % pool.len(); + line.cost_center = Some(pool[idx].clone()); + } else { + let idx = cc_seed.wrapping_add(i) % Self::COST_CENTER_POOL.len(); + line.cost_center = Some(Self::COST_CENTER_POOL[idx].to_string()); + } } } - // 3. profit_center: derive from company code + business process + // 3. profit_center: assign from master pool when available + // (`with_profit_center_pool`); otherwise derive from + // company code + business process (legacy behaviour, which + // does not match the master-data PC ID format). if line.profit_center.is_none() { - let suffix = match business_process { - Some(BusinessProcess::P2P) => "-P2P", - Some(BusinessProcess::O2C) => "-O2C", - Some(BusinessProcess::R2R) => "-R2R", - Some(BusinessProcess::H2R) => "-H2R", - _ => "", - }; - line.profit_center = Some(format!("PC-{company_code}{suffix}")); + if !self.profit_center_pool.is_empty() { + let needle = format!("-{company_code}-"); + let candidates: Vec<&String> = self + .profit_center_pool + .iter() + .filter(|id| id.contains(&needle)) + .collect(); + let pool: Vec<&String> = if candidates.is_empty() { + self.profit_center_pool.iter().collect() + } else { + candidates + }; + let idx = cc_seed.wrapping_add(i) % pool.len(); + line.profit_center = Some(pool[idx].clone()); + } else { + let suffix = match business_process { + Some(BusinessProcess::P2P) => "-P2P", + Some(BusinessProcess::O2C) => "-O2C", + Some(BusinessProcess::R2R) => "-R2R", + Some(BusinessProcess::H2R) => "-H2R", + _ => "", + }; + line.profit_center = Some(format!("PC-{company_code}{suffix}")); + } } // 4. line_text: fall back to header_text if not already set @@ -2530,6 +2626,13 @@ impl ParallelGenerator for JournalEntryGenerator { gen.vendor_pool = self.vendor_pool.clone(); gen.customer_pool = self.customer_pool.clone(); gen.material_pool = self.material_pool.clone(); + // v5.9.0: master-data pools so sub-generators emit + // CC/PC values that join back to the corresponding + // masters (without these clones, parallel workers + // fell back to the hardcoded `COST_CENTER_POOL` const + // and the legacy `PC-{COMP}-{P2P|O2C|...}` derivation). + gen.cost_center_pool = self.cost_center_pool.clone(); + gen.profit_center_pool = self.profit_center_pool.clone(); gen.using_real_master_data = self.using_real_master_data; gen.fraud_config = self.fraud_config.clone(); gen.persona_errors_enabled = self.persona_errors_enabled; diff --git a/crates/datasynth-generators/src/master_data/asset_generator.rs b/crates/datasynth-generators/src/master_data/asset_generator.rs index 4b0a8610..11093687 100644 --- a/crates/datasynth-generators/src/master_data/asset_generator.rs +++ b/crates/datasynth-generators/src/master_data/asset_generator.rs @@ -252,7 +252,11 @@ impl AssetGenerator { // Set location info asset.location = Some(format!("P{company_code}")); - asset.cost_center = Some(format!("CC-{company_code}-ADMIN")); + // v5.9.0: align with master cost-centers vocabulary (FIN / PROD / + // SALES / RD / CORP). ADMIN was specific to asset_generator and + // didn't exist in the cost-centers master, so JE lines derived + // from these assets fell outside the cost_centers join. + asset.cost_center = Some(format!("CC-{company_code}-CORP")); // Generate serial number for equipment if matches!( @@ -306,7 +310,11 @@ impl AssetGenerator { asset.account_determination = self.generate_account_determination(&asset_class); asset.location = Some(format!("P{company_code}")); - asset.cost_center = Some(format!("CC-{company_code}-ADMIN")); + // v5.9.0: align with master cost-centers vocabulary (FIN / PROD / + // SALES / RD / CORP). ADMIN was specific to asset_generator and + // didn't exist in the cost-centers master, so JE lines derived + // from these assets fell outside the cost_centers join. + asset.cost_center = Some(format!("CC-{company_code}-CORP")); if matches!( asset_class, diff --git a/crates/datasynth-generators/src/master_data/employee_generator.rs b/crates/datasynth-generators/src/master_data/employee_generator.rs index 23c11671..0c685060 100644 --- a/crates/datasynth-generators/src/master_data/employee_generator.rs +++ b/crates/datasynth-generators/src/master_data/employee_generator.rs @@ -265,7 +265,11 @@ impl EmployeeGenerator { let name = self.name_generator.generate_name(&mut self.rng); let employee_id = format!("EMP-{}-{:06}", company_code, self.employee_counter); - let user_id = format!("u{:06}", self.employee_counter); + // v5.9.0: align Employee.user_id with UserPool's user_id format so a + // UserPool built from this employee pool produces user_ids that join + // back to `employees.user_id`. Previously this was `u{:06}` which + // was disjoint from the username-style ids JE.created_by used. + let user_id = name.to_user_id(self.employee_counter); let email = self.name_generator.generate_email(&name); let job_level = self.select_job_level(); @@ -538,7 +542,8 @@ impl EmployeeGenerator { let name = self.name_generator.generate_name(&mut self.rng); let employee_id = format!("EMP-{}-{:06}", company_code, self.employee_counter); - let user_id = format!("exec{:04}", self.employee_counter); + // v5.9.0: see `generate_employee` for the format change rationale. + let user_id = name.to_user_id(self.employee_counter); let email = self.name_generator.generate_email(&name); let mut employee = Employee::new( diff --git a/crates/datasynth-runtime/src/enhanced_orchestrator.rs b/crates/datasynth-runtime/src/enhanced_orchestrator.rs index b2c67927..f72ba474 100644 --- a/crates/datasynth-runtime/src/enhanced_orchestrator.rs +++ b/crates/datasynth-runtime/src/enhanced_orchestrator.rs @@ -10970,13 +10970,43 @@ impl EnhancedOrchestrator { // Pass fraud configuration for fraud injection let je_pack = self.primary_pack(); + // Master-data CC / PC pools so JE.cost_center and + // JE.profit_center join back to `cost_centers.id` and + // `profit_centers.id` (closes the v5.9.0 linkage gap that + // had `JE.cost_center = "CC1000"` while master used + // `CC-1000-FIN` etc.). Empty when no master is present — + // the generator falls back to its hardcoded constants. + let cc_pool: Vec = self + .master_data + .cost_centers + .iter() + .map(|c| c.id.clone()) + .collect(); + let pc_pool: Vec = self + .master_data + .profit_centers + .iter() + .map(|p| p.id.clone()) + .collect(); + + // Build a UserPool from the generated employee master so + // JE.created_by lines join back to `employees.user_id`. v5.9.0: + // closes the third linkage gap (the previous behaviour had + // JeGenerator generate its own UserPool internally with + // ids disjoint from the employee master). + let user_pool_from_employees = + datasynth_core::models::UserPool::from_employees(&self.master_data.employees); + let mut generator = generator .with_master_data( &self.master_data.vendors, &self.master_data.customers, &self.master_data.materials, ) + .with_cost_center_pool(cc_pool) + .with_profit_center_pool(pc_pool) .with_country_pack_names(je_pack) + .with_user_pool(user_pool_from_employees) .with_country_pack_temporal( self.config.temporal_patterns.clone(), self.seed + 200, @@ -11083,6 +11113,30 @@ impl EnhancedOrchestrator { let populate_fec = je_config.populate_fec_fields; let mut generator = DocumentFlowJeGenerator::with_config_and_seed(je_config, self.seed); + // Master-data CC / PC pools so document-flow-derived JEs + // (P2P / O2C postings) reference IDs that join back to the + // cost-centers / profit-centers masters. Same plumbing as + // for `JeGenerator` above; falls back to hardcoded const + // pools when masters are absent. + let cc_pool: Vec = self + .master_data + .cost_centers + .iter() + .map(|c| c.id.clone()) + .collect(); + let pc_pool: Vec = self + .master_data + .profit_centers + .iter() + .map(|p| p.id.clone()) + .collect(); + if !cc_pool.is_empty() { + generator.set_cost_center_pool(cc_pool); + } + if !pc_pool.is_empty() { + generator.set_profit_center_pool(pc_pool); + } + // Build auxiliary account lookup from vendor/customer master data so that // FEC auxiliary_account_number uses framework-specific GL accounts (e.g., // PCG "4010001") instead of raw partner IDs. diff --git a/crates/datasynth-runtime/src/output_writer.rs b/crates/datasynth-runtime/src/output_writer.rs index 41a3fe6e..e2d1ef6b 100644 --- a/crates/datasynth-runtime/src/output_writer.rs +++ b/crates/datasynth-runtime/src/output_writer.rs @@ -102,6 +102,11 @@ fn write_journal_entries_csv( // v5.6.0 added (ISO 21378 Audit Data Collection classification): // account_class, account_class_name (Level-2 e.g. "A.B" / "Trade Receivables") // account_sub_class, account_sub_class_name (Level-3 e.g. "A.B.A" / "Trade Accounts Receivable") + // v5.8.0 added: + // predecessor_line_id (UUID v5 of preceding line in document chain; + // populated by document_flow_je_generator for + // P2P / O2C chains, empty for chain heads and + // for purely-GL adjustments) writeln!( w, "document_id,company_code,fiscal_year,fiscal_period,posting_date,document_date,\ @@ -113,7 +118,8 @@ fn write_journal_entries_csv( is_manual,is_post_close,source_system,\ account_description,financial_statement_category,\ assignment,value_date,tax_code,transaction_id,\ - account_class,account_class_name,account_sub_class,account_sub_class_name" + account_class,account_class_name,account_sub_class,account_sub_class_name,\ + predecessor_line_id" )?; // Build a CoA → (short_description, ISO class, ISO sub-class) lookup. @@ -175,7 +181,7 @@ fn write_journal_entries_csv( }); writeln!( w, - "{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}", + "{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}", h.document_id, csv_escape(&h.company_code), h.fiscal_year, @@ -220,6 +226,7 @@ fn write_journal_entries_csv( csv_escape(coa_class_name), csv_escape(coa_sub_class), csv_escape(coa_sub_class_name), + csv_opt_str(&line.predecessor_line_id), )?; } } @@ -252,6 +259,7 @@ fn write_journal_entries_csv( fn write_je_network_csv( result: &EnhancedGenerationResult, output_dir: &Path, + method: datasynth_config::JeNetworkMethod, ) -> Result<(), Box> { use rust_decimal::Decimal; @@ -323,6 +331,19 @@ fn write_je_network_csv( continue; } + // Method A: bijective on 2-line entries only. Multi-line JEs + // are skipped under this method — see Ivertowski (2024) + // Methods A through E. The full Cartesian product of a + // multi-line consolidation produces O(n × m) edges per JE, + // which dominates total dataset size at scale; users who need + // the multi-line edges should set + // `graph_export.je_network.method: cartesian` (the default). + if method == datasynth_config::JeNetworkMethod::A + && !(debits.len() == 1 && credits.len() == 1) + { + continue; + } + let total_debit: Decimal = debits.iter().map(|i| je.lines[*i].debit_amount).sum(); let total_credit: Decimal = credits.iter().map(|i| je.lines[*i].credit_amount).sum(); if total_debit.is_zero() || total_credit.is_zero() { @@ -536,6 +557,7 @@ pub fn write_all_output( datasynth_config::FileFormat::Csv, datasynth_config::FileFormat::Json, ], + datasynth_config::JeNetworkMethod::default(), ) } @@ -560,7 +582,13 @@ pub fn write_all_output_with_root( formats: &[datasynth_config::FileFormat], ) -> Result<(), Box> { let effective = root.effective_dir(); - write_all_output_with_layout(result, &effective, export_layout, formats) + write_all_output_with_layout( + result, + &effective, + export_layout, + formats, + datasynth_config::JeNetworkMethod::default(), + ) } /// Write all generated data with a configurable export layout and format set. @@ -573,6 +601,7 @@ pub fn write_all_output_with_layout( output_dir: &Path, export_layout: datasynth_config::ExportLayout, formats: &[datasynth_config::FileFormat], + je_network_method: datasynth_config::JeNetworkMethod, ) -> Result<(), Box> { let csv_enabled = formats.is_empty() || formats.contains(&datasynth_config::FileFormat::Csv) @@ -632,7 +661,7 @@ pub fn write_all_output_with_layout( // Always emit when CSV is requested; cheap relative to the // main JE table. s.spawn(|| { - if let Err(e) = write_je_network_csv(result, output_dir) { + if let Err(e) = write_je_network_csv(result, output_dir, je_network_method) { warn!("Failed to write graphs/je_network.csv: {}", e); } }); diff --git a/scripts/hf_to_parquet.py b/scripts/hf_to_parquet.py index 76ad3b10..fd3476b9 100755 --- a/scripts/hf_to_parquet.py +++ b/scripts/hf_to_parquet.py @@ -1,15 +1,21 @@ #!/usr/bin/env python3 """ -Convert v5.5.1 generation outputs into HF-ready parquet artefacts. +Convert DataSynth generation outputs into HF-ready parquet artefacts. Inputs (under --output-dir): journal_entries.csv -> data/train-{NNNNN}-of-{TOTAL}.parquet + graphs/je_network.csv -> je_network.parquet (v5.8.0+) chart_of_accounts.json -> chart_of_accounts.parquet period_close/trial_balances.json -> trial_balances.parquet master_data/cost_centers.json -> cost_centers.parquet (optional) master_data/profit_centers.json -> profit_centers.parquet (optional) The JE table is sharded so each shard is <= ~150 MB compressed. + +The accounting-network edge list (`graphs/je_network.csv`) is the +v5.8.0+ flat Cartesian-product of debit↔credit edges per journal +entry — joinable back to `journal_entries.parquet` via +`from_line_id` / `to_line_id`. """ from __future__ import annotations @@ -89,6 +95,31 @@ def flush(shard_idx: int, dfs: list[pd.DataFrame]) -> None: return rows_emitted +def je_network_csv_to_parquet(csv_path: Path, out_path: Path) -> int: + """Convert the v5.8.0 `graphs/je_network.csv` flat edge list to parquet. + + 13 columns; written as a single file (typical row counts are 1.5–3× + the JE row count, but the per-row payload is much smaller, so a + single parquet stays well under the ~150 MB shard threshold for + the configurations published on HF). + """ + dtypes = { + "amount": "float64", + "confidence": "float64", + "is_fraud": "bool", + "is_anomaly": "bool", + } + parse_dates = ["posting_date"] + + df = pd.read_csv(csv_path, dtype=dtypes, parse_dates=parse_dates, low_memory=False) + out_path.parent.mkdir(parents=True, exist_ok=True) + table = pa.Table.from_pandas(df, preserve_index=False) + pq.write_table(table, out_path, compression="zstd", compression_level=9) + size_mb = out_path.stat().st_size / 1024 / 1024 + print(f" wrote {out_path.name}: {len(df):,} edges, {size_mb:.1f} MB (Accounting Network)") + return len(df) + + def json_to_parquet(json_path: Path, out_path: Path, label: str) -> int: """Convert a JSON array (or dict-with-list) to a single parquet file.""" with open(json_path) as f: @@ -220,6 +251,12 @@ def main() -> int: print(f"Converting {src.name} -> parquet …") json_to_parquet(src, dst, label) + # 5. Accounting network (v5.8.0+) — flat Cartesian-product edge list. + je_network_csv = out / "graphs" / "je_network.csv" + if je_network_csv.exists(): + print("Converting graphs/je_network.csv -> parquet …") + je_network_csv_to_parquet(je_network_csv, hf / "je_network.parquet") + print(f"\nAll artefacts written to {hf}/") return 0