Skip to content

chore(hf): bake audit-p2p + supply-chain-ocel recipes; align all data-card READMEs#189

Merged
mivertowski merged 2 commits intomainfrom
chore/hf-dataset-fleet-v5.9.0
May 9, 2026
Merged

chore(hf): bake audit-p2p + supply-chain-ocel recipes; align all data-card READMEs#189
mivertowski merged 2 commits intomainfrom
chore/hf-dataset-fleet-v5.9.0

Conversation

@mivertowski
Copy link
Copy Markdown
Owner

HF fleet refresh, v5.9.0. Bakes reproducible recipes for two
more VynFi datasets and aligns the data-card READMEs across the
whole fleet.

What

Recipes baked into the repo

  • configs/examples/hf/audit_p2p.yaml — 234-document P2P/O2C
    corpus. 10 entities × 6 months × manufacturing × 3 %
    document-level fraud × Method-A je_network bonus.
  • configs/examples/hf/supply_chain_ocel.yaml — 30 k-event /
    7.4 k-object native OCEL 2.0 corpus. ~50 % bigger than the
    v0.x release (was 20 k events) for serious process-mining
    usability.
  • scripts/hf_audit_p2p_to_parquet.py — flattens the document-
    flow JSON files into HF's subdir-per-config layout.
  • scripts/hf_supply_chain_ocel_to_parquet.py — converts OCPM
    JSON outputs to HF's subdir-per-config layout.

Datasets uploaded to HF

READMEs aligned

All seven VynFi datasets now share the same data-card structure:
frontmatter → TL;DR → "What's included" → Schema highlights →
Quick start → Generation → What changed in v5.9.0 (or
Provenance note for non-regenerated sets) → License → Citation →
Related.

The four datasets that were not regenerated for v5.9.0
(vynfi-aml-100k, vynfi-sar-narratives,
vynfi-ocel-manufacturing, vynfi-group-audit-enterprise-2000)
each carry a Provenance note explaining why the v5.9.0 fixes
don't apply (banking module / reconstructed events / group-audit
engine were untouched).

Cleanup

The six duplicate datasets without the vynfi- prefix
(audit-p2p, supply-chain-ocel, sar-narratives, aml-100k,
journal-entries-1m, ocel-manufacturing) were deleted from the
hub.

Why

The HF org is the public showcase for the DataSynth project.
Consumers browsing the org should see uniform UX across the
fleet — same headings in the same order, same citation form,
same load-dataset snippet style. This PR makes that the
property of the org.

🤖 Generated with Claude Code

mivertowski and others added 2 commits May 9, 2026 08:55
VynFi HF dataset fleet refresh, v5.9.0:

- configs/examples/hf/audit_p2p.yaml (NEW) — reproducible recipe
  for the 234-document P2P/O2C corpus published as
  `VynFi/vynfi-audit-p2p`.  10 entities × 6 months × manufacturing,
  3 % document-level fraud, predecessor_line_id chain context, and
  the v5.9.0 master-pool linkages on cost-centre / profit-centre /
  employees so document references join the JE-side master tables.
- configs/examples/hf/supply_chain_ocel.yaml (NEW) — reproducible
  recipe for the 30 k-event / 7.4 k-object native OCEL 2.0 dataset
  published as `VynFi/vynfi-supply-chain-ocel`.  ~50 % bigger than
  the v0.x release (was 20 k events) for serious process-mining
  usability, while keeping per-company JE volume at `ten_k` to
  bound the in-memory orchestrator footprint.
- scripts/hf_audit_p2p_to_parquet.py (NEW) — converts the
  document-flow JSON files to the HF subdir-per-config layout
  (purchase_orders/, goods_receipts/, vendor_invoices/, payments/).
- scripts/hf_supply_chain_ocel_to_parquet.py (NEW) — converts the
  process-mining JSON outputs to the HF subdir-per-config layout
  (events/, objects/, anomaly_labels/, document_events/).

The existing `journal_entries_1m.yaml` + `hf_to_parquet.py`
remain the canonical recipe for the 1 M-line JE dataset.

Companion to the README alignment pushed to all seven VynFi
datasets in this same refresh: every dataset card now shares a
common structure (frontmatter → TL;DR → "What's included" →
Schema highlights → Quick start → Generation → What changed in
v5.9.0 / Provenance note → License → Citation → Related), so a
consumer browsing the org sees a uniform UX across the fleet.

Cleanup: the six duplicate datasets without the `vynfi-` prefix
(`audit-p2p`, `supply-chain-ocel`, `sar-narratives`, `aml-100k`,
`journal-entries-1m`, `ocel-manufacturing`) were deleted from the
hub as part of this refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Interactive ISO 21378 Level-2 account-class graph from
VynFi/vynfi-journal-entries-1m je_network.parquet — Method-A
edge list aggregated to ~30 class nodes with click-to-drill-down,
business-process / fraud / anomaly filters, and force-directed
or hierarchical layout.

Live: https://huggingface.co/spaces/VynFi/accounting-network-explorer
@mivertowski mivertowski merged commit 8efd54b into main May 9, 2026
16 checks passed
@mivertowski mivertowski deleted the chore/hf-dataset-fleet-v5.9.0 branch May 9, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant