chore(hf): bake audit-p2p + supply-chain-ocel recipes; align all data-card READMEs#189
Merged
mivertowski merged 2 commits intomainfrom May 9, 2026
Merged
Conversation
VynFi HF dataset fleet refresh, v5.9.0: - configs/examples/hf/audit_p2p.yaml (NEW) — reproducible recipe for the 234-document P2P/O2C corpus published as `VynFi/vynfi-audit-p2p`. 10 entities × 6 months × manufacturing, 3 % document-level fraud, predecessor_line_id chain context, and the v5.9.0 master-pool linkages on cost-centre / profit-centre / employees so document references join the JE-side master tables. - configs/examples/hf/supply_chain_ocel.yaml (NEW) — reproducible recipe for the 30 k-event / 7.4 k-object native OCEL 2.0 dataset published as `VynFi/vynfi-supply-chain-ocel`. ~50 % bigger than the v0.x release (was 20 k events) for serious process-mining usability, while keeping per-company JE volume at `ten_k` to bound the in-memory orchestrator footprint. - scripts/hf_audit_p2p_to_parquet.py (NEW) — converts the document-flow JSON files to the HF subdir-per-config layout (purchase_orders/, goods_receipts/, vendor_invoices/, payments/). - scripts/hf_supply_chain_ocel_to_parquet.py (NEW) — converts the process-mining JSON outputs to the HF subdir-per-config layout (events/, objects/, anomaly_labels/, document_events/). The existing `journal_entries_1m.yaml` + `hf_to_parquet.py` remain the canonical recipe for the 1 M-line JE dataset. Companion to the README alignment pushed to all seven VynFi datasets in this same refresh: every dataset card now shares a common structure (frontmatter → TL;DR → "What's included" → Schema highlights → Quick start → Generation → What changed in v5.9.0 / Provenance note → License → Citation → Related), so a consumer browsing the org sees a uniform UX across the fleet. Cleanup: the six duplicate datasets without the `vynfi-` prefix (`audit-p2p`, `supply-chain-ocel`, `sar-narratives`, `aml-100k`, `journal-entries-1m`, `ocel-manufacturing`) were deleted from the hub as part of this refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Interactive ISO 21378 Level-2 account-class graph from VynFi/vynfi-journal-entries-1m je_network.parquet — Method-A edge list aggregated to ~30 class nodes with click-to-drill-down, business-process / fraud / anomaly filters, and force-directed or hierarchical layout. Live: https://huggingface.co/spaces/VynFi/accounting-network-explorer
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
HF fleet refresh, v5.9.0. Bakes reproducible recipes for two
more VynFi datasets and aligns the data-card READMEs across the
whole fleet.
What
Recipes baked into the repo
configs/examples/hf/audit_p2p.yaml— 234-document P2P/O2Ccorpus. 10 entities × 6 months × manufacturing × 3 %
document-level fraud × Method-A je_network bonus.
configs/examples/hf/supply_chain_ocel.yaml— 30 k-event /7.4 k-object native OCEL 2.0 corpus. ~50 % bigger than the
v0.x release (was 20 k events) for serious process-mining
usability.
scripts/hf_audit_p2p_to_parquet.py— flattens the document-flow JSON files into HF's subdir-per-config layout.
scripts/hf_supply_chain_ocel_to_parquet.py— converts OCPMJSON outputs to HF's subdir-per-config layout.
Datasets uploaded to HF
VynFi/vynfi-journal-entries-1m— already refreshed in PR release: v5.9.0 — je_network.method config + JE CSV widen + post-v5.8.0 fixes #188.VynFi/vynfi-audit-p2p— re-uploaded with v5.9.0 generation.VynFi/vynfi-supply-chain-ocel— re-uploaded with v5.9.0generation, 30 k events / 7.4 k objects.
READMEs aligned
All seven VynFi datasets now share the same data-card structure:
frontmatter → TL;DR → "What's included" → Schema highlights →
Quick start → Generation → What changed in v5.9.0 (or
Provenance note for non-regenerated sets) → License → Citation →
Related.
The four datasets that were not regenerated for v5.9.0
(
vynfi-aml-100k,vynfi-sar-narratives,vynfi-ocel-manufacturing,vynfi-group-audit-enterprise-2000)each carry a Provenance note explaining why the v5.9.0 fixes
don't apply (banking module / reconstructed events / group-audit
engine were untouched).
Cleanup
The six duplicate datasets without the
vynfi-prefix(
audit-p2p,supply-chain-ocel,sar-narratives,aml-100k,journal-entries-1m,ocel-manufacturing) were deleted from thehub.
Why
The HF org is the public showcase for the DataSynth project.
Consumers browsing the org should see uniform UX across the
fleet — same headings in the same order, same citation form,
same load-dataset snippet style. This PR makes that the
property of the org.
🤖 Generated with Claude Code