A SHACL-inspired, LPG-native data-quality validator. You declare the expected shape of a property graph as a YAML PG Shape Contract; PGSC loads your data into an in-memory graph, runs the contract, and emits a violation report. Rules are meant to be authored by hand or generated from natural language by an LLM.
First use case: detecting the data-quality defects injected by twindq-bench and scoring detection against its ground-truth ledger.
📖 PG Shape Contract — Language Specification — the full syntax and semantics reference (SHACL-style): document structure, both shape kinds, and every constraint component. Learning by example? See authoring by example — natural-language requirements translated to shape definitions.
SHACL is RDF-native; real customer graphs (Neo4j, Spanner Graph, CSV exports) are property graphs where relationships are first-class and entities carry property bags. PGSC keeps SHACL's grammar (shapes, cardinality, severity) and swaps in an LPG execution model plus two LPG-native ideas: source provenance (catalog vs telemetry in one graph) and named cross-source checks.
pip install -e '.[dev]' # add ',viz' for the KuzuDB / GraphXR export
pytest -q# validate a twindq-bench output dir against a contract → violation-report.yaml
pgsc validate --contract contracts/telecom-twindq-v1.yaml \
--bench-out ../gnnsandbox-twindq-bench/benchmark/out/chaos \
--report out/chaos-report.yaml
# validate AND score against the ground-truth ledger → precision/recall table
pgsc score --contract contracts/telecom-twindq-v1.yaml \
--bench-out ../gnnsandbox-twindq-bench/benchmark/out/chaos
# debug: load + summarise a benchmark output dir
pgsc load --bench-out ../gnnsandbox-twindq-bench/benchmark/out/chaostwindq-bench out/<scenario>/ pgsc.graph.load_twindq
catalog/*.csv (as-designed) ─┐
telemetry.json (as-running) ─┴─► one merged LPG, every node/edge tagged
source ∈ {catalog, telemetry} + a
source-independent resolution `key`
│ pgsc.validate.engine
contract.yaml ──► typed shapes ──────►│ NodeShape (per source)
(pgsc.model) │ CrossSourceShape (compare via key)
▼
list[Violation] ──► report.yaml
│ pgsc.bench.score (vs defect_ledger.jsonl)
▼
precision / recall per class & ISO dimension
Catalog references entities by name, telemetry by synthetic id; both are
mapped to a shared key, so cross-source defects (an entity in one source only,
or conflicting attributes) surface as key mismatches. See the
language spec for the full reference, contracts/ for the
telecom contracts, and pgsc/model/shapes.py for the normative shape model.
Validator + scorer working end-to-end on all five twindq-bench scenarios, scored
against contracts/telecom-twindq-v1.yaml. 0 violations on defect-free data
and 100% precision on every scenario.
| Scenario | detection recall | precision | notes |
|---|---|---|---|
| topology_mismatch | 100% | 100% | |
| phantom_devices | 100% | 100% | |
| stale_catalog | 100% | 100% | cross-source attribute-presence catches stale nulls |
| naming_drift | 100% | 100% | normalized-key reconciliation collapses a rename to one signal |
| chaos | 91% | 100% | residual recall = catalog-side defects that are structurally unobservable |
Every contract rule yields 0 violations on defect-free data (the calibration invariant, enforced by the test suite), and precision is 100% on all five scenarios. The earlier sub-100% precision was a normalized-key entity- resolution gap, since closed: a renamed / split / merged entity reconciles to a single finding instead of flooding its interfaces and links with cascade false positives — this took naming_drift precision 10%→100% and chaos 85%→100% with no recall loss.
Known v1 limits: (a) chaos recall is 91% — the residual is catalog-side defects
the loader cannot observe: two catalog-interface validity.malformed_id (the CSV
renders interfaces by name, never the id) and one
cross_source.structural_conflict whose catalog endpoints are empty — a
benchmark-side note for Matteo; (b) v1 ships two shape kinds (NodeShape,
CrossSourceShape) and no general subgraph/edge DSL yet. The natural-language →
contract authoring path is delivered as the /pgsc-author Claude Code skill
(see .claude/skills/pgsc-author/), not a standalone llm/ pipeline.