PGSC — PG Shape Constraints

A SHACL-inspired, LPG-native data-quality validator. You declare the expected shape of a property graph as a YAML PG Shape Contract; PGSC loads your data into an in-memory graph, runs the contract, and emits a violation report. Rules are meant to be authored by hand or generated from natural language by an LLM.

First use case: detecting the data-quality defects injected by twindq-bench and scoring detection against its ground-truth ledger.

📖 PG Shape Contract — Language Specification — the full syntax and semantics reference (SHACL-style): document structure, both shape kinds, and every constraint component. Learning by example? See authoring by example — natural-language requirements translated to shape definitions.

Why not just SHACL?

SHACL is RDF-native; real customer graphs (Neo4j, Spanner Graph, CSV exports) are property graphs where relationships are first-class and entities carry property bags. PGSC keeps SHACL's grammar (shapes, cardinality, severity) and swaps in an LPG execution model plus two LPG-native ideas: source provenance (catalog vs telemetry in one graph) and named cross-source checks.

Install

pip install -e '.[dev]'          # add ',viz' for the KuzuDB / GraphXR export
pytest -q

Use

# validate a twindq-bench output dir against a contract → violation-report.yaml
pgsc validate --contract contracts/telecom-twindq-v1.yaml \
  --bench-out ../gnnsandbox-twindq-bench/benchmark/out/chaos \
  --report out/chaos-report.yaml

# validate AND score against the ground-truth ledger → precision/recall table
pgsc score --contract contracts/telecom-twindq-v1.yaml \
  --bench-out ../gnnsandbox-twindq-bench/benchmark/out/chaos

# debug: load + summarise a benchmark output dir
pgsc load --bench-out ../gnnsandbox-twindq-bench/benchmark/out/chaos

How it works

twindq-bench out/<scenario>/        pgsc.graph.load_twindq
  catalog/*.csv  (as-designed) ─┐
  telemetry.json (as-running)  ─┴─► one merged LPG, every node/edge tagged
                                    source ∈ {catalog, telemetry} + a
                                    source-independent resolution `key`
                                         │  pgsc.validate.engine
   contract.yaml ──► typed shapes ──────►│  NodeShape  (per source)
   (pgsc.model)                          │  CrossSourceShape (compare via key)
                                         ▼
                                    list[Violation] ──► report.yaml
                                         │  pgsc.bench.score (vs defect_ledger.jsonl)
                                         ▼
                                    precision / recall per class & ISO dimension

Catalog references entities by name, telemetry by synthetic id; both are mapped to a shared key, so cross-source defects (an entity in one source only, or conflicting attributes) surface as key mismatches. See the language spec for the full reference, contracts/ for the telecom contracts, and pgsc/model/shapes.py for the normative shape model.

Status (v1, 2026-05-26)

Validator + scorer working end-to-end on all five twindq-bench scenarios, scored against contracts/telecom-twindq-v1.yaml. 0 violations on defect-free data and 100% precision on every scenario.

Scenario	detection recall	precision	notes
topology_mismatch	100%	100%
phantom_devices	100%	100%
stale_catalog	100%	100%	cross-source attribute-presence catches stale nulls
naming_drift	100%	100%	normalized-key reconciliation collapses a rename to one signal
chaos	91%	100%	residual recall = catalog-side defects that are structurally unobservable

Every contract rule yields 0 violations on defect-free data (the calibration invariant, enforced by the test suite), and precision is 100% on all five scenarios. The earlier sub-100% precision was a normalized-key entity- resolution gap, since closed: a renamed / split / merged entity reconciles to a single finding instead of flooding its interfaces and links with cascade false positives — this took naming_drift precision 10%→100% and chaos 85%→100% with no recall loss.

Known v1 limits: (a) chaos recall is 91% — the residual is catalog-side defects the loader cannot observe: two catalog-interface validity.malformed_id (the CSV renders interfaces by name, never the id) and one cross_source.structural_conflict whose catalog endpoints are empty — a benchmark-side note for Matteo; (b) v1 ships two shape kinds (NodeShape, CrossSourceShape) and no general subgraph/edge DSL yet. The natural-language → contract authoring path is delivered as the /pgsc-author Claude Code skill (see .claude/skills/pgsc-author/), not a standalone llm/ pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude/skills		.claude/skills
contracts		contracts
docs		docs
pgsc		pgsc
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PGSC — PG Shape Constraints

Why not just SHACL?

Install

Use

How it works

Status (v1, 2026-05-26)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PGSC — PG Shape Constraints

Why not just SHACL?

Install

Use

How it works

Status (v1, 2026-05-26)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages