Dataset engineering framework for deterministic ingestion, cleaning, release packaging, and analysis.
A framework-oriented repository for turning heterogeneous annotation exports into auditable dataset releases that can be validated, published, and reused over time.
label-lab is designed as a research instrument, not a one-off conversion
script.
It is meant to support:
- explicit source adapters instead of hidden format assumptions
- one canonical dataset model before release packaging hardens
- deterministic cleaning and quality-review passes
- artifact-backed dataset publication and later audit
- long-lived format evolution without mutating accepted historical outputs
Key principle:
- source formats adapt into one canonical dataset surface, then release packaging hardens around that normalized representation
src/label_lab/: active Python runtime for adapters, normalization, and release emissiontests/: runtime and CLI validation surfaceartifacts/: persisted output surface for cleaned datasets, reports, and release bundlesdocs/README.md: docs indexdocs/architecture.md: runtime map and format posturedocs/source_formats.md: current source-adapter support matrixdocs/runbooks.md: setup, validation, and release-emission guide
- Use a Python environment that already contains the repo dependencies.
On declaratively managed workstations, no repo-local
.venvis required.
just setup- Run the baseline validation passes.
just lint
just typecheck
just test- Emit a release bundle from a supported source export.
PYTHONPATH=src python -m label_lab.main emit-release \
--source-coco /path/to/annotations.json \
--output-dir artifacts/releases/demo \
--release-version 0.1.0Or from a bounded anno-lab raw export:
PYTHONPATH=src python -m label_lab.main emit-release \
--source-anno-lab-raw /path/to/raw_collection_export.json \
--output-dir artifacts/releases/demo-anno-lab-raw \
--release-version 0.1.0Or from LabelMe JSON sidecars:
PYTHONPATH=src python -m label_lab.main emit-release \
--source-labelme /path/to/labelme-sidecars \
--output-dir artifacts/releases/demo-labelme \
--release-version 0.1.0For repeated-measure anno-lab raw exports, emit the repo-local review
artifact instead of a release bundle:
PYTHONPATH=src python -m label_lab.main emit-review-groups \
--source-anno-lab-raw /path/to/repeated_measure_export.json \
--output-path artifacts/reviews/demo-anno-lab-raw-review-groups.jsonThen import adjudicated selections into a canonical-record artifact:
PYTHONPATH=src python -m label_lab.main emit-canonical-records \
--review-groups artifacts/reviews/demo-anno-lab-raw-review-groups.json \
--adjudication /path/to/review_adjudication.json \
--output-path artifacts/reviews/demo-anno-lab-raw-canonical-records.jsonSource Adapter -> Canonical Dataset Model -> Deterministic Cleaning / Analysis -> Release Bundle
- source adapters normalize incoming exports into stable dataset records
- cleaning and analysis operate on canonical records instead of raw format quirks
- release emission writes a manifest-backed bundle for downstream use
- a public
emit-releaseCLI for COCO-first bundle emission from directCOCOinputs, boundedanno-labraw exports, or LabelMe JSON sidecars, plus a repo-localemit-review-groupsCLI andemit-canonical-recordsCLI for repeated-measureanno_lab_rawadjudication artifacts - an explicit
sources/adapter boundary with release-readyCOCO,anno_lab_raw, andlabelmeinputs whileLVISremains planned but fail-closed - manifest-backed releases that currently write
release_manifest.jsonplusannotations.coco.json src/label_lab/release_contract.pyas the public producer authority for the emitted bundle contract and artifact names- an active Python runtime under
src/label_lab/ - an artifact surface under
artifacts/for reproducible generated outputs - a checked-in minimal contract fixture under
tests/fixtures/published_release_bundle_v1/for bundle-regression coverage
- harden
anno-labraw-export ingestion without pulling raw-export logic back intoanno-lab - keep repeated-measure adjudication explicit and repo-local before widening source support or publication semantics
- expand adapter seams for
YOLO,LVIS, andPASCAL VOCafter the LabelMe path is stable - add richer provenance, quality-review, and analysis sidecars without distorting the base publication contract
- preserve source provenance and deterministic transform identity
- fail closed when geometry or metadata semantics are ambiguous
- keep richer metadata in explicit sidecars instead of overloading raw
COCOobjects - keep
.env.exampleplaceholder-only - treat historical artifacts as additive, not mutable
docs/README.md: docs indexdocs/architecture.md: architecture, runtime seams, and format posturedocs/source_formats.md: source-adapter support matrix and raw-export /LVISboundariesdocs/runbooks.md: bootstrap, validation, and release-emission referenceartifacts/README.md: artifact-output guidance
This project is licensed under the MIT License. See LICENSE for the full
text.