(Markdown) code generation by mojodna · Pull Request #451 · OvertureMaps/schema

Seth Fitzsimmons (mojodna) · 2026-02-27T00:14:28Z

Summary

Add overture-schema-codegen, a code generator that produces documentation from
Pydantic schema models.

Pydantic's model_json_schema() flattens the schema's domain vocabulary into JSON
Schema primitives. NewType names, constraint provenance, and custom constraint classes
disappear. Navigating Python's type annotation machinery -- NewType chains, nested
Annotated wrappers, union filtering, generic resolution -- is complex. The codegen
does it once. analyze_type() unwraps annotations into TypeInfo, a flat
target-independent representation that renderers consume without re-entering the type
system.

Architecture

Four layers with strict downward imports:

Rendering            ← Output formatting, all presentation decisions
Output Layout        ← What to generate, where it goes
Extraction           ← TypeInfo, FieldSpec, ModelSpec, UnionSpec
Discovery            ← discover_models() from overture-schema-core

analyze_type() is the central function. A single iterative loop peels NewType,
Annotated, Union, and container wrappers in fixed order, accumulating constraints tagged
with the NewType that contributed them. The result is a TypeInfo dataclass that
downstream modules consume without re-entering the type system.

Both concrete BaseModel subclasses and discriminated union type aliases (like Segment = Annotated[Union[RoadSegment, ...], ...]) satisfy the FeatureSpec protocol and flow
through the same pipeline. Union extraction finds the common base class, partitions
fields into shared and variant-specific, and extracts the discriminator mapping.

markdown_pipeline.py orchestrates the full pipeline without I/O: tree expansion,
supplementary type collection, path assignment, reverse references, and rendering.
Returns list[RenderedPage]. The CLI writes files to disk with Docusaurus frontmatter.

Design doc: packages/overture-schema-codegen/docs/design.md

Changes outside the codegen package

Preparatory fixes and refactors in core/system/CLI packages:

Rename ModelKey.class_name to entry_point (carries module:Class path, not just the
class name)
Attach docstrings to NewTypes at runtime (so the codegen can extract them)
Add resolve_discriminator_field_name() to system feature module
Fix relative imports and f-string prefixes in core
Use dict instead of Mapping in system test util type hints

Example (real) data added to theme pyproject.toml files (addresses, base, buildings,
divisions, places) under [examples.ModelName] sections.

What's in the package

Source:

Module	Purpose
`type_analyzer.py`	Iterative type unwrapping into `TypeInfo`
`specs.py`	Data structures shared between extraction and rendering
`type_registry.py`	Type name → per-target display string mapping
`model_extraction.py`	Pydantic model → `ModelSpec`, tree expansion
`union_extraction.py`	Union alias → `UnionSpec`, discriminator mapping
`enum_extraction.py`	Enum → `EnumSpec`
`newtype_extraction.py`	NewType → `NewTypeSpec`
`primitive_extraction.py`	Numeric primitives and geometry types
`field_constraint_description.py`	Constraint objects → display text
`model_constraint_description.py`	Model-level constraints → prose
`module_layout.py`	Python module paths → output directories
`type_collection.py`	Supplementary type discovery from field trees
`path_assignment.py`	Type names → output file paths
`link_computation.py`	Relative links between output pages
`reverse_references.py`	"Used By" reference computation
`markdown_type_format.py`	`TypeInfo` → markdown type strings with links
`markdown_renderer.py`	Jinja2 template driver for all page types
`example_loader.py`	TOML example loading, validation, flattening
`markdown_pipeline.py`	Pipeline orchestration (no I/O)
`cli.py`	Click CLI: `generate` and `list` commands
`case_conversion.py`	PascalCase → snake_case
`docstring.py`	Custom vs. auto-generated docstring detection

Tests: unit tests per module, golden file tests for
rendered markdown, integration tests against real schema models.

Design decisions worth reviewing

analyze_type is iterative, not recursive. The while True loop handles arbitrary
nesting depth (NewType wrapping Annotated wrapping NewType wrapping Annotated...)
without stack growth. Dict key/value types are the one exception where it recurses.

Cache insertion before recursion in expand_model_tree. The sub-model's ModelSpec
enters the cache before its fields are expanded. A back-edge encounter finds the cached
entry and marks starts_cycle=True rather than infinite-looping.

FeatureSpec is a Protocol, not a base class. ModelSpec and UnionSpec have
different field structures (flat list vs. annotated-field list with variant provenance).
A protocol lets them share a pipeline interface without forcing inheritance.

Constraint provenance via ConstraintSource. Each constraint records which NewType
contributed it. Field-level constraints with source=None render on the field;
constraints with a named source render on the NewType's own page. This prevents
duplication.

Test plan

make check passes (pytest + doctests + ruff + mypy)
make install && overture-codegen generate --format markdown --output-dir /tmp/schema-docs produces output
Spot-check generated markdown for a union feature (e.g., Segment) and a model
feature (e.g., Building) -- field tables, links, constraint descriptions, examples
Verify cross-page links resolve correctly (supplementary types link back to
features, features link to shared types)

The live schema reference contains Markdown produced by these changes (modulo some improvements from today).

pytest-subtests merged into pytest core as of pytest 9. Update test imports from pytest_subtests.SubTests to _pytest.subtests.Subtests.

- Add -q, --tb=short to `make test` for compact output - Set verbosity_subtests=0 to suppress per-subtest progress characters (the u/,/- markers from pytest's built-in subtests support)

Bare triple-quoted strings after NewType assignments are expression statements that Python never attaches to the NewType object, leaving __doc__ as None. Convert each to an explicit __doc__ assignment so codegen and introspection tools can read them at runtime. Same pattern DocumentedEnum uses for enum member docs.

OvertureFeature validator error message had two continuation lines missing the f-prefix, so {self.__class__.__name__} was rendered literally. Also add missing space before "and".

Also fix "supserset" typo in docstring.

Replace hardcoded discriminator_fields tuple ("type", "theme", "subtype") in _process_union_member with the discriminator field name extracted from the union's Annotated metadata. introspect_union already extracted the discriminator field name but didn't pass it through to member processing. Now it does, so unions using any field name as discriminator work correctly. For nested unions, parent discriminator values are extracted from nested leaf models to preserve structural tuple classification. Feature.field_discriminator now attaches _field_name to the callable, and _extract_discriminator_name reads it. This handles the Discriminator-wrapping-a-callable case that str(disc) got wrong silently.

Make _extract_literal_value return str directly instead of object, eliminating implicit str() conversions at call sites. Add comment explaining nested union re-indexing under the parent discriminator. Remove redundant test covered by TestDiscriminatorDiscovery and debugging print() calls from TestStructuralTuples.

The field holds the entry point value in "module:Class" format, not a class name. The old name required callers to know this (codegen's cli.py had a comment explaining it, and assigned to a local `entry_point` variable to compensate).

Empty package with build config, namespace packages, and py.typed marker. Declares click, jinja2, tomli, and overture-schema-core/system as dependencies.

Type analyzer (analyze_type) handles all type unwrapping in a single iterative function: NewType → Annotated → Union → list → terminal classification. Constraints accumulate from Annotated metadata with source tracking via ConstraintSource. Data structures: TypeInfo (type representation), FieldSpec (model field), ModelSpec (model), EnumSpec, NewTypeSpec, PrimitiveSpec. Type registry maps type names to per-target string representations via TypeMapping. is_semantic_newtype() distinguishes meaningful NewTypes from pass-through aliases. Utilities: case_conversion (snake_case), docstring (cleaning and custom-docstring detection).

Domain-specific extractors that consume analyze_type() and produce specs: - model_extraction: extract_model() for Pydantic models with MRO-aware field ordering, alias resolution, and recursive sub-model expansion via expand_model_tree() - enum_extraction: extract_enum() for DocumentedEnum classes - newtype_extraction: extract_newtype() for semantic NewTypes - primitive_extraction: extract_primitives() for numeric types with range and precision introspection - union_extraction: extract_union() with field merging across discriminated union variants Shared test fixtures in codegen_test_support.py.

Generate prose from extracted constraint data: - field_constraint_description: describe field-level constraints (ranges, patterns, unique items, hex colors) as human-readable notes with NewType source attribution - model_constraint_description: describe model-level constraints (@require_any_of, @radio_group, @min_fields_set, @require_if, @forbid_if) as prose, with consolidation of same-field conditional constraints

Determine what artifacts to generate and where they go: - module_layout: compute output directories for entry points, map Python module paths to filesystem output paths via compute_output_dir - path_assignment: build_placement_registry maps types to output file paths. Feature models get {theme}/{slug}/, shared types get types/{subsystem}/, theme-local types nest under their feature or sit flat at theme level - type_collection: discover supplementary types (enums, NewTypes, sub-models) by walking expanded feature trees - link_computation: relative_link() computes cross-page links, LinkContext holds page path + registry for resolving links during rendering

Embed JSON example features in [tool.overture-schema.examples] sections. Each example is a complete GeoJSON Feature matching the theme's Pydantic model, used by the codegen example_loader to render example tables in documentation.

Jinja2 templates and rendering logic for documentation pages: - markdown_renderer: orchestrates page rendering for features, enums, NewTypes, primitives, and geometry. Recursively expands MODEL-kind fields inline with dot-notation. - markdown_type_format: type string formatting with link-aware rendering via LinkContext - example_loader: loads examples from theme pyproject.toml, validates against Pydantic models, flattens to dot-notation - reverse_references: computes "Used By" cross-references between types and the features that reference them Templates: feature, enum, newtype, primitives, geometry pages. Golden-file snapshot tests verify rendered output stability. Adds renderer-specific fixtures to conftest.py (cli_runner, primitives_markdown, geometry_markdown).

Click-based CLI entry point (overture-codegen generate) that wires discovery → extraction → output layout → rendering: - Discovers models via discover_models() entry points - Filters themes, extracts specs, builds placement registry - Renders markdown pages with field tables, examples, cross- references, and sidebar metadata - Supports --theme filtering and --output-dir targeting Integration tests verify extraction against real Overture models (Building, Division, Segment, etc.) to catch schema drift. CLI tests verify end-to-end generation, output structure, and link integrity.

Design doc covers the four-layer architecture, analyze_type(), domain-specific extractors, and extension points for new output targets. Walkthrough traces Segment through the full pipeline module-by-module in dependency order, with FeatureVersion as a secondary example for constraint provenance in the type analyzer. README describes the problem (Pydantic flattens domain vocabulary), the "unwrap once, render many" approach, CLI usage, architecture overview, and programmatic API.

TypeInfo.literal_value discarded multi-value Literals entirely (Literal["a", "b"] got None). Renamed to literal_values as a tuple of all args so consumers decide presentation. single_literal_value() preserves its contract: returns the value for single-arg Literals, None otherwise. Callers (example_loader, union_extraction) are unchanged. Multi-value Literals render as pipe-separated quoted values in markdown tables: `"a"` \| `"b"`.

Adam Lastowka (Rachmanin0xFF) · 2026-02-28T01:10:33Z

packages/overture-schema-codegen/src/overture/schema/codegen/type_analyzer.py

+                raise TypeError("Bare list without type argument is not supported")
+            state.is_list = True
+            annotation = args[0]
+            continue


Haven't reviewed everything yet, but I found an issue in here while testing the parquet generator: this does not properly unpack nested lists. I don't think it's a problem for the markdown, but it surfaces in Divisions where we have list[NewType("Hierarchy", list[HierarchyItem])]. You can see the diff in the resulting arrow schemas:

Generated:
list<element: struct<division_id: string not null, subtype: string not null, name: string not null>>

Release Data (2026-02-18.0):
list<element: list<element: struct<division_id: string, subtype: string, name: string>>>

This should instead use something like a list_depth or a recursive unwrap.

Seth Fitzsimmons (mojodna) added 16 commits February 24, 2026 18:54

fix(core): switch to relative import

8029c48

fix(core): fix __name__ reference

537b36f

chore: add install make target

cb8b8db

Remove pytest-subtests dependency

e7771dc

pytest-subtests merged into pytest core as of pytest 9. Update test imports from pytest_subtests.SubTests to _pytest.subtests.Subtests.

Quiet pytest output for dev workflow

6f7cb5c

- Add -q, --tb=short to `make test` for compact output - Set verbosity_subtests=0 to suppress per-subtest progress characters (the u/,/- markers from pytest's built-in subtests support)

fix(core): add missing f-prefix to string continuation lines

0edb552

OvertureFeature validator error message had two continuation lines missing the f-prefix, so {self.__class__.__name__} was rendered literally. Also add missing space before "and".

fix(system): use dict instead of Mapping in test util type hints

f969ffc

Also fix "supserset" typo in docstring.

feat(codegen): add overture-schema-codegen package

28ce953

Empty package with build config, namespace packages, and py.typed marker. Declares click, jinja2, tomli, and overture-schema-core/system as dependencies.

Seth Fitzsimmons (mojodna) added the change type - cosmetic 🌹 Cosmetic change label Feb 27, 2026

Seth Fitzsimmons (mojodna) temporarily deployed to staging February 27, 2026 00:14 — with GitHub Actions Inactive

Seth Fitzsimmons (mojodna) requested review from Adam Lastowka (Rachmanin0xFF), Roel Bollens (RoelBollens-TomTom), Jennings Anderson (jenningsanderson) and Victor Schappert (vcschapp) February 27, 2026 00:19

Seth Fitzsimmons (mojodna) mentioned this pull request Feb 27, 2026

Sync codegen output (2026-02-26) OvertureMaps/docs#278

Merged

Seth Fitzsimmons (mojodna) force-pushed the codegen branch from dafd3d7 to 4198027 Compare February 27, 2026 00:34

Seth Fitzsimmons (mojodna) temporarily deployed to staging February 27, 2026 00:34 — with GitHub Actions Inactive

Seth Fitzsimmons (mojodna) force-pushed the codegen branch from 4198027 to 23b22c7 Compare February 27, 2026 00:37

Seth Fitzsimmons (mojodna) temporarily deployed to staging February 27, 2026 00:38 — with GitHub Actions Inactive

Seth Fitzsimmons (mojodna) added 3 commits February 26, 2026 16:40

Seth Fitzsimmons (mojodna) force-pushed the codegen branch from 23b22c7 to 8b0d396 Compare February 27, 2026 00:42

Seth Fitzsimmons (mojodna) temporarily deployed to staging February 27, 2026 00:42 — with GitHub Actions Inactive

Adam Lastowka (Rachmanin0xFF) reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Markdown) code generation#451

(Markdown) code generation#451
Seth Fitzsimmons (mojodna) wants to merge 21 commits intodevfrom
codegen

Seth Fitzsimmons (mojodna) commented Feb 27, 2026

Uh oh!

Adam Lastowka (Rachmanin0xFF) Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Seth Fitzsimmons (mojodna) commented Feb 27, 2026

Summary

Architecture

Changes outside the codegen package

What's in the package

Design decisions worth reviewing

Test plan

Uh oh!

Adam Lastowka (Rachmanin0xFF) Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants