Skip to content

Item fragments#448

Merged
bitner merged 12 commits into
mainfrom
item_fragments
Jun 9, 2026
Merged

Item fragments#448
bitner merged 12 commits into
mainfrom
item_fragments

Conversation

@bitner

@bitner bitner commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Technical Context

Description

Reworks how STAC items are stored and retrieved for the upcoming v0.10.0 breaking release of pgstac.
New storage model

  • item_fragments — deduplicated shared subtrees (asset metadata, link shapes, root keys) stored once per collection, keyed by a 32-byte sha256 (hash bytea). Items reference their fragment via fragment_id.
  • items — per-item delta columns (assets, properties, links, extra) plus ~30 promoted scalar columns for well-known queryables (datetime, platform, gsd, eo:*, proj:*, view:*, sat:*, file:*, sci:*).
  • collections.fragment_config text[] — list of fragment paths, auto-derived from item_assets on collection creation. This can be overridden to be able to further optimize and deduplicate information stored across items that is common across all items in a collection.
  • items.link_hrefs / item_fragments.links_template — link split storage: shared link shape (rel/type/title) deduped into the fragment; per-item hrefs stored separately.
  • item_field_registry — tracks observed JSON paths per collection for queryable discovery. This will also allow figuring out the full schema of all data in a collection for use when storing to schema-requiring formats like parquet.

Items_staging tables have been updated to work with schema changes.

Functions are added to create a canonical hash that can be calculated the same internally in postgres/pgstac as well as externally to allow for fast lookups/diffing when loading data.

Fixes #158 and #425datetime: null round-trip

The STAC spec requires "datetime": null to be explicitly present when start_datetime/end_datetime are used. Earlier versions applied jsonb_strip_nulls to the full properties object, silently dropping it. The new temporal_properties_from_item builds jsonb_build_object('datetime', NULL) before the jsonb_strip_nulls block that covers only promoted scalars, so the explicit JSON null survives end-to-end through both get_item and search.

Test gate

scripts/test --formatting --pgtap --basicsql --pgdump is green (349 PGTap tests; pg_dump → pgstac_restore round-trip verified).

--pypgstac and --migrations are intentionally skipped These will be fixed in upcoming PRs prior to pgstac v0.10.0 release. We are intentionally keeping the slices for PRs leading to v0.10.0 smaller and allowing for some tests not to pass to allow us to iterate to the point that we can be ready for this breaking release.


Checklist

  • Linting: Code is formatted and linted
  • Tests: Tests pass. New tests added for split-storage round-trip, fragment dedup, canonical hash contract, datetime:null fix (search response does not include datetime when null #158/bug: jsonb_strip_nulls() in hydration functions strips datetime: null from item properties, producing invalid STAC items #425), jsonb_merge_recursive depth-4/collision correctness, and item_hash bytea sizing.
  • Edge Cases: Verified: depth-4 fragment paths, scalar key collisions in merge, empty items, range items with datetime:null via both get_item and search, upsert idempotency (re-upsert 1,000 items with no fragment explosion), changed-upsert reflected, direct UPDATE leaves item_hash stable.
  • Documentation: README not updated (no new env vars or config); CLAUDE.md, AGENTS.md, and CHANGELOG.md updated.
  • Accountability: I can explain the implementation logic for every line of code submitted.

AI tool usage

  • AI (Copilot or something similar) supported my development of this PR. See our policy about AI tool use. Use of AI tools must be indicated.

Policy: We require a "human-in-the-loop." You are the author and are fully accountable for all submitted code. Please ensure all tool-generated content is thoroughly reviewed before submission to ensure it is not an "extractive contribution" that squanders maintainer time.

bitner and others added 8 commits June 1, 2026 16:01
- Dockerfile: add plprofiler and plpgsql_check for profiling sessions
- scripts/loadsampledata: new host-facing fixture-loader; extend in-container
  version to load Planetary Computer NAIP, Landsat, and Sentinel-2 fixtures
- scripts/container-scripts/test: add --pgdump gate; update flag docs
- Developer docs: CLAUDE.md migration workflow and test-gate guidance; AGENTS.md
  persona definitions; scripts.instructions.md updated for new scripts
- CHANGELOG.md: unreleased entries for v0.10.0 split-storage changes
- .gitignore: ignore local .plans/ planning artifacts

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add 1,000-item NDJSON snapshots for landsat-c2-l2, naip, and sentinel-2-l2a
under src/pgstac/tests/testdata/planetary-computer/. Deterministic fixtures
(fetched once, checked in) for reproducible disk-size measurement and
benchmarking of the v0.10 split-storage schema. Each collection exercises a
different data shape: Landsat (25 assets with many constant sub-keys), NAIP
(4 assets dominated by per-item Azure blob hrefs), Sentinel-2 (23 assets with
per-item varying properties). Includes a fixture-summary.json recording fetch
parameters.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s, ingest

Split the monolithic items.content JSONB into typed columns and a deduplicated
fragment store, with server-side hydration on every read.

Schema
- items: per-item delta (assets/properties/links/extra) + ~30 promoted scalar
  columns (datetime, platform, eo:*, proj:*, view:*, sat:*, file:*, sci:*) with
  native BTREE indexes + fragment_id reference into item_fragments
- item_fragments(collection, hash bytea, content, links_template): deduplicated
  shared subtrees keyed by raw 32-byte sha256 (compact unique index)
- collections.fragment_config text[]: per-collection fragment paths, auto-derived
  from item_assets sub-keys (depth-3 paths for stable asset metadata)
- item_field_registry: tracks observed JSON paths per collection for queryable
  discovery and schema inference
- items_deleted_log: tombstone table for soft-delete audit

Dehydrate at ingest (items_staging_triggerfunc → items_staging_dehydrate)
- Set-based pipeline: dehydrate → fragment extract → ON CONFLICT hash dedup
  → strip fragment-covered keys; shared by insert/ignore/upsert branches via
  items_staging_dehydrate() so the enriched column list lives in one place
- Links split storage: shared link shape (rel/type/title, no href) deduped in
  item_fragments.links_template; per-item hrefs in items.link_hrefs
- Partition creation and stats updates queued via run_or_queue (ingest returns fast)

Hydrate at read (content_hydrate, format_item, search)
- jsonb_merge_recursive with disjoint fast-path: ingest strip removes
  fragment-owned keys from per-item columns, so the two sub-objects almost
  always have disjoint keys; merge shallow-concats and only recurses on real
  overlap (~2.5× faster asset merge, byte-identical output verified on 3,000
  real items + depth-4/collision unit tests)
- promoted_properties_from_item: direct jsonb_strip_nulls(jsonb_build_object)
  mirroring content_dehydrate (~35% faster than the prior per-item defs-join)
- tstz_to_stac_text: canonical UTC serializer (trims trailing zeros)
- Net: content_hydrate 27–50% faster on the Planetary Computer fixtures

Externally reproducible content_hash
- jsonb_canonical(jsonb): RFC 8785-aligned serializer (code-point-sorted keys,
  compact separators, UTF-8 strings, IEEE-754 shortest-round-trip numbers)
- content_hash = sha256(jsonb_canonical(item)) — verified byte-identical to a
  Python reference on 3,000 real items plus numeric/unicode edge cases
- Set once at ingest; items_touch_triggerfunc no longer recomputes on UPDATE

Queryables and CQL routing
- promoted_queryables_defaults() populates queryables.property_path for all
  promoted scalar columns; CQL2 translator bypasses JSONB cast and hits native
  BTREE indexes directly for promoted queryables
- Permissions for new tables/functions in 998_idempotent_post.sql

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- pgtap/001a_jsonutils.sql: jsonb_merge_recursive disjoint fast-path (depth-4,
  collision, NULL/empty guards); jsonb_canonical key-sort, numbers, nested
  objects; pgstac_item_hash vector pinning the external reproducibility contract
- pgtap/002_collections.sql: fragment_config auto-derivation from item_assets
- pgtap/002a_queryables.sql: promoted_queryables_defaults, property_path routing
- pgtap/003_items.sql: split-storage round-trip (create/get/update/upsert/delete),
  fragment dedup, root-key fragmentation, link split storage, promoted column
  values, touch trigger leaves content_hash stable on direct UPDATE
- pgtap/004_search.sql: format_item hydration, CQL promoted-column routing
- pgtap/9999_readonly.sql: read-only role access checks for new tables/functions
- pgtap.sql: plan count updated to 343
- basic/hydration.sql + .sql.out: assert properties.datetime absent from stored
  row (promoted) and correctly rehydrated via get_item
- basic/crud_functions.sql: ORDER BY id on multi-row queries for deterministic
  output; .sql.out regenerated
- basic/cql2_searches.sql.out: updated for promoted-column routing output

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Regenerate src/pgstac/pgstac.sql (assembled base install) and the unreleased
base migration from the edited sql/ source. The incremental migration
(pgstac--0.9.11--unreleased.sql) reflects the schema delta from 0.9.11; it will
be finalized and renamed when the v0.10.0 release branch is assembled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, #425)

The STAC spec requires `"datetime": null` to be explicitly present in item
properties when `start_datetime`/`end_datetime` are used. Earlier pgstac
versions applied jsonb_strip_nulls to the full properties object during
hydration, silently dropping it and producing invalid STAC output.

The new split-storage hydration (temporal_properties_from_item) builds
`jsonb_build_object('datetime', NULL)` before the jsonb_strip_nulls block that
covers only the promoted scalar columns, so the explicit JSON null is preserved
end-to-end.

Tests added:
- pgtap/003_items.sql: four assertions covering get_item and search — key
  presence (? 'datetime') and value type ('null'::jsonb) for a range item
- basic/hydration.sql: search() check alongside the existing get_item check,
  with regenerated .out confirming null in both paths (plan: 343 → 347)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… private, move jsonb_field_rows

Five related schema and API improvements:

item_hash bytea (was content_hash text)
  Store the canonical item digest as a raw 32-byte sha256 (bytea) instead of a
  64-char hex text. Half the storage per row; direct binary comparison on the
  unique index; 'octet_length(item_hash) = 32' replaces 'length = 64' in tests.
  Applies to both items and items_deleted_log.

jsonb_hash(jsonb) RETURNS bytea (was pgstac_item_hash RETURNS text)
  General-purpose RFC 8785-aligned canonical hash: sha256(utf8(jsonb_canonical(j))).
  Returns bytea directly; call encode(..., 'hex') when a printable string is needed.
  Always schema-qualified as pgstac.jsonb_hash() to avoid shadowing the pg_catalog
  hash support function of the same name (which returns integer for index hashing).
  The private column is intentionally excluded from this hash — it is operator
  metadata outside the STAC item identity contract.

private jsonb on items (restored)
  The old items schema had a private jsonb column for operator metadata not
  returned by the STAC API. It was dropped in the v0.10 rewrite; add it back.
  Not included in content_dehydrate (always NULL from ingest), not in
  items_content_distinct_sql (not item content), and not in hydration output.
  Operators set it via direct UPDATE, same pattern as collections.private.

jsonb_field_rows moved to 001a_jsonutils.sql
  The recursive JSONB path-walker is a general utility, not items-specific.
  Moving it to jsonutils makes it available earlier in the load order and
  alongside jsonb_leaf_rows, jsonb_common_values, and the other JSONB helpers.
  A comment in 003a_items.sql notes that it is defined in 001a.

Tests updated
  - plan: 347 → 349 (two new has_function checks: jsonb_hash, jsonb_field_rows)
  - 003_items.sql: content_hash → item_hash everywhere; length = 64 →
    octet_length = 32; tombstone INSERT uses decode(repeat('aa',32), 'hex')
    (32-byte bytea placeholder, was 64-char text)
  - 004_search.sql: content_hash → item_hash in explicit INSERT column list
  - 001a_jsonutils.sql: pgstac_item_hash → jsonb_hash; expected value is
    decode('77f18c0a…', 'hex') (bytea, not text)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bitner

bitner commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

@copilot turn off the rustac tests in CI, we do not expect these to work yet with the schema rework.

Copilot AI commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@copilot turn off the rustac tests in CI, we do not expect these to work yet with the schema rework.

Done in f9ff0f1 — I disabled the rust-crate job in .github/workflows/continuous-integration.yml with if: ${{ false }} so rustac tests no longer run in CI for now.

Copilot finished work on behalf of bitner June 2, 2026 13:58
@bitner bitner marked this pull request as ready for review June 2, 2026 14:49
@bitner bitner requested review from gadomski and hrodmn June 2, 2026 14:49

@hrodmn hrodmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me! Switching to the explicit top-level columns for item properties seems like a good directional change - are we going to do something similar for collections or does the small scale of the collections table not warrant the added complexity?

Note: I am trying to update my dev environment with scripts/update and scripts/migrate so I can interact with the new db schema but I can't seem to get the right incantation.

I tried:

scripts/update  # exits cleanly
scripts/migrate  # fails

error:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "pypgstac": executable file not found in $PATH

$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;


-- jsonb_canonical: RFC 8785 (JSON Canonicalization Scheme)-aligned serialization.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

-- promoted_item_property_defs: Shared STAC-property mapping for promoted native
-- item columns. This function lives with queryables metadata so queryable seeding
-- and items-table property extraction reference the same source of truth.
CREATE OR REPLACE FUNCTION promoted_item_property_defs()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to document somewhere the criteria that we would use when deciding promote additional fields.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"promoted" fields are still hard coded, there is no user ability to modify those. I chose the list based on common properties and the properties of extensions marked as "stable" - I would consider the "criteria" to be that on any major version release (since it is a breaking change to the schema) that we would update that list with anything new that has been marked "stable".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a little uneasy about promoting extension fields — though the extension may be "stable", it might have breaking releases (e.g. proj). Core fields I'm much more comfortable with.

CREATE TABLE IF NOT EXISTS collections (
key bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
id text GENERATED ALWAYS AS (content->>'id') STORED UNIQUE NOT NULL,
content JSONB NOT NULL,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is outside the scope of this PR but are you considering moving collection properties into top-level columns instead of stuffing them into content?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't because collections is generally small enough, that it's not really a "hot path", so the gains relative to items is pretty minimal.

-- the DELETE snapshot is taken but becomes referenced by a later insert could be removed.
-- The retention_interval guard makes this unlikely for normal ingest, but operators should
-- still run gc_fragments during low-ingest periods or with a sufficiently conservative
-- retention interval. This is a documented operational tradeoff, not a silent invariant.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation about the lack of foreign key for fragment_id


-- Create an item
SELECT create_item((SELECT content FROM test_items LIMIT 1));
SELECT id, geometry, collection, datetime, end_datetime, content, private FROM items WHERE collection='pgstactest-crudtest';

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is private gone now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's still there but it still is basically a place to stash things that all the mechanisms of pgstac ignore

@bitner

bitner commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

The changes look good to me! Switching to the explicit top-level columns for item properties seems like a good directional change - are we going to do something similar for collections or does the small scale of the collections table not warrant the added complexity?

Note: I am trying to update my dev environment with scripts/update and scripts/migrate so I can interact with the new db schema but I can't seem to get the right incantation.

I tried:

scripts/update  # exits cleanly
scripts/migrate  # fails

error:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "pypgstac": executable file not found in $PATH

So, the next couple releases with these SQL changes have broken some things that we will be working on later so that we could keep the amount to review per PR more reasonable, so right now there is no incremental migration and some things with pypgstac are broken. The incremental migration is going to be it's own PR at the end of this cycle of v0.10 development. Also, there will be a PR that replaces pypgstac (or more accurately, just makes it a wrapper around rustac pgstac), but the API will be a breaking change and will more closely reflect the rustac/pystac api. I'll update the PR description with instructions on how to get things set up in the meantime.

@gadomski gadomski left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pausing review mid-stream while we figure out extension fields in the item fragments.

Comment thread .github/workflows/continuous-integration.yml Outdated
Comment thread docs/src/pgstac.md
Comment on lines +90 to +91
The legacy `conf.nohydrate` flag is still accepted in the request JSON for backward
compatibility, but split-storage search always returns hydrated items.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this correctly, that hydration is always at the database layer now? If so, do we need to exercise this through stac-fastapi-pgstac to make sure this won't be a performance regression?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but rather than being a flag to return an item collection that then needs to be blown apart and reconstructed, the new rustac pgstac (and python wrapper) functions will make a query that gets the raw rows (using functions coming in the next PR) and so those tools (which stac-fastapi-pgstac will need to be updated to use) will have a much faster and memory efficient path. we still maintain the search function though that returns the full item collection which will not have that option any longer though and so for things that use THAT function, yes, there might be a performance regression, but with that being said, the new hydration flow IS considerably faster than the old one.

Comment on lines +19 to +25
per_item_asset_fields CONSTANT text[] := ARRAY[
'href',
'file:size', 'file:checksum', 'file:local_path',
'alternate',
'storage:path', 'storage:platform', 'storage:region',
'storage:requester_pays', 'storage:tier'
];

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 if things vary, they shouldn't be item_assets anyways, right? E.g. href isn't in any of the values of Collection.item_assets.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, this is a bit of an artifact from iterating through a number of ideas for generating the default fragment config, but I think it's ok to leave in as a safety valve since there are no constraints on what anybody might stick into item_assets.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no constraints on what anybody might stick into item_assets

That suggests a default allow list might be safer, and allow folks to improve performance by manually providing more fields?

-- promoted_item_property_defs: Shared STAC-property mapping for promoted native
-- item columns. This function lives with queryables metadata so queryable seeding
-- and items-table property extraction reference the same source of truth.
CREATE OR REPLACE FUNCTION promoted_item_property_defs()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a little uneasy about promoting extension fields — though the extension may be "stable", it might have breaking releases (e.g. proj). Core fields I'm much more comfortable with.

Comment thread src/pgstac/sql/003a_items.sql Outdated
bitner and others added 2 commits June 4, 2026 13:47
Co-authored-by: Pete Gadomski <pete.gadomski@gmail.com>
Audit of the promoted native-column queryables against the current core common
metadata + extension specs (projection v2.0.0, eo v2.0.0, view v1.1.0, sat v1.2.0,
scientific v1.0.0, file v2.1.0):

- proj:epsg (integer) -> proj:code (text). Renamed + retyped per projection v2.0.0
  (proj:epsg removed; proj:code is a string like "EPSG:32659").
- eo:bands -> bands. eo:bands was removed in eo v2.0.0 in favor of the STAC 1.1
  common `bands` array.
- file:values_regex removed -- not a field in the file extension (v2.1.0).
- Added proj:geometry (object), sat:platform_international_designator (text),
  sat:anx_datetime (timestamptz/date-time), view:moon_azimuth, view:moon_elevation.

Applied across all six COLUMN LIST SYNC CONTRACT locations (items DDL,
content_dehydrate, promoted_item_property_defs, promoted_properties_from_item,
items_staging_dehydrate, promoted_items_column_list); strip_promoted_properties and
the queryable seeding auto-derive. Updated pgtap/002a_queryables (was hardcoded to
proj:epsg) and regenerated pgstac.sql.

Legacy items keep deprecated field names round-tripping via the JSON fragment (just
not promoted to a native column). pgtap + all 9 basic-SQL suites green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bitner bitner requested a review from gadomski June 4, 2026 20:08

@gadomski gadomski left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it, but can you add a table showing the "reference version" for each extension?

@bitner

bitner commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Maybe I missed it, but can you add a table showing the "reference version" for each extension?

@gadomski just added a doc at docs/src/promoted-fields.md

@bitner bitner requested a review from gadomski June 9, 2026 14:12
@bitner bitner merged commit fd1e3e7 into main Jun 9, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants