Skip to content

feat: denormalize message and post tags into TEXT[] columns#1116

Merged
odesenfans merged 3 commits into
mainfrom
od/denormalize-tags
May 7, 2026
Merged

feat: denormalize message and post tags into TEXT[] columns#1116
odesenfans merged 3 commits into
mainfrom
od/denormalize-tags

Conversation

@odesenfans
Copy link
Copy Markdown
Collaborator

@odesenfans odesenfans commented May 5, 2026

Summary

  • Adds messages.tags TEXT[] and posts.tags TEXT[] with GIN indexes, populated by an Alembic migration that backfills from the per-type JSONB locations and drops the obsolete POST-only ix_messages_content_tags_gin.
  • Switches the tag filter on /messages and /posts from a JSONB ?| probe (POST-only, full heap scan elsewhere) to tags && ARRAY[...] against the new column. Side effect: tag filtering now actually works for STORE / INSTANCE / PROGRAM.
  • Models auto-extract tags from JSONB content based on message type at write time (extract_tags helper in db/models/messages.py, plus a tiny PostDb.__init__).
  • Pins aleph-message to commit 3c71a82 (Tighten tag storage per content type aleph-message#156, "Tighten tag storage per content type") so the schema matches the validator's per-type rules:
    • POST + AGGREGATE: content -> 'content' -> 'tags'
    • STORE: content -> 'tags'
    • INSTANCE/PROGRAM: content -> 'metadata' -> 'tags'

Migration notes

  • Single migration 0056_7e5a630e4b36_denormalize_message_tags: schema add, batched backfill (50k/batch with intermediate COMMITs), GIN indexes built CONCURRENTLY on a separate AUTOCOMMIT connection, then drop of the legacy index.
  • Empty/absent tag lists stay NULL in the column (distinguishable from explicit []).
  • Downgrade restores the old POST-only JSONB GIN index and drops both new columns.

Test plan

  • hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_tags
  • hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_tags_per_type
  • hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_shared_tag_returns_all_types
  • hatch run testing:test tests/api/test_posts.py::test_get_posts_tags
  • Run migration against a staging snapshot and spot-check messages.tags / posts.tags populated for each message type that carries tags
  • hatch run linting:all

@odesenfans odesenfans force-pushed the od/denormalize-tags branch from 3c26ea6 to 24ebecd Compare May 5, 2026 20:51
foxpatch-aleph
foxpatch-aleph previously approved these changes May 5, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR correctly denormalizes tags from per-type JSONB paths into native TEXT[] columns on both messages and posts tables. The migration properly backfills data with batched updates, builds GIN indexes concurrently, and drops the legacy POST-only index. The model changes auto-extract tags at write time via a shared extract_tags helper. Query filters are correctly switched from JSONB has_any to array overlap. The merged-post queries properly coalesce tags from amend vs original. Existing tests align with the new behavior — fixture posts and messages will get tags populated through the init paths.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 49): Minor concern: _batched_update relies on implicit autocommit mode after the prior explicit COMMIT in upgrade(). While this works in practice (the COMMIT at line 113 leaves the connection in autocommit mode), consider adding connection.execution_options(isolation_level="AUTOCOMMIT") before the loop to make this explicit and safer against future refactors. Alternatively, use connection.begin() / connection.begin().commit() instead of raw BEGIN/COMMIT text.

src/aleph/db/models/messages.py (line 80): Nice touch distinguishing NULL (no tags) from empty array (explicitly empty) — the not tags check ensures empty lists stay NULL. This matches the migration's jsonb_array_length > 0 guard.

src/aleph/db/accessors/posts.py (line 96): The coalesce(Amend.tags, Original.tags) correctly prioritizes the amend's tags when present, matching the existing coalesce(Amend.content, Original.content) pattern.

@odesenfans odesenfans force-pushed the od/denormalize-tags branch 2 times, most recently from d5907c4 to 0b742fc Compare May 5, 2026 21:23
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactor that denormalizes tags into native TEXT[] columns with GIN indexes, fixing tag filtering for all message types (not just POST). The migration is well-structured with batched backfill and concurrent index builds. The extract_tags helper cleanly maps the per-type tag locations. A few minor issues around Protocol definitions and transaction housekeeping in the migration, but nothing blocking.

src/aleph/db/accessors/posts.py (line 40): The MergedPost Protocol (line 40) and MergedPostV0 Protocol (line 52) are missing the tags field. Both make_select_merged_post_stmt and make_select_merged_post_with_message_info_stmt now return a tags column via func.coalesce(Amend.tags, Original.tags).label("tags"). Add tags: Optional[List[str]] to both Protocol definitions so the type contract matches the actual query results.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 152): The if was_in_transaction: connection.execute(text("BEGIN")) at the end of upgrade() and downgrade() is unnecessary — no statements execute after it. It's harmless but dead code. Consider removing it.

src/aleph/db/models/messages.py (line 207): Minor: in MessageDb.__init__, extract_tags is called via setdefault, but from_pending_message explicitly passes tags=extract_tags(...) so the __init__ call is redundant (the setdefault correctly doesn't override). Not a bug, just a tiny inefficiency.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 45): Minor: _batched_update manually issues raw BEGIN/COMMIT to control transaction boundaries for the batched backfill. Consider adding a brief comment explaining why raw SQL transactions are used instead of letting Alembic manage them (to bound WAL growth with intermediate commits).

foxpatch-aleph
foxpatch-aleph previously approved these changes May 5, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR cleanly denormalizes tags from JSONB content into native TEXT[] columns on both messages and posts tables, with GIN indexes for efficient filtering. The extract_tags helper correctly routes per-message-type JSONB paths, the accessor changes switch from a POST-only ?| probe to && overlap (fixing tag filtering for STORE/INSTANCE/PROGRAM as a side effect), and the migration follows the established batching pattern from prior migrations. The coalesce(Amend.tags, Original.tags) in the merged-post queries correctly prefer amender tags. Test fixtures properly exercise the new __init__-level tag extraction. No blocking issues found.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 49): The _batched_update function manually issues BEGIN/COMMIT via raw SQL, which creates nested transactions (SAVEPOINTs) under SQLAlchemy's transaction management. This matches the pattern from migration 0048 and works correctly, but it's a fragile pattern. If this were new code I'd suggest using connection.begin() / connection.commit() instead of raw text("BEGIN")/text("COMMIT"), but since it's consistent with the existing migration style, no change needed.

src/aleph/db/models/posts.py (line 35): The PostDb.__init__ extracts tags from content.get('tags'), which matches the POST content schema. This is correct since the posts table stores the inner post content (not the full message envelope). Consider a brief inline comment to clarify that content here is the post-level payload, not the message-level one — this distinction tripped me up on first read.

pyproject.toml (line 33): Pinning aleph-message to a specific git commit rather than a version is a reasonable short-term approach but makes reproducibility harder. A minor note: consider adding a comment in the toml referencing the PR or issue number so the reason for the pin is clear to future maintainers.

odesenfans added 2 commits May 6, 2026 11:03
Promote the tag list out of JSONB into native ``messages.tags`` and
``posts.tags`` columns indexed by GIN. The previous tag filter only
looked at ``content.content.tags`` and skipped non-POST messages, so
it never matched STORE / INSTANCE / PROGRAM messages. The new column
unifies the per-type locations matched by the tighter aleph-message
validation:

* POST + AGGREGATE: ``content -> 'content' -> 'tags'``
* STORE:            ``content -> 'tags'``
* INSTANCE/PROGRAM: ``content -> 'metadata' -> 'tags'``

Pin aleph-message to the snapshot that introduced the per-type rules.
Adds a parametrized API test that seeds one message of each tagged
type (POST, AGGREGATE, STORE, INSTANCE, PROGRAM) with tags stored at
the legacy per-type JSONB key, then asserts:

* a type-unique tag returns exactly the matching message;
* a shared tag returns one match per type.

Guards against regressing back into the previous POST-only filter.
@odesenfans odesenfans force-pushed the od/denormalize-tags branch from 5700009 to b04ac1f Compare May 6, 2026 09:03
Production data contains posts where ``content->'tags'`` is a scalar
string rather than an array. The original backfill used
``jsonb_array_length(...) > 0`` to skip empty arrays, but
``jsonb_array_length`` raises on scalars and PostgreSQL is free to
evaluate that predicate before the ``jsonb_typeof = 'array'`` guard.

Replace the length check with a JSONB literal comparison
(``<> '[]'::jsonb``). Equality is defined across all JSONB types, so
the predicate never aborts regardless of the planner's evaluation
order. NULL-vs-empty semantics are preserved: absent and explicitly
empty tag arrays both stay NULL in the new column.

Re-running the migration after this fix is idempotent: rows already
backfilled stay populated and are skipped via ``tags IS NULL``.
@odesenfans odesenfans merged commit b98c716 into main May 7, 2026
4 checks passed
@odesenfans odesenfans deleted the od/denormalize-tags branch May 7, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants