feat: denormalize message and post tags into TEXT[] columns by odesenfans · Pull Request #1116 · aleph-im/pyaleph

odesenfans · 2026-05-05T20:39:52Z

Summary

Adds messages.tags TEXT[] and posts.tags TEXT[] with GIN indexes, populated by an Alembic migration that backfills from the per-type JSONB locations and drops the obsolete POST-only ix_messages_content_tags_gin.
Switches the tag filter on /messages and /posts from a JSONB ?| probe (POST-only, full heap scan elsewhere) to tags && ARRAY[...] against the new column. Side effect: tag filtering now actually works for STORE / INSTANCE / PROGRAM.
Models auto-extract tags from JSONB content based on message type at write time (extract_tags helper in db/models/messages.py, plus a tiny PostDb.__init__).
Pins aleph-message to commit 3c71a82 (Tighten tag storage per content type aleph-message#156, "Tighten tag storage per content type") so the schema matches the validator's per-type rules:
- POST + AGGREGATE: content -> 'content' -> 'tags'
- STORE: content -> 'tags'
- INSTANCE/PROGRAM: content -> 'metadata' -> 'tags'

Migration notes

Single migration 0056_7e5a630e4b36_denormalize_message_tags: schema add, batched backfill (50k/batch with intermediate COMMITs), GIN indexes built CONCURRENTLY on a separate AUTOCOMMIT connection, then drop of the legacy index.
Empty/absent tag lists stay NULL in the column (distinguishable from explicit []).
Downgrade restores the old POST-only JSONB GIN index and drops both new columns.

Test plan

hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_tags
hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_tags_per_type
hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_shared_tag_returns_all_types
hatch run testing:test tests/api/test_posts.py::test_get_posts_tags
Run migration against a staging snapshot and spot-check messages.tags / posts.tags populated for each message type that carries tags
hatch run linting:all

foxpatch-aleph

The PR correctly denormalizes tags from per-type JSONB paths into native TEXT[] columns on both messages and posts tables. The migration properly backfills data with batched updates, builds GIN indexes concurrently, and drops the legacy POST-only index. The model changes auto-extract tags at write time via a shared extract_tags helper. Query filters are correctly switched from JSONB has_any to array overlap. The merged-post queries properly coalesce tags from amend vs original. Existing tests align with the new behavior — fixture posts and messages will get tags populated through the init paths.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 49): Minor concern: _batched_update relies on implicit autocommit mode after the prior explicit COMMIT in upgrade(). While this works in practice (the COMMIT at line 113 leaves the connection in autocommit mode), consider adding connection.execution_options(isolation_level="AUTOCOMMIT") before the loop to make this explicit and safer against future refactors. Alternatively, use connection.begin() / connection.begin().commit() instead of raw BEGIN/COMMIT text.

src/aleph/db/models/messages.py (line 80): Nice touch distinguishing NULL (no tags) from empty array (explicitly empty) — the not tags check ensures empty lists stay NULL. This matches the migration's jsonb_array_length > 0 guard.

src/aleph/db/accessors/posts.py (line 96): The coalesce(Amend.tags, Original.tags) correctly prioritizes the amend's tags when present, matching the existing coalesce(Amend.content, Original.content) pattern.

foxpatch-aleph

Solid refactor that denormalizes tags into native TEXT[] columns with GIN indexes, fixing tag filtering for all message types (not just POST). The migration is well-structured with batched backfill and concurrent index builds. The extract_tags helper cleanly maps the per-type tag locations. A few minor issues around Protocol definitions and transaction housekeeping in the migration, but nothing blocking.

src/aleph/db/accessors/posts.py (line 40): The MergedPost Protocol (line 40) and MergedPostV0 Protocol (line 52) are missing the tags field. Both make_select_merged_post_stmt and make_select_merged_post_with_message_info_stmt now return a tags column via func.coalesce(Amend.tags, Original.tags).label("tags"). Add tags: Optional[List[str]] to both Protocol definitions so the type contract matches the actual query results.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 152): The if was_in_transaction: connection.execute(text("BEGIN")) at the end of upgrade() and downgrade() is unnecessary — no statements execute after it. It's harmless but dead code. Consider removing it.

src/aleph/db/models/messages.py (line 207): Minor: in MessageDb.__init__, extract_tags is called via setdefault, but from_pending_message explicitly passes tags=extract_tags(...) so the __init__ call is redundant (the setdefault correctly doesn't override). Not a bug, just a tiny inefficiency.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 45): Minor: _batched_update manually issues raw BEGIN/COMMIT to control transaction boundaries for the batched backfill. Consider adding a brief comment explaining why raw SQL transactions are used instead of letting Alembic manage them (to bound WAL growth with intermediate commits).

foxpatch-aleph

This PR cleanly denormalizes tags from JSONB content into native TEXT[] columns on both messages and posts tables, with GIN indexes for efficient filtering. The extract_tags helper correctly routes per-message-type JSONB paths, the accessor changes switch from a POST-only ?| probe to && overlap (fixing tag filtering for STORE/INSTANCE/PROGRAM as a side effect), and the migration follows the established batching pattern from prior migrations. The coalesce(Amend.tags, Original.tags) in the merged-post queries correctly prefer amender tags. Test fixtures properly exercise the new __init__-level tag extraction. No blocking issues found.

deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 49): The _batched_update function manually issues BEGIN/COMMIT via raw SQL, which creates nested transactions (SAVEPOINTs) under SQLAlchemy's transaction management. This matches the pattern from migration 0048 and works correctly, but it's a fragile pattern. If this were new code I'd suggest using connection.begin() / connection.commit() instead of raw text("BEGIN")/text("COMMIT"), but since it's consistent with the existing migration style, no change needed.

src/aleph/db/models/posts.py (line 35): The PostDb.__init__ extracts tags from content.get('tags'), which matches the POST content schema. This is correct since the posts table stores the inner post content (not the full message envelope). Consider a brief inline comment to clarify that content here is the post-level payload, not the message-level one — this distinction tripped me up on first read.

pyproject.toml (line 33): Pinning aleph-message to a specific git commit rather than a version is a reasonable short-term approach but makes reproducibility harder. A minor note: consider adding a comment in the toml referencing the PR or issue number so the reason for the pin is clear to future maintainers.

Promote the tag list out of JSONB into native ``messages.tags`` and ``posts.tags`` columns indexed by GIN. The previous tag filter only looked at ``content.content.tags`` and skipped non-POST messages, so it never matched STORE / INSTANCE / PROGRAM messages. The new column unifies the per-type locations matched by the tighter aleph-message validation: * POST + AGGREGATE: ``content -> 'content' -> 'tags'`` * STORE: ``content -> 'tags'`` * INSTANCE/PROGRAM: ``content -> 'metadata' -> 'tags'`` Pin aleph-message to the snapshot that introduced the per-type rules.

Adds a parametrized API test that seeds one message of each tagged type (POST, AGGREGATE, STORE, INSTANCE, PROGRAM) with tags stored at the legacy per-type JSONB key, then asserts: * a type-unique tag returns exactly the matching message; * a shared tag returns one match per type. Guards against regressing back into the previous POST-only filter.

Production data contains posts where ``content->'tags'`` is a scalar string rather than an array. The original backfill used ``jsonb_array_length(...) > 0`` to skip empty arrays, but ``jsonb_array_length`` raises on scalars and PostgreSQL is free to evaluate that predicate before the ``jsonb_typeof = 'array'`` guard. Replace the length check with a JSONB literal comparison (``<> '[]'::jsonb``). Equality is defined across all JSONB types, so the predicate never aborts regardless of the planner's evaluation order. NULL-vs-empty semantics are preserved: absent and explicitly empty tag arrays both stay NULL in the new column. Re-running the migration after this fix is idempotent: rows already backfilled stay populated and are skipped via ``tags IS NULL``.

odesenfans force-pushed the od/denormalize-tags branch from 3c26ea6 to 24ebecd Compare May 5, 2026 20:51

foxpatch-aleph previously approved these changes May 5, 2026

View reviewed changes

odesenfans dismissed foxpatch-aleph’s stale review via d5907c4 May 5, 2026 21:21

odesenfans force-pushed the od/denormalize-tags branch 2 times, most recently from d5907c4 to 0b742fc Compare May 5, 2026 21:23

foxpatch-aleph approved these changes May 5, 2026

View reviewed changes

foxpatch-aleph previously approved these changes May 5, 2026

View reviewed changes

odesenfans dismissed foxpatch-aleph’s stale review via 5700009 May 6, 2026 08:09

odesenfans added 2 commits May 6, 2026 11:03

odesenfans force-pushed the od/denormalize-tags branch from 5700009 to b04ac1f Compare May 6, 2026 09:03

odesenfans merged commit b98c716 into main May 7, 2026
4 checks passed

odesenfans deleted the od/denormalize-tags branch May 7, 2026 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: denormalize message and post tags into TEXT[] columns#1116

feat: denormalize message and post tags into TEXT[] columns#1116
odesenfans merged 3 commits into
mainfrom
od/denormalize-tags

odesenfans commented May 5, 2026 •

edited

Loading

Uh oh!

foxpatch-aleph left a comment

Uh oh!

foxpatch-aleph left a comment

Uh oh!

foxpatch-aleph left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

odesenfans commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Migration notes

Test plan

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

odesenfans commented May 5, 2026 •

edited

Loading