feat: denormalize message and post tags into TEXT[] columns#1116
Conversation
3c26ea6 to
24ebecd
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR correctly denormalizes tags from per-type JSONB paths into native TEXT[] columns on both messages and posts tables. The migration properly backfills data with batched updates, builds GIN indexes concurrently, and drops the legacy POST-only index. The model changes auto-extract tags at write time via a shared extract_tags helper. Query filters are correctly switched from JSONB has_any to array overlap. The merged-post queries properly coalesce tags from amend vs original. Existing tests align with the new behavior — fixture posts and messages will get tags populated through the init paths.
deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 49): Minor concern: _batched_update relies on implicit autocommit mode after the prior explicit COMMIT in upgrade(). While this works in practice (the COMMIT at line 113 leaves the connection in autocommit mode), consider adding connection.execution_options(isolation_level="AUTOCOMMIT") before the loop to make this explicit and safer against future refactors. Alternatively, use connection.begin() / connection.begin().commit() instead of raw BEGIN/COMMIT text.
src/aleph/db/models/messages.py (line 80): Nice touch distinguishing NULL (no tags) from empty array (explicitly empty) — the not tags check ensures empty lists stay NULL. This matches the migration's jsonb_array_length > 0 guard.
src/aleph/db/accessors/posts.py (line 96): The coalesce(Amend.tags, Original.tags) correctly prioritizes the amend's tags when present, matching the existing coalesce(Amend.content, Original.content) pattern.
d5907c4 to
0b742fc
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Solid refactor that denormalizes tags into native TEXT[] columns with GIN indexes, fixing tag filtering for all message types (not just POST). The migration is well-structured with batched backfill and concurrent index builds. The extract_tags helper cleanly maps the per-type tag locations. A few minor issues around Protocol definitions and transaction housekeeping in the migration, but nothing blocking.
src/aleph/db/accessors/posts.py (line 40): The MergedPost Protocol (line 40) and MergedPostV0 Protocol (line 52) are missing the tags field. Both make_select_merged_post_stmt and make_select_merged_post_with_message_info_stmt now return a tags column via func.coalesce(Amend.tags, Original.tags).label("tags"). Add tags: Optional[List[str]] to both Protocol definitions so the type contract matches the actual query results.
deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 152): The if was_in_transaction: connection.execute(text("BEGIN")) at the end of upgrade() and downgrade() is unnecessary — no statements execute after it. It's harmless but dead code. Consider removing it.
src/aleph/db/models/messages.py (line 207): Minor: in MessageDb.__init__, extract_tags is called via setdefault, but from_pending_message explicitly passes tags=extract_tags(...) so the __init__ call is redundant (the setdefault correctly doesn't override). Not a bug, just a tiny inefficiency.
deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 45): Minor: _batched_update manually issues raw BEGIN/COMMIT to control transaction boundaries for the batched backfill. Consider adding a brief comment explaining why raw SQL transactions are used instead of letting Alembic manage them (to bound WAL growth with intermediate commits).
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This PR cleanly denormalizes tags from JSONB content into native TEXT[] columns on both messages and posts tables, with GIN indexes for efficient filtering. The extract_tags helper correctly routes per-message-type JSONB paths, the accessor changes switch from a POST-only ?| probe to && overlap (fixing tag filtering for STORE/INSTANCE/PROGRAM as a side effect), and the migration follows the established batching pattern from prior migrations. The coalesce(Amend.tags, Original.tags) in the merged-post queries correctly prefer amender tags. Test fixtures properly exercise the new __init__-level tag extraction. No blocking issues found.
deployment/migrations/versions/0056_7e5a630e4b36_denormalize_message_tags.py (line 49): The _batched_update function manually issues BEGIN/COMMIT via raw SQL, which creates nested transactions (SAVEPOINTs) under SQLAlchemy's transaction management. This matches the pattern from migration 0048 and works correctly, but it's a fragile pattern. If this were new code I'd suggest using connection.begin() / connection.commit() instead of raw text("BEGIN")/text("COMMIT"), but since it's consistent with the existing migration style, no change needed.
src/aleph/db/models/posts.py (line 35): The PostDb.__init__ extracts tags from content.get('tags'), which matches the POST content schema. This is correct since the posts table stores the inner post content (not the full message envelope). Consider a brief inline comment to clarify that content here is the post-level payload, not the message-level one — this distinction tripped me up on first read.
pyproject.toml (line 33): Pinning aleph-message to a specific git commit rather than a version is a reasonable short-term approach but makes reproducibility harder. A minor note: consider adding a comment in the toml referencing the PR or issue number so the reason for the pin is clear to future maintainers.
Promote the tag list out of JSONB into native ``messages.tags`` and ``posts.tags`` columns indexed by GIN. The previous tag filter only looked at ``content.content.tags`` and skipped non-POST messages, so it never matched STORE / INSTANCE / PROGRAM messages. The new column unifies the per-type locations matched by the tighter aleph-message validation: * POST + AGGREGATE: ``content -> 'content' -> 'tags'`` * STORE: ``content -> 'tags'`` * INSTANCE/PROGRAM: ``content -> 'metadata' -> 'tags'`` Pin aleph-message to the snapshot that introduced the per-type rules.
Adds a parametrized API test that seeds one message of each tagged type (POST, AGGREGATE, STORE, INSTANCE, PROGRAM) with tags stored at the legacy per-type JSONB key, then asserts: * a type-unique tag returns exactly the matching message; * a shared tag returns one match per type. Guards against regressing back into the previous POST-only filter.
5700009 to
b04ac1f
Compare
Production data contains posts where ``content->'tags'`` is a scalar string rather than an array. The original backfill used ``jsonb_array_length(...) > 0`` to skip empty arrays, but ``jsonb_array_length`` raises on scalars and PostgreSQL is free to evaluate that predicate before the ``jsonb_typeof = 'array'`` guard. Replace the length check with a JSONB literal comparison (``<> '[]'::jsonb``). Equality is defined across all JSONB types, so the predicate never aborts regardless of the planner's evaluation order. NULL-vs-empty semantics are preserved: absent and explicitly empty tag arrays both stay NULL in the new column. Re-running the migration after this fix is idempotent: rows already backfilled stay populated and are skipped via ``tags IS NULL``.
Summary
messages.tags TEXT[]andposts.tags TEXT[]with GIN indexes, populated by an Alembic migration that backfills from the per-type JSONB locations and drops the obsolete POST-onlyix_messages_content_tags_gin./messagesand/postsfrom a JSONB?|probe (POST-only, full heap scan elsewhere) totags && ARRAY[...]against the new column. Side effect: tag filtering now actually works for STORE / INSTANCE / PROGRAM.extract_tagshelper indb/models/messages.py, plus a tinyPostDb.__init__).aleph-messageto commit3c71a82(Tighten tag storage per content type aleph-message#156, "Tighten tag storage per content type") so the schema matches the validator's per-type rules:content -> 'content' -> 'tags'content -> 'tags'content -> 'metadata' -> 'tags'Migration notes
0056_7e5a630e4b36_denormalize_message_tags: schema add, batched backfill (50k/batch with intermediate COMMITs), GIN indexes builtCONCURRENTLYon a separate AUTOCOMMIT connection, then drop of the legacy index.NULLin the column (distinguishable from explicit[]).Test plan
hatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_tagshatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_tags_per_typehatch run testing:test tests/api/test_list_messages.py::test_get_messages_filter_by_shared_tag_returns_all_typeshatch run testing:test tests/api/test_posts.py::test_get_posts_tagsmessages.tags/posts.tagspopulated for each message type that carries tagshatch run linting:all