Tighten tag storage per content type by odesenfans · Pull Request #156 · aleph-im/aleph-message

odesenfans · 2026-05-05T10:52:36Z

Summary

Replaces the unified BaseContent.tags field (previous version of this PR) with per-type validation that tightens tag storage where it already lives, without introducing a parallel canonical location.

Stores: tags: Optional[List[Tag]] is now a typed top-level field on StoreContent (this is the only type where the canonical location is at the content root).
Posts and aggregates: a field validator on the inner content dict ensures that, when a tags key is present, it matches the canonical Optional[List[Tag]] shape with the usual 1–64 char/16 entry caps. The legacy nested content.content.tags location stays canonical.
Instances and programs: BaseExecutableContent.metadata is tightened from Dict[str, Any] to scalars (str, int, float, bool, null) plus an optional tags key holding a List[Tag]. The legacy content.metadata.tags location stays canonical.

Adds MAX_METADATA_KEY_LENGTH = 64 and MAX_METADATA_VALUE_LENGTH = 512 so metadata is bounded by bytes, not just by entry count.

BaseContent.tags and the Message.tags() accessor introduced earlier in this branch are removed.

Why this approach

The earlier design added tags as a typed field on BaseContent and provided a unified accessor that fell back to legacy nested locations for posts/aggregates. That gave us a second valid place to store tags on every message type without removing the existing ones — codifying two parallel conventions per type indefinitely. Tightening the legacy locations instead keeps a single canonical path per type and defers the "unify the schema" question to a future message format revision.

Hash compatibility

This is a strict-validation tightening, not a structural change. Existing messages whose tag lists already match List[Tag] and whose metadata values are scalars are unaffected. Messages that today store malformed tags (non-list, non-string elements, oversized) or non-scalar metadata values will fail to revalidate — same trade-off as the prior version, on the legacy locations rather than a new one. Worth running a count against prod data before this lands.

Test plan

pytest aleph_message/tests/test_models.py (offline) — 63 passed, 1 skipped, including:
- StoreContent.tags: count/length/empty-string boundaries and serialization defaults
- PostContent / AggregateContent: nested content.tags shape validation, count/length boundaries, non-list and non-string-entry rejection, top-level tags rejection, non-dict content unaffected
- InstanceContent metadata: tags shape validation, scalar value enforcement, dict/list value rejection (outside tags), key/value length boundaries, top-level tags rejection
black, isort, ruff, mypy clean
Network-dependent tests against the testnet endpoint — left for CI

Promotes tags to a typed field on BaseContent (Optional[List[str]], max 16 tags of 64 chars each) so every message type carries them consistently. Adds a tags() method on BaseMessage that falls back to the legacy content.content.tags location for POST and AGGREGATE messages so existing data remains addressable. The field defaults to None and is excluded from serialization when unset, preserving item_hash compatibility for messages that don't opt in.

foxpatch-aleph

Clean, well-designed change. The tags field on BaseContent is correctly typed with sensible constraints (16 max tags, 1-64 chars each), defaults to None for hash compatibility, and the tags() accessor on BaseMessage provides a useful unified interface with proper legacy fallback for POST/AGGREGATE and defensive filtering. Test coverage is thorough — boundary tests, serialization round-trip, canonical vs legacy precedence, empty-list semantics, malformed input handling, and the non-fallback for executable types are all covered.

Replace the unified `BaseContent.tags` field and `Message.tags()` accessor with per-type validation that reflects how each message format already stores tags in the wild: - StoreContent gains `tags: Optional[List[Tag]]` as a typed top-level field (kept canonical for stores). - PostContent and AggregateContent validate the optional `tags` key inside their nested user-content dict against the canonical Tag list shape (legacy location remains canonical for posts/aggregates). - BaseExecutableContent tightens `metadata` from `Dict[str, Any]` to scalars (str, int, float, bool, null) plus an optional `tags` key that holds a Tag list (legacy location remains canonical for instances/programs). Adds MAX_METADATA_KEY_LENGTH (64) and MAX_METADATA_VALUE_LENGTH (512) to bound metadata values by bytes, not just by entry count. The previous unified-field approach added a second valid location for tags without removing the legacy ones, which would have left two co-existing conventions per content type. This change keeps tags where they already live and tightens their shape instead.

foxpatch-aleph

The PR cleanly tightens tag storage per content type without introducing a parallel canonical location. StoreContent gains a typed top-level tags field, Post/Aggregate content validates nested content.tags via a shared adapter, and BaseExecutableContent.metadata is tightened to scalars plus an optional tags key. The new constants (MAX_N_TAGS, MAX_TAG_LENGTH, MAX_METADATA_KEY_LENGTH, MAX_METADATA_VALUE_LENGTH) are well-placed. The isinstance(value, (bool, int, float)) check correctly orders bool before int. Tests are thorough with boundary cases covered. One minor suggestion below.

aleph_message/tests/test_models.py (line 1010): Consider adding a test for ProgramContent metadata validation as well (e.g., reusing _load_instance_fixture pattern with factory=ProgramMessage). ProgramContent inherits from BaseExecutableContent but isn't explicitly exercised — a fixture like machine.json could be used. Not a blocker since the shared validator is on the base class and InstanceContent already exercises it.

Production tags reach ~200 characters; the previous 64-char cap would reject existing messages. Bump the limit to 256 to keep validation tighter than 'no cap' while not rejecting historical content.

foxpatch-aleph

The PR is well-structured and correctly tightens tag storage per content type. The approach of keeping canonical tag locations per type (StoreContent.tags, Post/Aggregate content.tags, Instance/Program metadata.tags) rather than introducing a unified top-level field is sound. One genuine bug was found: the metadata validator for Instance/Program messages rejects {"tags": null} because the TypeAdapter was called unconditionally when the key was "tags", without a None guard — unlike the nested-tags validator for Posts/Aggregates which already handles this correctly. I've fixed this in aleph_message/models/execution/abstract.py:97-99 and added a matching test. All 68 tests pass now.

aleph_message/models/execution/abstract.py (line 97): BUG: The metadata validator calls _METADATA_TAGS_ADAPTER.validate_python(value) unconditionally when key == "tags", but the adapter is for List[Tag] (not Optional[List[Tag]]). This means {"tags": null} in metadata raises a ValidationError, while the equivalent for Posts/Aggregates correctly allows null (guarded at aleph_message/models/__init__.py:132). Fixed by adding a if value is not None guard. Added test_metadata_tags_null_allowed for parity with test_post_nested_tags_null_allowed.

aleph_message/models/__init__.py (line 124): NIT: The _NESTED_TAGS_ADAPTER and _METADATA_TAGS_ADAPTER are identical — consider defining it once in abstract.py and importing both places for DRY. Not a blocker.

aleph_message/models/__init__.py (line 132): NIT: The test_post_top_level_tags_rejected and test_instance_top_level_tags_rejected tests pass tags as a kwarg to the constructor, which triggers extra="forbid" on BaseContent. This is correct behavior, but the test could be more explicit about why tags is rejected (i.e., it's because extra="forbid" on BaseContent, not a custom check).

foxpatch-aleph approved these changes May 5, 2026

View reviewed changes

odesenfans changed the title ~~Add tags field to BaseContent and Message.tags() accessor~~ Tighten tag storage per content type May 5, 2026

foxpatch-aleph approved these changes May 5, 2026

View reviewed changes

odesenfans mentioned this pull request May 5, 2026

feat: denormalize message and post tags into TEXT[] columns aleph-im/pyaleph#1116

Merged

6 tasks

Bump MAX_TAG_LENGTH to 256

3c71a82

Production tags reach ~200 characters; the previous 64-char cap would reject existing messages. Bump the limit to 256 to keep validation tighter than 'no cap' while not rejecting historical content.

foxpatch-aleph requested changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten tag storage per content type#156

Tighten tag storage per content type#156
odesenfans wants to merge 3 commits into
mainfrom
od/tags

odesenfans commented May 5, 2026 •

edited

Loading

Uh oh!

foxpatch-aleph left a comment

Uh oh!

foxpatch-aleph left a comment

Uh oh!

foxpatch-aleph left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

odesenfans commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this approach

Hash compatibility

Test plan

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

foxpatch-aleph left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

odesenfans commented May 5, 2026 •

edited

Loading