Tighten tag storage per content type#156
Conversation
Promotes tags to a typed field on BaseContent (Optional[List[str]], max 16 tags of 64 chars each) so every message type carries them consistently. Adds a tags() method on BaseMessage that falls back to the legacy content.content.tags location for POST and AGGREGATE messages so existing data remains addressable. The field defaults to None and is excluded from serialization when unset, preserving item_hash compatibility for messages that don't opt in.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean, well-designed change. The tags field on BaseContent is correctly typed with sensible constraints (16 max tags, 1-64 chars each), defaults to None for hash compatibility, and the tags() accessor on BaseMessage provides a useful unified interface with proper legacy fallback for POST/AGGREGATE and defensive filtering. Test coverage is thorough — boundary tests, serialization round-trip, canonical vs legacy precedence, empty-list semantics, malformed input handling, and the non-fallback for executable types are all covered.
Replace the unified `BaseContent.tags` field and `Message.tags()` accessor with per-type validation that reflects how each message format already stores tags in the wild: - StoreContent gains `tags: Optional[List[Tag]]` as a typed top-level field (kept canonical for stores). - PostContent and AggregateContent validate the optional `tags` key inside their nested user-content dict against the canonical Tag list shape (legacy location remains canonical for posts/aggregates). - BaseExecutableContent tightens `metadata` from `Dict[str, Any]` to scalars (str, int, float, bool, null) plus an optional `tags` key that holds a Tag list (legacy location remains canonical for instances/programs). Adds MAX_METADATA_KEY_LENGTH (64) and MAX_METADATA_VALUE_LENGTH (512) to bound metadata values by bytes, not just by entry count. The previous unified-field approach added a second valid location for tags without removing the legacy ones, which would have left two co-existing conventions per content type. This change keeps tags where they already live and tightens their shape instead.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR cleanly tightens tag storage per content type without introducing a parallel canonical location. StoreContent gains a typed top-level tags field, Post/Aggregate content validates nested content.tags via a shared adapter, and BaseExecutableContent.metadata is tightened to scalars plus an optional tags key. The new constants (MAX_N_TAGS, MAX_TAG_LENGTH, MAX_METADATA_KEY_LENGTH, MAX_METADATA_VALUE_LENGTH) are well-placed. The isinstance(value, (bool, int, float)) check correctly orders bool before int. Tests are thorough with boundary cases covered. One minor suggestion below.
aleph_message/tests/test_models.py (line 1010): Consider adding a test for ProgramContent metadata validation as well (e.g., reusing _load_instance_fixture pattern with factory=ProgramMessage). ProgramContent inherits from BaseExecutableContent but isn't explicitly exercised — a fixture like machine.json could be used. Not a blocker since the shared validator is on the base class and InstanceContent already exercises it.
Production tags reach ~200 characters; the previous 64-char cap would reject existing messages. Bump the limit to 256 to keep validation tighter than 'no cap' while not rejecting historical content.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR is well-structured and correctly tightens tag storage per content type. The approach of keeping canonical tag locations per type (StoreContent.tags, Post/Aggregate content.tags, Instance/Program metadata.tags) rather than introducing a unified top-level field is sound. One genuine bug was found: the metadata validator for Instance/Program messages rejects {"tags": null} because the TypeAdapter was called unconditionally when the key was "tags", without a None guard — unlike the nested-tags validator for Posts/Aggregates which already handles this correctly. I've fixed this in aleph_message/models/execution/abstract.py:97-99 and added a matching test. All 68 tests pass now.
aleph_message/models/execution/abstract.py (line 97): BUG: The metadata validator calls _METADATA_TAGS_ADAPTER.validate_python(value) unconditionally when key == "tags", but the adapter is for List[Tag] (not Optional[List[Tag]]). This means {"tags": null} in metadata raises a ValidationError, while the equivalent for Posts/Aggregates correctly allows null (guarded at aleph_message/models/__init__.py:132). Fixed by adding a if value is not None guard. Added test_metadata_tags_null_allowed for parity with test_post_nested_tags_null_allowed.
aleph_message/models/__init__.py (line 124): NIT: The _NESTED_TAGS_ADAPTER and _METADATA_TAGS_ADAPTER are identical — consider defining it once in abstract.py and importing both places for DRY. Not a blocker.
aleph_message/models/__init__.py (line 132): NIT: The test_post_top_level_tags_rejected and test_instance_top_level_tags_rejected tests pass tags as a kwarg to the constructor, which triggers extra="forbid" on BaseContent. This is correct behavior, but the test could be more explicit about why tags is rejected (i.e., it's because extra="forbid" on BaseContent, not a custom check).
Summary
Replaces the unified
BaseContent.tagsfield (previous version of this PR) with per-type validation that tightens tag storage where it already lives, without introducing a parallel canonical location.tags: Optional[List[Tag]]is now a typed top-level field onStoreContent(this is the only type where the canonical location is at the content root).contentdict ensures that, when atagskey is present, it matches the canonicalOptional[List[Tag]]shape with the usual 1–64 char/16 entry caps. The legacy nestedcontent.content.tagslocation stays canonical.BaseExecutableContent.metadatais tightened fromDict[str, Any]to scalars (str, int, float, bool, null) plus an optionaltagskey holding aList[Tag]. The legacycontent.metadata.tagslocation stays canonical.Adds
MAX_METADATA_KEY_LENGTH = 64andMAX_METADATA_VALUE_LENGTH = 512so metadata is bounded by bytes, not just by entry count.BaseContent.tagsand theMessage.tags()accessor introduced earlier in this branch are removed.Why this approach
The earlier design added
tagsas a typed field onBaseContentand provided a unified accessor that fell back to legacy nested locations for posts/aggregates. That gave us a second valid place to store tags on every message type without removing the existing ones — codifying two parallel conventions per type indefinitely. Tightening the legacy locations instead keeps a single canonical path per type and defers the "unify the schema" question to a future message format revision.Hash compatibility
This is a strict-validation tightening, not a structural change. Existing messages whose tag lists already match
List[Tag]and whose metadata values are scalars are unaffected. Messages that today store malformed tags (non-list, non-string elements, oversized) or non-scalar metadata values will fail to revalidate — same trade-off as the prior version, on the legacy locations rather than a new one. Worth running a count against prod data before this lands.Test plan
pytest aleph_message/tests/test_models.py(offline) — 63 passed, 1 skipped, including:content.tagsshape validation, count/length boundaries, non-list and non-string-entry rejection, top-leveltagsrejection, non-dict content unaffectedtags), key/value length boundaries, top-leveltagsrejection