Skip to content

Tighten tag storage per content type#156

Open
odesenfans wants to merge 3 commits into
mainfrom
od/tags
Open

Tighten tag storage per content type#156
odesenfans wants to merge 3 commits into
mainfrom
od/tags

Conversation

@odesenfans
Copy link
Copy Markdown
Contributor

@odesenfans odesenfans commented May 5, 2026

Summary

Replaces the unified BaseContent.tags field (previous version of this PR) with per-type validation that tightens tag storage where it already lives, without introducing a parallel canonical location.

  • Stores: tags: Optional[List[Tag]] is now a typed top-level field on StoreContent (this is the only type where the canonical location is at the content root).
  • Posts and aggregates: a field validator on the inner content dict ensures that, when a tags key is present, it matches the canonical Optional[List[Tag]] shape with the usual 1–64 char/16 entry caps. The legacy nested content.content.tags location stays canonical.
  • Instances and programs: BaseExecutableContent.metadata is tightened from Dict[str, Any] to scalars (str, int, float, bool, null) plus an optional tags key holding a List[Tag]. The legacy content.metadata.tags location stays canonical.

Adds MAX_METADATA_KEY_LENGTH = 64 and MAX_METADATA_VALUE_LENGTH = 512 so metadata is bounded by bytes, not just by entry count.

BaseContent.tags and the Message.tags() accessor introduced earlier in this branch are removed.

Why this approach

The earlier design added tags as a typed field on BaseContent and provided a unified accessor that fell back to legacy nested locations for posts/aggregates. That gave us a second valid place to store tags on every message type without removing the existing ones — codifying two parallel conventions per type indefinitely. Tightening the legacy locations instead keeps a single canonical path per type and defers the "unify the schema" question to a future message format revision.

Hash compatibility

This is a strict-validation tightening, not a structural change. Existing messages whose tag lists already match List[Tag] and whose metadata values are scalars are unaffected. Messages that today store malformed tags (non-list, non-string elements, oversized) or non-scalar metadata values will fail to revalidate — same trade-off as the prior version, on the legacy locations rather than a new one. Worth running a count against prod data before this lands.

Test plan

  • pytest aleph_message/tests/test_models.py (offline) — 63 passed, 1 skipped, including:
    • StoreContent.tags: count/length/empty-string boundaries and serialization defaults
    • PostContent / AggregateContent: nested content.tags shape validation, count/length boundaries, non-list and non-string-entry rejection, top-level tags rejection, non-dict content unaffected
    • InstanceContent metadata: tags shape validation, scalar value enforcement, dict/list value rejection (outside tags), key/value length boundaries, top-level tags rejection
  • black, isort, ruff, mypy clean
  • Network-dependent tests against the testnet endpoint — left for CI

Promotes tags to a typed field on BaseContent (Optional[List[str]],
max 16 tags of 64 chars each) so every message type carries them
consistently. Adds a tags() method on BaseMessage that falls back to
the legacy content.content.tags location for POST and AGGREGATE
messages so existing data remains addressable.

The field defaults to None and is excluded from serialization when
unset, preserving item_hash compatibility for messages that don't
opt in.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-designed change. The tags field on BaseContent is correctly typed with sensible constraints (16 max tags, 1-64 chars each), defaults to None for hash compatibility, and the tags() accessor on BaseMessage provides a useful unified interface with proper legacy fallback for POST/AGGREGATE and defensive filtering. Test coverage is thorough — boundary tests, serialization round-trip, canonical vs legacy precedence, empty-list semantics, malformed input handling, and the non-fallback for executable types are all covered.

Replace the unified `BaseContent.tags` field and `Message.tags()` accessor
with per-type validation that reflects how each message format already
stores tags in the wild:

- StoreContent gains `tags: Optional[List[Tag]]` as a typed top-level
  field (kept canonical for stores).
- PostContent and AggregateContent validate the optional `tags` key
  inside their nested user-content dict against the canonical Tag list
  shape (legacy location remains canonical for posts/aggregates).
- BaseExecutableContent tightens `metadata` from `Dict[str, Any]` to
  scalars (str, int, float, bool, null) plus an optional `tags` key
  that holds a Tag list (legacy location remains canonical for
  instances/programs).

Adds MAX_METADATA_KEY_LENGTH (64) and MAX_METADATA_VALUE_LENGTH (512)
to bound metadata values by bytes, not just by entry count.

The previous unified-field approach added a second valid location for
tags without removing the legacy ones, which would have left two
co-existing conventions per content type. This change keeps tags where
they already live and tightens their shape instead.
@odesenfans odesenfans changed the title Add tags field to BaseContent and Message.tags() accessor Tighten tag storage per content type May 5, 2026
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR cleanly tightens tag storage per content type without introducing a parallel canonical location. StoreContent gains a typed top-level tags field, Post/Aggregate content validates nested content.tags via a shared adapter, and BaseExecutableContent.metadata is tightened to scalars plus an optional tags key. The new constants (MAX_N_TAGS, MAX_TAG_LENGTH, MAX_METADATA_KEY_LENGTH, MAX_METADATA_VALUE_LENGTH) are well-placed. The isinstance(value, (bool, int, float)) check correctly orders bool before int. Tests are thorough with boundary cases covered. One minor suggestion below.

aleph_message/tests/test_models.py (line 1010): Consider adding a test for ProgramContent metadata validation as well (e.g., reusing _load_instance_fixture pattern with factory=ProgramMessage). ProgramContent inherits from BaseExecutableContent but isn't explicitly exercised — a fixture like machine.json could be used. Not a blocker since the shared validator is on the base class and InstanceContent already exercises it.

Production tags reach ~200 characters; the previous 64-char cap would
reject existing messages. Bump the limit to 256 to keep validation
tighter than 'no cap' while not rejecting historical content.
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is well-structured and correctly tightens tag storage per content type. The approach of keeping canonical tag locations per type (StoreContent.tags, Post/Aggregate content.tags, Instance/Program metadata.tags) rather than introducing a unified top-level field is sound. One genuine bug was found: the metadata validator for Instance/Program messages rejects {"tags": null} because the TypeAdapter was called unconditionally when the key was "tags", without a None guard — unlike the nested-tags validator for Posts/Aggregates which already handles this correctly. I've fixed this in aleph_message/models/execution/abstract.py:97-99 and added a matching test. All 68 tests pass now.

aleph_message/models/execution/abstract.py (line 97): BUG: The metadata validator calls _METADATA_TAGS_ADAPTER.validate_python(value) unconditionally when key == "tags", but the adapter is for List[Tag] (not Optional[List[Tag]]). This means {"tags": null} in metadata raises a ValidationError, while the equivalent for Posts/Aggregates correctly allows null (guarded at aleph_message/models/__init__.py:132). Fixed by adding a if value is not None guard. Added test_metadata_tags_null_allowed for parity with test_post_nested_tags_null_allowed.

aleph_message/models/__init__.py (line 124): NIT: The _NESTED_TAGS_ADAPTER and _METADATA_TAGS_ADAPTER are identical — consider defining it once in abstract.py and importing both places for DRY. Not a blocker.

aleph_message/models/__init__.py (line 132): NIT: The test_post_top_level_tags_rejected and test_instance_top_level_tags_rejected tests pass tags as a kwarg to the constructor, which triggers extra="forbid" on BaseContent. This is correct behavior, but the test could be more explicit about why tags is rejected (i.e., it's because extra="forbid" on BaseContent, not a custom check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants