Skip to content

feat(schema): provider data-policy metadata fields#5

Merged
OriginalGary merged 2 commits intomainfrom
feat/provider-metadata
May 5, 2026
Merged

feat(schema): provider data-policy metadata fields#5
OriginalGary merged 2 commits intomainfrom
feat/provider-metadata

Conversation

@OriginalGary
Copy link
Copy Markdown
Owner

@OriginalGary OriginalGary commented May 5, 2026

Summary

Five data-policy fields added to ProviderSchema + audited across all 152 providers. Two commits:

  1. Initial schema + conservative audit — Zod invariants, 13 providers classified, 139 TODO
  2. Priority-1 audit resolution — 10 more providers verified, tier-dependent pattern documented

Schema touchpoints

Single schema layer extended: src/shared/validation/providerSchema.tssrc/shared/constants/providers.ts. The main RegistryEntry in providerRegistry.ts (operational call config) is unchanged — sensitivity routing will join on provider ID.

Invariants

# Rule Error message shape
1 local=truetrains_on_data=false Provider "id": invariant violated — local=true requires trains_on_data=false
2 e2ee=truetrains_on_data=false Provider "id": invariant violated — e2ee=true requires trains_on_data=false
3 data_residency="multi"retention_days not null Provider "id": invariant violated — data_residency="multi" requires retention_days to be specified

Audit results

Confidently classified: 23 providers (up from 13 in commit 1)

Provider trains_on_data data_residency retention_days local
anthropic false US 30 false
openai false US 30 false
glm false SG null false
glm-cn false CN null false
glmt false SG null false
azure-openai false multi 30 false
azure-ai false multi 30 false
bedrock false multi 0 false
vertex false multi 30 false
vertex-partner false multi 30 false
lm-studio … comfyui (10 local) false local 0 true
searxng-search false local 0 true

glmt classification: glmt is NOT a separate aggregator — it is a preset variant (thinking mode + higher token budget + longer timeout) on the same Z.ai API endpoint (https://api.z.ai/api/anthropic/v1/messages) as glm. Registry entry confirms baseUrl is identical. Same Z.ai Additional Terms §3.b apply → same SG residency and trains_on_data=false as glm.

Tier-dependent providers (3): gemini, codex, github — inline comments upgraded from TODO to Verified, but conservative default (trains_on_data=true) is preserved. Override mechanism deferred to workstream 4.

Anthropic ZDR: confirmed e2ee=false. Contractual ≠ architectural.

gemini.trains_on_data confirmed true (assertion in test suite prevents regression).

TODO(sam): ~129 providers remain.

Sam-verify priority — remaining

  1. Aggregators (openrouter, laozhang, etc.) — need contractual guarantee from aggregator, not upstream
  2. Remaining OAuth providers (cursor, gitlab-duo, kimi-coding, claude consumer OAuth, etc.)
  3. Enterprise cloud (watsonx, oci, sap, databricks, etc.)
  4. Niche/specialized — all remaining APIKEY, WEB_COOKIE, AUDIO, SEARCH

Tier eligibility summary

Tier Count Providers
tier-1 (all) 152 all
tier-2 (trains_on_data=false) 23 anthropic, openai, glm, glm-cn, glmt, azure-openai, azure-ai, bedrock, vertex, vertex-partner, 11 local + searxng-search
tier-3 (local=true) 12 11 LOCAL_PROVIDERS + searxng-search

Surprises for sensitivity-routing prompt

  • AGGREGATOR_PROVIDER_IDS Set (17 entries) and SELF_HOSTED_CHAT_PROVIDER_IDS Set (8 entries) already exist in providers.ts — sensitivity routing can use these for fast-path decisions
  • Two schema layers (providers.ts = policy, providerRegistry.ts = call config) — routing joins on provider ID
  • sdwebui and comfyui are local=true (tier-3) but are NOT in SELF_HOSTED_CHAT_PROVIDER_IDS (image-only)

Tests

32 assertions in tests/unit/provider-metadata-schema.test.ts:

  • Schema field acceptance, data_residency format validation
  • All 3 invariant violations with named-provider error messages
  • Regression guards: gemini.trains_on_data=true, bedrock.retention_days=0, azure-openai invariant-3 path
  • glm/glm-cn/glmt, vertex/vertex-partner, github/codex tier-dependent fields
  • Full registry load guard + all-fields-present sweep

🤖 Generated with Claude Code

Adds trains_on_data, data_residency, retention_days, local, and e2ee to
ProviderSchema (src/shared/validation/providerSchema.ts). All five fields
are required and Zod-validated at module load time with three cross-field
invariants enforced with named-provider error messages:

  1. local=true → trains_on_data must be false
  2. e2ee=true → trains_on_data must be false
  3. data_residency="multi" → retention_days must not be null

Provider audit: 13 providers confidently classified (anthropic, openai,
and all 11 self-hosted local providers including searxng-search). The
remaining 139 providers are assigned conservative defaults
(trains_on_data=true, data_residency="unknown", retention_days=null) with
per-provider TODO(sam) comments pointing to their policy URL. Anthropic
ZDR is explicitly marked e2ee=false (contractual ≠ architectural).

22 new tests in provider-metadata-schema.test.ts cover: field acceptance,
data_residency format validation, all three invariant violations with
named-provider error messages, local provider bulk assertions, the
Anthropic ZDR e2ee=false case, full registry load regression guard, and
validateProviders throw-on-violation paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

CI Coverage Report

  • Coverage job: success
  • PR test policy: failure

Coverage artifact was not available for this run.

PR Test Policy

This PR changes production code in src/, open-sse/, electron/, or bin/ without accompanying automated tests.

… bedrock, vertex, tier-dependent)

Applies verified data-policy values to 10 providers previously at
conservative TODO defaults:

  glm, glmt      — Z.ai international, trains_on_data=false, data_residency=SG
  glm-cn         — Z.ai China endpoint, trains_on_data=false, data_residency=CN
  azure-openai   — trains_on_data=false, data_residency=multi, retention_days=30
  azure-ai       — same Azure AOAI data-privacy policy as azure-openai
  bedrock        — trains_on_data=false, data_residency=multi, retention_days=0
                   (zero-persistence architecture, not just contractual)
  vertex         — trains_on_data=false, data_residency=multi, retention_days=30
  vertex-partner — same Vertex AI DPA covers partner models in Model Garden

Tier-dependent providers (gemini, codex, github) keep trains_on_data=true
but have their inline comments upgraded from TODO to Verified, with policy
source URLs and an explanation that the conservative default applies because
Graze cannot determine subscription tier from an API key or OAuth token.
github gets retention_days=28 and data_residency=US from its published policy.

Adds §5a Tier-dependent providers to GRAZE.md documenting the pattern,
Graze's stance, and the deferred override mechanism.

10 new test assertions: gemini.trains_on_data=true regression guard,
bedrock.retention_days=0, azure-openai invariant-3 path, glm/glm-cn/glmt
residency, vertex/vertex-partner, github tier-dependent fields, codex.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@OriginalGary OriginalGary merged commit c77b999 into main May 5, 2026
3 of 4 checks passed
@OriginalGary OriginalGary deleted the feat/provider-metadata branch May 5, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants