Skip to content

feat(sync): centralize parser-output validation before persistence#802

Merged
wesm merged 15 commits into
kenn-io:mainfrom
mjacobs:feat/central-parser-output-validation
Jun 24, 2026
Merged

feat(sync): centralize parser-output validation before persistence#802
wesm merged 15 commits into
kenn-io:mainfrom
mjacobs:feat/central-parser-output-validation

Conversation

@mjacobs

@mjacobs mjacobs commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Several parsers extract fields by walking undocumented payloads — the Antigravity protobuf most of all. Bytes pulled out that way can pass utf8.Valid yet not be text, and each incident so far earned its own narrow guard (isPlausibleModelName, maxPlausibleTokens, isNoisyAntigravityStepString, …). The NUL-in-a-model-name case (#647) bit hardest: it broke pg push and recurred silently many times before the exit code was noticed. This replaces the scattered per-site guards with one validation/sanitization pass at the sync→persistence seam, so the next bad field — in a format nobody has written a guard for yet — is handled by default.

What it does

internal/sync/validate.go adds validateAndSanitize over db.Message/db.UsageEvent/db.Session, wired into prepareSessionWrite, toDBUsageEvents, and the incremental-append path so every write flows through it. The policy is sanitize/clamp, never drop, so no legitimate row is lost for an agent format we haven't characterized:

  • roles outside {user, assistant, system, tool} → empty (new parser.ValidRole)
  • per-rune control-character strip (C0/C1/DEL except \n \t \r) — C1 controls are valid two-byte UTF-8 and previously survived utf8.Valid
  • model name clamped to 128 bytes on a rune boundary
  • token counts bounded to [0, 2_000_000]
  • timestamps outside 2000–2100 blanked

Idempotency and backend parity

The per-rune strip extends the shared db.SanitizeUTF8, already applied symmetrically by the local fingerprint builders and the Postgres push/readback path, so re-sanitizing is a no-op and fingerprints don't drift (no spurious re-push). ContentLength is recomputed after the strip, and visualStudioCopilotMessageHasArchiveUpdate sanitizes parsed content before comparison to avoid a per-sync rewrite of control-char content. The pass runs before rows diverge into the SQLite/PostgreSQL/DuckDB paths.

Tradeoffs and limitations

  • One-time effect: pre-existing sessions whose stored values contain C1/control bytes are detected as changed and rewritten once on the next sync, then stabilize.
  • Message.TokenUsage and the nested ToolCalls/ToolResults strings are intentionally left to the symmetric fingerprint/pg-push paths, not the central pass — documented in validate.go as a follow-up.
  • validateAndSanitize returns per-category fix counts that callers currently discard — a seam for a future parser-health surface.

Reviewers: internal/sync/validate.go, the seam wiring in internal/sync/engine.go, and the db.SanitizeUTF8 extension.

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (1006c85)

Sanitization has medium-risk correctness regressions that can skew stored metrics and aggregates.

Medium

  • internal/sync/validate.go:142
    sanitizeMessage now overwrites every ContentLength with len(m.Content), even when no sanitization occurred. Several parsers intentionally set ContentLength to include thinking/reasoning text or source-length semantics, so this changes stored metrics and archive/fingerprint behavior for normal messages, not just sanitized ones.
    Fix: Preserve the parser’s length semantics and only subtract bytes removed by sanitization, or recompute from the same fields the parser used with tests for content+thinking cases.

  • internal/sync/engine.go:6905
    Token counts are clamped in messages and usage events after session aggregate totals have already been computed, and sanitizeSession leaves TotalOutputTokens / PeakContextTokens untouched. A corrupt per-message or usage-event token value can therefore be removed from child rows but still persist in the session totals shown by list/analytics views.
    Fix: Sanitize token-bearing rows before aggregate totals are computed, or recompute session totals after validation by summing the sanitized child rows while allowing the final session total to exceed the per-row cap.


Panel: ci_default_security | Synthesis: codex, 20s | Members: codex_default (codex/default, done, 4m55s), codex_security (codex/security, done, 1m45s) | Total: 7m0s

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (1ee9717)

Summary verdict: two medium correctness issues remain; no security findings were reported.

Medium

  • internal/sync/validate.go:142: sanitizeMessage overwrites every ContentLength with len(m.Content). Some parsers intentionally use semantic lengths that differ from display content length, such as thinking-inclusive lengths or tool-only messages with empty display content but nonzero work/result length. This can corrupt stored message length metrics and fingerprints.

    • Fix: Preserve parser-provided semantic length and adjust only by the sanitization delta for fields contributing to that length, or recompute only when the original length matched the raw content length.
  • internal/sync/engine.go:6905: Message and usage-event token counts are clamped after session aggregate token totals have already been derived, while sanitizeSession leaves TotalOutputTokens and PeakContextTokens unchanged. Implausible per-message or usage-event token values can still persist in session-level totals even though the underlying row was clamped.

    • Fix: After token clamping, recompute session aggregates from sanitized per-message rows/events when those aggregates are message/event-derived, while preserving truly aggregate-only provider totals.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 5m30s), codex_security (codex/security, done, 22s) | Total: 6m1s

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (d20cc7f)

Medium findings remain: parser-defined content_length semantics are being overwritten, and incremental session metadata still bypasses the new validation path.

Medium

  • Location: internal/sync/validate.go:142
    Problem: sanitizeMessage now overwrites every parser-provided ContentLength with len(m.Content), but some parsers intentionally use a different contract, such as content plus reasoning/thinking text or content without display wrapper markers. This can corrupt stored content_length and change fingerprints/diff behavior even when no control bytes were stripped.
    Fix: Preserve parser length semantics and adjust only for bytes removed by sanitization, or move sanitized length computation into parser-specific conversion where the intended contract is known. Add coverage for thinking/reasoning messages from affected parsers.

  • Location: internal/sync/engine.go:7637
    Problem: The incremental path validates only appended message rows; session fields written by UpdateSessionIncremental still bypass the new validation policy. An implausible appended timestamp can be persisted as ended_at, and token aggregates computed before message clamping can store raw overlarge values, creating different results from a full sync of the same file.
    Fix: Run incremental session update values through the same timestamp validation and recompute token aggregates from sanitized message rows before calling UpdateSessionIncremental. Add an incremental-sync test for out-of-window timestamps and overlarge token counts.


Panel: ci_default_security | Synthesis: codex, 10s | Members: codex_default (codex/default, done, 5m13s), codex_security (codex/security, done, 1m51s) | Total: 7m14s

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (c7879ce)

Findings require changes before merge.

Medium

  • internal/sync/engine.go:7390: visualStudioCopilotMessageLooksIncomplete compares raw parsed ContentLength against stored rows that are now sanitized and length-adjusted. A truncated reparse containing stripped control bytes can look long enough and bypass archive preservation, allowing incomplete content to replace the archived transcript. Compare using the sanitized parsed content length, or sanitize a copy of parsed messages before archive reconciliation.

Panel: ci_default_security | Synthesis: codex, 5s | Members: codex_default (codex/default, done, 6m55s), codex_security (codex/security, done, 1m42s) | Total: 8m42s

@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (d1b6e05)

Medium confidence with two Medium findings to address before merge.

Medium

  • internal/sync/engine.go:8082 - The OpenCode archive preservation check still compares raw parsed ContentLength before the new sanitizer runs. A truncated parse padded with stripped control bytes can look long enough here, skip archive preservation, then be sanitized and written as shorter content, overwriting the archived complete row. Apply the same sanitized parsed-message comparison used for VS Copilot in openCodeMessageLooksIncomplete, or run validation before archive preservation checks.

  • internal/db/db.go:231 - The persisted message/session shape changed, but dataVersion remains 48. Existing rows marked current can skip full replacement, so old unsanitized content, timestamps, roles, and token counts remain in place while new code assumes stored rows have passed the sanitizer. Bump dataVersion and update the CurrentDataVersion test so existing databases get a full resync.


Panel: ci_default_security | Synthesis: codex, 11s | Members: codex_default (codex/default, done, 5m50s), codex_security (codex/security, done, 23s) | Total: 6m24s

@mjacobs mjacobs force-pushed the feat/central-parser-output-validation branch from d1b6e05 to 0cf101e Compare June 22, 2026 19:46
@roborev-ci

roborev-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

roborev: Combined Review (443f38a)

Summary verdict: one medium correctness issue remains; no security issues were found.

Medium

  • internal/sync/validate.go:43: Message.TokenUsage is excluded from validation, but usage and cost aggregation still read token counts from the raw token_usage JSON instead of the clamped ContextTokens/OutputTokens columns. A corrupt token_usage.output_tokens value can still inflate usage dashboards and session cost despite the new clamp.
    • Fix: Clamp or normalize numeric fields inside TokenUsage, or update usage aggregation to use the clamped columns consistently. Add a regression test with an over-bound token_usage JSON value.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 6m14s), codex_security (codex/security, done, 24s) | Total: 6m45s

mjacobs and others added 5 commits June 24, 2026 11:40
Add a single validation and sanitization pass over db-layer rows before persistence so every parser flows through the same storage contract. The pass normalizes roles, strips unsafe control runes, bounds model identifiers and per-row token counts, and blanks implausible timestamps without dropping otherwise valid rows.

The follow-up fixes keep derived session totals and usage reporting consistent with those sanitized rows across message-derived totals, usage-event-derived totals, SQLite, PostgreSQL, and DuckDB aggregation paths. They also preserve fingerprint stability by keeping raw JSON bytes while clamping every token count when aggregations read them.

Follow-up commits included:
- fix(sync): recompute content length after sanitize and correct validation comments
- test(sync): escape control runes and use new() in validation tests
- fix(sync): avoid double-pointer deref in timestamp blanking for nilaway
- fix(sync): delta-adjust content length and validate incremental ended_at
- fix(sync): compare sanitized parsed content in VS Copilot archive reconcile
- test(sync): set ordinal on incremental append tests for WriteSessionIncremental
- fix(sync): re-derive message-derived session token totals from sanitized rows
- fix(sync): re-derive event-derived session token totals from sanitized rows
- fix(sync): include cache tokens in event-derived peak context reconciliation
- fix(sync): share one event-token rollup between parser and reconciliation
- fix(usage): clamp raw token usage in aggregations

Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
Session-summary usage events mirror parser-level aggregate totals rather than per-turn rows, so using them to detect event-derived aggregates can clamp valid long-session totals. The reconciliation now excludes those summary events while retaining the per-turn event clamp path.\n\nThe same validation pass now accounts for stripped thinking text when parser semantics include it in content_length, and cache economics reuses the shared clamped token_usage counter so stats cannot be inflated by corrupt raw JSON values.
Central timestamp sanitization blanks implausible message timestamps to an empty string. Bounded SQLite usage queries previously treated that value as a real message timestamp, so the row could miss the message-time bounds and never reach the session-start fallback branch.\n\nTreat empty message timestamps as missing with NULLIF throughout the SQLite usage row selection and bounded branch predicates so sanitized rows follow the same fallback behavior as NULL timestamps.
Session-sourced usage events represent aggregate summary totals for agents such as Hermes, Vibe, and Zed. Those rows should still floor corrupt negative token counts, but the per-row plausibility cap is too small for legitimate long-session summary totals and caused daily/session usage reporting to disagree with the preserved session aggregate fields.\n\nKeep bounded SQLite message scans on the raw timestamp column so the timestamp index remains usable, while retaining NULLIF handling only for blank timestamp fallback behavior.
PostgreSQL usage aggregation needs the same source-aware token handling as SQLite. Session-sourced usage events represent aggregate summary totals for Hermes, Vibe, Zed, and similar agents, so valid long-session totals can exceed the per-row parser clamp while still needing negative values floored.\n\nMirror the SQLite read-side token rules in PostgreSQL daily/session usage so pg serve and pg-backed reports do not cap valid summary totals after push sync.
@wesm wesm force-pushed the feat/central-parser-output-validation branch from 443f38a to f84294c Compare June 24, 2026 17:16
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (f84294c)

Summary verdict: one medium-severity backend parity issue should be fixed before merge.

Medium

  • internal/duckdb/analytics_usage.go:2287, internal/duckdb/activityreport.go:366
    • DuckDB clamps every non-message usage_events row to MaxPlausibleTokens, including source = 'session' summary events. SQLite and PostgreSQL preserve upper bounds for these session-summary rows and only floor negatives, so DuckDB daily usage, top-session cost, session cost, and activity reports can undercount agents like Hermes/Vibe/Zed when summary totals exceed 2M.
    • Suggested fix: add a WHEN source = 'session' THEN GREATEST(<token>, 0) branch for each usage-event token field in the DuckDB normalization SQL, plus a DuckDB coverage test for a source='session' event above db.MaxPlausibleTokens.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 5m0s), codex_security (codex/security, done, 2m57s) | Total: 8m5s

DuckDB usage aggregation needs the same source-aware token handling as SQLite and PostgreSQL. Session-sourced usage events are aggregate summary totals for long-running agents, so capping them at the per-row parser limit undercounts valid usage and cost after DuckDB sync.

Keep ordinary usage events and message-derived JSON bounded, but allow session summary rows to exceed the cap while still flooring negative values so DuckDB reports stay behaviorally aligned with the other backends.
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (1c24565)

Summary verdict: two medium issues remain; no high or critical findings were reported.

Medium

  • internal/db/usage.go:282: COALESCE(NULLIF(m.timestamp, ''), s.started_at) can return SQL NULL when a message has a blank timestamp and its session has no started_at. scanUsageRow and scanDailyUsageRow scan ts into a string, so session usage and unbounded usage queries can fail with a NULL scan error.
    Fix: Make the usage ts expressions non-null, for example COALESCE(NULLIF(m.timestamp, ''), s.started_at, ''), or scan into sql.NullString.

  • internal/sync/validate.go:109: The new validation pass is private to internal/sync and only called by sync engine paths. Uploaded parsed sessions still go through sessionBatchWriteFromParsed and WriteSessionBatchAtomic without this sanitization, so control bytes, overlong models, and inflated token fields can still persist through the upload API.
    Fix: Move/export the shared sanitizer or apply it at the DB batch-write boundary, and route the upload conversion through the same token aggregate reconciliation.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 8m0s), codex_security (codex/security, done, 3m27s) | Total: 11m36s

wesm added 3 commits June 24, 2026 12:57
The upload API writes parsed sessions through the DB batch path without going through the sync engine. That left local uploads able to persist unsafe control bytes, overlong model names, and per-row token counts that the sync path would clamp before storage.

Move the sanitizer to the DB write boundary and keep sync as a thin caller so upload and sync writes share the same storage contract. Row-derived session token totals are re-derived after clamping, while session-summary totals remain preserved. Usage row timestamps now coalesce to an empty string so blank message timestamps with missing session starts cannot surface NULL scan errors.
DuckDB usage reporting scans the derived local_date string, so rows with no message/event timestamp and no session start need the same non-null read contract as SQLite. Without this, uploaded blank-start sessions can still fail in DuckDB read mode even though local SQLite usage no longer does.

Keep the raw DuckDB timestamp column typed as TIMESTAMP and coalesce only the formatted local date, preserving date filters while avoiding NULL string scans.
DuckDB usage rows with missing timestamps need a non-null fallback dedup key as well as a non-null local date. Otherwise unrelated blank-timestamp rows without stronger source identifiers can collapse into one partition and undercount usage after sync.

Coalesce the timestamp string only inside the fallback key so ordinary timestamp typing and stronger Claude/source/dedup identifiers remain unchanged.
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (e5a6493)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 13m9s), codex_security (codex/security, done, 35s) | Total: 13m44s

@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (11f8dbc)

Medium findings remain; no Critical or High issues were reported.

Medium

  • internal/importer/importer.go:229 and internal/importer/importer.go:366
    Imported parser output still persists through UpsertSession + ReplaceSessionMessages, which do not call the new ValidateAndSanitize path. That leaves import flows able to store control/NUL bytes, overlong models, and implausible timestamps despite the new centralized validation.
    Fix: Route importer writes through WriteSessionBatchAtomic, or add sanitization at the lower-level DB write boundary used by UpsertSession / message replacement APIs.

  • internal/db/usage.go:905
    parseUsageTokenInt accumulates into int before clamping, so an extremely large token literal can overflow and then be clamped as a wrapped negative or small value instead of MaxPlausibleTokens.
    Fix: Parse token integers with saturation or explicit overflow handling while still consuming the remaining digits; add a test above math.MaxInt64.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 7m34s), codex_security (codex/security, done, 24s) | Total: 8m6s

CI showed WebKit resolving the target controls but timing out while waiting for Playwright's stability check on the first WebKit interactions. The tests already assert visibility before the action and verify the post-click URL or active session state, so forcing these clicks keeps the behavioral coverage while avoiding the browser-specific actionability stall.
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (61d5da9)

Summary verdict: One medium issue remains; no critical or high findings were reported.

Medium

  • internal/importer/importer.go:229: The new sanitizer is wired into sync and batch writes, but import still persists parser-derived rows through UpsertSession and ReplaceSessionMessages directly. Imported sessions can still retain control bytes, overlong model strings, and implausible timestamps instead of getting the new centralized validation.
    • Fix: Route imports through WriteSessionBatchAtomic, or apply ValidateAndSanitize inside the lower-level UpsertSession / ReplaceSessionMessages paths so all parser-derived writes share the same cleanup.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 6m43s), codex_security (codex/security, done, 1m32s) | Total: 8m21s

wesm added 3 commits June 24, 2026 14:09
GitHub's WebKit workers can spend most of the global 20s per-test budget before the first project tests finish, while the same scenarios complete quickly once the workers are warm. The affected tests still assert the same behavior; they just get enough budget for the cold WebKit startup path that CI is exercising.
Imports and some legacy paths write parser-derived rows through UpsertSession and ReplaceSessionMessages instead of the atomic batch writer. Keeping sanitization only on batch writes left those paths able to persist the same unsafe values the central validation pass was meant to normalize.

Apply the idempotent sanitizer at the lower DB write boundary and cover the import path with a regression that exercises control bytes, implausible timestamps, and overlong model strings. Move the WebKit E2E timeout to describe configuration so it applies before fixture and beforeEach work begins.
ChatGPT re-imports refresh session_name through a targeted metadata update instead of UpsertSession so they can preserve existing session fields. That path still receives parser-derived titles and needs the same text sanitization as full session writes.

Sanitize a copied session_name before the targeted update so nil and caller-owned pointer semantics stay intact while imported metadata cannot retain control bytes.
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (c09af0b)

Summary verdict: Two medium correctness issues remain; no high or critical findings were reported.

Medium

  • internal/db/usage.go:843: parseUsageTokenCounters is not JSON-aware. It resumes scanning from the closing quote of quoted numeric values, so valid token usage such as Shelley’s {"input_tokens":"-5","cache_read_input_tokens":"100","output_tokens":"42"} can drop later counters. It also scans nested/string keys instead of only top-level keys, diverging from PostgreSQL/DuckDB extraction.

    • Fix: Use a JSON-aware top-level scanner that handles quoted numbers and escaping, or restore top-level gjson.Get(...) extraction with clamping for complete JSON and a separate tolerant path for truncated rows.
  • internal/db/validate.go:192: Model IDs are clamped before UTF-8/control sanitization in both message and usage-event sanitizers. A long dirty prefix such as NUL/control bytes followed by a valid model can be truncated to only removable bytes, then sanitized to an empty or partial model, causing usage rows to be excluded from pricing.

    • Fix: Sanitize the model string first, then clamp the cleaned value to MaxModelLen; add coverage for long control-prefixed model values.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 9m1s), codex_security (codex/security, done, 2m53s) | Total: 12m3s

wesm added 2 commits June 24, 2026 14:56
Message token_usage can contain quoted numeric fields and nested metadata, but the fallback counter scanner was not respecting JSON structure. That could drop later counters or count nested and string keys differently from PostgreSQL and DuckDB.\n\nParser-derived model IDs also need cleanup before applying the length cap so long dirty prefixes do not erase the valid pricing model that follows.
SQLite's tolerant token_usage scanner still needs JSON string semantics when it skips ignored metadata or reads quoted numeric counters. Go literal decoding rejects valid JSON escapes such as escaped slash, which could stop scanning before later counters and diverge from the PostgreSQL and DuckDB extraction paths.\n\nUse JSON-compatible string decoding so valid escaped metadata does not prevent later top-level usage counters from contributing.
@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (c2f1cf1)

Summary verdict: No medium, high, or critical findings were reported.

The only reported issue was low severity and is omitted per instructions.


Panel: ci_default_security | Synthesis: codex, 5s | Members: codex_default (codex/default, done, 11m13s), codex_security (codex/security, done, 2m22s) | Total: 13m40s

@wesm wesm merged commit 800c19b into kenn-io:main Jun 24, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants