feat(sync): centralize parser-output validation before persistence#802
Conversation
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
d1b6e05 to
0cf101e
Compare
roborev: Combined Review (
|
Add a single validation and sanitization pass over db-layer rows before persistence so every parser flows through the same storage contract. The pass normalizes roles, strips unsafe control runes, bounds model identifiers and per-row token counts, and blanks implausible timestamps without dropping otherwise valid rows. The follow-up fixes keep derived session totals and usage reporting consistent with those sanitized rows across message-derived totals, usage-event-derived totals, SQLite, PostgreSQL, and DuckDB aggregation paths. They also preserve fingerprint stability by keeping raw JSON bytes while clamping every token count when aggregations read them. Follow-up commits included: - fix(sync): recompute content length after sanitize and correct validation comments - test(sync): escape control runes and use new() in validation tests - fix(sync): avoid double-pointer deref in timestamp blanking for nilaway - fix(sync): delta-adjust content length and validate incremental ended_at - fix(sync): compare sanitized parsed content in VS Copilot archive reconcile - test(sync): set ordinal on incremental append tests for WriteSessionIncremental - fix(sync): re-derive message-derived session token totals from sanitized rows - fix(sync): re-derive event-derived session token totals from sanitized rows - fix(sync): include cache tokens in event-derived peak context reconciliation - fix(sync): share one event-token rollup between parser and reconciliation - fix(usage): clamp raw token usage in aggregations Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
Session-summary usage events mirror parser-level aggregate totals rather than per-turn rows, so using them to detect event-derived aggregates can clamp valid long-session totals. The reconciliation now excludes those summary events while retaining the per-turn event clamp path.\n\nThe same validation pass now accounts for stripped thinking text when parser semantics include it in content_length, and cache economics reuses the shared clamped token_usage counter so stats cannot be inflated by corrupt raw JSON values.
Central timestamp sanitization blanks implausible message timestamps to an empty string. Bounded SQLite usage queries previously treated that value as a real message timestamp, so the row could miss the message-time bounds and never reach the session-start fallback branch.\n\nTreat empty message timestamps as missing with NULLIF throughout the SQLite usage row selection and bounded branch predicates so sanitized rows follow the same fallback behavior as NULL timestamps.
Session-sourced usage events represent aggregate summary totals for agents such as Hermes, Vibe, and Zed. Those rows should still floor corrupt negative token counts, but the per-row plausibility cap is too small for legitimate long-session summary totals and caused daily/session usage reporting to disagree with the preserved session aggregate fields.\n\nKeep bounded SQLite message scans on the raw timestamp column so the timestamp index remains usable, while retaining NULLIF handling only for blank timestamp fallback behavior.
PostgreSQL usage aggregation needs the same source-aware token handling as SQLite. Session-sourced usage events represent aggregate summary totals for Hermes, Vibe, Zed, and similar agents, so valid long-session totals can exceed the per-row parser clamp while still needing negative values floored.\n\nMirror the SQLite read-side token rules in PostgreSQL daily/session usage so pg serve and pg-backed reports do not cap valid summary totals after push sync.
443f38a to
f84294c
Compare
roborev: Combined Review (
|
DuckDB usage aggregation needs the same source-aware token handling as SQLite and PostgreSQL. Session-sourced usage events are aggregate summary totals for long-running agents, so capping them at the per-row parser limit undercounts valid usage and cost after DuckDB sync. Keep ordinary usage events and message-derived JSON bounded, but allow session summary rows to exceed the cap while still flooring negative values so DuckDB reports stay behaviorally aligned with the other backends.
roborev: Combined Review (
|
The upload API writes parsed sessions through the DB batch path without going through the sync engine. That left local uploads able to persist unsafe control bytes, overlong model names, and per-row token counts that the sync path would clamp before storage. Move the sanitizer to the DB write boundary and keep sync as a thin caller so upload and sync writes share the same storage contract. Row-derived session token totals are re-derived after clamping, while session-summary totals remain preserved. Usage row timestamps now coalesce to an empty string so blank message timestamps with missing session starts cannot surface NULL scan errors.
DuckDB usage reporting scans the derived local_date string, so rows with no message/event timestamp and no session start need the same non-null read contract as SQLite. Without this, uploaded blank-start sessions can still fail in DuckDB read mode even though local SQLite usage no longer does. Keep the raw DuckDB timestamp column typed as TIMESTAMP and coalesce only the formatted local date, preserving date filters while avoiding NULL string scans.
DuckDB usage rows with missing timestamps need a non-null fallback dedup key as well as a non-null local date. Otherwise unrelated blank-timestamp rows without stronger source identifiers can collapse into one partition and undercount usage after sync. Coalesce the timestamp string only inside the fallback key so ordinary timestamp typing and stronger Claude/source/dedup identifiers remain unchanged.
roborev: Combined Review (
|
roborev: Combined Review (
|
CI showed WebKit resolving the target controls but timing out while waiting for Playwright's stability check on the first WebKit interactions. The tests already assert visibility before the action and verify the post-click URL or active session state, so forcing these clicks keeps the behavioral coverage while avoiding the browser-specific actionability stall.
roborev: Combined Review (
|
GitHub's WebKit workers can spend most of the global 20s per-test budget before the first project tests finish, while the same scenarios complete quickly once the workers are warm. The affected tests still assert the same behavior; they just get enough budget for the cold WebKit startup path that CI is exercising.
Imports and some legacy paths write parser-derived rows through UpsertSession and ReplaceSessionMessages instead of the atomic batch writer. Keeping sanitization only on batch writes left those paths able to persist the same unsafe values the central validation pass was meant to normalize. Apply the idempotent sanitizer at the lower DB write boundary and cover the import path with a regression that exercises control bytes, implausible timestamps, and overlong model strings. Move the WebKit E2E timeout to describe configuration so it applies before fixture and beforeEach work begins.
ChatGPT re-imports refresh session_name through a targeted metadata update instead of UpsertSession so they can preserve existing session fields. That path still receives parser-derived titles and needs the same text sanitization as full session writes. Sanitize a copied session_name before the targeted update so nil and caller-owned pointer semantics stay intact while imported metadata cannot retain control bytes.
roborev: Combined Review (
|
Message token_usage can contain quoted numeric fields and nested metadata, but the fallback counter scanner was not respecting JSON structure. That could drop later counters or count nested and string keys differently from PostgreSQL and DuckDB.\n\nParser-derived model IDs also need cleanup before applying the length cap so long dirty prefixes do not erase the valid pricing model that follows.
SQLite's tolerant token_usage scanner still needs JSON string semantics when it skips ignored metadata or reads quoted numeric counters. Go literal decoding rejects valid JSON escapes such as escaped slash, which could stop scanning before later counters and diverge from the PostgreSQL and DuckDB extraction paths.\n\nUse JSON-compatible string decoding so valid escaped metadata does not prevent later top-level usage counters from contributing.
roborev: Combined Review (
|
Several parsers extract fields by walking undocumented payloads — the Antigravity protobuf most of all. Bytes pulled out that way can pass
utf8.Validyet not be text, and each incident so far earned its own narrow guard (isPlausibleModelName,maxPlausibleTokens,isNoisyAntigravityStepString, …). The NUL-in-a-model-name case (#647) bit hardest: it brokepg pushand recurred silently many times before the exit code was noticed. This replaces the scattered per-site guards with one validation/sanitization pass at the sync→persistence seam, so the next bad field — in a format nobody has written a guard for yet — is handled by default.What it does
internal/sync/validate.goaddsvalidateAndSanitizeoverdb.Message/db.UsageEvent/db.Session, wired intoprepareSessionWrite,toDBUsageEvents, and the incremental-append path so every write flows through it. The policy is sanitize/clamp, never drop, so no legitimate row is lost for an agent format we haven't characterized:{user, assistant, system, tool}→ empty (newparser.ValidRole)\n\t\r) — C1 controls are valid two-byte UTF-8 and previously survivedutf8.Valid[0, 2_000_000]Idempotency and backend parity
The per-rune strip extends the shared
db.SanitizeUTF8, already applied symmetrically by the local fingerprint builders and the Postgres push/readback path, so re-sanitizing is a no-op and fingerprints don't drift (no spurious re-push).ContentLengthis recomputed after the strip, andvisualStudioCopilotMessageHasArchiveUpdatesanitizes parsed content before comparison to avoid a per-sync rewrite of control-char content. The pass runs before rows diverge into the SQLite/PostgreSQL/DuckDB paths.Tradeoffs and limitations
Message.TokenUsageand the nestedToolCalls/ToolResultsstrings are intentionally left to the symmetric fingerprint/pg-push paths, not the central pass — documented invalidate.goas a follow-up.validateAndSanitizereturns per-category fix counts that callers currently discard — a seam for a future parser-health surface.Reviewers:
internal/sync/validate.go, the seam wiring ininternal/sync/engine.go, and thedb.SanitizeUTF8extension.