Skip to content

Guard desktop startup against legacy provider state#1027

Open
Amer-alsayed wants to merge 6 commits intopingdotgg:mainfrom
Amer-alsayed:codex/fix-orchestration-legacy-provider
Open

Guard desktop startup against legacy provider state#1027
Amer-alsayed wants to merge 6 commits intopingdotgg:mainfrom
Amer-alsayed:codex/fix-orchestration-legacy-provider

Conversation

@Amer-alsayed
Copy link

@Amer-alsayed Amer-alsayed commented Mar 13, 2026

What Changed

  • Desktop preload fetches the authenticated WS URL via IPC so the renderer always connects to the correct backend endpoint.
  • ProviderSessionDirectory drops unknown persisted providers (e.g., legacy gemini) instead of failing startup.
  • OrchestrationEventStore skips legacy/unreadable events and logs a warning rather than crashing the server.

Why

Persisted legacy provider data can cause orchestration event decode failures at startup, which restarts the backend and leaves the UI stuck (e.g., Add project hangs). This keeps startup resilient and preserves UX while still logging the issue.

UI Changes

None.

Checklist

  • This PR is small and focused
  • I explained what changed and why
  • No UI changes
  • Tests run: bun fmt, bun lint, bun typecheck

Note

Guard desktop startup against legacy provider state and undecodable events

  • readFromSequence in OrchestrationEventStore.ts now skips individual decode failures (logging a warning) rather than failing the stream, and advances pagination based on the last row sequence instead of the last decoded event.
  • getBinding in ProviderSessionDirectory.ts deletes and drops invalid provider sessions instead of propagating a decode error, preventing stale/legacy state from blocking startup.
  • The desktop preload now fetches wsUrl synchronously from the main process via a new desktop:get-ws-url IPC channel instead of reading from an environment variable, returning null on failure.

Macroscope summarized c038d0c.

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions github-actions bot added size:XL 500-999 changed lines (additions + deletions). vouch:unvouched PR author is not yet trusted in the VOUCHED list. labels Mar 13, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f9466f27-a959-425b-97af-82cb47c29ee2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes.

Add @coderabbitai placeholder anywhere in the title of your PR and CodeRabbit will replace it with a title based on the changes in the PR. You can change the placeholder by changing the reviews.auto_title_placeholder setting.

if (normalizedLimit === 0) {
return Stream.empty;
}
const readPage = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical Layers/OrchestrationEventStore.ts:216

When a row fails decoding, its sequence is excluded from events, so the pagination cursor (events[events.length - 1]!.sequence) advances past only successfully decoded events. The next query fetches from that cursor, re-retrieving the failed row which fails decoding again, creating an infinite loop. For example: rows with sequences [1,2,3] are fetched, row 3 fails decoding, events becomes [{seq:1},{seq:2}], cursor becomes 2, next query refetches row 3, which fails again.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file apps/server/src/persistence/Layers/OrchestrationEventStore.ts around line 216:

When a row fails decoding, its sequence is excluded from `events`, so the pagination cursor (`events[events.length - 1]!.sequence`) advances past only successfully decoded events. The next query fetches from that cursor, re-retrieving the failed row which fails decoding again, creating an infinite loop. For example: rows with sequences [1,2,3] are fetched, row 3 fails decoding, `events` becomes [{seq:1},{seq:2}], cursor becomes 2, next query refetches row 3, which fails again.

Evidence trail:
apps/server/src/persistence/Layers/OrchestrationEventStore.ts lines 230-241 (decode failure returns Option.none), lines 248-261 (pagination logic with cursor based on last successful event's sequence at line 258)

@github-actions github-actions bot added size:M 30-99 changed lines (additions + deletions). and removed size:XL 500-999 changed lines (additions + deletions). labels Mar 13, 2026
@Amer-alsayed
Copy link
Author

Updated pagination to advance using the last row sequence when skipping invalid events, preventing decode loops. Desktop WS URL IPC fix and legacy provider cleanup unchanged. Tests: bun fmt, bun lint, bun typecheck.

@juliusmarminge
Copy link
Member

did you run into an issue where the state wasn't backwards compatible?

i think we just need to make sure we have good migrations in https://github.com/pingdotgg/t3code/tree/main/apps/server/src/persistence/Migrations whenever we do breaking changes that auto-migrates people's state to be compatible?

@Amer-alsayed
Copy link
Author

@juliusmarminge Yes, that’s what I ran into.

The failure mode here was older persisted local state containing legacy provider/runtime data and orchestration events that the current startup path no longer decoded successfully, which caused bootstrap to fail and the backend to restart repeatedly.

I agree the right long-term fix is to keep persisted state backwards compatible through explicit migrations in apps/server/src/persistence/Migrations whenever we introduce breaking persistence changes. I scoped this PR as a small defensive reliability fix so already-incompatible state does not leave the app unusable, but I think the proper follow-up would be a migration that normalizes or removes legacy provider-specific persisted rows/events during upgrade.

@juliusmarminge
Copy link
Member

@juliusmarminge Yes, that’s what I ran into.

The failure mode here was older persisted local state containing legacy provider/runtime data and orchestration events that the current startup path no longer decoded successfully, which caused bootstrap to fail and the backend to restart repeatedly.

I agree the right long-term fix is to keep persisted state backwards compatible through explicit migrations in apps/server/src/persistence/Migrations whenever we introduce breaking persistence changes. I scoped this PR as a small defensive reliability fix so already-incompatible state does not leave the app unusable, but I think the proper follow-up would be a migration that normalizes or removes legacy provider-specific persisted rows/events during upgrade.

you know when we broke your state? were you using some fork maybe? we did make some breaking changes in the 0.0.0-alpha releases but since 0.0.1 every change should have been stable

@Amer-alsayed
Copy link
Author

@juliusmarminge Yes, I think that explains it.

I had previously been running a fork of 0.0.9 where I added Gemini support, and after installing 0.0.11 the app stopped working. So the incompatible persisted state was most likely produced by my fork, not by the normal upstream 0.0.1+ upgrade path.

From what I saw, the concrete failure was legacy provider/runtime/event data in local state that the current startup path could no longer decode cleanly. So I don’t want to frame this as evidence of a stable migration regression upstream. I think this PR is better understood as a small defensive guard so invalid or fork-produced persisted state does not leave the app unusable.

@juliusmarminge
Copy link
Member

Yes i think it's good defensive check. Just wanted to track down the cause of your error as we shouldn't have broken anything on our end!

Will do a proper review later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M 30-99 changed lines (additions + deletions). vouch:unvouched PR author is not yet trusted in the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants