Skip to content

direct: add resilience against eventual consistency + fix tests#5694

Open
denik wants to merge 3 commits into
mainfrom
denik/eventual-consistency
Open

direct: add resilience against eventual consistency + fix tests#5694
denik wants to merge 3 commits into
mainfrom
denik/eventual-consistency

Conversation

@denik

@denik denik commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Changes

  • Add deterministic eventual consistency simulation to testserver for dashboard backend (first GET always returns stale response, then correct one).
  • Update direct engine to retry 404s when we know the resource should exist (e.g. after create or update).

Why

We've seen the dashboard API being eventually consistent which causes cloud tests to fail.

Tests

  • Update tests to avoid reading stale values (e.g. parse output of PUT instead of doing follow up GET).
  • In some cases, retry GET request if we can see it is stale (reading old ETAG value).
  • New script retry.py does retry based on substring in the response.

…ngine

The testserver now returns 404 on the first dashboard GET after a create
(eventual-consistency token), and the direct engine retries reads on 404
when it knows the resource should exist (has an ID on record).

Co-authored-by: Isaac
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/bundle/ - needs approval

4 files changed
Suggested: @pietern
Also eligible: @janniklasrose, @shreyas-goenka, @andrewnester, @anton-107, @lennartkats-db

/bundle/ - needs approval

5 files changed
Suggested: @pietern
Also eligible: @janniklasrose, @shreyas-goenka, @andrewnester, @anton-107, @lennartkats-db

General files (require maintainer)

8 files changed
Based on git history:

  • @pietern -- recent work in libs/testserver/, bundle/direct/, bundle/direct/dresources/

Any maintainer (@andrewnester, @anton-107, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@denik denik temporarily deployed to test-trigger-is June 23, 2026 19:02 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 23, 2026 19:02 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 84f6cf6

Run: 28077092784

Env 🟨​KNOWN ✅​pass 🙈​skip Time
🟨​ aws linux 1 216 99 2:56
🟨​ aws windows 1 218 97 2:44
🟨​ aws-ucws linux 1 297 18 3:30
🟨​ aws-ucws windows 1 299 16 3:34
🟨​ azure linux 1 216 98 3:08
🟨​ azure windows 1 218 96 2:41
🟨​ azure-ucws linux 1 299 15 4:12
🟨​ azure-ucws windows 1 301 13 3:52
🟨​ gcp linux 1 215 100 3:01
🟨​ gcp windows 1 217 98 2:51
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K

These just delegated to DoRead with no readiness polling. The post-create
eventual-consistency read is already handled by refreshRemoteState, which
retries on 404 via retryOnTransientOrMissing.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 24, 2026 04:46 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 24, 2026 04:46 — with GitHub Actions Inactive
The matrix DATABRICKS_BUNDLE_ENGINE value is only set on the CLI subprocess
env, so reading it via env.Get(t.Context()) in PrepareServerAndClient returned
"" and the EC token was never selected -- the simulation was dead in tests.

Thread the per-variant env into PrepareServerAndClient and gate EC on an
explicit TESTS_STALE_ONCE=1 (direct engine only). Enable it for the dashboards
tests and the no_drift invariant; migrate/continue_293 invoke terraform or the
old CLI which do not retry, so they are left out.

With EC genuinely on, WaitAfterCreate is required again to consume the
post-create stale inside deploy; a 404 retry is expected and logged at debug
(not warn). Retry interval is set to 1ms for acceptance to avoid 15s sleeps.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 24, 2026 05:21 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 24, 2026 05:21 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants