acc: Always destroy deployed bundles on test exit in cloud runs#5585
acc: Always destroy deployed bundles on test exit in cloud runs#5585chrisst wants to merge 4 commits into
Conversation
When acceptance tests run against real cloud workspaces (CLOUD_ENV set), a test that fails, times out, or exits mid-script never reaches its own 'bundle destroy' step: scripts run under 'bash -e' and the merged script.cleanup parts are skipped on failure. The deployed resources (SQL warehouses, pipelines, jobs, ...) then leak in the shared test workspaces. Leaked started warehouses recently exhausted a GCP quota and took CI down for two days; we observed 100+ leaked warehouses and dozens of leaked test pipelines in a single workspace. This adds a harness-level safety net: on cloud runs, runTest registers a t.Cleanup (before starting the script, so it also covers timeouts and mid-test failures) that scans the test's temp dir for bundle state directories (<bundle_root>/.databricks/bundle/<target>) and runs '$CLI bundle destroy --auto-approve --target <target>' in each bundle root, reusing the exact environment the script ran with. The mechanism is deliberately best effort and invisible to test output: - It is a no-op for local testserver runs (gated on CLOUD_ENV). - It runs after output comparison and logs only via t.Logf, so cleanup output is never compared against expected out files. - Double-destroy is harmless: 'bundle destroy' on an already-destroyed bundle exits 0 with 'No active deployment found to destroy!'. In the common success path the shared script.cleanup already removed .databricks, so nothing is even attempted. - Destroy failures are logged but never fail the test. Co-authored-by: Isaac
Use context.WithoutCancel(t.Context()) instead of context.Background() (t.Context() is already canceled when cleanups run), make the best-effort nilerr skip explicit, and trim narration comments. Co-authored-by: Isaac
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Integration test reportCommit: 9aa5344
20 interesting tests: 13 SKIP, 7 RECOVERED
Top 20 slowest tests (at least 2 minutes):
|
shreyas-goenka
left a comment
There was a problem hiding this comment.
Can you expand on the motivation for this change? We do have cleaup scripts in eng-dev-ecosystem that every night cleanup all resources and file trees in our test workspaces. That automatically captures orphaned resources.
Are we seeing more resource exhausted errors or running into limits?
Tangentially related: @pietern is also looking into creating separate worksapces for DABs vs PAE to help with the test load.
| } | ||
| } | ||
|
|
||
| func destroyBundle(t *testing.T, cliPath, bundleRoot, target string, env []string) { |
There was a problem hiding this comment.
A lot of acceptance tests do not create a bundle. Most (all?) also clean up after themselves. It's not clear having this is a net benefit.
Resources are currently leaking so fast that they are causing daily quota failures, I've been struggling lately to get PRs merged due to quotas being hit. IMO having a nightly cleanup is nice, but an imperfect solution. If we can address the cleanup right after the test run it will stop resources from building up during the day and possibly causing failures due to quotas. Also it's going to be better from a cost management perspective. If we can clean up with a high certainty right after a test suite runs then that's the gold standard. re splitting dabs and pae - that's nice but tangential. We still should aim to clean up resources as soon as they are irrelevant. And the nightly cleanup should serve as a good janitor. given all that - I'm not sure the best way to do targeted cleanups, this was just an attempt to use dabs to do the cleanup if dabs was the one doing the creation. I've got some other ideas to clean up per run, but this felt like a cheap and easy way to attempt a targeted cleanup. |
|
I'll be handling this another layer up to account for any timeout induced resource leakages. |
Pull request was closed
What
Adds a harness-level guarantee that bundles deployed by acceptance tests are destroyed when the test exits, for cloud runs (
CLOUD_ENVset):runTestregisters at.Cleanupbefore the script starts (covering failures,requireaborts, and script timeouts), capturing a clone of the exact env the script ran with;.databricks/bundle/<target>state dirs and runs$CLI bundle destroy --auto-approve --target <target>per bundle root (10-minute cap per destroy);t.Logfonly — never fail the test; double-destroy is harmless ("No active deployment found to destroy!" exits 0);Why
Acceptance scripts run under
bash -e, andscript.cleanupfragments are appended after the main body — they never execute when a script fails or times out betweenbundle deployandbundle destroy. Against shared cloud test workspaces this leaks real resources: during the 2026-06-11/12 incident one shared GCP workspace had accumulated 100+ leaked test warehouses and dozens of leakedtest-bundle-pipeline-*pipelines, exhausting the project's local-SSD quota and blocking terraform-provider CI for ~2 days (ref ES-1974228).Cleanup output cannot pollute golden files: it runs after output comparison and goes only to the test log.
Known limitation: destroy is best-effort — a bundle deployed with required
--varflags or a config corrupted mid-test may still fail to destroy; this is logged as a leak warning rather than failing the run.Tests
go build ./...,go vet ./acceptancepass; local deploy+destroy acceptance tests (bundle/resources/sql_warehouses,bundle/resources/pipelines/recreate-keysacross all engine variants) pass with no output regressions.This pull request and its description were written by Isaac.