Skip to content

Fix stale group listing after rename in groups migration workflow#4738

Open
mwojtyczka wants to merge 3 commits into
mainfrom
fix_reflext_account_groups
Open

Fix stale group listing after rename in groups migration workflow#4738
mwojtyczka wants to merge 3 commits into
mainfrom
fix_reflext_account_groups

Conversation

@mwojtyczka

@mwojtyczka mwojtyczka commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

Changes

  • Fix non-monotonic consistency gap between rename_workspace_local_groups and reflect_account_groups_on_workspace in the migrate-groups workflow
  • When the groups API returns stale data showing a renamed group under its old name, detect this by comparing group IDs and proceed with account group reflection instead of skipping

Problem

A customer ran the migrate-groups workflow and observed that all steps completed successfully on the first run, but only groups were renamed — account groups were never added to the workspace and permissions were not migrated. Running the workflow a second time completed the remaining steps.

The root cause is a non-monotonic consistency gap between steps 2 and 3 of the workflow. Step 2 (rename_workspace_local_groups) renames groups and waits up to 2 minutes for the API listing to reflect the changes. However, even after this wait confirms the rename, step 3 (reflect_account_groups_on_workspace) makes a fresh API call that can hit a different cache server and see the old group name. When it sees name_in_account still present as a workspace group, it skips the group thinking it already exists — making both step 3 (reflect) and step 4 (apply permissions) no-ops. The existing 2-minute wait in _wait_for_renamed_groups does not protect against this because the groups API is not monotonically consistent: seeing the correct state once does not guarantee subsequent calls will also see it.

Fix

In reflect_account_groups_on_workspace, when a workspace group matching name_in_account is found, compare its ID against migrated_group.id_in_workspace:

  • Same ID: the renamed group is appearing with its old name due to stale cache — log a warning and proceed with reflecting the account group
  • Different ID: a genuinely different workspace group exists with that name — skip as before

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: databricks labs ucx ...
  • added a new workflow
  • modified existing workflow: groups-migration
  • added a new table
  • modified existing table: ...

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

Verification on staging environment

image

Tested with 2 groups.

All 4 steps completed successfully in a single run. The rename step took ~6.7 minutes — well beyond the 2-minute consistency wait. And critically, reflect_account_groups_on_workspace succeeded and actually did work (321s, not a quick no-op skip), followed by apply_permissions completing in 11s. This confirms the fix is working — even if the groups API returned stale data, step 3 detected it by comparing group IDs and proceeded correctly.

Logs

rename_workspace_local_groups

10:04:12 INFO [d.l.u.workspace_access.groups] Listing workspace groups (resource_type=Group) with id,displayName,externalId,meta ...
10:04:13 INFO [d.l.u.workspace_access.groups] Found 1962 Group
10:04:13 INFO [d.l.u.workspace_access.groups] Listing workspace groups (resource_type=WorkspaceGroup) with id,displayName,meta,externalId,members,roles,entitlements ...
10:09:30 INFO [d.l.u.workspace_access.groups] Found 612 WorkspaceGroup
10:09:43 INFO [d.l.u.workspace_access.groups] Starting to rename 2 groups for migration...
10:09:55 INFO [d.l.blueprint.parallel][rename_groups_in_the_workspace_0] rename groups in the workspace 2/2, rps: 0.173/sec
10:09:55 INFO [d.l.blueprint.parallel] Finished 'rename groups in the workspace' tasks: 0% results available (0/2). Took 0:00:11.567695
10:09:56 INFO [d.l.blueprint.parallel][waiting_for_renamed_groups_in_the_workspace_0] waiting for renamed groups in the workspace 2/2, rps: 1.760/sec
10:09:56 INFO [d.l.blueprint.parallel] Finished 'waiting for renamed groups in the workspace' tasks: 0% results available (0/2). Took 0:00:01.137327
10:09:56 INFO [d.l.u.workspace_access.groups] Listing workspace groups (resource_type=WorkspaceGroup) with id,displayName ...
10:09:56 INFO [d.l.u.workspace_access.groups] Found 2574 WorkspaceGroup
10:09:56 INFO [d.l.u.workspace_access.groups] Listing workspace groups (resource_type=WorkspaceGroup) with id,displayName ...
10:09:56 INFO [d.l.u.workspace_access.groups] Found 2574 WorkspaceGroup

reflect_account_groups_on_workspace

10:10:11 INFO [d.l.u.workspace_access.groups] Listing account groups with id,displayName,externalId...
10:10:11 INFO [d.l.u.workspace_access.groups] Found 2527 account groups
10:10:11 INFO [d.l.u.workspace_access.groups] Listing workspace groups (resource_type=Group) with id,displayName,externalId,meta ...
10:10:11 INFO [d.l.u.workspace_access.groups] Found 1962 Group
10:10:11 INFO [d.l.u.workspace_access.groups] Listing workspace groups (resource_type=WorkspaceGroup) with id,displayName,meta,externalId,members,roles,entitlements ...
10:15:17 INFO [d.l.u.workspace_access.groups] Found 612 WorkspaceGroup
10:15:19 INFO [d.l.u.workspace_access.groups] Starting to reflect 2 account groups into workspace for migration...
10:15:19 INFO [d.l.blueprint.parallel][reflect_account_groups_on_this_workspace_1] reflect account groups on this workspace 2/2, rps: 3.244/sec
10:15:19 INFO [d.l.blueprint.parallel] Finished 'reflect account groups on this workspace' tasks: 100% results available (2/2). Took 0:00:00.618280

apply_permissions

10:15:34 INFO [d.l.u.workspace_access.groups] Migrating permissions for 2 account groups.
10:15:34 INFO [d.l.u.workspace_access.groups] Migrating permissions: db-temp-rc-test-group-a (workspace) -> rc-test-group-a (account) starting
10:15:35 INFO [d.l.u.workspace_access.groups] Migrating permissions: db-temp-rc-test-group-a (workspace) -> rc-test-group-a (account) progress=11(+11)
10:15:35 INFO [d.l.u.workspace_access.groups] Migrating permissions: db-temp-rc-test-group-a (workspace) -> rc-test-group-a (account) finished
10:15:35 INFO [d.l.u.workspace_access.groups] Migrated 11 permissions: db-temp-rc-test-group-a (workspace) -> rc-test-group-a (account)
10:15:35 INFO [d.l.u.workspace_access.groups] Migrating permissions: db-temp-rc-test-group-b (workspace) -> rc-test-group-b (account) starting
10:15:36 INFO [d.l.u.workspace_access.groups] Migrating permissions: db-temp-rc-test-group-b (workspace) -> rc-test-group-b (account) progress=5(+5)
10:15:37 INFO [d.l.u.workspace_access.groups] Migrating permissions: db-temp-rc-test-group-b (workspace) -> rc-test-group-b (account) finished
10:15:37 INFO [d.l.u.workspace_access.groups] Migrated 5 permissions: db-temp-rc-test-group-b (workspace) -> rc-test-group-b (account)
10:15:37 INFO [d.l.u.workspace_access.groups] Migrated 16 permissions for 2/2 groups successfully.
10:15:37 INFO [d.l.u.workspace_access.workflows] Group permission migration completed successfully.

@mwojtyczka mwojtyczka requested a review from a team as a code owner April 15, 2026 08:31
@mwojtyczka mwojtyczka requested a review from asnare April 15, 2026 08:31
@mwojtyczka mwojtyczka changed the title Fix stale group listing in reflect_account_groups_on_workspace after rename Fix stale group listing after rename in groups migration workflow Apr 15, 2026
@codecov

codecov Bot commented Apr 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.81%. Comparing base (4451cbe) to head (0f857d9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4738      +/-   ##
==========================================
+ Coverage   87.79%   87.81%   +0.01%     
==========================================
  Files         123      123              
  Lines       17595    17598       +3     
  Branches     3717     3718       +1     
==========================================
+ Hits        15448    15454       +6     
+ Misses       1458     1456       -2     
+ Partials      689      688       -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mwojtyczka mwojtyczka requested a review from FastLee April 29, 2026 09:44
@mwojtyczka

Copy link
Copy Markdown
Collaborator Author

@asnare @FastLee any chance to review this?

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

❌ 59/60 passed, 1 flaky, 1 failed, 5 skipped, 6h22m23s total

❌ test_reflect_account_groups_on_workspace_warns_skipping_when_a_workspace_group_has_same_name: databricks.labs.blueprint.parallel.ManyError: Detected 1 failures: TimeoutError: Timed out after 0:20:00 (32m53.406s)
databricks.labs.blueprint.parallel.ManyError: Detected 1 failures: TimeoutError: Timed out after 0:20:00
[gw7] linux -- Python 3.10.20 /home/runner/work/ucx/ucx/.venv/bin/python
12:45 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sccpm5pwc.groups] fetching groups inventory
12:45 DEBUG [databricks.labs.ucx.framework.crawlers] Inventory table not found
Traceback (most recent call last):
  File "/home/runner/work/ucx/ucx/src/databricks/labs/ucx/framework/crawlers.py", line 152, in _snapshot
    cached_results = list(fetcher())
  File "/home/runner/work/ucx/ucx/src/databricks/labs/ucx/workspace_access/groups.py", line 647, in _try_fetch
    for row in self._sql_backend.fetch(f"SELECT * FROM {escape_sql_identifier(self.full_name)}"):
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/lsql/core.py", line 344, in fetch_all
    execute_response = self.execute(
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/lsql/core.py", line 268, in execute
    self._raise_if_needed(status)
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/lsql/core.py", line 478, in _raise_if_needed
    raise NotFound(error_message)
databricks.sdk.errors.platform.NotFound: [TABLE_OR_VIEW_NOT_FOUND] The table or view `hive_metastore`.`dummy_sccpm5pwc`.`groups` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01; line 1 pos 14
12:45 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sccpm5pwc.groups] crawling new set of snapshot data for groups
12:50 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sccpm5pwc.groups] found 1 new records for groups
12:50 WARNING [databricks.labs.ucx.workspace_access.groups] Stale workspace group listing for ucx_GYy0h-ra78c33c72 (id=2127253178895634): group was already renamed to ucx-temp-ucx_GYy0h-ra78c33c72, proceeding with account group reflection
13:10 ERROR [databricks.labs.blueprint.parallel] reflect account groups on this workspace('151861181404295') task failed: Timed out after 0:20:00
Traceback (most recent call last):
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 36, in wrapper
    return func(*args, **kwargs)
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/blueprint/limiter.py", line 65, in wrapper
    return func(*args, **kwargs)
  File "/home/runner/work/ucx/ucx/src/databricks/labs/ucx/workspace_access/groups.py", line 881, in _reflect_account_group_to_workspace
    self._ws.api_client.do("PUT", path, data=json.dumps({"permissions": ["USER"]}))
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 85, in do
    return self._api_client.do(
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/_base_client.py", line 196, in do
    response = call(
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 57, in wrapper
    raise err
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 36, in wrapper
    return func(*args, **kwargs)
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/_base_client.py", line 298, in _perform
    raise error from None
databricks.sdk.errors.platform.ResourceConflict: Workspace group with name ucx_GYy0h-ra78c33c72 already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/blueprint/parallel.py", line 168, in inner
    return func(*args, **kwargs), None
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 65, in wrapper
    raise TimeoutError(f"Timed out after {timeout}") from last_err
TimeoutError: Timed out after 0:20:00
13:10 CRITICAL [databricks.labs.blueprint.parallel] All 'reflect account groups on this workspace' tasks failed!!!
12:45 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sccpm5pwc.groups] fetching groups inventory
12:45 DEBUG [databricks.labs.ucx.framework.crawlers] Inventory table not found
Traceback (most recent call last):
  File "/home/runner/work/ucx/ucx/src/databricks/labs/ucx/framework/crawlers.py", line 152, in _snapshot
    cached_results = list(fetcher())
  File "/home/runner/work/ucx/ucx/src/databricks/labs/ucx/workspace_access/groups.py", line 647, in _try_fetch
    for row in self._sql_backend.fetch(f"SELECT * FROM {escape_sql_identifier(self.full_name)}"):
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/lsql/core.py", line 344, in fetch_all
    execute_response = self.execute(
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/lsql/core.py", line 268, in execute
    self._raise_if_needed(status)
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/lsql/core.py", line 478, in _raise_if_needed
    raise NotFound(error_message)
databricks.sdk.errors.platform.NotFound: [TABLE_OR_VIEW_NOT_FOUND] The table or view `hive_metastore`.`dummy_sccpm5pwc`.`groups` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01; line 1 pos 14
12:45 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sccpm5pwc.groups] crawling new set of snapshot data for groups
12:50 DEBUG [databricks.labs.ucx.framework.crawlers] [hive_metastore.dummy_sccpm5pwc.groups] found 1 new records for groups
12:50 WARNING [databricks.labs.ucx.workspace_access.groups] Stale workspace group listing for ucx_GYy0h-ra78c33c72 (id=2127253178895634): group was already renamed to ucx-temp-ucx_GYy0h-ra78c33c72, proceeding with account group reflection
13:10 ERROR [databricks.labs.blueprint.parallel] reflect account groups on this workspace('151861181404295') task failed: Timed out after 0:20:00
Traceback (most recent call last):
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 36, in wrapper
    return func(*args, **kwargs)
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/blueprint/limiter.py", line 65, in wrapper
    return func(*args, **kwargs)
  File "/home/runner/work/ucx/ucx/src/databricks/labs/ucx/workspace_access/groups.py", line 881, in _reflect_account_group_to_workspace
    self._ws.api_client.do("PUT", path, data=json.dumps({"permissions": ["USER"]}))
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 85, in do
    return self._api_client.do(
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/_base_client.py", line 196, in do
    response = call(
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 57, in wrapper
    raise err
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 36, in wrapper
    return func(*args, **kwargs)
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/_base_client.py", line 298, in _perform
    raise error from None
databricks.sdk.errors.platform.ResourceConflict: Workspace group with name ucx_GYy0h-ra78c33c72 already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/labs/blueprint/parallel.py", line 168, in inner
    return func(*args, **kwargs), None
  File "/home/runner/work/ucx/ucx/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 65, in wrapper
    raise TimeoutError(f"Timed out after {timeout}") from last_err
TimeoutError: Timed out after 0:20:00
13:10 CRITICAL [databricks.labs.blueprint.parallel] All 'reflect account groups on this workspace' tasks failed!!!
[gw7] linux -- Python 3.10.20 /home/runner/work/ucx/ucx/.venv/bin/python

Flaky tests:

  • 🤪 test_some_entitlements[True] (25.754s)

Running from acceptance #9042

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant