Skip to content

[GG-181] Use intermediate host for swap moves#2248

Open
bimboterminator1 wants to merge 1 commit intofeature/ADBDEV-6608from
GG-181
Open

[GG-181] Use intermediate host for swap moves#2248
bimboterminator1 wants to merge 1 commit intofeature/ADBDEV-6608from
GG-181

Conversation

@bimboterminator1
Copy link
Member

@bimboterminator1 bimboterminator1 commented Feb 18, 2026

Previously, primary and mirror could coexist at the same host during
execution of moves, where segments just swap their hosts. This violates
the HA rule for the whole cluster.

When the suboptimal rebalance plan requires swapping the locations
of a primary segment and its mirror, the planner now decomposes
this into three safe phases using an intermediate host to prevent
primary-mirror coexistence violations.

planner.py now detects swap moves in form_moves() and
chooses the appropriate 3rd host for mirror movement.
The search is performed based on available space, considering
other moves, host status, and on other swap counts.
Thus, plan, which previously looked like:


---------------------------------BALANCE MOVES----------------------------------
Total moves planned: 2

  [1] Move Segment(content=3, dbid=5, role=p) [254.73 MB]
      From: sdw1:7005 → /home/gpadmin/.data/primary/gpseg3
      To:   sdw2:7005 → /home/gpadmin/.data/primary/gpseg3

  [2] Move Segment(content=3, dbid=11, role=m) [190.44 MB]
      From: sdw2:7053 → /home/gpadmin/.data/mirror/gpseg3
      To:   sdw1:7053 → /home/gpadmin/.data/mirror/gpseg3

now expands into three moves


---------------------------------BALANCE MOVES----------------------------------
Total moves planned: 3

  [1] Move Segment(content=2, dbid=10, role=m) [190.45 MB]
      From: sdw2:7052 → /home/gpadmin/.data/mirror/gpseg2
      To:   sdw3:7052 → /home/gpadmin/.data/mirror/gpseg2

  [2] Move Segment(content=2, dbid=4, role=p) [254.74 MB]
      From: sdw1:7004 → /home/gpadmin/.data/primary/gpseg2
      To:   sdw2:7004 → /home/gpadmin/.data/primary/gpseg2

  [3] Move Segment(content=2, dbid=10, role=m) [190.45 MB]
      From: sdw3:7052 → /home/gpadmin/.data/mirror/gpseg2
      To:   sdw1:7054 → /home/gpadmin/.data/mirror/gpseg2

Additionally unit tests were fixed, because we've forgotten to check them
in previous patches.

Behave tests were not added, since we don't have functionality of creating
a cluster with custom configuration.

Example of configuration with single swap for manual testing:

conf
QD_PRIMARY_ARRAY=cdw~cdw~7000~/home/gpadmin/.data/gpseg-1~1~-1~0
declare -a PRIMARY_ARRAY=(
sdw1~sdw1~7002~/home/gpadmin/.data/primary/gpseg0~2~0~11100
sdw1~sdw1~7003~/home/gpadmin/.data/primary/gpseg1~3~1~11110
sdw1~sdw1~7004~/home/gpadmin/.data/primary/gpseg2~4~2~11220
sdw2~sdw2~7003~/home/gpadmin/.data/primary/gpseg3~5~3~11350
sdw3~sdw3~7004~/home/gpadmin/.data/primary/gpseg4~6~4~11360
sdw3~sdw3~7005~/home/gpadmin/.data/primary/gpseg5~7~5~11370
)
declare -a MIRROR_ARRAY=(
sdw3~sdw3~7050~/home/gpadmin/.data/mirror/gpseg0~8~0~51130
sdw3~sdw3~7051~/home/gpadmin/.data/mirror/gpseg1~9~1~51140
sdw2~sdw2~7052~/home/gpadmin/.data/mirror/gpseg2~10~2~51160
sdw1~sdw1~7053~/home/gpadmin/.data/mirror/gpseg3~11~3~51160
sdw2~sdw2~7054~/home/gpadmin/.data/mirror/gpseg4~12~4~51200
sdw2~sdw2~7055~/home/gpadmin/.data/mirror/gpseg5~13~5~51136
)

@whitehawk
Copy link

In the description:
'now exands into three moves' -> 'now expands into three moves'

'remove_hosts', 'remove_hosts_file', 'target_datadirs', 'target_datadirs_file',
'target_segment_count', 'mirror_mode', 'skip_rebalance', 'show_plan',
'skip_resource_estimation', 'analyze', 'replay_lag', 'hba_hostnames'
'skip_resource_estimation', 'analyze', 'replay_lag', 'hba_hostnames', 'inplace_swap_roles'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add checks for these new incompatible combinations into test

batch_mirror_steps = []
batch_primary_steps = [[],[],[]]

def is_swap_phase3(move: LogicalMove) -> bool:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this phase metadata? From my understanding, when we are performing a swap, it will always transform in 3 movements:

  1. mirror to an intermediate host;
  2. primary to its target host;
  3. mirror from an intermediate host to its target host;

So, there always will be role change between any of these 3 steps. And they will not get into the same batch.

Or am I missing smth? Can you please advise a cluster configuration, where only role_changed check is not enough?

"Need intermediate host to avoid primary and mirror coexistence." )

swap_move_ids = set()
for prim_move, mir_move in swap_pairs:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this loop with the one below:

...
            for prim_move, mir_move in swap_pairs:
                content_id = prim_move.seg.getSegmentContentId()
                try:
                    # Select intermediate host
...

?

active_hosts_count = len([h for h in self.target_hosts
if h.status in [HostStatus.ACTIVE, HostStatus.NEW]])
if active_hosts_count < 3:
raise ValidationError(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a check in select_intermediate_host():

        if not candidates:
            raise PlanningError(
                f"No intermediate host available for swap of content {content_id}. "
                f"Need at least 3 hosts to perform swaps safely."
            )

which intersects with this check. The check in select_intermediate_host() seems to be more general, so maybe we can leave only it, and remove the check here?

swap_phase3_moves.append(phase3)

except Exception as e:
raise PlanningError(f"Failed to plan swap for content {content_id}: {e}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a reminder here about --inplace-swap-roles option?

Comment on lines +832 to +903
if swap_pairs:
self.logger.info(f"Detected {len(swap_pairs)} primary-mirror pairs which just swap hosts")
active_hosts_count = len([h for h in self.target_hosts
if h.status in [HostStatus.ACTIVE, HostStatus.NEW]])
if active_hosts_count < 3:
raise ValidationError(
"Cannot safely perform primary-mirror swaps with less than 3 hosts. "
"Need intermediate host to avoid primary and mirror coexistence." )

swap_move_ids = set()
for prim_move, mir_move in swap_pairs:
swap_move_ids.add(prim_move.seg.getSegmentDbId())
swap_move_ids.add(mir_move.seg.getSegmentDbId())

# Track intermediate host usage for better distribution
used_intermediate_hosts = {}

# Decompose swaps with intermediate host selection
swap_phase1_moves = []
swap_phase2_moves = []
swap_phase3_moves = []

for prim_move, mir_move in swap_pairs:
content_id = prim_move.seg.getSegmentContentId()
try:
# Select intermediate host
intermediate_host = self.select_intermediate_host(
prim_move,
mir_move,
used_intermediate_hosts,
resource_estimator # Pass the estimator with cached filesystem data
)

# Decompose into 3 phases
phase1, phase2, phase3 = self.decompose_swap(
prim_move,
mir_move,
intermediate_host,
port_allocator
)

swap_phase1_moves.append(phase1)
swap_phase2_moves.append(phase2)
swap_phase3_moves.append(phase3)

except Exception as e:
raise PlanningError(f"Failed to plan swap for content {content_id}: {e}")

# Collect non-swap moves
non_swap_mirror_moves = []
non_swap_primary_moves = []

for move in moves:
if move.seg.getSegmentDbId() not in swap_move_ids:
if move.seg.isSegmentPrimary():
non_swap_primary_moves.append(move)
else:
non_swap_mirror_moves.append(move)

moves_with_swap = []
# Batch 1: All initial mirror moves (regular + phase1)
moves_with_swap.extend(non_swap_mirror_moves)
moves_with_swap.extend(swap_phase1_moves)

# Batch 2: All primary moves (regular + phase2)
moves_with_swap.extend(non_swap_primary_moves)
moves_with_swap.extend(swap_phase2_moves)

return moves
# Batch 3: Phase 3 mirrors (must execute last)
moves_with_swap.extend(swap_phase3_moves)

final_moves = moves_with_swap

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this logic under if swap_pairs: condition into a separate function in order to improve readability? So it would be smth like:

Suggested change
if swap_pairs:
self.logger.info(f"Detected {len(swap_pairs)} primary-mirror pairs which just swap hosts")
active_hosts_count = len([h for h in self.target_hosts
if h.status in [HostStatus.ACTIVE, HostStatus.NEW]])
if active_hosts_count < 3:
raise ValidationError(
"Cannot safely perform primary-mirror swaps with less than 3 hosts. "
"Need intermediate host to avoid primary and mirror coexistence." )
swap_move_ids = set()
for prim_move, mir_move in swap_pairs:
swap_move_ids.add(prim_move.seg.getSegmentDbId())
swap_move_ids.add(mir_move.seg.getSegmentDbId())
# Track intermediate host usage for better distribution
used_intermediate_hosts = {}
# Decompose swaps with intermediate host selection
swap_phase1_moves = []
swap_phase2_moves = []
swap_phase3_moves = []
for prim_move, mir_move in swap_pairs:
content_id = prim_move.seg.getSegmentContentId()
try:
# Select intermediate host
intermediate_host = self.select_intermediate_host(
prim_move,
mir_move,
used_intermediate_hosts,
resource_estimator # Pass the estimator with cached filesystem data
)
# Decompose into 3 phases
phase1, phase2, phase3 = self.decompose_swap(
prim_move,
mir_move,
intermediate_host,
port_allocator
)
swap_phase1_moves.append(phase1)
swap_phase2_moves.append(phase2)
swap_phase3_moves.append(phase3)
except Exception as e:
raise PlanningError(f"Failed to plan swap for content {content_id}: {e}")
# Collect non-swap moves
non_swap_mirror_moves = []
non_swap_primary_moves = []
for move in moves:
if move.seg.getSegmentDbId() not in swap_move_ids:
if move.seg.isSegmentPrimary():
non_swap_primary_moves.append(move)
else:
non_swap_mirror_moves.append(move)
moves_with_swap = []
# Batch 1: All initial mirror moves (regular + phase1)
moves_with_swap.extend(non_swap_mirror_moves)
moves_with_swap.extend(swap_phase1_moves)
# Batch 2: All primary moves (regular + phase2)
moves_with_swap.extend(non_swap_primary_moves)
moves_with_swap.extend(swap_phase2_moves)
return moves
# Batch 3: Phase 3 mirrors (must execute last)
moves_with_swap.extend(swap_phase3_moves)
final_moves = moves_with_swap
if swap_pairs:
final_moves = self.handle_swaps(......)

return self.filesystem_allocations[fs_key]['required_kb']
return 0

def get_available_space_for_directory(self, hostname: str, host_address: str, directory: str) -> Optional[int]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one seems not used anywhere

@whitehawk
Copy link

Behave tests were not added, since we don't have functionality of creating
a cluster with custom configuration.

I think it is easy to add it.
For ex., with the patch:

diff --git a/gpMgmt/test/behave/mgmt_utils/steps/mgmt_utils.py b/gpMgmt/test/behave/mgmt_utils/steps/mgmt_utils.py
index 059f6e27c8f..663c971d012 100644
--- a/gpMgmt/test/behave/mgmt_utils/steps/mgmt_utils.py
+++ b/gpMgmt/test/behave/mgmt_utils/steps/mgmt_utils.py
@@ -4546,6 +4546,13 @@ def impl(context, delay):
 def impl(context):
     os.environ['GPMGMT_FAULT_DELAY_MS'] = ""

+@given('store text in file "{filename}"')
+@then('store text in file "{filename}"')
+@when('store text in file "{filename}"')
+def impl(context, filename):
+    with open(filename, 'w') as fp:
+        fp.write(context.text + '\n')
+
 @given('stub')
 def impl(context):
     pass

And following sample test definition:

    Scenario: test N. ggrebalance create custom cluster
        Given the database is not running
         And a working directory of the test as '/data/gpdata/ggrebalance'
         And the user runs command "ssh sdw1 mkdir -p /data/gpdata/ggrebalance/primary"
         And the user runs command "ssh sdw1 mkdir -p /data/gpdata/ggrebalance/mirror"
         And the user runs command "ssh sdw2 mkdir -p /data/gpdata/ggrebalance/primary"
         And the user runs command "ssh sdw2 mkdir -p /data/gpdata/ggrebalance/mirror"
         And the user runs command "ssh sdw3 mkdir -p /data/gpdata/ggrebalance/primary"
         And the user runs command "ssh sdw3 mkdir -p /data/gpdata/ggrebalance/mirror"
         And store text in file "/tmp/config_file"
            """
            TRUSTED_SHELL=ssh
            ENCODING=UNICODE
            SEG_PREFIX=gpseg
            HEAP_CHECKSUM=on
            HBA_HOSTNAMES=0
            QD_PRIMARY_ARRAY=cdw~cdw~7000~/data/gpdata/ggrebalance/gpseg-1~1~-1~0
            declare -a PRIMARY_ARRAY=(
            sdw1~sdw1~7002~/data/gpdata/ggrebalance/primary/gpseg0~2~0~11100
            sdw1~sdw1~7003~/data/gpdata/ggrebalance/primary/gpseg1~3~1~11110
            sdw1~sdw1~7004~/data/gpdata/ggrebalance/primary/gpseg2~4~2~11220
            sdw2~sdw2~7003~/data/gpdata/ggrebalance/primary/gpseg3~5~3~11350
            sdw3~sdw3~7004~/data/gpdata/ggrebalance/primary/gpseg4~6~4~11360
            sdw3~sdw3~7005~/data/gpdata/ggrebalance/primary/gpseg5~7~5~11370
            )
            declare -a MIRROR_ARRAY=(
            sdw3~sdw3~7050~/data/gpdata/ggrebalance/mirror/gpseg0~8~0~51130
            sdw3~sdw3~7051~/data/gpdata/ggrebalance/mirror/gpseg1~9~1~51140
            sdw2~sdw2~7052~/data/gpdata/ggrebalance/mirror/gpseg2~10~2~51160
            sdw1~sdw1~7053~/data/gpdata/ggrebalance/mirror/gpseg3~11~3~51160
            sdw2~sdw2~7054~/data/gpdata/ggrebalance/mirror/gpseg4~12~4~51200
            sdw2~sdw2~7055~/data/gpdata/ggrebalance/mirror/gpseg5~13~5~51136
            )
            """
        When initialize a cluster using "/tmp/config_file"
        Then the temporary file "/tmp/config_file" is removed

I was able to bring up a cluster with the required custom config:

Running behave on management scripts...
@ggrebalance_basics
Feature: ggrebalance behave tests

  Scenario: test N. ggrebalance create custom cluster
    Given the database is not running                                              # 9.089s
    And a working directory of the test as '/data/gpdata/ggrebalance'              # 0.001s
    And the user runs command "ssh sdw1 mkdir -p /data/gpdata/ggrebalance/primary" # 0.159s
    And the user runs command "ssh sdw1 mkdir -p /data/gpdata/ggrebalance/mirror"  # 0.193s
    And the user runs command "ssh sdw2 mkdir -p /data/gpdata/ggrebalance/primary" # 0.185s
    And the user runs command "ssh sdw2 mkdir -p /data/gpdata/ggrebalance/mirror"  # 0.193s
    And the user runs command "ssh sdw3 mkdir -p /data/gpdata/ggrebalance/primary" # 0.172s
    And the user runs command "ssh sdw3 mkdir -p /data/gpdata/ggrebalance/mirror"  # 0.174s
    And store text in file "/tmp/config_file"                                      # 0.001s
      """
      TRUSTED_SHELL=ssh
      ENCODING=UNICODE
      SEG_PREFIX=gpseg
      HEAP_CHECKSUM=on
      HBA_HOSTNAMES=0
      QD_PRIMARY_ARRAY=cdw~cdw~7000~/data/gpdata/ggrebalance/gpseg-1~1~-1~0
      declare -a PRIMARY_ARRAY=(
      sdw1~sdw1~7002~/data/gpdata/ggrebalance/primary/gpseg0~2~0~11100
      sdw1~sdw1~7003~/data/gpdata/ggrebalance/primary/gpseg1~3~1~11110
      sdw1~sdw1~7004~/data/gpdata/ggrebalance/primary/gpseg2~4~2~11220
      sdw2~sdw2~7003~/data/gpdata/ggrebalance/primary/gpseg3~5~3~11350
      sdw3~sdw3~7004~/data/gpdata/ggrebalance/primary/gpseg4~6~4~11360
      sdw3~sdw3~7005~/data/gpdata/ggrebalance/primary/gpseg5~7~5~11370
      )
      declare -a MIRROR_ARRAY=(
      sdw3~sdw3~7050~/data/gpdata/ggrebalance/mirror/gpseg0~8~0~51130
      sdw3~sdw3~7051~/data/gpdata/ggrebalance/mirror/gpseg1~9~1~51140
      sdw2~sdw2~7052~/data/gpdata/ggrebalance/mirror/gpseg2~10~2~51160
      sdw1~sdw1~7053~/data/gpdata/ggrebalance/mirror/gpseg3~11~3~51160
      sdw2~sdw2~7054~/data/gpdata/ggrebalance/mirror/gpseg4~12~4~51200
      sdw2~sdw2~7055~/data/gpdata/ggrebalance/mirror/gpseg5~13~5~51136
      )
      """
    When initialize a cluster using "/tmp/config_file"                             # 125.132s
    Then the temporary file "/tmp/config_file" is removed                          # 0.000s

1 feature passed, 0 failed, 27 skipped
1 scenario passed, 0 failed, 841 skipped
11 steps passed, 0 failed, 14721 skipped, 0 undefined
Took 2m15.298s
make: Leaving directory '/home/gpadmin/gpdb_src/gpMgmt'
gpadmin@cdw:~/gpdb_src$ psql postgres -c "select * from gp_segment_configuration;"
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                 datadir
------+---------+------+----------------+------+--------+------+----------+---------+-----------------------------------------
    1 |      -1 | p    | p              | n    | u      | 7000 | cdw      | cdw     | /data/gpdata/ggrebalance/gpseg-1
    4 |       2 | p    | p              | s    | u      | 7004 | sdw1     | sdw1    | /data/gpdata/ggrebalance/primary/gpseg2
   10 |       2 | m    | m              | s    | u      | 7052 | sdw2     | sdw2    | /data/gpdata/ggrebalance/mirror/gpseg2
    5 |       3 | p    | p              | s    | u      | 7003 | sdw2     | sdw2    | /data/gpdata/ggrebalance/primary/gpseg3
   11 |       3 | m    | m              | s    | u      | 7053 | sdw1     | sdw1    | /data/gpdata/ggrebalance/mirror/gpseg3
    2 |       0 | p    | p              | s    | u      | 7002 | sdw1     | sdw1    | /data/gpdata/ggrebalance/primary/gpseg0
    8 |       0 | m    | m              | s    | u      | 7050 | sdw3     | sdw3    | /data/gpdata/ggrebalance/mirror/gpseg0
    3 |       1 | p    | p              | s    | u      | 7003 | sdw1     | sdw1    | /data/gpdata/ggrebalance/primary/gpseg1
    9 |       1 | m    | m              | s    | u      | 7051 | sdw3     | sdw3    | /data/gpdata/ggrebalance/mirror/gpseg1
    6 |       4 | p    | p              | s    | u      | 7004 | sdw3     | sdw3    | /data/gpdata/ggrebalance/primary/gpseg4
   12 |       4 | m    | m              | s    | u      | 7054 | sdw2     | sdw2    | /data/gpdata/ggrebalance/mirror/gpseg4
    7 |       5 | p    | p              | s    | u      | 7005 | sdw3     | sdw3    | /data/gpdata/ggrebalance/primary/gpseg5
   13 |       5 | m    | m              | s    | u      | 7055 | sdw2     | sdw2    | /data/gpdata/ggrebalance/mirror/gpseg5
(13 rows)

gpadmin@cdw:~/gpdb_src$


Therefore, I suggest to add behave tests in scope of this ticket.

Also, it would be nice to see cases of cluster configuration, where several swaps are required, in the tests .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments