ADBDEV-8057: Fix pg_rewind leaving replica on old timeline#1970
Merged
silent-observer merged 3 commits intoadb-7.2.0from Sep 19, 2025
Merged
ADBDEV-8057: Fix pg_rewind leaving replica on old timeline#1970silent-observer merged 3 commits intoadb-7.2.0from
silent-observer merged 3 commits intoadb-7.2.0from
Conversation
cc0a1c8 to
df9dd60
Compare
RekGRpth
reviewed
Sep 4, 2025
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Is this not the same case as described in the first commit? Add the appropriate description in the second commit. |
KnightMurloc
reviewed
Sep 8, 2025
|
If switches to the first commit, the tests pass with unchanged to pg_rewind. |
silent-observer
pushed a commit
that referenced
this pull request
Sep 15, 2025
It's entirely possible that the calculated "point of divergence" is before the end of WAL on the target, because when we calculate the point of divergence, we assume that the target's WAL extends to infinity on its current timeline. It might be more clear to calculate the end of WAL first, and if the target is a direct ancestor of the source, consider the end of target WAL as the point of divergence. However, it doesn't change the result: the target is a direct ancestor of the source, and doesn't need to be rewound. Add tests to cover the "no rewind required" cases, both when the target's end of WAL is exactly at the point of divergence, which is what the removed assertion assumed (case B in the test), and the case where the assertion failed (case A). Changes from original patch: test was made more general to prepare for more detailed tests in the next commits. In particular, we pg_rewind from a running source node, not from a stopped one, and the source node is passed as an argument. This doesn't affect the test validity for this patch. Discussion: https://www.postgresql.org/message-id/18575-1b9b3b60e2a480de@postgresql.org Ticket: ADBDEV-8057
silent-observer
pushed a commit
that referenced
this pull request
Sep 15, 2025
Alternatively, perhaps just copying the history file would be enough? Changes from original patch: - Move sourceHistfile initialization to before rewind_parseTimeLineHistory, since rewind_parseTimeLineHistory actually modifies it. - Don't copy timeline history file when dry_run is specified. Dry run shouldn't make changes that can potentially break the target server when applied incorrectly. - Fix 107_empty_conf.pl test to preserve primary_conninfo setting. Ticket: ADBDEV-8057
silent-observer
added a commit
that referenced
this pull request
Sep 15, 2025
In one specific edge case, when the source server makes a checkpoint exactly after switching to a new timeline, switchpoint LSN exactly matches latest checkpoint LSN. This can happen, for example, if the source has just performed archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in pg_rewind is illegal and causes an error at startup, since minRecoveryPoint actually means the minimum allowed end of WAL, and we want that to be after the switch occurs, not at the end of the previous timeline. As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which now correctly handles this case. Also add a test scenario for the source node that has performed recovery. Ticket: ADBDEV-8057
df9dd60 to
506a166
Compare
Author
|
I split the patch into three commits now.
|
silent-observer
pushed a commit
that referenced
this pull request
Sep 16, 2025
Alternatively, perhaps just copying the history file would be enough? Changes from original patch: - Refactor sourceHistfile to be an out parameter instead of global state. - Move sourceHistfile initialization to before rewind_parseTimeLineHistory, since rewind_parseTimeLineHistory actually modifies it. - Don't copy timeline history file when dry_run is specified. Dry run shouldn't make changes that can potentially break the target server when applied incorrectly. - Fix 107_empty_conf.pl test to preserve primary_conninfo setting. Ticket: ADBDEV-8057
silent-observer
added a commit
that referenced
this pull request
Sep 16, 2025
In one specific edge case, when the source server makes a checkpoint exactly after switching to a new timeline, switchpoint LSN exactly matches latest checkpoint LSN. This can happen, for example, if the source has just performed archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in pg_rewind is illegal and causes an error at startup, since minRecoveryPoint actually means the minimum allowed end of WAL, and we want that to be after the switch occurs, not at the end of the previous timeline. As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which now correctly handles this case. Also add a test scenario for the source node that has performed recovery. Ticket: ADBDEV-8057
506a166 to
c06e2a8
Compare
silent-observer
pushed a commit
that referenced
this pull request
Sep 16, 2025
Alternatively, perhaps just copying the history file would be enough? Changes from original patch: - Refactor sourceHistfile to be an out parameter instead of global state. - Move sourceHistfile initialization to before rewind_parseTimeLineHistory, since rewind_parseTimeLineHistory actually modifies it. - Don't copy timeline history file when dry_run is specified. Dry run shouldn't make changes that can potentially break the target server when applied incorrectly. - Fix 107_empty_conf.pl test to preserve primary_conninfo setting. Ticket: ADBDEV-8057
silent-observer
added a commit
that referenced
this pull request
Sep 16, 2025
In one specific edge case, when the source server makes a checkpoint exactly after switching to a new timeline, switchpoint LSN exactly matches latest checkpoint LSN. This can happen, for example, if the source has just performed archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in pg_rewind is illegal and causes an error at startup, since minRecoveryPoint actually means the minimum allowed end of WAL, and we want that to be after the switch occurs, not at the end of the previous timeline. As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which now correctly handles this case. Also add a test scenario for the source node that has performed recovery. Ticket: ADBDEV-8057
c06e2a8 to
fbc4fec
Compare
silent-observer
added a commit
that referenced
this pull request
Sep 18, 2025
In one specific edge case, when the source server makes a checkpoint exactly after switching to a new timeline, switchpoint LSN exactly matches latest checkpoint LSN. This can happen, for example, if the source has just performed archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in pg_rewind is illegal and causes an error at startup, since minRecoveryPoint actually means the minimum allowed end of WAL, and we want that to be after the switch occurs, not at the end of the previous timeline. As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which now correctly handles this case. Also add a test scenario for the source node that has performed recovery. Ticket: ADBDEV-8057
fbc4fec to
b713bf6
Compare
KnightMurloc
previously approved these changes
Sep 18, 2025
This comment was marked as resolved.
This comment was marked as resolved.
Member
pgindent these your changes |
Author
Yes, changes to 107_empty_conf.pl only fix config line being accidentally deleted, they don't change the nature of the test. 010_target_is_ancestor.pl is the main test for this patch. |
It's entirely possible that the calculated "point of divergence" is before the end of WAL on the target, because when we calculate the point of divergence, we assume that the target's WAL extends to infinity on its current timeline. It might be more clear to calculate the end of WAL first, and if the target is a direct ancestor of the source, consider the end of target WAL as the point of divergence. However, it doesn't change the result: the target is a direct ancestor of the source, and doesn't need to be rewound. Add tests to cover the "no rewind required" cases, both when the target's end of WAL is exactly at the point of divergence, which is what the removed assertion assumed (case B in the test), and the case where the assertion failed (case A). Changes from original patch: test was made more general to prepare for more detailed tests in the next commits. In particular, we pg_rewind from a running source node, not from a stopped one, and the source node is passed as an argument. This doesn't affect the test validity for this patch. Discussion: https://www.postgresql.org/message-id/18575-1b9b3b60e2a480de@postgresql.org Ticket: ADBDEV-8057
silent-observer
pushed a commit
that referenced
this pull request
Sep 19, 2025
Alternatively, perhaps just copying the history file would be enough? Changes from original patch: - Refactor sourceHistfile to be an out parameter instead of global state. - Move sourceHistfile initialization to before rewind_parseTimeLineHistory, since rewind_parseTimeLineHistory actually modifies it. - Don't copy timeline history file when dry_run is specified. Dry run shouldn't make changes that can potentially break the target server when applied incorrectly. - Fix 107_empty_conf.pl test to preserve primary_conninfo setting. Ticket: ADBDEV-8057
silent-observer
added a commit
that referenced
this pull request
Sep 19, 2025
In one specific edge case, when the source server makes a checkpoint exactly after switching to a new timeline, switchpoint LSN exactly matches latest checkpoint LSN. This can happen, for example, if the source has just performed archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in pg_rewind is illegal and causes an error at startup, since minRecoveryPoint actually means the minimum allowed end of WAL, and we want that to be after the switch occurs, not at the end of the previous timeline. As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which now correctly handles this case. Also add a test scenario for the source node that has performed recovery. Ticket: ADBDEV-8057
b713bf6 to
cc50848
Compare
Alternatively, perhaps just copying the history file would be enough? Changes from original patch: - Refactor sourceHistfile to be an out parameter instead of global state. - Move sourceHistfile initialization to before rewind_parseTimeLineHistory, since rewind_parseTimeLineHistory actually modifies it. - Don't copy timeline history file when dry_run is specified. Dry run shouldn't make changes that can potentially break the target server when applied incorrectly. - Fix 107_empty_conf.pl test to preserve primary_conninfo setting. Ticket: ADBDEV-8057
In one specific edge case, when the source server makes a checkpoint exactly after switching to a new timeline, switchpoint LSN exactly matches latest checkpoint LSN. This can happen, for example, if the source has just performed archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in pg_rewind is illegal and causes an error at startup, since minRecoveryPoint actually means the minimum allowed end of WAL, and we want that to be after the switch occurs, not at the end of the previous timeline. As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which now correctly handles this case. Also add a test scenario for the source node that has performed recovery. Ticket: ADBDEV-8057
cc50848 to
a962757
Compare
RekGRpth
approved these changes
Sep 19, 2025
KnightMurloc
approved these changes
Sep 19, 2025
silent-observer
pushed a commit
that referenced
this pull request
Sep 19, 2025
It's entirely possible that the calculated "point of divergence" is before the end of WAL on the target, because when we calculate the point of divergence, we assume that the target's WAL extends to infinity on its current timeline. It might be more clear to calculate the end of WAL first, and if the target is a direct ancestor of the source, consider the end of target WAL as the point of divergence. However, it doesn't change the result: the target is a direct ancestor of the source, and doesn't need to be rewound. Add tests to cover the "no rewind required" cases, both when the target's end of WAL is exactly at the point of divergence, which is what the removed assertion assumed (case B in the test), and the case where the assertion failed (case A). Changes from original patch: test was made more general to prepare for more detailed tests in the next commits. In particular, we pg_rewind from a running source node, not from a stopped one, and the source node is passed as an argument. This doesn't affect the test validity for this patch. Discussion: https://www.postgresql.org/message-id/18575-1b9b3b60e2a480de@postgresql.org Ticket: ADBDEV-8057
silent-observer
pushed a commit
that referenced
this pull request
Sep 19, 2025
Alternatively, perhaps just copying the history file would be enough? Changes from original patch: - Refactor sourceHistfile to be an out parameter instead of global state. - Move sourceHistfile initialization to before rewind_parseTimeLineHistory, since rewind_parseTimeLineHistory actually modifies it. - Don't copy timeline history file when dry_run is specified. Dry run shouldn't make changes that can potentially break the target server when applied incorrectly. - Fix 107_empty_conf.pl test to preserve primary_conninfo setting. Ticket: ADBDEV-8057
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove bogus assertion in pg_rewind
It's entirely possible that the calculated "point of divergence" is
before the end of WAL on the target, because when we calculate the
point of divergence, we assume that the target's WAL extends to
infinity on its current timeline. It might be more clear to calculate
the end of WAL first, and if the target is a direct ancestor of the
source, consider the end of target WAL as the point of
divergence. However, it doesn't change the result: the target is a
direct ancestor of the source, and doesn't need to be rewound.
Add tests to cover the "no rewind required" cases, both when the
target's end of WAL is exactly at the point of divergence, which is
what the removed assertion assumed (case B in the test), and the case
where the assertion failed (case A).
Changes from original patch: test was made more general to prepare for more
detailed tests in the next commits. In particular, we pg_rewind from a
running source node, not from a stopped one, and the source node is passed as
an argument. This doesn't affect the test validity for this patch.
Discussion: https://www.postgresql.org/message-id/18575-1b9b3b60e2a480de@postgresql.org
Alternatively, perhaps just copying the history file would be enough?
Changes from original patch:
since rewind_parseTimeLineHistory actually modifies it.
shouldn't make changes that can potentially break the target server when
applied incorrectly.
Fix pg_rewind edge case when checkpoint is at switchpoint
In one specific edge case, when the source server makes a checkpoint exactly
after switching to a new timeline, switchpoint LSN exactly matches latest
checkpoint LSN. This can happen, for example, if the source has just performed
archive recovery. Then setting minRecoveryPoint to the checkpoint LSN in
pg_rewind is illegal and causes an error at startup, since minRecoveryPoint
actually means the minimum allowed end of WAL, and we want that to be after
the switch occurs, not at the end of the previous timeline.
As a fix, add +1 to the checkpoint LSN when setting minRecoveryPoint, which
now correctly handles this case. Also add a test scenario for the source node
that has performed recovery.
Note: Do not squash to preserve authorship