Skip to content

libpcp: recover from corrupted archives in multi-archive contexts#2586

Open
kurik wants to merge 2 commits into
performancecopilot:mainfrom
kurik:corrupted-archive-recovery
Open

libpcp: recover from corrupted archives in multi-archive contexts#2586
kurik wants to merge 2 commits into
performancecopilot:mainfrom
kurik:corrupted-archive-recovery

Conversation

@kurik
Copy link
Copy Markdown
Contributor

@kurik kurik commented May 7, 2026

When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read.

  • Add 'corrupted' field to __pmMultiLogCtl to track damaged archives
  • Skip corrupt/unreadable archives during multi-archive context init with a warning rather than failing outright
  • Rebuild shared __pmLogCtl when lost due to failed archive opens
  • On PM_ERR_LOGREC in __pmLogRead, advance to the next (or previous) archive instead of returning an error in multi-archive mode
  • Skip known-corrupted archives in LogChangeToNextArchive() and LogChangeToPreviousArchive()
  • Update QA tests 722 and 1671 for new recovery behaviour

Resolves: RHEL-169855

When reading multi-archive PCP logs (e.g. Cockpit metrics history),
a corrupted volume no longer terminates the entire replay.  Instead,
corrupted archives are flagged and skipped so that data from healthy
volumes can still be read.

- Add 'corrupted' field to __pmMultiLogCtl to track damaged archives
- Skip corrupt/unreadable archives during multi-archive context init
  with a warning rather than failing outright
- Rebuild shared __pmLogCtl when lost due to failed archive opens
- On PM_ERR_LOGREC in __pmLogRead, advance to the next (or previous)
  archive instead of returning an error in multi-archive mode
- Skip known-corrupted archives in LogChangeToNextArchive() and
  LogChangeToPreviousArchive()
- Update QA tests 722 and 1671 for new recovery behaviour

Resolves: RHEL-169855
@kurik kurik requested review from kmcdonell and natoscott May 7, 2026 08:54
@kurik
Copy link
Copy Markdown
Contributor Author

kurik commented May 7, 2026

@natoscott , @kmcdonell May I ask you for a review, please?

Comment thread src/libpcp/src/logutil.c Outdated
"__pmLogRead: corrupted archive \"%s\" vol %d, attempting to skip",
acp->ac_log->name, acp->ac_curvol);
if (acp->ac_cur_log >= 0 && acp->ac_cur_log < acp->ac_num_logs)
acp->ac_log_list[acp->ac_cur_log]->corrupted = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurik I wonder if it would be simpler (and safer) to remove the corrupt archive from the list, instead of keeping it around flagging it as bad? If we keep it around there's always a chance it might be used somewhere accidentally when it should not be, whereas if we drop it completely from the list we cannot really go wrong can we?

Rather than flagging corrupted archives via a per-entry "corrupted"
field in __pmMultiLogCtl and skipping over them during archive
transitions, remove them from ac_log_list entirely when corruption
is detected at read time.
This eliminates the "corrupted" field from __pmMultiLogCtl and the
recursive skip logic in LogChangeToNextArchive() and
LogChangeToPreviousArchive(), replacing it with direct removal from
the list (free, memmove, adjust ac_num_logs) followed by an
immediate switch to the appropriate neighbouring archive.
@kmcdonell
Copy link
Copy Markdown
Member

[babble from slack]
@kurik More or less orthogonal to the fixes for libpcp, it may be worth (I'm not sure how, but particularly for RH Support) publicising that pmlogcheck -r or -R if you're brave, repairs the most common forms of archive corruption (the last record in one of the files is incomplete). Since I took my shell script hack and turned it into C code in pmlogcheck I've had 100% success in repairing corrupted archives and preserving as much data as is possible. This is something that happens infrequently, but regularly, in the QA Farm and allows pmlogger-daily to recover from corrupted archives (something that cannot be done in libpcp and something where we don't want to skip the corrupted archive if there is useful information up to the point of corruption). On reflection, I wonder if the pmlogcheck logic should be somehow incorporated into libpcp (some PCP_FOO_OPTION?) and this makes it not-so-orthogonal, so I'll add this comment to the PR so the thought bubble does not get lost.

If they are seeing corruption that is not of the "truncation" form I'd be very interested to know the details of the corruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants