libpcp: recover from corrupted archives in multi-archive contexts#2586
libpcp: recover from corrupted archives in multi-archive contexts#2586kurik wants to merge 2 commits into
Conversation
When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read. - Add 'corrupted' field to __pmMultiLogCtl to track damaged archives - Skip corrupt/unreadable archives during multi-archive context init with a warning rather than failing outright - Rebuild shared __pmLogCtl when lost due to failed archive opens - On PM_ERR_LOGREC in __pmLogRead, advance to the next (or previous) archive instead of returning an error in multi-archive mode - Skip known-corrupted archives in LogChangeToNextArchive() and LogChangeToPreviousArchive() - Update QA tests 722 and 1671 for new recovery behaviour Resolves: RHEL-169855
|
@natoscott , @kmcdonell May I ask you for a review, please? |
| "__pmLogRead: corrupted archive \"%s\" vol %d, attempting to skip", | ||
| acp->ac_log->name, acp->ac_curvol); | ||
| if (acp->ac_cur_log >= 0 && acp->ac_cur_log < acp->ac_num_logs) | ||
| acp->ac_log_list[acp->ac_cur_log]->corrupted = 1; |
There was a problem hiding this comment.
@kurik I wonder if it would be simpler (and safer) to remove the corrupt archive from the list, instead of keeping it around flagging it as bad? If we keep it around there's always a chance it might be used somewhere accidentally when it should not be, whereas if we drop it completely from the list we cannot really go wrong can we?
Rather than flagging corrupted archives via a per-entry "corrupted" field in __pmMultiLogCtl and skipping over them during archive transitions, remove them from ac_log_list entirely when corruption is detected at read time. This eliminates the "corrupted" field from __pmMultiLogCtl and the recursive skip logic in LogChangeToNextArchive() and LogChangeToPreviousArchive(), replacing it with direct removal from the list (free, memmove, adjust ac_num_logs) followed by an immediate switch to the appropriate neighbouring archive.
|
[babble from slack] If they are seeing corruption that is not of the "truncation" form I'd be very interested to know the details of the corruption. |
When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read.
Resolves: RHEL-169855