Fix skip_records over-counting when partial record precedes num_rows page skip#9374
Fix skip_records over-counting when partial record precedes num_rows page skip#9374jonded94 wants to merge 7 commits intoapache:mainfrom
Conversation
|
Thanks for tracking this down @jonded94! Rows spanning page boundaries have been the bane of my existence 😬. I want to give this a good look this afternoon. |
|
I've looked at the code and the test, and I'm not quite sure this is the correct fix. Given the data ( Maybe something like pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
let mut remaining_records = num_records;
let mut all_v2_pages = true; // << add this boolean
while remaining_records != 0 {
if self.num_buffered_values == self.num_decoded_values {
...
if let Some(rows) = rows {
// this short circuits decoding levels, but can be incorrect if V1 pages are
// also present
if rows <= remaining_records && all_v2_pages { // << add to this test
self.page_reader.skip_next_page()?;
remaining_records -= rows;
continue;
}
} else {
all_v2_pages = false; // << set to false if we encounter V1 page
} |
|
Hey @etseidl 👋 Thanks for your review! I'm not super well versed around data page decoding but I agree that after skipping 2 records, I'm really sorry I can't invest more time into that right now, but I hope what I committed is now finally correct! |
| // offset index), its records are self-contained, so the | ||
| // partial from the previous page is complete at this boundary. | ||
| if let Some(decoder) = self.rep_level_decoder.as_mut() { | ||
| if decoder.flush_partial() { |
There was a problem hiding this comment.
I think this is correct. If all pages are V2, then has_partial will never be true, because V2 pages must start on a new record (R=0). If all pages are V1 this will never trigger because num_rows will be None. This case only applies when switching from V1 to V2, in which case it's appropriate to call flush_partial because, as said above, V2 pages must start at a new record.
There was a problem hiding this comment.
So that means that this bug requires a parquet file with a column chunk that has both V1 and V2 pages?
There was a problem hiding this comment.
I made such a file (that triggers the error with a RowSelection): 🤯 #9395
There was a problem hiding this comment.
If all pages are V2, then has_partial will never be true, because V2 pages must start on a new record (R=0). If all pages are V1 this will never trigger because num_rows will be None. This case only applies when switching from V1 to V2
I know nothing about v1 vs. v2 parquet, but the unit test says it only uses v2 pages, with the comment
parquet-rs writes only v2 data pages and no offset index is loaded, so at_record_boundary() returns false for non-last page
So it sounds like there are multiple ways for decoding to end in the middle of a page (leaving has_partial=true), but the correct fix in all cases is to flush_partial upon reaching a new v2 page (because the page must start on a new record)?
There was a problem hiding this comment.
I know nothing about v1 vs. v2 parquet, but the unit test says it only uses v2 pages, with the comment
The original comment was for an earlier revision of the test which mixed V1 and V2 pages.
So it sounds like there are multiple ways for decoding to end in the middle of a page (leaving has_partial=true), but the correct fix in all cases is to flush_partial upon reaching a new v2 page (because the page must start on a new record)?
The issue is decoding hitting the end of a page. There's no way of knowing if the page ends at a record boundary (unless the page is the last one in a column chunk), so the has_partial flag is set and the number of records read is off by one. We should be able to reason that if V2 headers or the offset index are present we are at a record boundary, but unfortunately earlier writers (including the one in this crate) did not universally follow the spec. As a consequence, at_record_boundary, which is used to set has_record_delimiter, is only true at a column chunk boundary (i.e. it was decided to not use V2/offset index as a guarantor of page boundaries being record boundaries).
The current fix, however, does make that assumption (as does the proposed alternate fix). I had proposed simply not allowing the whole-page-skip optimization in the presence of repetition, and instead continuing to decode repetition levels, but that solution doesn't work when whole pages have not been fetched based on the offset index (found this when an async reader test of row skipping failed). I've put up a kludgey fix in #9406 which at least passes all tests, but is likely still too brittle.
I do think we likely need to add a test (if one doesn't already exist), for correctly skipping records when pages do not start at a record boundary (as is still allowed for V1 pages with no offset index).
As I said earlier, page spanning records are the bane of my existence.
There was a problem hiding this comment.
I haven't been following the conversation very closely, but IIRC records shouldn't be split across any pages, regardless of V1 or V2. The issue is that many older writers, including arrow-rs, did occasionally do this, as the spec at the time was ambiguous. IIRC this was independent of V1 vs V2.
IMO enough time has probably passed that we can just assume that records aren't split across pages, and error on encountering a page with a non-zero initial repetition level. Potentially with some flag to disable this, that also disables things like filter pushdown
There was a problem hiding this comment.
IMO enough time has probably passed that we can just assume that records aren't split across pages
Datapoint: The file that lead to the original error message was written with arrow-rs version 57.1.0: parquet viewer and more verbose debug dumps. In any case, at least at my company we probably have a few PiB of data written with this or an even earlier version.
There was a problem hiding this comment.
Datapoint: The file that lead to the original error message was written with arrow-rs version 57.1.0: parquet viewer and #9370 (comment). In any case, at least at my company we probably have a few PiB of data written with this or an even earlier version.
To be fair, neither the test data, nor almost certainly your data, have rows that span pages. Rather, it's the assumption that the data may contain such rows that is leading to the incorrect skip behavior.
There was a problem hiding this comment.
I haven't been following the conversation very closely, but IIRC records shouldn't be split across any pages, regardless of V1 or V2. The issue is that many older writers, including arrow-rs, did occasionally do this, as the spec at the time was ambiguous. IIRC this was independent of V1 vs V2.
Thanks for chiming in @tustvold. The last time this was kicked around in the Parquet community (see apache/parquet-format#244) the consensus seemed to me to be that when writing files with V1 headers and no offset index, it was still ok to have rows span multiple pages. In fact, pyarrow as of 20.0.0 will still produce such files.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
run benchmark arrow_reader |
|
I would just like to make sure this isn't causing some major regression in performance (I don't think it is) and then it looks good to me. It would be nice to add a "end to end" reproducer with an actual parquet file -- I made one for your consideration on #9395 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Some of the benchmarks look like they got slower, but it might be noise. I will run them again to see |
|
run benchmark arrow_reader |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Actually, what I stated (and tested for) here was wrong: This isn't about mixing v1 and v2 data pages, but can be reproduced with v2 data pages only. That's what my new commit 365bd9a is about. Additionally, I verified that the original file I had this issue with also only contains v2 data pages. Please read this comment for a very detailed overview over all data pages in the affected row group: #9370 (comment) I also added a correctly currently panicking test in PR #9399 that I validated to no longer panic when the bugfix of this PR is introduced. Regarding the actual bugfix introduced in this PR: It seems to lead to some quite hefty performance regressions, especially in these benchmarks launched by @alamb |
|
Hey, unfortunately this is a bit out of my skill set/expertise, but here is a slightly different fix: I validated that the bugfix still lets the test in this PR and also my end-to-end regression test in #9399 succeed (i.e., for the latter I had to remove the If you think the new bugfix would make more sense, I'll commit it. Claude output about why this new bugfix is potentially faster:
|
|
Hmm, I had assumed I think maybe we should only trust Something like // If page has less rows than the remaining records to
// be skipped, skip entire page
// If no repetition levels, num_levels == num_rows
// We cannot trust V2 pages to always respect record boundaries, so we'll only
// perform this short-circuit when no lists are present.
if self.rep_level_decoder.is_none() {
if let Some(rows) = metadata.num_levels {
if rows <= remaining_records {
self.page_reader.skip_next_page()?;
remaining_records -= rows;
continue;
}
}
} |
# Which issue does this PR close? Related to #9374 #9370 # Rationale for this change In #9374, I only came up with a unit-test that didn't really throw the message "Not all children array length are the same!" error the issue #9370 is about. In this PR, tests are introduced that are able to reproduce this issue end-to-end and are expected to still panic. In test `test_list_struct_page_boundary_desync_produces_length_mismatch`, we exactly get the error message the original issue is about, while `test_row_selection_list_column_v2_page_boundary_skip` shows a slightly different error message. # What changes are included in this PR? Two test are introduced, which currently are expected to panic: - test_row_selection_list_column_v2_page_boundary_skip - test_list_struct_page_boundary_desync_produces_length_mismatch # Are there any user-facing changes? No
…state-after-skipping-all-records
|
I merged up from main and up "should_panic"d the tests I need to devote more time to studying this PR (and trying to reproduce the benchmark differences locally) tomorrow |
|
This is great work @jonded94 |
Hmm, never mind. This causes some async tests to fail (I suspect due to skipping the fetching/reading of pages via the page indexes). I'm testing a different approach in #9406. |
|
I think we shouldn't be thrown off by the benches. Compare the latest bench of this PR to the same from #9406 and notice how the |
Which issue does this PR close?
RowSelection#9370 .Rationale for this change
The bug occurs when using RowSelection with nested types (like List) when:
The issue was in
arrow-rs/parquet/src/column/reader.rs:287-382(skip_recordsfunction).Root cause: When
skip_recordscompleted successfully after crossing page boundaries, thehas_partialstate in theRepetitionLevelDecodercould incorrectly remain true.This happened when:
For a more descriptive explanation, look here: #9370 (comment)
What changes are included in this PR?
Added code at the end of skip_records to reset the partial record state when all requested records have been successfully skipped.
This ensures that after skip_records completes, we're at a clean record boundary with no lingering partial record state, fixing the array length mismatch in StructArrayReader.
Are these changes tested?
Commit 365bd9a introduces a test showcasing this issue with v2 data pages only on a unit-test level. PR #9399 could be used to showcase the issue in an end-to-end way.
Previously wrong assumption that thought it had to do with mixing v1 and v2 data pages: