#724 feat: add functions for add and replacing data directly with datafiles#723
Open
agaddis02 wants to merge 4 commits intoapache:mainfrom
Open
#724 feat: add functions for add and replacing data directly with datafiles#723agaddis02 wants to merge 4 commits intoapache:mainfrom
agaddis02 wants to merge 4 commits intoapache:mainfrom
Conversation
…, add explicit warning in docstring
Author
|
@rockwotj, @zeroshade & @subkanthi, this pull request has been updated based on the comments left on the other PR: #710. Let me know if anything else needs to be added or changed, I believe this addresses the main points of concerns in terms of adding test, explicitly warning users that this can be dangerous, and attempting to curb that danger as much as possible by validating all of the possible items we can (with out scanning the file). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
If you want to write your own parquet files and only use iceberg to handle the metadata, you are only left with the option (for the most part) of leveraging the
ReplaceDataFilesfunction.This function takes in a list of existing files and a list of new file paths to override that previous data with.
This function works fine for the most part, but the function includes a scan in it which means it's not actually taking your word that your new parquet files match the table schema.
This scan proves to be problematic in some cases when you are writing files very fast and leveraging multipart uploads. You know the location of all files, know they are valid parquet files, but the commit has the possibility to return an error because at the time of commit the file might not be fully available.
the error looks something like this at commit time:
failed to replace data files: error encountered during file conversion: parquet: could not read 8 bytes from end of file.Solution
We have tested this out in vendor code and opened a fork that adds a new function.
ReplaceDataFilesis scanning your file paths to try and ensure the schema of said files match the schema of the table you are inputting them into.We, and I would assume a lot of people writing their own parquet files, don't need this. Our ingestion framework guarantees we will never get a incorrect parquet file, and we also have access to our Parquet Schema and Arrow Schema for the entirety of the ingestion.
So I can build data files directly and would much rather just pass my own datafiles to this function, as I know the files will eventually be available and they will be correct. all this is doing is telling the metadata where to look at said file, there is no real harm in committing before that file is actually available unless you are querying it right away and it happens to not be available.
This also speeds up the commit time tremendously as this library doesn't need to go through scan all of the files for every single commit.