fix: resolve URI schemes in ArrowFileSystemFileIO before opening files#2
Closed
smaheshwar-pltr wants to merge 2 commits into
Closed
fix: resolve URI schemes in ArrowFileSystemFileIO before opening files#2smaheshwar-pltr wants to merge 2 commits into
smaheshwar-pltr wants to merge 2 commits into
Conversation
c78de0c to
09c1b18
Compare
Add OpenInputFile() and OpenOutputStream() public methods to ArrowFileSystemFileIO that apply ResolvePath() before delegating to the underlying Arrow filesystem. Update all reader/writer call sites (avro, parquet) and tests to use the new methods instead of calling fs() directly. This ensures S3 URIs (s3://bucket/key) are properly resolved to bare paths (bucket/key) regardless of whether the caller goes through FileIO or accesses the Arrow filesystem directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
09c1b18 to
7a1b962
Compare
Owner
Author
|
Closing — this fix has been superseded by upstream commit fc80e4b (feat(io): add streaming FileIO support, apache#641). |
Draft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Closes #7
When reading Avro or Parquet files on S3 via
ArrowFileSystemFileIO, the readers callio->fs()->OpenInputFile()directly on the underlying Arrow filesystem, bypassingResolvePath(). This meanss3://prefixes are never stripped. Arrow'sS3FileSystemexpects barebucket/keypaths, not full URIs.Consumers using iceberg-cpp to scan tables with S3-backed storage hit:
PR
When reading or writing Avro/Parquet files on S3 via
ArrowFileSystemFileIO, the file path includes the URI scheme (e.g.,s3://bucket/key). Arrow'sS3FileSystemexpects bare paths (bucket/key).ArrowFileSystemFileIO::ResolvePath()handles this stripping inReadFile/WriteFile/DeleteFile, but the Avro and Parquet readers/writers bypassFileIOand callio->fs()->OpenInputFile()/io->fs()->OpenOutputStream()directly — skippingResolvePath()entirely.This causes S3 file operations to fail when going through the Avro/Parquet code paths (e.g., manifest reads).
Fix
Add
OpenInputFile()andOpenOutputStream()public methods toArrowFileSystemFileIOthat applyResolvePath()before delegating to the underlying Arrow filesystem. Update all four reader/writer call sites (avro_reader,avro_writer,parquet_reader,parquet_writer) to use the new methods instead of callingfs()directly.This keeps URI resolution encapsulated in one place rather than scattered as manual
://stripping in each reader/writer.