Skip to content

fix: resolve URI schemes in ArrowFileSystemFileIO before opening files#2

Closed
smaheshwar-pltr wants to merge 2 commits into
mainfrom
fix-uri-resolution-in-readers
Closed

fix: resolve URI schemes in ArrowFileSystemFileIO before opening files#2
smaheshwar-pltr wants to merge 2 commits into
mainfrom
fix-uri-resolution-in-readers

Conversation

@smaheshwar-pltr
Copy link
Copy Markdown
Owner

@smaheshwar-pltr smaheshwar-pltr commented Apr 16, 2026

Issue

Closes #7

When reading Avro or Parquet files on S3 via ArrowFileSystemFileIO, the readers call io->fs()->OpenInputFile() directly on the underlying Arrow filesystem, bypassing ResolvePath(). This means s3:// prefixes are never stripped. Arrow's S3FileSystem expects bare bucket/key paths, not full URIs.

Consumers using iceberg-cpp to scan tables with S3-backed storage hit:

Invalid: Expected an S3 object path of the form 'bucket/key...', got a URI:
's3://warehouse/default/test_table/metadata/snap-487842974509551922-0-dc0a55d6-5df1-4ffa-a01c-b7481e5c663c.avro'

PR

When reading or writing Avro/Parquet files on S3 via ArrowFileSystemFileIO, the file path includes the URI scheme (e.g., s3://bucket/key). Arrow's S3FileSystem expects bare paths (bucket/key). ArrowFileSystemFileIO::ResolvePath() handles this stripping in ReadFile/WriteFile/DeleteFile, but the Avro and Parquet readers/writers bypass FileIO and call io->fs()->OpenInputFile() / io->fs()->OpenOutputStream() directly — skipping ResolvePath() entirely.

This causes S3 file operations to fail when going through the Avro/Parquet code paths (e.g., manifest reads).

Fix

Add OpenInputFile() and OpenOutputStream() public methods to ArrowFileSystemFileIO that apply ResolvePath() before delegating to the underlying Arrow filesystem. Update all four reader/writer call sites (avro_reader, avro_writer, parquet_reader, parquet_writer) to use the new methods instead of calling fs() directly.

This keeps URI resolution encapsulated in one place rather than scattered as manual :// stripping in each reader/writer.

@smaheshwar-pltr smaheshwar-pltr force-pushed the fix-uri-resolution-in-readers branch 4 times, most recently from c78de0c to 09c1b18 Compare April 16, 2026 22:55
@smaheshwar-pltr smaheshwar-pltr marked this pull request as ready for review April 17, 2026 01:31
Add OpenInputFile() and OpenOutputStream() public methods to
ArrowFileSystemFileIO that apply ResolvePath() before delegating to the
underlying Arrow filesystem. Update all reader/writer call sites (avro,
parquet) and tests to use the new methods instead of calling fs() directly.

This ensures S3 URIs (s3://bucket/key) are properly resolved to bare paths
(bucket/key) regardless of whether the caller goes through FileIO or
accesses the Arrow filesystem directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@smaheshwar-pltr smaheshwar-pltr force-pushed the fix-uri-resolution-in-readers branch from 09c1b18 to 7a1b962 Compare April 17, 2026 06:36
@smaheshwar-pltr
Copy link
Copy Markdown
Owner Author

Closing — this fix has been superseded by upstream commit fc80e4b (feat(io): add streaming FileIO support, apache#641). ArrowFileSystemFileIO::NewInputFile/NewOutputFile and the new OpenArrowInputStream/OpenArrowOutputStream helpers now call ResolvePath() for the URI fast-path, and all four reader/writer call sites (avro/parquet × read/write) have been updated to use them. No remaining gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avro/Parquet readers bypass FileIO::ResolvePath(), causing S3 URI scheme errors

1 participant