Skip to content

feat: add S3/OSS/HTTP extension support for cloud object storage#179

Open
BingqingLyu wants to merge 15 commits intoalibaba:mainfrom
BingqingLyu:s3_oss_extension
Open

feat: add S3/OSS/HTTP extension support for cloud object storage#179
BingqingLyu wants to merge 15 commits intoalibaba:mainfrom
BingqingLyu:s3_oss_extension

Conversation

@BingqingLyu
Copy link
Copy Markdown
Collaborator

PR Title: feat: add S3/OSS/HTTP extension support for cloud object storage
Reviewers:

Related Issues

closes #127

What does this PR do?

This PR adds comprehensive cloud object storage support to NeuG through a new S3 extension module. The extension provides read-only access to files stored on AWS S3, Alibaba Cloud OSS and plain HTTP/HTTPS endpoints, enabling users to directly query data files from cloud storage without local downloads.

What changes in this PR?

New Features

  1. S3/OSS Filesystem Support (extension/s3/src/s3_filesystem.cc, include/s3_filesystem.h)

    • Implements S3FileSystem based on Arrow's S3 filesystem integration
    • Supports AWS S3, Alibaba Cloud OSS
    • Configurable region, endpoint, credentials (explicit keys, environment variables)
    • Anonymous credential support for public buckets
    • OSS endpoint auto-detection and region resolution
    • File glob pattern expansion
  2. HTTP/HTTPS Filesystem Support (extension/s3/src/http_filesystem.cc, include/http_filesystem.h)

    • Implements HTTPFileSystem using libcurl for direct HTTP/HTTPS file access
    • HTTP Range request support for efficient partial reads
    • SSL/TLS verification with configurable options
    • Timeout and retry configuration
  3. Configuration Options (include/s3_options.h, src/s3_options.cc, include/http_options.h)

    • S3 configuration: region, endpoint, credentials, timeouts
    • HTTP configuration: bearer token, SSL verification
  4. VFS Interface Integration (extension/s3/src/s3_extension.cpp)

    • Adapts S3 and HTTP filesystems to NeuG's neug::fsys::FileSystem interface
    • Implements glob(), toArrowFileSystem() methods
    • Registers "s3", "http", "https" protocols in FileSystemRegistry

Testing

  1. S3 Unit Tests (tests/s3_test.cpp)

    • URI parser validation (valid/invalid S3 URIs, glob patterns, bucket name validation)
    • S3 options configuration tests (default AWS, OSS endpoint, explicit region, HTTPS scheme)
    • Credential configuration tests (anonymous, explicit, AWS alias, role/web-identity rejection)
    • Multi-region S3 options validation
  2. HTTP Unit Tests (tests/http_test.cpp)

    • URI parser validation (HTTP/HTTPS schemes, port numbers, invalid formats)
    • HTTP filesystem creation and type checks
    • Configuration defaults and key validation
    • E2E test: real HTTPS access to public Parquet file with Parquet magic number verification
  3. Integration Tests (tools/python_bind/tests/test_load.py, example/complex_test.py)

    • Python E2E tests for S3/HTTP data loading
    • Complex test scenarios with multiple file formats

Documentation & CI

  1. Documentation (doc/source/extensions/load_s3.md, doc/source/extensions/index.md)

    • Comprehensive guide for S3/OSS/HTTP usage with examples
    • Configuration reference for all options
    • Authentication methods (explicit credentials, environment variables, anonymous access)
  2. CI/CD (.github/workflows/neug-extension-test.yml, .github/workflows/build-extensions.yml)

    • Added S3/HTTP test execution in extension test workflow
    • Build extension tests in CI pipeline

Build System

  1. CMake Integration (extension/s3/CMakeLists.txt, CMakeLists.txt, cmake/BuildArrowAsThirdParty.cmake)
    • Added S3 extension build configuration
    • Linked required dependencies: libcurl, Arrow S3 filesystem, AWS SDK
    • Configured test build for S3 extension

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add S3/OSS/HTTP extension support for cloud object storage

✨ Enhancement 🧪 Tests 📝 Documentation

Grey Divider

Walkthroughs

Description
• Adds comprehensive cloud object storage support through a new S3 extension module enabling
  read-only access to AWS S3, Alibaba Cloud OSS, and HTTP/HTTPS endpoints
• **S3/OSS Filesystem**: Implements S3FileSystem wrapper around Arrow's S3 filesystem with URI
  parsing, glob pattern expansion, and configurable credentials (explicit keys, environment variables,
  anonymous access)
• **HTTP/HTTPS Filesystem**: Implements HTTPFileSystem using libcurl with Range request support,
  SSL/TLS verification, and custom header configuration
• **Configuration System**: Centralized options for S3 (region, endpoint, credentials, timeouts) and
  HTTP (bearer token, SSL verification, proxy settings)
• **VFS Integration**: Registers S3, OSS, HTTP, and HTTPS protocols in the global FileSystemRegistry
  with proper interface adaptation
• **Comprehensive Testing**: Unit tests for URI parsing, configuration validation, and end-to-end
  integration tests accessing public datasets from AWS S3, Alibaba OSS, and HTTPS endpoints
• **Python Bindings**: E2E tests for data loading from cloud storage with row/column count and data
  type validation
• **Build System**: CMake configuration for S3 extension with Arrow S3 filesystem and libcurl
  dependencies
• **CI/CD Integration**: Extension tests added to GitHub workflows for automated validation
• **Documentation**: User guide covering installation, credential modes, configuration options, and
  query examples
Diagram
flowchart LR
  A["S3/OSS URIs<br/>HTTP/HTTPS URLs"] -->|"Parse & Configure"| B["S3FileSystem<br/>HTTPFileSystem"]
  B -->|"Implement VFS Interface"| C["FileSystemRegistry"]
  C -->|"Register Protocols"| D["s3://, oss://<br/>http://, https://"]
  B -->|"Adapt to Arrow"| E["Arrow FileSystem"]
  E -->|"Access Cloud Storage"| F["AWS S3<br/>Alibaba OSS<br/>HTTP Endpoints"]
  G["S3Options<br/>HTTPOptions"] -->|"Configure"| B
  H["Unit & Integration<br/>Tests"] -->|"Validate"| B
Loading

Grey Divider

File Changes

1. extension/s3/tests/s3_test.cpp 🧪 Tests +679/-0

S3/OSS filesystem unit and integration tests

• Comprehensive S3/OSS URI parser tests validating bucket names, object keys, and glob patterns
• S3Options configuration tests covering AWS defaults, OSS endpoints, region resolution, and
 credential modes
• End-to-end integration tests accessing public OSS and AWS S3 datasets with anonymous credentials
• Negative test demonstrating that S3FileSystem cannot access plain HTTPS URLs (requires S3 API)

extension/s3/tests/s3_test.cpp


2. extension/s3/src/http_filesystem.cc ✨ Enhancement +660/-0

HTTP/HTTPS filesystem implementation with libcurl

• Implements HTTPRandomAccessFile for HTTP/HTTPS file access using libcurl with Range request
 support
• Implements HTTPFileSystem adapting libcurl to Arrow's FileSystem interface (read-only)
• Provides URI parsing for HTTP/HTTPS URLs with host, port, and path extraction
• Supports custom headers, bearer token authentication, SSL/TLS verification, and proxy
 configuration

extension/s3/src/http_filesystem.cc


3. extension/s3/src/s3_options.cc ✨ Enhancement +367/-0

S3/OSS configuration builder with credential resolution

• Implements S3OptionsBuilder to configure Arrow S3Options from NeuG FileSchema
• Resolves endpoint, region, and credentials with environment variable fallback support
• Auto-detects OSS region from endpoint and applies OSS-specific settings (virtual addressing)
• Validates and rejects unsupported credential kinds (Role, WebIdentity) with clear error messages

extension/s3/src/s3_options.cc


View more (19)
4. extension/s3/src/s3_filesystem.cc ✨ Enhancement +292/-0

S3/OSS filesystem wrapper with URI parsing and glob support

• Implements S3FileSystem wrapping Arrow's S3FileSystem with NeuG VFS interface
• Parses S3/OSS URIs and converts to Arrow-compatible paths (bucket/key format)
• Implements glob pattern expansion for S3 object keys
• Provides toArrowFileSystem() adapter using wrapper class for ownership model compatibility

extension/s3/src/s3_filesystem.cc


5. extension/s3/tests/http_test.cpp 🧪 Tests +258/-0

HTTP/HTTPS filesystem unit and integration tests

• HTTP/HTTPS URI parser tests validating scheme, host, port, and path extraction
• HTTPFileSystem instantiation and type checking tests
• End-to-end test accessing public Parquet file via HTTPS with magic number verification
• Tests for custom headers, bearer token, and SSL verification configuration

extension/s3/tests/http_test.cpp


6. extension/s3/src/s3_extension.cpp ✨ Enhancement +88/-0

S3 extension registration and initialization

• Registers S3FileSystem for s3:// and oss:// URI schemes in global VFS registry
• Registers HTTPFileSystem for http:// and https:// URI schemes
• Provides extension entry points (Init() and Name()) for dynamic loading

extension/s3/src/s3_extension.cpp


7. tools/python_bind/example/complex_test.py 🧪 Tests +148/-0

Python E2E tests for S3/OSS/HTTP extension

• Adds run_s3_extension_suite() function testing OSS, HTTP, and AWS S3 data loading
• Tests OSS public bucket access with anonymous credentials
• Tests HTTP/HTTPS direct URL access to OSS-hosted Parquet files
• Tests AWS S3 public dataset (Ookla Open Data) with anonymous access

tools/python_bind/example/complex_test.py


8. tools/python_bind/tests/test_load.py 🧪 Tests +166/-0

Python unit tests for S3/OSS/HTTP data loading

• Adds test for loading Parquet vertices/edges from OSS with inline S3 options
• Adds test for loading Parquet files via HTTP/HTTPS URLs
• Adds test for accessing public AWS S3 dataset (Ookla Open Data) with anonymous credentials
• Tests verify row counts, column counts, and data type validation

tools/python_bind/tests/test_load.py


9. cmake/BuildArrowAsThirdParty.cmake ⚙️ Configuration changes +5/-1

CMake Arrow S3 filesystem build configuration

• Adds conditional Arrow S3 filesystem build configuration based on ARROW_ENABLE_S3 flag
• Enables ARROW_S3 CMake option when S3 support is requested

cmake/BuildArrowAsThirdParty.cmake


10. extension/s3/include/s3_options.h ✨ Enhancement +215/-0

S3/OSS configuration options and builder interface

• Defines centralized S3 configuration option keys (endpoint, region, credentials)
• Documents credential resolution priority (Explicit > Default > Anonymous)
• Implements S3OptionsBuilder class for constructing Arrow S3Options from FileSchema
• Provides comprehensive documentation of OSS-specific handling and credential modes

extension/s3/include/s3_options.h


11. extension/s3/include/http_filesystem.h ✨ Enhancement +207/-0

HTTP/HTTPS filesystem implementation with libcurl support

• Introduces HTTPURIComponents struct for parsing HTTP(S) URLs with scheme, host, port, and path
 components
• Implements HTTPRandomAccessFile class extending Arrow's RandomAccessFile interface with
 libcurl-based HTTP Range request support
• Defines HTTPFileSystem class implementing both Arrow and NeuG filesystem interfaces for
 read-only HTTP/HTTPS access
• Provides factory function CreateHTTPFileSystem() for FileSchema-based HTTP filesystem creation

extension/s3/include/http_filesystem.h


12. extension/s3/include/glob_utils.h ✨ Enhancement +168/-0

Glob pattern matching utilities for cloud storage

• Implements MatchGlobPattern() inline function for lightweight glob pattern matching with * and
 ? wildcards
• Provides ResolvePathsWithGlobOnFs() helper function to resolve glob patterns on any Arrow
 FileSystem
• Supports recursive globbing with ** pattern and integrates with Arrow's FileSelector API

extension/s3/include/glob_utils.h


13. extension/s3/include/s3_filesystem.h ✨ Enhancement +105/-0

S3/OSS filesystem abstraction layer

• Defines S3URIComponents struct for parsing S3/OSS URIs with bucket and object key extraction
• Implements S3FileSystem class extending NeuG's FileSystem interface for S3-compatible storage
 access
• Provides buildS3Options() static method to construct Arrow S3Options from FileSchema
 configuration
• Includes factory function CreateS3FileSystem() for S3 filesystem creation

extension/s3/include/s3_filesystem.h


14. extension/s3/include/http_options.h ⚙️ Configuration changes +69/-0

HTTP configuration options and defaults

• Defines HTTPConfigOptionKeys struct with centralized HTTP configuration keys (authentication,
 headers, SSL, proxy, timeouts)
• Provides HTTPConfigDefaults struct with default values for SSL verification, timeouts, and retry
 settings

extension/s3/include/http_options.h


15. include/neug/utils/reader/options.h ✨ Enhancement +12/-0

Support for double-precision option values

• Adds DoubleOption() static factory method to support double-precision floating-point
 configuration values
• Includes exception handling for invalid double parsing with std::stod()

include/neug/utils/reader/options.h


16. doc/source/extensions/load_s3.md 📝 Documentation +136/-0

S3/OSS/HTTP extension user documentation

• Comprehensive documentation for S3 Extension covering installation, loading, and supported URI
 schemes (s3://, oss://, http://, https://)
• Details credential modes (Default, Anonymous, Explicit) with configuration options for endpoints,
 regions, and timeouts
• Provides query examples for AWS S3, Alibaba OSS, HTTP URLs, and glob patterns
• Explains integration with other extensions like PARQUET

doc/source/extensions/load_s3.md


17. .github/workflows/neug-extension-test.yml ⚙️ Configuration changes +5/-3

CI/CD integration for S3 extension testing

• Adds s3 to BUILD_EXTENSIONS environment variable alongside existing json and parquet
• Includes S3 extension test execution with ctest -R s3_extension_test
• Adds Python pytest for S3 tests in test_load.py

.github/workflows/neug-extension-test.yml


18. doc/source/extensions/index.md 📝 Documentation +5/-3

Extension documentation index updates

• Updates extension table to mark PARQUET and HTTP/HTTPS/S3/OSS as v0.2 (no longer "planned")
• Adds link to load_s3.md documentation for VFS extension
• Minor formatting improvements (blank line additions)

doc/source/extensions/index.md


19. extension/s3/CMakeLists.txt ⚙️ Configuration changes +40/-0

S3 extension CMake build configuration

• Defines CMake build configuration for S3 extension with CURL dependency
• Configures include directories for extension headers and Arrow/CURL libraries
• Links required libraries (CURL, neug) and sets up extension test build with Arrow dataset support

extension/s3/CMakeLists.txt


20. .github/workflows/build-extensions.yml ⚙️ Configuration changes +2/-2

Build workflow S3 extension inclusion

• Updates default extensions list to include s3 alongside json and parquet
• Applies change to both workflow dispatch and default branch build paths

.github/workflows/build-extensions.yml


21. extension/CMakeLists.txt ⚙️ Configuration changes +3/-4

Extension build system S3 integration

• Adds S3 extension to build system via add_extension_if_enabled("s3")
• Removes redundant OpenSSL and CRYPTO_LIB dependencies from add_extension_test() function

extension/CMakeLists.txt


22. CMakeLists.txt ⚙️ Configuration changes +6/-0

Arrow S3 support configuration

• Adds conditional Arrow S3 support configuration when s3 extension is enabled in
 BUILD_EXTENSIONS
• Sets ARROW_ENABLE_S3 cache variable to enable Arrow's S3 filesystem backend

CMakeLists.txt


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 7, 2026

Code Review by Qodo

🐞 Bugs (6) 📘 Rule violations (0) 📎 Requirement gaps (1) 🎨 UX Issues (0)

Grey Divider


Action required

1. Hardcoded scheme dispatch implemented 📎 Requirement gap ⚙ Maintainability
Description
The new remote filesystem support hardcodes URI scheme parsing/dispatch (via a custom VFS registry
and bespoke URI parsers) instead of using Arrow’s arrow::fs::FileSystemFromUri() dispatch
mechanism. This violates the requirement to rely on Arrow’s URI dispatch for scheme-based filesystem
resolution.
Code

extension/s3/src/s3_extension.cpp[R29-47]

+// Register S3/OSS filesystem factories in the global VFS registry
+static void RegisterS3Provider() {
+  auto* vfs = neug::main::MetadataRegistry::getVFS();
+
+  // Register for both s3:// and oss:// schemes
+  vfs->Register("s3", neug::extension::s3::CreateS3FileSystem);
+  vfs->Register("oss", neug::extension::s3::CreateS3FileSystem);
+
+  LOG(INFO) << "[s3 extension] S3FileSystem registered for schemes: s3, oss";
+}
+
+// Register HTTP/HTTPS filesystem factories in the global VFS registry
+static void RegisterHTTPProvider() {
+  auto* vfs = neug::main::MetadataRegistry::getVFS();
+
+  // Register for both http:// and https:// schemes
+  vfs->Register("http", neug::extension::http::CreateHTTPFileSystem);
+  vfs->Register("https", neug::extension::http::CreateHTTPFileSystem);
+
Evidence
PR Compliance ID 3 requires scheme-based filesystem resolution to be implemented via Arrow URI
dispatch (arrow::fs::FileSystemFromUri() or equivalent). The added code registers per-scheme
factories in a custom registry and separately parses schemes in custom URI parsers (s3:///oss://
and http(s)://) rather than delegating to Arrow’s URI dispatch.

Uses Arrow URI dispatch (FileSystemFromUri) for scheme-based filesystem resolution
extension/s3/src/s3_extension.cpp[29-47]
extension/s3/src/s3_filesystem.cc[145-153]
extension/s3/src/http_filesystem.cc[49-59]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Scheme-based filesystem resolution is implemented via custom string parsing and a custom registry, but the compliance requirement mandates using Arrow’s URI dispatch (`arrow::fs::FileSystemFromUri()` or an Arrow-supported equivalent).

## Issue Context
- Current implementation registers providers for `s3`, `oss`, `http`, `https` and uses custom URI parsing (`S3URIComponents::parse`, `HTTPURIComponents::parse`) instead of Arrow URI dispatch.
- Arrow’s dispatch should be used wherever possible for scheme-to-filesystem resolution.

## Fix Focus Areas
- src/utils/file_sys/file_system.cc[68-97]
- extension/s3/src/s3_extension.cpp[29-47]
- extension/s3/src/s3_filesystem.cc[141-184]
- extension/s3/src/http_filesystem.cc[46-90]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Empty endpoint override ignored 🐞 Bug ≡ Correctness
Description
getOptionWithEnv() skips schema.options entries with empty values, so setting ENDPOINT_OVERRIDE=""
does not override environment endpoints and can unintentionally route AWS S3 reads to an
env-configured OSS/S3-compatible endpoint.
This contradicts the test’s own comment that an empty ENDPOINT_OVERRIDE “Force[s] default AWS S3
endpoint”, and can cause incorrect runtime behavior in shared environments.
Code

extension/s3/src/s3_options.cc[R40-53]

+T getOptionWithEnv(const reader::options_t& options,
+                   const reader::Option<T>& opt,
+                   std::initializer_list<const char*> option_keys,
+                   std::initializer_list<const char*> env_keys) {
+  // Priority 1: schema.options - check all key aliases
+  for (const char* key : option_keys) {
+    auto it = options.find(key);
+    if (it != options.end() && !it->second.empty()) {
+      // Use the Option's parser to handle type conversion
+      reader::options_t temp;
+      temp.emplace(opt.getKey(), it->second);
+      return opt.get(temp);
+    }
+  }
Evidence
getOptionWithEnv() only honors schema.options when the value is non-empty, so an explicitly provided
empty string is ignored and the function falls back to environment variables. The S3 E2E test relies
on ENDPOINT_OVERRIDE="" to force default AWS behavior, but that configuration is ignored by the
current implementation.

extension/s3/src/s3_options.cc[40-53]
extension/s3/tests/s3_test.cpp[454-458]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`getOptionWithEnv()` ignores explicitly-provided empty-string option values, so callers cannot clear env-provided endpoint/region (e.g., to force default AWS S3) by setting `ENDPOINT_OVERRIDE=""`.

### Issue Context
The S3 tests and expected behavior assume an empty override should take precedence over environment variables.

### Fix Focus Areas
- extension/s3/src/s3_options.cc[40-53]
- extension/s3/src/s3_options.cc[138-158]

### Suggested fix
- Treat **presence** of any alias key in `schema.options` as higher priority than env vars, even if the value is empty.
 - For string options (endpoint/region), remove the `!it->second.empty()` guard so explicit empty values are honored.
 - Keep env fallback only when none of the option keys are present.
- Add/adjust a unit test to validate: when env has `OSS_ENDPOINT` but schema.options has `ENDPOINT_OVERRIDE=""`, the resolved endpoint is empty and AWS defaults are used.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. HTTP ReadAt misreports bytes 🐞 Bug ≡ Correctness
Description
HTTPRandomAccessFile::ReadRange() logs a short response (received < requested) but still returns OK,
and ReadAt() always returns the requested nbytes even when fewer bytes were written.
This can feed Arrow readers buffers containing uninitialized/incorrect data, causing corrupt reads
or parsing failures.
Code

extension/s3/src/http_filesystem.cc[R339-433]

+arrow::Status HTTPRandomAccessFile::ReadRange(int64_t offset, int64_t length, void* buffer) {
+  if (closed_) {
+    return arrow::Status::Invalid("File is closed");
+  }
+  
+  // Setup GET request with HTTP Range header (RFC 7233)
+  curl_easy_reset(curl_handle_);
+  SetupCURLHandle(curl_handle_);
+  curl_easy_setopt(curl_handle_, CURLOPT_URL, url_.c_str());
+  
+  // Set Range header: "bytes=offset-end"
+  // IMPORTANT: CURLOPT_RANGE expects "start-end" format without "bytes="
+  // CURL will automatically add the "Range: bytes=" prefix
+  std::string range_value = std::to_string(offset) + "-" + 
+                           std::to_string(offset + length - 1);
+  curl_easy_setopt(curl_handle_, CURLOPT_RANGE, range_value.c_str());
+  
+  // Setup direct write to buffer
+  std::pair<void*, size_t> write_info{buffer, static_cast<size_t>(length)};
+  curl_easy_setopt(curl_handle_, CURLOPT_WRITEFUNCTION, WriteCallbackDirect);
+  curl_easy_setopt(curl_handle_, CURLOPT_WRITEDATA, &write_info);
+  
+  // Perform request
+  CURLcode res = curl_easy_perform(curl_handle_);
+  
+  if (res != CURLE_OK) {
+    return arrow::Status::IOError("HTTP Range request failed: " +
+                                   std::string(curl_easy_strerror(res)));
+  }
+  
+  // Verify response code
+  // 200 = OK (server doesn't support Range, sent full file)
+  // 206 = Partial Content (server supports Range, sent requested range)
+  long response_code = 0;
+  curl_easy_getinfo(curl_handle_, CURLINFO_RESPONSE_CODE, &response_code);
+  
+  if (response_code == 206) {
+    // Success: server sent the requested range
+    // Check how much data was actually written
+    if (write_info.second > 0) {
+      // Buffer wasn't completely filled - server sent less data than requested
+      LOG(WARNING) << "HTTP Range request returned less data than expected. "
+                   << "Requested: " << length << " bytes, "
+                   << "Received: " << (length - write_info.second) << " bytes";
+    }
+    return arrow::Status::OK();
+  } else if (response_code == 200) {
+    // Server doesn't support Range requests and sent the full file
+    // This is a critical error because we can't fit the full file in our buffer
+    return arrow::Status::IOError(
+        "Server doesn't support HTTP Range requests (returned 200 instead of 206). "
+        "Cannot read file efficiently without Range support.");
+  } else {
+    return arrow::Status::IOError("HTTP Range request failed with status " +
+                                   std::to_string(response_code));
+  }
+}
+
+arrow::Result<int64_t> HTTPRandomAccessFile::Tell() const {
+  if (closed_) {
+    return arrow::Status::Invalid("File is closed");
+  }
+  return position_;
+}
+
+arrow::Result<int64_t> HTTPRandomAccessFile::GetSize() {
+  if (closed_) {
+    return arrow::Status::Invalid("File is closed");
+  }
+  return file_size_;
+}
+
+arrow::Status HTTPRandomAccessFile::Seek(int64_t position) {
+  if (closed_) {
+    return arrow::Status::Invalid("File is closed");
+  }
+  if (position < 0) {
+    return arrow::Status::Invalid("Negative seek position");
+  }
+  position_ = position;
+  return arrow::Status::OK();
+}
+
+arrow::Result<int64_t> HTTPRandomAccessFile::ReadAt(int64_t position, int64_t nbytes, void* out) {
+  if (closed_) {
+    return arrow::Status::Invalid("File is closed");
+  }
+  
+  auto status = ReadRange(position, nbytes, out);
+  if (!status.ok()) {
+    return status;
+  }
+  
+  return nbytes;
+}
Evidence
The write callback tracks remaining capacity in write_info.second; if it’s still >0 after a 206
response, the code only logs a warning and returns OK. ReadAt() then returns nbytes
unconditionally, so callers can’t detect partial reads and may consume uninitialized bytes.

extension/s3/src/http_filesystem.cc[113-135]
extension/s3/src/http_filesystem.cc[339-394]
extension/s3/src/http_filesystem.cc[422-433]
extension/s3/src/http_filesystem.cc[435-453]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`HTTPRandomAccessFile` may successfully return from a range request even when fewer bytes were received than requested, and `ReadAt()` reports the requested size rather than actual bytes read.

### Issue Context
Arrow dataset/parquet readers rely on `RandomAccessFile::ReadAt` semantics; misreporting byte counts can cause corrupted buffers and undefined behavior.

### Fix Focus Areas
- extension/s3/src/http_filesystem.cc[339-394]
- extension/s3/src/http_filesystem.cc[422-453]

### Suggested fix
- Compute `bytes_received = length - write_info.second` and:
 - If `bytes_received == 0` and the server indicates EOF/416, return 0 (or an appropriate Arrow status).
 - If `bytes_received < length`, either return `bytes_received` (preferred) or return an IOError to avoid propagating partial data as complete.
- Update `ReadAt(position, nbytes, void*)` to return `bytes_received` instead of `nbytes`.
- Update `ReadAt(position, nbytes)` (buffer-returning overload) to return a buffer sized to `bytes_received` (e.g., allocate `nbytes`, then `Slice(0, bytes_received)`), or allocate exactly `bytes_received` when known.
- Add handling for `nbytes == 0` to avoid constructing an invalid range like `offset-(offset-1)`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
4. HTTP timeout parsing can throw 🐞 Bug ☼ Reliability
Description
HTTPRandomAccessFile::SetupCURLHandle() uses std::stoi on user-provided timeout values without
catching exceptions.
Invalid values can throw std::exception types that are not caught by HTTPFileSystem::OpenInputFile()
(it only catches neug::exception::Exception), potentially terminating the process.
Code

extension/s3/src/http_filesystem.cc[R294-307]

+  // Timeouts
+  auto connect_timeout_it = options_.find(HTTPConfigOptionKeys::kConnectTimeout);
+  int connect_timeout = HTTPConfigDefaults::kConnectTimeoutDefault;
+  if (connect_timeout_it != options_.end()) {
+    connect_timeout = std::stoi(connect_timeout_it->second);
+  }
+  curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, connect_timeout);
+  
+  auto request_timeout_it = options_.find(HTTPConfigOptionKeys::kRequestTimeout);
+  int request_timeout = HTTPConfigDefaults::kRequestTimeoutDefault;
+  if (request_timeout_it != options_.end()) {
+    request_timeout = std::stoi(request_timeout_it->second);
+  }
+  curl_easy_setopt(curl, CURLOPT_TIMEOUT, request_timeout);
Evidence
SetupCURLHandle parses CONNECT_TIMEOUT/REQUEST_TIMEOUT using std::stoi without a try/catch.
OpenInputFile only catches neug::exception::Exception, so a thrown
std::invalid_argument/std::out_of_range will escape and can crash the caller.

extension/s3/src/http_filesystem.cc[294-307]
extension/s3/src/http_filesystem.cc[614-622]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Parsing HTTP timeout options with `std::stoi` can throw uncaught `std::exception` types, which are not handled by the current exception conversion path.

### Issue Context
Timeout values come from `FileSchema.options` (user input). Invalid input must not crash the process.

### Fix Focus Areas
- extension/s3/src/http_filesystem.cc[294-307]
- extension/s3/src/http_filesystem.cc[614-622]

### Suggested fix
- Wrap `std::stoi` calls in try/catch and throw a `THROW_INVALID_ARGUMENT_EXCEPTION(...)` (or return an Arrow Status) with a clear message indicating the bad key/value.
- Optionally extend `OpenInputFile()` to also catch `const std::exception&` and convert it to `arrow::Status::IOError` so unexpected std exceptions don’t terminate the process.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. s3_extension_test missing curl link 🐞 Bug ≡ Correctness
Description
The s3_extension_test executable compiles HTTP filesystem sources that call libcurl, but the test
target does not link against CURL_LIBRARIES.
This will typically cause undefined-reference linker failures for curl symbols when building tests.
Code

extension/s3/CMakeLists.txt[R34-40]

+if(BUILD_TEST)
+    add_extension_test(NAME s3
+        TEST_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/tests/s3_test.cpp
+                  ${CMAKE_CURRENT_SOURCE_DIR}/tests/http_test.cpp
+        EXTRA_LINK_LIBS arrow_dataset_objlib ${ARROW_PARQUET_LIB}
+    )
+endif()
Evidence
add_extension_test builds an executable from the test sources plus the extension sources, and links
only neug, ${ARG_EXTRA_LINK_LIBS}, ${GLOG_LIBRARIES}, and ${ARROW_LIB}—it does not link the
extension shared library target (which has CURL). The s3 extension’s add_extension_test invocation
does not add ${CURL_LIBRARIES} to EXTRA_LINK_LIBS.

extension/CMakeLists.txt[60-106]
extension/s3/CMakeLists.txt[34-40]
extension/s3/src/http_filesystem.cc[24-25]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`s3_extension_test` compiles code that uses libcurl but does not link against libcurl.

### Issue Context
`add_extension_test()` compiles extension sources directly into the test binary, so the test must explicitly link all dependencies used by those sources.

### Fix Focus Areas
- extension/s3/CMakeLists.txt[34-40]
- extension/CMakeLists.txt[87-106]

### Suggested fix
- In `extension/s3/CMakeLists.txt`, update the `add_extension_test(NAME s3 ...)` call to include curl:
 - Add `${CURL_LIBRARIES}` to `EXTRA_LINK_LIBS`.
 - If needed on some platforms, also add `${CURL_INCLUDE_DIRS}` via `EXTRA_INCLUDE_DIRS`.
- Ensure the resulting `s3_extension_test` links successfully in CI.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

6. Bracket globs never match 🐞 Bug ≡ Correctness
Description
The S3 glob logic treats '[' as a glob wildcard (in hasGlob detection and base_dir calculation) but
MatchGlobPattern only implements '*' and '?' matching.
Any path pattern containing bracket expressions will be treated as a glob but will never match,
leading to a misleading "No files matched" error.
Code

extension/s3/include/glob_utils.h[R100-106]

+  // Extract base directory (part before first wildcard)
+  std::string base_dir = pattern;
+  size_t wildcard_pos = std::min({
+      base_dir.find('*'),
+      base_dir.find('?'),
+      base_dir.find('[')});
+
Evidence
MatchGlobPattern explicitly documents and implements only '*' and '?' handling, but
ResolvePathsWithGlobOnFs uses find('[') when locating wildcards; S3URIComponents also sets hasGlob
based on '['. This mismatch causes bracket patterns to be treated as globs without any actual
bracket matching support.

extension/s3/include/glob_utils.h[29-78]
extension/s3/include/glob_utils.h[100-106]
extension/s3/src/s3_filesystem.cc[178-181]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The glob implementation flags `[` as a wildcard but the matcher doesn’t implement bracket expressions, so such patterns will never match.

### Issue Context
Users may reasonably expect `[0-9]`-style globs; current behavior fails with a confusing "No files matched".

### Fix Focus Areas
- extension/s3/include/glob_utils.h[29-78]
- extension/s3/include/glob_utils.h[100-106]
- extension/s3/src/s3_filesystem.cc[178-181]

### Suggested fix
Choose one:
1) Remove `[` from glob detection and base_dir wildcard scanning (treat it as a literal) to match the documented supported features.
2) Implement bracket-expression support in `MatchGlobPattern` (and add unit tests).

Also consider updating documentation/tests to reflect the supported glob syntax.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. Unconditional network E2E tests 🐞 Bug ☼ Reliability
Description
s3_extension_test includes E2E tests that depend on live public S3/OSS access and call FAIL() when
connectivity or remote availability issues occur.
This can make CI non-deterministic in restricted or intermittently-connected environments.
Code

extension/s3/tests/s3_test.cpp[R413-483]

+TEST_F(S3ExtensionTest, E2E_InitializeOSSWithAnonymousAccess) {
+  // Test OSS public bucket with explicit anonymous access
+  // Using GraphScope public dataset on OSS
+  FileSchema schema;
+  schema.paths = {"oss://graphscope/neug-dataset/tinysnb/vPerson.parquet"};
+  schema.format = "parquet";
+  schema.protocol = "s3";
+  // Use AWS-compatible option names
+  schema.options["ENDPOINT_OVERRIDE"] = "oss-cn-beijing.aliyuncs.com";
+  schema.options["AWS_DEFAULT_REGION"] = "oss-cn-beijing";
+  schema.options["CREDENTIALS_KIND"] = "Anonymous";
+
+  auto fileInfo = provideS3(schema);
+  EXPECT_NE(fileInfo.fileSystem, nullptr);
+  EXPECT_FALSE(fileInfo.resolvedPaths.empty());
+
+  auto file_path = fileInfo.resolvedPaths[0];
+  // Note: resolvedPaths already returns Arrow-format paths ("bucket/key"), not "s3://" or "oss://"
+  // So we can use it directly with Arrow's GetFileInfo()
+  auto file_info_result = fileInfo.fileSystem->GetFileInfo(file_path);
+  if (file_info_result.ok()) {
+    auto info = *file_info_result;
+    LOG(INFO) << "Test: File info retrieved successfully";
+    LOG(INFO) << "  Type: " << (info.IsFile() ? "file" : "directory");
+    LOG(INFO) << "  Size: " << info.size() << " bytes";
+    EXPECT_TRUE(info.IsFile());
+    EXPECT_GT(info.size(), 0);  // File should have content
+    SUCCEED() << "OSS anonymous access works - file verified: " << info.size() << " bytes";
+  } else {
+    FAIL() << "Failed to get file info: " << file_info_result.status().ToString();
+  }
+}
+
+TEST_F(S3ExtensionTest, E2E_InitializeAWSS3WithAnonymousAccess) {
+  // Test with AWS S3 public bucket (no OSS, just AWS S3)
+  // Using NOAA's public dataset which supports anonymous access
+  FileSchema schema;
+  schema.paths = {"s3://noaa-ghcn-pds/csv/by_year/1763.csv"}; // NOAA public dataset
+  schema.format = "csv";
+  schema.protocol = "s3";
+  // Use AWS-compatible option names
+  // IMPORTANT: Explicitly set empty endpoint to override any environment variables
+  // This ensures we use default AWS S3 endpoint, not OSS
+  schema.options["ENDPOINT_OVERRIDE"] = "";  // Force default AWS S3 endpoint
+  schema.options["AWS_DEFAULT_REGION"] = "us-east-1";  // NOAA bucket is in us-east-1
+  schema.options["CREDENTIALS_KIND"] = "Anonymous";
+
+  LOG(INFO) << "Test: Initializing AWS S3 with anonymous credentials...";
+  auto fileInfo = provideS3(schema);
+  LOG(INFO) << "Test: AWS S3 FileSystem initialized successfully";
+  EXPECT_NE(fileInfo.fileSystem, nullptr);
+  EXPECT_FALSE(fileInfo.resolvedPaths.empty());
+
+  // Verify we can actually get file info (proves connectivity)
+  auto file_path = fileInfo.resolvedPaths[0];
+  LOG(INFO) << "Test: Attempting to get file info for: " << file_path;
+
+  // Note: resolvedPaths already returns Arrow-format paths ("bucket/key"), not "s3://bucket/key"
+  // So we can use it directly with Arrow's GetFileInfo()
+  auto file_info_result = fileInfo.fileSystem->GetFileInfo(file_path);
+  if (file_info_result.ok()) {
+    auto info = *file_info_result;
+    LOG(INFO) << "Test: File info retrieved successfully";
+    LOG(INFO) << "  Type: " << (info.IsFile() ? "file" : "directory");
+    LOG(INFO) << "  Size: " << info.size() << " bytes";
+    EXPECT_TRUE(info.IsFile());
+    EXPECT_GT(info.size(), 0);  // File should have content
+    SUCCEED() << "AWS S3 anonymous access works - file verified: " << info.size() << " bytes";
+  } else {
+    FAIL() << "Failed to get file info: " << file_info_result.status().ToString();
+  }
Evidence
The E2E tests access real public endpoints (GraphScope OSS, NOAA S3) and fail the test on any
GetFileInfo error rather than skipping when the network is unavailable, which risks flaky CI
outcomes.

extension/s3/tests/s3_test.cpp[413-444]
extension/s3/tests/s3_test.cpp[446-484]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Network-dependent E2E tests can fail CI due to transient connectivity or restricted environments.

### Issue Context
These are valuable tests, but they should be opt-in or skip gracefully when prerequisites aren’t met.

### Fix Focus Areas
- extension/s3/tests/s3_test.cpp[413-444]
- extension/s3/tests/s3_test.cpp[446-484]

### Suggested fix
- Gate live-network E2E tests behind an env var (e.g., `NEUG_RUN_S3_E2E=1`) and `GTEST_SKIP()` when not set.
- Alternatively, on network errors, replace `FAIL()` with `GTEST_SKIP()` (or a softer assertion) to avoid flakiness while still providing coverage when enabled.
- Keep pure unit tests (URI parsing/options building) always-on.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +29 to +47
// Register S3/OSS filesystem factories in the global VFS registry
static void RegisterS3Provider() {
auto* vfs = neug::main::MetadataRegistry::getVFS();

// Register for both s3:// and oss:// schemes
vfs->Register("s3", neug::extension::s3::CreateS3FileSystem);
vfs->Register("oss", neug::extension::s3::CreateS3FileSystem);

LOG(INFO) << "[s3 extension] S3FileSystem registered for schemes: s3, oss";
}

// Register HTTP/HTTPS filesystem factories in the global VFS registry
static void RegisterHTTPProvider() {
auto* vfs = neug::main::MetadataRegistry::getVFS();

// Register for both http:// and https:// schemes
vfs->Register("http", neug::extension::http::CreateHTTPFileSystem);
vfs->Register("https", neug::extension::http::CreateHTTPFileSystem);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Hardcoded scheme dispatch implemented 📎 Requirement gap ⚙ Maintainability

The new remote filesystem support hardcodes URI scheme parsing/dispatch (via a custom VFS registry
and bespoke URI parsers) instead of using Arrow’s arrow::fs::FileSystemFromUri() dispatch
mechanism. This violates the requirement to rely on Arrow’s URI dispatch for scheme-based filesystem
resolution.
Agent Prompt
## Issue description
Scheme-based filesystem resolution is implemented via custom string parsing and a custom registry, but the compliance requirement mandates using Arrow’s URI dispatch (`arrow::fs::FileSystemFromUri()` or an Arrow-supported equivalent).

## Issue Context
- Current implementation registers providers for `s3`, `oss`, `http`, `https` and uses custom URI parsing (`S3URIComponents::parse`, `HTTPURIComponents::parse`) instead of Arrow URI dispatch.
- Arrow’s dispatch should be used wherever possible for scheme-to-filesystem resolution.

## Fix Focus Areas
- src/utils/file_sys/file_system.cc[68-97]
- extension/s3/src/s3_extension.cpp[29-47]
- extension/s3/src/s3_filesystem.cc[141-184]
- extension/s3/src/http_filesystem.cc[46-90]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +40 to +53
T getOptionWithEnv(const reader::options_t& options,
const reader::Option<T>& opt,
std::initializer_list<const char*> option_keys,
std::initializer_list<const char*> env_keys) {
// Priority 1: schema.options - check all key aliases
for (const char* key : option_keys) {
auto it = options.find(key);
if (it != options.end() && !it->second.empty()) {
// Use the Option's parser to handle type conversion
reader::options_t temp;
temp.emplace(opt.getKey(), it->second);
return opt.get(temp);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Empty endpoint override ignored 🐞 Bug ≡ Correctness

getOptionWithEnv() skips schema.options entries with empty values, so setting ENDPOINT_OVERRIDE=""
does not override environment endpoints and can unintentionally route AWS S3 reads to an
env-configured OSS/S3-compatible endpoint.
This contradicts the test’s own comment that an empty ENDPOINT_OVERRIDE “Force[s] default AWS S3
endpoint”, and can cause incorrect runtime behavior in shared environments.
Agent Prompt
### Issue description
`getOptionWithEnv()` ignores explicitly-provided empty-string option values, so callers cannot clear env-provided endpoint/region (e.g., to force default AWS S3) by setting `ENDPOINT_OVERRIDE=""`.

### Issue Context
The S3 tests and expected behavior assume an empty override should take precedence over environment variables.

### Fix Focus Areas
- extension/s3/src/s3_options.cc[40-53]
- extension/s3/src/s3_options.cc[138-158]

### Suggested fix
- Treat **presence** of any alias key in `schema.options` as higher priority than env vars, even if the value is empty.
  - For string options (endpoint/region), remove the `!it->second.empty()` guard so explicit empty values are honored.
  - Keep env fallback only when none of the option keys are present.
- Add/adjust a unit test to validate: when env has `OSS_ENDPOINT` but schema.options has `ENDPOINT_OVERRIDE=""`, the resolved endpoint is empty and AWS defaults are used.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +339 to +433
arrow::Status HTTPRandomAccessFile::ReadRange(int64_t offset, int64_t length, void* buffer) {
if (closed_) {
return arrow::Status::Invalid("File is closed");
}

// Setup GET request with HTTP Range header (RFC 7233)
curl_easy_reset(curl_handle_);
SetupCURLHandle(curl_handle_);
curl_easy_setopt(curl_handle_, CURLOPT_URL, url_.c_str());

// Set Range header: "bytes=offset-end"
// IMPORTANT: CURLOPT_RANGE expects "start-end" format without "bytes="
// CURL will automatically add the "Range: bytes=" prefix
std::string range_value = std::to_string(offset) + "-" +
std::to_string(offset + length - 1);
curl_easy_setopt(curl_handle_, CURLOPT_RANGE, range_value.c_str());

// Setup direct write to buffer
std::pair<void*, size_t> write_info{buffer, static_cast<size_t>(length)};
curl_easy_setopt(curl_handle_, CURLOPT_WRITEFUNCTION, WriteCallbackDirect);
curl_easy_setopt(curl_handle_, CURLOPT_WRITEDATA, &write_info);

// Perform request
CURLcode res = curl_easy_perform(curl_handle_);

if (res != CURLE_OK) {
return arrow::Status::IOError("HTTP Range request failed: " +
std::string(curl_easy_strerror(res)));
}

// Verify response code
// 200 = OK (server doesn't support Range, sent full file)
// 206 = Partial Content (server supports Range, sent requested range)
long response_code = 0;
curl_easy_getinfo(curl_handle_, CURLINFO_RESPONSE_CODE, &response_code);

if (response_code == 206) {
// Success: server sent the requested range
// Check how much data was actually written
if (write_info.second > 0) {
// Buffer wasn't completely filled - server sent less data than requested
LOG(WARNING) << "HTTP Range request returned less data than expected. "
<< "Requested: " << length << " bytes, "
<< "Received: " << (length - write_info.second) << " bytes";
}
return arrow::Status::OK();
} else if (response_code == 200) {
// Server doesn't support Range requests and sent the full file
// This is a critical error because we can't fit the full file in our buffer
return arrow::Status::IOError(
"Server doesn't support HTTP Range requests (returned 200 instead of 206). "
"Cannot read file efficiently without Range support.");
} else {
return arrow::Status::IOError("HTTP Range request failed with status " +
std::to_string(response_code));
}
}

arrow::Result<int64_t> HTTPRandomAccessFile::Tell() const {
if (closed_) {
return arrow::Status::Invalid("File is closed");
}
return position_;
}

arrow::Result<int64_t> HTTPRandomAccessFile::GetSize() {
if (closed_) {
return arrow::Status::Invalid("File is closed");
}
return file_size_;
}

arrow::Status HTTPRandomAccessFile::Seek(int64_t position) {
if (closed_) {
return arrow::Status::Invalid("File is closed");
}
if (position < 0) {
return arrow::Status::Invalid("Negative seek position");
}
position_ = position;
return arrow::Status::OK();
}

arrow::Result<int64_t> HTTPRandomAccessFile::ReadAt(int64_t position, int64_t nbytes, void* out) {
if (closed_) {
return arrow::Status::Invalid("File is closed");
}

auto status = ReadRange(position, nbytes, out);
if (!status.ok()) {
return status;
}

return nbytes;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Http readat misreports bytes 🐞 Bug ≡ Correctness

HTTPRandomAccessFile::ReadRange() logs a short response (received < requested) but still returns OK,
and ReadAt() always returns the requested nbytes even when fewer bytes were written.
This can feed Arrow readers buffers containing uninitialized/incorrect data, causing corrupt reads
or parsing failures.
Agent Prompt
### Issue description
`HTTPRandomAccessFile` may successfully return from a range request even when fewer bytes were received than requested, and `ReadAt()` reports the requested size rather than actual bytes read.

### Issue Context
Arrow dataset/parquet readers rely on `RandomAccessFile::ReadAt` semantics; misreporting byte counts can cause corrupted buffers and undefined behavior.

### Fix Focus Areas
- extension/s3/src/http_filesystem.cc[339-394]
- extension/s3/src/http_filesystem.cc[422-453]

### Suggested fix
- Compute `bytes_received = length - write_info.second` and:
  - If `bytes_received == 0` and the server indicates EOF/416, return 0 (or an appropriate Arrow status).
  - If `bytes_received < length`, either return `bytes_received` (preferred) or return an IOError to avoid propagating partial data as complete.
- Update `ReadAt(position, nbytes, void*)` to return `bytes_received` instead of `nbytes`.
- Update `ReadAt(position, nbytes)` (buffer-returning overload) to return a buffer sized to `bytes_received` (e.g., allocate `nbytes`, then `Slice(0, bytes_received)`), or allocate exactly `bytes_received` when known.
- Add handling for `nbytes == 0` to avoid constructing an invalid range like `offset-(offset-1)`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +294 to +307
// Timeouts
auto connect_timeout_it = options_.find(HTTPConfigOptionKeys::kConnectTimeout);
int connect_timeout = HTTPConfigDefaults::kConnectTimeoutDefault;
if (connect_timeout_it != options_.end()) {
connect_timeout = std::stoi(connect_timeout_it->second);
}
curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, connect_timeout);

auto request_timeout_it = options_.find(HTTPConfigOptionKeys::kRequestTimeout);
int request_timeout = HTTPConfigDefaults::kRequestTimeoutDefault;
if (request_timeout_it != options_.end()) {
request_timeout = std::stoi(request_timeout_it->second);
}
curl_easy_setopt(curl, CURLOPT_TIMEOUT, request_timeout);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

4. Http timeout parsing can throw 🐞 Bug ☼ Reliability

HTTPRandomAccessFile::SetupCURLHandle() uses std::stoi on user-provided timeout values without
catching exceptions.
Invalid values can throw std::exception types that are not caught by HTTPFileSystem::OpenInputFile()
(it only catches neug::exception::Exception), potentially terminating the process.
Agent Prompt
### Issue description
Parsing HTTP timeout options with `std::stoi` can throw uncaught `std::exception` types, which are not handled by the current exception conversion path.

### Issue Context
Timeout values come from `FileSchema.options` (user input). Invalid input must not crash the process.

### Fix Focus Areas
- extension/s3/src/http_filesystem.cc[294-307]
- extension/s3/src/http_filesystem.cc[614-622]

### Suggested fix
- Wrap `std::stoi` calls in try/catch and throw a `THROW_INVALID_ARGUMENT_EXCEPTION(...)` (or return an Arrow Status) with a clear message indicating the bad key/value.
- Optionally extend `OpenInputFile()` to also catch `const std::exception&` and convert it to `arrow::Status::IOError` so unexpected std exceptions don’t terminate the process.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +34 to +40
if(BUILD_TEST)
add_extension_test(NAME s3
TEST_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/tests/s3_test.cpp
${CMAKE_CURRENT_SOURCE_DIR}/tests/http_test.cpp
EXTRA_LINK_LIBS arrow_dataset_objlib ${ARROW_PARQUET_LIB}
)
endif()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

5. S3_extension_test missing curl link 🐞 Bug ≡ Correctness

The s3_extension_test executable compiles HTTP filesystem sources that call libcurl, but the test
target does not link against CURL_LIBRARIES.
This will typically cause undefined-reference linker failures for curl symbols when building tests.
Agent Prompt
### Issue description
`s3_extension_test` compiles code that uses libcurl but does not link against libcurl.

### Issue Context
`add_extension_test()` compiles extension sources directly into the test binary, so the test must explicitly link all dependencies used by those sources.

### Fix Focus Areas
- extension/s3/CMakeLists.txt[34-40]
- extension/CMakeLists.txt[87-106]

### Suggested fix
- In `extension/s3/CMakeLists.txt`, update the `add_extension_test(NAME s3 ...)` call to include curl:
  - Add `${CURL_LIBRARIES}` to `EXTRA_LINK_LIBS`.
  - If needed on some platforms, also add `${CURL_INCLUDE_DIRS}` via `EXTRA_INCLUDE_DIRS`.
- Ensure the resulting `s3_extension_test` links successfully in CI.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

- Fix empty endpoint override being ignored (comment #2)
  - Modified getOptionWithEnv to respect explicitly set empty values
  - Allows ENDPOINT_OVERRIDE="" to force default AWS S3 endpoint

- Fix HTTP ReadAt returning incorrect byte count (comment alibaba#3)
  - Changed ReadRange to return actual bytes read (Result<int64_t>)
  - Updated ReadAt to return actual bytes instead of requested nbytes
  - Handle zero-length reads and HTTP 416 responses
  - Use arrow::SliceBuffer for correct buffer sizing

- Fix HTTP timeout parsing uncaught exceptions (comment alibaba#4)
  - Added try-catch around std::stoi for timeout values
  - Added std::exception handler in OpenInputFile

- Fix s3_extension_test missing curl link library (comment alibaba#5)
  - Added CURL_LIBRARIES and CURL_INCLUDE_DIRS to test target
- S3TestEnvironment::SetUp() already initializes S3 subsystem globally
- Remove redundant EnsureS3Initialized() from S3ExtensionTest::SetUp()
- Each test case no longer needs to initialize S3 independently
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support remote filesystems in extension (S3, OSS, HTTP)

1 participant