fix: route dataset manifest and file preview through service for private buckets#795
fix: route dataset manifest and file preview through service for private buckets#795KeitaW wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
…t private buckets (NVIDIA#793) The UI's fetchManifest and file preview proxy performed unsigned fetch() against S3 HTTPS URLs, which fails with 403 on private buckets. Added two service-side proxy endpoints: - GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage using bucket credentials (supports S3, GCS, Azure, Swift, TOS) - GET /{bucket}/dataset/{name}/file-content — streams individual file content with storage_path validation against dataset container Updated UI to call these endpoints through the existing /api catch-all proxy instead of direct unsigned fetch. Removed unused fetchManifest server action files.
The file preview panel sends a HEAD request before GET to check content-type and access. FastAPI's @router.get does not handle HEAD, returning 405 Method Not Allowed. Changed to @router.api_route with both GET and HEAD methods.
fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails
during SSR because the request goes through the proxy without browser
auth cookies, resulting in 403 from the API gateway.
Re-introduced the server action pattern: fetchManifest now calls the
backend service directly using getServerApiBaseUrl() (internal URL),
bypassing the auth gateway. Works for both SSR and client hydration.
📝 WalkthroughWalkthroughThe PR adds backend API endpoints for fetching dataset manifests and file content, shifting manifest and file access from direct unsigned frontend fetches against storage to authenticated service-proxied requests that resolve private bucket access failures. Changes
Sequence DiagramsequenceDiagram
actor Browser as Browser/UI
participant Proxy as Frontend<br/>Proxy Route
participant Service as Backend<br/>Service
participant Storage as Object<br/>Storage (S3)
Note over Browser,Storage: New Flow: Service-Proxied Authenticated Access
Browser->>Proxy: GET /proxy/dataset/file?bucket=X&name=Y&storagePath=...
Proxy->>Service: GET /api/bucket/X/dataset/Y/file-content?storage_path=...
Note over Service: Validate storage_path container<br/>matches dataset hash_location
Service->>Storage: fetch(signed_url_or_creds)
Storage-->>Service: File bytes
Service-->>Proxy: StreamingResponse<br/>(inferred mime-type)
Proxy-->>Browser: Forwarded response<br/>with headers
rect rgba(100, 200, 100, 0.5)
Note over Browser,Storage: Manifest Flow
Browser->>Proxy: Manifest request
Proxy->>Service: GET /api/bucket/X/dataset/Y/manifest?version=V
Service->>Storage: Load manifest from version location
Storage-->>Service: JSON manifest
Service-->>Browser: Parsed manifest
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/service/core/data/data_service.py`:
- Around line 1032-1036: The current check only compares
requested_backend.container to dataset_backend.container (constructed via
storage.construct_storage_backend), which allows any path in the same container;
instead validate that the supplied storage_path is within the dataset's storage
prefix by ensuring storage_path (or the backend's full path) starts with
dataset_info.hash_location (or its normalized prefix) before proceeding; if it
does not, raise osmo_errors.OSMOUserError with a clear message. Normalize both
paths (resolve trailing slashes, case/URL encoding as appropriate) prior to the
startswith check and keep the container equality check as a fast precondition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e0ab25ae-7964-4301-9a03-e53a0d59233a
📒 Files selected for processing (10)
src/service/core/data/data_service.pysrc/ui/next.config.tssrc/ui/src/app/proxy/dataset/file/route.impl.production.tssrc/ui/src/features/datasets/detail/components/dataset-detail-content.tsxsrc/ui/src/features/datasets/detail/components/file-preview-panel.tsxsrc/ui/src/lib/api/adapter/datasets-hooks.tssrc/ui/src/lib/api/adapter/datasets.tssrc/ui/src/lib/api/server/dataset-actions.production.tssrc/ui/src/lib/api/server/dataset-actions.tssrc/ui/src/mocks/handlers.ts
💤 Files with no reviewable changes (2)
- src/ui/next.config.ts
- src/ui/src/lib/api/server/dataset-actions.ts
| requested_backend = storage.construct_storage_backend(storage_path) | ||
| dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) | ||
| if requested_backend.container != dataset_backend.container: | ||
| raise osmo_errors.OSMOUserError( | ||
| 'Storage path does not belong to this dataset.') |
There was a problem hiding this comment.
Container-only validation may be overly permissive.
The validation only checks that requested_backend.container matches dataset_backend.container. This allows access to any file within the same container, not just files belonging to this specific dataset's storage prefix. Consider validating that storage_path starts with dataset_info.hash_location to restrict access to only the dataset's files.
🛡️ Proposed stricter path validation
# Validate that the storage path belongs to this dataset's storage prefix
requested_backend = storage.construct_storage_backend(storage_path)
dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
- if requested_backend.container != dataset_backend.container:
+ if requested_backend.container != dataset_backend.container or \
+ not storage_path.startswith(dataset_info.hash_location.rstrip('/')):
raise osmo_errors.OSMOUserError(
'Storage path does not belong to this dataset.')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| requested_backend = storage.construct_storage_backend(storage_path) | |
| dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) | |
| if requested_backend.container != dataset_backend.container: | |
| raise osmo_errors.OSMOUserError( | |
| 'Storage path does not belong to this dataset.') | |
| requested_backend = storage.construct_storage_backend(storage_path) | |
| dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) | |
| if requested_backend.container != dataset_backend.container or \ | |
| not storage_path.startswith(dataset_info.hash_location.rstrip('/')): | |
| raise osmo_errors.OSMOUserError( | |
| 'Storage path does not belong to this dataset.') |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/service/core/data/data_service.py` around lines 1032 - 1036, The current
check only compares requested_backend.container to dataset_backend.container
(constructed via storage.construct_storage_backend), which allows any path in
the same container; instead validate that the supplied storage_path is within
the dataset's storage prefix by ensuring storage_path (or the backend's full
path) starts with dataset_info.hash_location (or its normalized prefix) before
proceeding; if it does not, raise osmo_errors.OSMOUserError with a clear
message. Normalize both paths (resolve trailing slashes, case/URL encoding as
appropriate) prior to the startswith check and keep the container equality check
as a fast precondition.
…oints #795 was authored against an older Client.get_object_stream() shape that accepted no positional args. On main, the public overloads now require a remote_path argument. Single-object fetches should go through storage.SingleObjectClient (already used by cli/dataset.py and lib/data/dataset/downloading.py) — its get_object_stream() takes no path because the path is bound at create() time from storage_uri. mypy on data_service.py now passes; behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The conflict resolution when cherry-picking #795 left a multi-line generateFlatManifest call that prettier wants on one line under the project's print-width settings. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Four review-driven fixes that all touched the dataset preview path: 1. parse_uri_to_link preserves override_url base path (CAIOS via reverse proxy). Previously dropped parsed.path; an override of 'https://gateway.example.com/s3' produced URLs against gateway.example.com with no /s3 segment. Now host + base_path are preserved through both addressing styles, including the no-scheme-prefix case where urlparse drops the whole input into 'path'. 2. get_file_content validates against dataset hash_location prefix, not just the bucket container. This was the unresolved CodeRabbit feedback on #795 — container-only matching let an authenticated caller request any object in the same bucket via the dataset endpoint. 3. Forward Range request and accept-ranges/content-range response headers through the proxy so 206 Partial Content semantics survive (needed for video/large-file preview). 4. Remove the legacy 'url=' fallback from the proxy. It accepted any http(s) URL — a clear SSRF vector (cloud metadata endpoints, RFC1918, loopback). Manifests since #795 always carry storage_path; the legacy path was effectively dead. UI's file-preview-panel updated to drop the matching client-side fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
The conflict resolution when cherry-picking #795 left a multi-line generateFlatManifest call that prettier wants on one line under the project's print-width settings. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Four review-driven fixes that all touched the dataset preview path: 1. parse_uri_to_link preserves override_url base path (CAIOS via reverse proxy). Previously dropped parsed.path; an override of 'https://gateway.example.com/s3' produced URLs against gateway.example.com with no /s3 segment. Now host + base_path are preserved through both addressing styles, including the no-scheme-prefix case where urlparse drops the whole input into 'path'. 2. get_file_content validates against dataset hash_location prefix, not just the bucket container. This was the unresolved CodeRabbit feedback on #795 — container-only matching let an authenticated caller request any object in the same bucket via the dataset endpoint. 3. Forward Range request and accept-ranges/content-range response headers through the proxy so 206 Partial Content semantics survive (needed for video/large-file preview). 4. Remove the legacy 'url=' fallback from the proxy. It accepted any http(s) URL — a clear SSRF vector (cloud metadata endpoints, RFC1918, loopback). Manifests since #795 always carry storage_path; the legacy path was effectively dead. UI's file-preview-panel updated to drop the matching client-side fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
|
Thank you for the contribution, it's cherry-picked into #957 |
|
Thanks for working on it @cypres ! |
* fix(storage): build dataset URLs against override_url for S3-compatible backends
S3Backend.parse_uri_to_link hardcoded the AWS pattern
'https://<bucket>.s3.<region>.amazonaws.com/...' regardless of credential
override_url. The UI consumes these URLs to fetch dataset content; for a
CAIOS-backed bucket this produced
'osmo-on-cw-dev-harnholm-datasets.s3.us-east-14a.amazonaws.com', which
ENOTFOUNDs in DNS.
Thread the credential's override_url and addressing_style through
parse_uri_to_link. When override_url is set, S3Backend builds a URL
against that host (virtual-host by default; addressing_style='path'
yields path-style for localstack/MinIO without wildcard DNS). AWS S3
behavior (no override_url) is unchanged. Other backends accept the
new kwargs but ignore them.
Updated callsites:
- data_service.py: dataset listing payload now respects bucket
config's default_credential.override_url. This is the immediate
UI fix.
- dataset/uploading.py (×2) + migrating.py: new uploads/migrations
record the correct URL in the manifest. Existing manifest entries
persisted with AWS-pattern URLs are not migrated (separate concern).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: route dataset file browser and preview through service to support private buckets (#793)
The UI's fetchManifest and file preview proxy performed unsigned fetch()
against S3 HTTPS URLs, which fails with 403 on private buckets.
Added two service-side proxy endpoints:
- GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage
using bucket credentials (supports S3, GCS, Azure, Swift, TOS)
- GET /{bucket}/dataset/{name}/file-content — streams individual file
content with storage_path validation against dataset container
Updated UI to call these endpoints through the existing /api catch-all
proxy instead of direct unsigned fetch. Removed unused fetchManifest
server action files.
* fix: support HEAD method on file-content endpoint for preview preflight
The file preview panel sends a HEAD request before GET to check
content-type and access. FastAPI's @router.get does not handle HEAD,
returning 405 Method Not Allowed. Changed to @router.api_route with
both GET and HEAD methods.
* fix: use server action for manifest fetch to avoid SSR auth failure
fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails
during SSR because the request goes through the proxy without browser
auth cookies, resulting in 403 from the API gateway.
Re-introduced the server action pattern: fetchManifest now calls the
backend service directly using getServerApiBaseUrl() (internal URL),
bypassing the auth gateway. Works for both SSR and client hydration.
* fix(service): use SingleObjectClient for manifest + file-content endpoints
accepted no positional args. On main, the public overloads now require a
remote_path argument. Single-object fetches should go through
storage.SingleObjectClient (already used by cli/dataset.py and
lib/data/dataset/downloading.py) — its get_object_stream() takes no path
because the path is bound at create() time from storage_uri.
mypy on data_service.py now passes; behavior is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
* fix(ui): forward auth headers in SSR manifest fetch
The server-action call to /api/bucket/{bucket}/dataset/{name}/manifest
went out without the incoming request's session cookie, so oauth2-proxy
returned 401 with "No valid authentication in request" and the dataset
detail page crashed with "Failed to fetch manifest: 401".
getServerFetchHeaders() already exists in config.production.ts for
exactly this purpose — it pulls authorization + cookie from the SSR
request via next/headers and forwards them upstream. Use it.
The original docstring claimed the server action would "bypass the API
gateway auth layer" by hitting an internal service URL, but
getServerApiBaseUrl() returns the public hostname (Envoy + oauth2-proxy
front), so requests still pass through the gateway. Forwarding the
caller's session is the simpler fix and doesn't require a new internal
URL knob in deployment config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
* fix(ui): forward filename to file-content endpoint for Content-Type guessing
storage_path is hash-keyed (e.g. .../hashes/<etag>) and carries no
extension. mimetypes.guess_type(storage_path) returned None, the service
fell back to application/octet-stream, and the preview panel rendered
every text file as a binary download.
Forward the original relative_path through the proxy as a 'filename'
query param. The service uses it for media-type guessing only; access
control still hinges on storage_path's container check. Falls through
cleanly when filename is absent (preserves current behavior for any
legacy caller).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
* chore(ui): prettier-format manifest mock handler
The conflict resolution when cherry-picking #795 left a multi-line
generateFlatManifest call that prettier wants on one line under the
project's print-width settings. Pure formatting; no behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
* fix: dataset proxy hardening and override URL prefix support
Four review-driven fixes that all touched the dataset preview path:
1. parse_uri_to_link preserves override_url base path (CAIOS via reverse
proxy). Previously dropped parsed.path; an override of
'https://gateway.example.com/s3' produced URLs against
gateway.example.com with no /s3 segment. Now host + base_path are
preserved through both addressing styles, including the
no-scheme-prefix case where urlparse drops the whole input into 'path'.
2. get_file_content validates against dataset hash_location prefix, not
just the bucket container. This was the unresolved CodeRabbit feedback
on #795 — container-only matching let an authenticated caller request
any object in the same bucket via the dataset endpoint.
3. Forward Range request and accept-ranges/content-range response headers
through the proxy so 206 Partial Content semantics survive (needed for
video/large-file preview).
4. Remove the legacy 'url=' fallback from the proxy. It accepted any
http(s) URL — a clear SSRF vector (cloud metadata endpoints, RFC1918,
loopback). Manifests since #795 always carry storage_path; the legacy
path was effectively dead. UI's file-preview-panel updated to drop the
matching client-side fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
* fix(ui): cache hardening on auth-gated dataset proxy + HEAD query
Two review fixes for the auth-sensitive dataset preview path:
1. Stop forwarding upstream Cache-Control on /proxy/dataset/file. Storage
providers commonly set 'public, max-age=…' on object responses; passing
that through on a per-user authenticated route lets intermediate caches
serve one user's bytes to another. Drop 'cache-control' from
FORWARDED_RESPONSE_HEADERS and explicitly set 'Cache-Control:
private, no-store' on every response. ('vary' was never forwarded — left
as is.)
2. Drop staleTime: Infinity from the file-preview HEAD useQuery. Caching a
401/403 forever would prevent revalidation after the user's session
refreshes. staleTime: 0 restores standard React Query revalidation
semantics (refetch on focus/mount when stale).
The text-preview useQuery one block up still has staleTime: Infinity; left
alone here since the reviewer scoped the finding to the HEAD query, but
worth a follow-up if we see stuck 401s on the text preview specifically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
* fix(service): thread override_url/addressing_style through query_dataset link
query_dataset's metadata-enabled branch was the third parse_uri_to_link
callsite I missed when threading override_url + addressing_style for
private S3-compatible buckets — get_collection_info and get_dataset_info
already do this. Without it, /api/bucket/{bucket}/query results
on a CAIOS-backed bucket return AWS-pattern hostnames that ENOTFOUND on
the UI side, same failure mode the original parse_uri_to_link bug had.
Mirrors the default_credential fallback pattern (override_url and
addressing_style live on bucket_config.default_credential, not directly
on bucket_config).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
---------
Signed-off-by: Hans Arnholm <harnholm@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Keita Watanabe <keitaw09@gmail.com>
Summary
Fixes #793 — dataset file browser and preview broken on private S3 buckets.
GET /{bucket}/dataset/{name}/manifestservice endpoint that reads manifest JSON from storage usingbucket_config.default_credential(supports S3, GCS, Azure, Swift, TOS)GET /{bucket}/dataset/{name}/file-contentservice endpoint that streams individual file content with storage path validationstoragePathis availableTest plan
pnpm type-check && pnpm lint && pnpm test --run— all passing (751/751)pnpm dev:mock→ navigate to dataset detail → file browser loadsSummary by CodeRabbit
Release Notes
New Features
Improvements