Skip to content

Native resumable upload for local files (fixes the #211 MCP timeout)#279

Merged
dannon merged 19 commits into
galaxyproject:mainfrom
dannon:fix/211-upload-timeout
Jun 17, 2026
Merged

Native resumable upload for local files (fixes the #211 MCP timeout)#279
dannon merged 19 commits into
galaxyproject:mainfrom
dannon:fix/211-upload-timeout

Conversation

@dannon

@dannon dannon commented Jun 11, 2026

Copy link
Copy Markdown
Member

Why

Uploading a large local file with galaxy_upload_file fails with MCP error -32001: Request timed out. The bytes stream through a single MCP request, and that request outlives the MCP SDK's hard-coded 60s per-request timeout, so the SDK aborts the call before Galaxy finishes ingesting. The URL variant (galaxy_upload_file_from_url) is fine -- Galaxy fetches server-side and the MCP request returns quickly. Nothing in loom can raise that timeout (it's baked into the SDK, below pi-mcp-adapter), so the only loom-side fix is to get the upload off the MCP request entirely. The aborted long request is also one of the things that trips the transport desync in #199 -- that's a separate problem, not addressed here.

What

A loom-native galaxy_upload_local_file tool that uploads with a resumable (TUS) transfer via tus-js-client, in its own in-process path -- no MCP request boundary, so no 60s wall. The broken MCP upload_file is hidden with excludeTools (URL upload stays), and the Galaxy context block nudges local uploads to the new tool.

Flow: TUS upload to /api/upload/resumable_upload -> POST /api/tools/fetch to trigger ingest -> wait the dataset to a terminal state before returning, so the model gets a real result rather than a "queued" non-answer (cf. #210). Plus:

  • Resumable for real -- an interrupted re-run picks up from the server's offset (findPreviousUploads/resumeFromPreviousUpload); a stale stored session restarts cleanly.
  • Credentials go through the child env, never argv.
  • Symlink guard -- the path is realpath-resolved before the sensitive/credential-path check, so a benign-named symlink can't smuggle out its target.

The upload core (galaxy-upload-tus.ts) is deliberately loom-free so it can move into @galaxyproject/galaxy-ops (galaxyproject/galaxy-mcp#61) when that lands and loom imports it directly (#202). This is the interim loom-side fix; the shared agent-ops package is the durable home. (It drops Python for uploads, but uvx is still needed for the galaxy-mcp server, so it doesn't remove Python from packaging on its own.)

Branch-history note: the early commits explored a uvx galaxy-upload CLI stopgap; the native TS implementation replaced it and the runner was deleted -- the net diff is native-only.

Testing

  • 550 unit tests pass (1 gated live test skips without creds); root + Orbit typechecks clean. Covers the orchestration branches (abort at each phase, fetch reject, terminal ingest states, read-back fallback) and the resume wiring (resume vs fresh start, abort-during-lookup).
  • Live-verified against test.galaxyproject.org (26.1.rc1): small file -> ok; 35 MiB multi-chunk -> ok; byte-exact binary round-trip (no to_posix_lines corruption); aborted-at-20 MiB upload resumes from 20 MiB on re-run; a planted dead session restarts cleanly.

Addresses #211.

@dannon dannon marked this pull request as ready for review June 11, 2026 16:22
dannon added 18 commits June 15, 2026 08:43
Pin the uvx package to a floor (galaxy-upload>=1.0.1) the way bin/loom.js
already does for galaxy-mcp -- detectUploadFailure leans on this CLI's stderr
format and its exit-0-on-ConnectionError behavior, so an unpinned spec could
drift out from under us. Also pass --no-auto-decompress: galaxy-upload defaults
to decompressing .gz/.zip on ingest, which silently turns an uploaded
fastqsanger.gz into an uncompressed dataset; keep the bytes as the user sent
them.
Needed for the new galaxy-upload-tus.ts module that replaces the
subprocess-based upload path.
buildFetchPayload, tusUpload, and waitForDataset -- the portable core that
will eventually move into galaxy-ops. No loom/pi imports; FileUrlStorage
imported via namespace cast to work around a gap in tus-js-client's type defs
(the runtime exports it, the .d.ts doesn't). 10 new tests covering the full
happy path, error path, abort, and timeout behavior.
…rsing, and test coverage

Move the pre-flight abort check after stream+finish() creation so cleanup
is unconditional regardless of when abort fires. Parse the session id with
the URL constructor instead of split() so trailing slashes and query strings
don't yield the wrong segment. Add a comment on the waitForDataset loop
clarifying what the timeout bounds. Add two missing tests: one for the
no-session-id rejection path, and one that verifies the settled guard
prevents a late onError from producing a second settlement after abort.
The tool now calls tusUpload() directly instead of shelling out to
uvx galaxy-upload. After the TUS transfer it posts to /tools/fetch
to trigger Galaxy ingest, then polls waitForDataset until a terminal
state. Fallback paths handle a missing outputs[] (history-contents
name-match) and a timed-out poll (return last-known state from the
fetch response).

Deleted galaxy-upload-runner.ts and removed the now-dead exports
(buildGalaxyUploadArgs, detectUploadFailure, GALAXY_UPLOAD_PACKAGE,
GalaxyUploadArgsOpts, UVX_MISSING). Test suite rewritten to mock the
TUS core and galaxy-api instead of the runner -- 14 handler tests
covering happy path, creds routing, abort, fetch rejection, read-back
fallback, and all existing guard cases.

543 tests green, typecheck clean.
…, add two abort tests

galaxyPost now maps AbortError to "Upload cancelled." instead of the misleading "Galaxy
rejected the upload" message. The waitForDataset catch gets a comment explaining why it's
intentionally non-fatal on abort/timeout -- the bytes are already uploaded and the dataset
exists, so returning last-known state is the right move. Added a comment on the fileName
vs realPath split so it doesn't read as an inconsistency. Two new tests cover the
galaxyPost-abort and waitForDataset-abort paths. The bare os/path imports in the test file
are used by resolveStoragePath's test and were left in place.
Skips automatically when GALAXY_URL / GALAXY_API_KEY / GALAXY_TEST_HISTORY_ID
are absent -- so CI and offline vitest runs stay green. When the env vars are
present it drives a real small-file upload end-to-end: TUS handshake ->
/api/tools/fetch -> waitForDataset, asserting the dataset reaches a terminal
state.
Adds a handler test asserting that a Galaxy ingest that lands in the
terminal 'error' state does not produce a tool-level error -- the bytes
got there, the ingest failed, and that distinction should be honest and
visible to the model.

Also drops a why-comment on the abort path in galaxy-upload-tus.ts: we
omit shouldTerminate intentionally so the partial TUS session stays on
the server and a retry can resume rather than re-uploading from scratch.
tus-js-client only stores resume state on its own -- it doesn't resume on
start(), so the FileUrlStorage we set up was being recorded and then thrown
away on retry, restarting every interrupted upload from zero. Wire
findPreviousUploads()/resumeFromPreviousUpload() before start() so a re-run of
the same file picks up from the server's offset. A stale stored session (404)
needs no special handling here -- tus-js-client drops it and creates a fresh
upload on its own since we always set endpoint. Verified live against
test.galaxyproject.org: a 60MiB upload aborted at 20MiB resumes from 20MiB on
re-run, and a planted dead session restarts cleanly without erroring.
…assertions

Review follow-up. The resume wiring added an async findPreviousUploads() step
before start(), so add a test that aborts while that lookup is in flight and
asserts start() never fires (the settled guard handles it). Also count
resumeFromPreviousUpload calls so the no-stored-partial test can prove it isn't
called at all, not just that resumedFrom stayed undefined. Plus a comment that
abort() before start() is a safe no-op.
@dannon dannon force-pushed the fix/211-upload-timeout branch from 53c9fab to 5651599 Compare June 15, 2026 12:45
A TUS transfer can complete while Galaxy's ingest lands the dataset in a state it won't accept as a tool input: error, discarded, or failed_metadata -- the exact complement of Galaxy's Dataset.valid_input_states (enforced by tools/parameters/dataset_matcher.py). Returning uploaded:true there let the agent build on a dataset it can't use, so report those as a tool failure with state-specific guidance: failed_metadata is recoverable by setting the datatype, while error/discarded are not. Also add the missing 'empty' terminal state so a zero-byte upload reaches a terminal state instead of polling until the ingest timeout.
@dannon dannon force-pushed the fix/211-upload-timeout branch from 4b21f1c to 87645e7 Compare June 17, 2026 21:50
@dannon dannon merged commit c817cf9 into galaxyproject:main Jun 17, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant