Skip to content

client: drop SelectiveHTTP2Transport, use HTTP/1.1 only#274

Merged
ggprior merged 2 commits into
mainfrom
georg/drop-http2-transport
May 11, 2026
Merged

client: drop SelectiveHTTP2Transport, use HTTP/1.1 only#274
ggprior merged 2 commits into
mainfrom
georg/drop-http2-transport

Conversation

@ggprior
Copy link
Copy Markdown
Contributor

@ggprior ggprior commented May 11, 2026

Summary

  • Drop SelectiveHTTP2Transport and the HTTPTransport import — ServiceClient.httpx_client now uses HTTP/1.1 everywhere.
  • Drop the [http2] extra from the httpx dependency so h2 is no longer pulled in transitively.
  • Inline comment explains why in case anyone is tempted to reintroduce HTTP/2 in the future.

Why

The transport opted /tabpfn/fit and /tabpfn/predict into HTTP/2 while leaving everything else on HTTP/1.1. Both are long-running unary POSTs (thinking-mode fits keep the stream open for 5–15 min) and gain nothing measurable from HTTP/2's multiplexing or HPACK — one request per fit, multipart body dominated by raw bytes.

What HTTP/2 did buy us: a state-machine race.

The h2 library treats a PING frame received while ConnectionState == CLOSED as a protocol violation (RFC 7540 §6.7 — PINGs only make sense while the connection is open). During long thinking fits, intermediate keepalive PINGs (Cloud Run's LB sends them on long-lived streams) regularly land on the client after the connection has been reaped by some hop, and httpx surfaces this as:

httpx.LocalProtocolError: Invalid input ConnectionInputs.RECV_PING in state ConnectionState.CLOSED

That exception class is not in either _fit's or _predict's @backoff.on_exception retry tuple — the tuple covers RemoteProtocolError (peer sent garbage) but not LocalProtocolError (our state machine got wrong-footed by a benign-but-out-of-order frame). So instead of being retried silently, the fit fails hard with a confusing error.

Reproduction / evidence

A 135-fit TabArena-Medium sweep against api.priorlabs.ai (15 datasets × 3 folds × 3 modes: plain / thinking_medium / thinking_high) ate 6 LocalProtocolErrors across thinking_medium + thinking_high:

{"error": "Invalid input ConnectionInputs.RECV_PING in state ConnectionState.CLOSED",
 "traceback": "...httpcore/_sync/http2.py:185...raise LocalProtocolError(exc)..."}

Re-running the same matrix locally with httpx_client monkey-patched to drop the selective HTTP/2 transport cleared all 5 missing thinking_high cells with zero protocol errors — confirming the race disappears.

Alternative considered

Adding httpx.LocalProtocolError to the retry tuple would also mask the issue, but it's strictly the worse fix:

  • still leaves HTTP/2 enabled with no upside
  • one retry per fit, ~5–15 min wasted on the doomed attempt
  • the h2 state machine can fail in other ways too (less common, but still in scope)

Dropping HTTP/2 entirely eliminates the class of bugs at the cost of zero observable functionality.

Test plan

  • pytest tests/unit/test_client.py — 22/22 pass (unchanged from main).
  • Local rerun of 5 previously-failing thinking_high TabArena-Medium tasks under the patched client — 0 protocol errors, all 5 complete cleanly.

`SelectiveHTTP2Transport` opted /tabpfn/fit and /tabpfn/predict into
HTTP/2 while leaving everything else on HTTP/1.1. Both endpoints are
long-running unary POSTs (thinking-mode fits keep the stream open for
5–15 min) and pick up no measurable benefit from HTTP/2's multiplexing
or HPACK — the body is a multipart upload dominated by raw bytes and
we only ever send one request per fit.

What HTTP/2 *did* buy us: a race condition. The h2 state machine
treats a PING frame received in the CLOSED state as a protocol
violation. During long thinking fits the server-side keepalive
PINGs (Cloud Run LB sends them on long-lived streams) regularly land
on the client *after* the connection has been reaped by some
intermediate hop, and httpx surfaces this as
`httpx.LocalProtocolError("Invalid input ConnectionInputs.RECV_PING
in state ConnectionState.CLOSED")`. That exception class is NOT in
either `_fit` or `_predict`'s `@backoff.on_exception` retry tuple
(the tuple covers `RemoteProtocolError` — peer sent garbage — but
not `LocalProtocolError` — our state machine got wrong-footed by
a benign-but-out-of-order frame). So instead of being retried
silently, the fit fails hard with a confusing error.

Empirically: a 135-fit TabArena-Medium sweep against api.priorlabs.ai
ate 6 of these errors across thinking_medium + thinking_high; re-running
the same matrix locally after patching `httpx_client` to drop the
selective HTTP/2 transport cleared all 5 thinking_high cells without
a single protocol error.

HTTP/1.1 has no PING frames at all and no equivalent state machine,
so the entire class of bug disappears. TCP-level resets surface as
the familiar `ConnectError` / `ReadError` family, which the retry
tuple already covers.

Side change: drop `[http2]` extra from the `httpx` dep since we no
longer use it; `h2` will no longer be pulled in transitively.

Net diff: 1 transport subclass + 1 import removed, 1 dep extra
dropped, ~10 lines of comment added explaining the rationale.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables HTTP/2 support by removing the SelectiveHTTP2Transport class and reverting to HTTP/1.1 for all requests. This change addresses a protocol violation issue where long-running requests would fail due to keepalive PINGs from the load balancer being received after a connection was closed. The httpx[http2] dependency has also been simplified to httpx. I have no feedback to provide.

@ggprior ggprior marked this pull request as ready for review May 11, 2026 08:33
@ggprior ggprior requested a review from a team as a code owner May 11, 2026 08:33
@ggprior ggprior requested review from noahho and removed request for a team May 11, 2026 08:33
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@ggprior ggprior requested review from simo-prior and removed request for noahho May 11, 2026 08:33
@ggprior ggprior added this pull request to the merge queue May 11, 2026
Merged via the queue into main with commit f47ac38 May 11, 2026
6 checks passed
@ggprior ggprior deleted the georg/drop-http2-transport branch May 11, 2026 10:10
@ggprior
Copy link
Copy Markdown
Contributor Author

ggprior commented May 11, 2026

Post-patch validation:

Re-ran the failed-row subset of the TabArena-Medium sweep (thinking_high, 45 fits over 5h 56m on api.priorlabs.ai) against this branch's patched SDK:

HTTP/2 (main) HTTP/1.1 (this PR)
fits attempted 45 45
fits completed 40/45 45/45
LocalProtocolError (RECV_PING in CLOSED) 5 0
backoff retries needed 0 (LocalProtocolError bypasses the retry tuple) 0 (no transient issues at all)
total wallclock 6h 31m 5h 56m

Zero protocol errors across 5h 56m of long-running thinking fits — the 4.4% intermittent failure rate from the HTTP/2 PING race is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants