client: drop SelectiveHTTP2Transport, use HTTP/1.1 only#274
Conversation
`SelectiveHTTP2Transport` opted /tabpfn/fit and /tabpfn/predict into
HTTP/2 while leaving everything else on HTTP/1.1. Both endpoints are
long-running unary POSTs (thinking-mode fits keep the stream open for
5–15 min) and pick up no measurable benefit from HTTP/2's multiplexing
or HPACK — the body is a multipart upload dominated by raw bytes and
we only ever send one request per fit.
What HTTP/2 *did* buy us: a race condition. The h2 state machine
treats a PING frame received in the CLOSED state as a protocol
violation. During long thinking fits the server-side keepalive
PINGs (Cloud Run LB sends them on long-lived streams) regularly land
on the client *after* the connection has been reaped by some
intermediate hop, and httpx surfaces this as
`httpx.LocalProtocolError("Invalid input ConnectionInputs.RECV_PING
in state ConnectionState.CLOSED")`. That exception class is NOT in
either `_fit` or `_predict`'s `@backoff.on_exception` retry tuple
(the tuple covers `RemoteProtocolError` — peer sent garbage — but
not `LocalProtocolError` — our state machine got wrong-footed by
a benign-but-out-of-order frame). So instead of being retried
silently, the fit fails hard with a confusing error.
Empirically: a 135-fit TabArena-Medium sweep against api.priorlabs.ai
ate 6 of these errors across thinking_medium + thinking_high; re-running
the same matrix locally after patching `httpx_client` to drop the
selective HTTP/2 transport cleared all 5 thinking_high cells without
a single protocol error.
HTTP/1.1 has no PING frames at all and no equivalent state machine,
so the entire class of bug disappears. TCP-level resets surface as
the familiar `ConnectError` / `ReadError` family, which the retry
tuple already covers.
Side change: drop `[http2]` extra from the `httpx` dep since we no
longer use it; `h2` will no longer be pulled in transitively.
Net diff: 1 transport subclass + 1 import removed, 1 dep extra
dropped, ~10 lines of comment added explaining the rationale.
There was a problem hiding this comment.
Code Review
This pull request disables HTTP/2 support by removing the SelectiveHTTP2Transport class and reverting to HTTP/1.1 for all requests. This change addresses a protocol violation issue where long-running requests would fail due to keepalive PINGs from the load balancer being received after a connection was closed. The httpx[http2] dependency has also been simplified to httpx. I have no feedback to provide.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Post-patch validation: Re-ran the failed-row subset of the TabArena-Medium sweep (
Zero protocol errors across 5h 56m of long-running thinking fits — the 4.4% intermittent failure rate from the HTTP/2 PING race is gone. |
Summary
SelectiveHTTP2Transportand theHTTPTransportimport —ServiceClient.httpx_clientnow uses HTTP/1.1 everywhere.[http2]extra from thehttpxdependency soh2is no longer pulled in transitively.Why
The transport opted
/tabpfn/fitand/tabpfn/predictinto HTTP/2 while leaving everything else on HTTP/1.1. Both are long-running unary POSTs (thinking-mode fits keep the stream open for 5–15 min) and gain nothing measurable from HTTP/2's multiplexing or HPACK — one request per fit, multipart body dominated by raw bytes.What HTTP/2 did buy us: a state-machine race.
The
h2library treats aPINGframe received whileConnectionState == CLOSEDas a protocol violation (RFC 7540 §6.7 — PINGs only make sense while the connection is open). During long thinking fits, intermediate keepalive PINGs (Cloud Run's LB sends them on long-lived streams) regularly land on the client after the connection has been reaped by some hop, and httpx surfaces this as:That exception class is not in either
_fit's or_predict's@backoff.on_exceptionretry tuple — the tuple coversRemoteProtocolError(peer sent garbage) but notLocalProtocolError(our state machine got wrong-footed by a benign-but-out-of-order frame). So instead of being retried silently, the fit fails hard with a confusing error.Reproduction / evidence
A 135-fit TabArena-Medium sweep against
api.priorlabs.ai(15 datasets × 3 folds × 3 modes: plain / thinking_medium / thinking_high) ate 6LocalProtocolErrors across thinking_medium + thinking_high:Re-running the same matrix locally with
httpx_clientmonkey-patched to drop the selective HTTP/2 transport cleared all 5 missing thinking_high cells with zero protocol errors — confirming the race disappears.Alternative considered
Adding
httpx.LocalProtocolErrorto the retry tuple would also mask the issue, but it's strictly the worse fix:h2state machine can fail in other ways too (less common, but still in scope)Dropping HTTP/2 entirely eliminates the class of bugs at the cost of zero observable functionality.
Test plan
pytest tests/unit/test_client.py— 22/22 pass (unchanged from main).