Conversation
Let's start dogfooding parallel type checking. This is about 3x faster than sequential on my mac laptop (not compiled).
ilevkivskyi
left a comment
There was a problem hiding this comment.
I don't remember if we install ast-serialize in all CI jobs, so you may need to update test.yml
|
It crashed on Windows: |
|
Yeah, something looks off on Windows, tests has been flaky for a while also, I am trying to debug test flakiness on Windows in #21220 |
This comment has been minimized.
This comment has been minimized.
|
@JukkaL I have re-run the PR build three times, it passed all three times. Unfortunately these Windows flakes are hard to reproduce. |
|
I wonder if running with a huge number of workers would reproduce this more reliably. I'm going to wait until we have a better understanding of the Windows issue. |
|
I asked Claude Code and Codex to investigate the crash, since I don't know much about Windows IPC. Here's one idea from Claude Code: ⏺ Here's my analysis. The root cause is a TOCTOU race condition in the Windows ready_to_read The Bug The Windows ready_to_read function at mypy/ipc.py:385 uses overlapped 1-byte probe reads on The race is in the cancel path at lines 425-433: for i, ov in pending:
if _winapi.WaitForSingleObject(ov.event, 0) == _winapi.WAIT_OBJECT_0:
_, err = ov.GetOverlappedResult(True)
data = ov.getbuffer()
if data:
conns[i].buffer.extend(data)
ready.append(i)
else:
ov.cancel() # <-- BUG: data may have been consumed alreadyHere's the sequence:
The TOCTOU race: Between step 3's WaitForSingleObject returning WAIT_TIMEOUT and step 4's
Why This Causes "No data received" The framing protocol prefixes each message with a 4-byte big-endian length. When 1 byte is lost
The Fix After ov.cancel(), always call GetOverlappedResult and getbuffer() to retrieve data from races for i, ov in pending:
if _winapi.WaitForSingleObject(ov.event, 0) == _winapi.WAIT_OBJECT_0:
_, err = ov.GetOverlappedResult(True)
data = ov.getbuffer()
if data:
conns[i].buffer.extend(data)
ready.append(i)
else:
ov.cancel()
# The I/O may have completed between WaitForSingleObject and cancel.
# Retrieve any consumed data to avoid losing bytes from the pipe.
try:
ov.GetOverlappedResult(True)
data = ov.getbuffer()
if data:
conns[i].buffer.extend(data)
except OSError:
pass # Properly cancelled, no data consumedThe saved data stays in conn.buffer. The next ready_to_read call will detect the non-empty Why It's Random
|
|
It looks like there are 3 different sources of flakes of flakes on Windows:
I am currently focusing on last two. For the first one, we can simply first cancel all of them after a wait, then check which were successful and mark those ready, which looks like a simpler fix that what Claude proposes. |
This may help debug random Windows crashes (see #21221).
This comment has been minimized.
This comment has been minimized.
|
I can look at the first issues on the list. |
|
I already have a fix for no 2 in the list (i.e. |
|
Btw @JukkaL you can click on little circular arrow it the build list in GitHub to re-run an individual job manually. You need to hover the mouse for the arrow to appear. |
|
I don't see that button to retry a build. Maybe I'm in a different A/B testing group, or it doesn't work on Firefox? #21228 is my attempt at fixing the first issue discussed above. |
I am on Firefox. You may need to cancel pending builds to see this (also again, the arrow is hidden until you hover the mouse above it). This feature is at least couple years old (and yeah, this is a horrible UI). |
Cancellation is asynchronous, so some data could be read after calling `cancel()`. Rework `ready_to_read` to first cancel all operations and then check for each if data is available. See discussion in #21221 for more context. I heavily relied on coding agent assist for this, but I did multiple review iterations and refinements.
|
Ah now I see the button. I'm going to merge my attempted fix first. |
|
I retried the GitHub actions job a bunch of times and it worked consistently. This looks good enough to merge now -- we can always revert if it's still flaky. @ilevkivskyi Do you have some fixes you want to merge first? |
No, please go ahead. My flake fixes are orthogonal to this, I will merge them later. |
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |

Let's start dogfooding parallel type checking. This is about 3x faster for a cold run than sequential on my mac laptop (not compiled).