Use parallel checking in self check by JukkaL · Pull Request #21221 · python/mypy

JukkaL · 2026-04-14T10:16:48Z

Let's start dogfooding parallel type checking. This is about 3x faster for a cold run than sequential on my mac laptop (not compiled).

Let's start dogfooding parallel type checking. This is about 3x faster than sequential on my mac laptop (not compiled).

ilevkivskyi

I don't remember if we install ast-serialize in all CI jobs, so you may need to update test.yml

JukkaL · 2026-04-14T10:21:53Z

It crashed on Windows:

type: commands[0]> python runtests.py self
Success: no issues found in 335 source files
run self: ['D:\\a\\mypy\\mypy\\.tox\\type\\Scripts\\python.EXE', '-m', 'mypy', '--config-file', 'mypy_self_check.ini', '-p', 'mypy', '-p', 'mypyc']
type: commands[1]> python -m mypy --config-file mypy_self_check.ini misc --exclude misc/sync-typeshed.py
Success: no issues found in 20 source files
type: commands[2]> python -m mypy --config-file mypy_self_check.ini test-data/unit/plugins
error: INTERNAL ERROR -- Please try using mypy master on GitHub:
https://mypy.readthedocs.io/en/stable/common_issues.html#using-a-development-mypy-build
Please report a bug at https://github.com/python/mypy/issues
version: 2.0.0+dev.0b3d22c874c34722e6d78be42742738a563ed766
note: use --pdb to drop into pdb
Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\a\mypy\mypy\mypy\__main__.py", line 50, in <module>
    console_entry()
  File "D:\a\mypy\mypy\mypy\__main__.py", line 15, in console_entry
    main()
  File "D:\a\mypy\mypy\mypy\main.py", line 143, in main
    res, messages, blockers = run_build(sources, options, fscache, t0, stdout, stderr)
  File "D:\a\mypy\mypy\mypy\main.py", line 233, in run_build
    res = build.build(sources, options, None, flush_errors, fscache, stdout, stderr)
  File "D:\a\mypy\mypy\mypy\build.py", line 406, in build
    result = build_inner(
  File "D:\a\mypy\mypy\mypy\build.py", line 509, in build_inner
    graph = dispatch(sources, manager, stdout, connect_threads)
  File "D:\a\mypy\mypy\mypy\build.py", line 3975, in dispatch
    process_graph(graph, manager)
  File "D:\a\mypy\mypy\mypy\build.py", line 4439, in process_graph
    done, still_working, results = manager.wait_for_done(graph)
  File "D:\a\mypy\mypy\mypy\build.py", line 1351, in wait_for_done
    return self.wait_for_done_workers(graph)
  File "D:\a\mypy\mypy\mypy\build.py", line 1370, in wait_for_done_workers
    buf = receive(self.workers[idx].conn)
  File "D:\a\mypy\mypy\mypy\ipc.py", line 461, in receive
    raise OSError("No data received")
OSError: No data received

type: exit 2 (1.09 seconds) D:\a\mypy\mypy> python -m mypy --config-file mypy_self_check.ini test-data/unit/plugins pid=1312
  type: FAIL code 2 (35.76=setup[0.53]+cmd[32.22,1.92,1.09] seconds)
  evaluation failed :( (36.14 seconds)

Error: Process completed with exit code 1.

ilevkivskyi · 2026-04-14T10:23:35Z

Yeah, something looks off on Windows, tests has been flaky for a while also, I am trying to debug test flakiness on Windows in #21220

ilevkivskyi · 2026-04-14T12:01:50Z

@JukkaL I have re-run the PR build three times, it passed all three times. Unfortunately these Windows flakes are hard to reproduce.

JukkaL · 2026-04-14T12:04:23Z

I wonder if running with a huge number of workers would reproduce this more reliably. I'm going to wait until we have a better understanding of the Windows issue.

JukkaL · 2026-04-14T12:34:12Z

I asked Claude Code and Codex to investigate the crash, since I don't know much about Windows IPC. Here's one idea from Claude Code:

⏺ Here's my analysis. The root cause is a TOCTOU race condition in the Windows ready_to_read
implementation that can lose pipe data.

The Bug

The Windows ready_to_read function at mypy/ipc.py:385 uses overlapped 1-byte probe reads on
message-mode named pipes to emulate select(). The pipes are configured as PIPE_TYPE_MESSAGE |
PIPE_READMODE_MESSAGE (line 256-257).

The race is in the cancel path at lines 425-433:

  for i, ov in pending:
      if _winapi.WaitForSingleObject(ov.event, 0) == _winapi.WAIT_OBJECT_0:
          _, err = ov.GetOverlappedResult(True)
          data = ov.getbuffer()
          if data:
              conns[i].buffer.extend(data)
          ready.append(i)
      else:
          ov.cancel()   # <-- BUG: data may have been consumed already

Here's the sequence:

ready_to_read issues ReadFile(conn.connection, 1, overlapped=True) for each worker — a
1-byte probe read on a message-mode pipe
WaitForMultipleObjects returns when one event is signaled
The code iterates all pending operations, checking each with WaitForSingleObject(ov.event,

(instant poll)

For "not ready" connections, it calls ov.cancel()

The TOCTOU race: Between step 3's WaitForSingleObject returning WAIT_TIMEOUT and step 4's
ov.cancel(), the worker writes to the pipe and the kernel completes the overlapped read —
consuming 1 byte from the message. Then:

ov.cancel() calls CancelIoEx which returns ERROR_NOT_FOUND (I/O already completed) — silently
ignored
The 1 byte is in the overlapped buffer, but ov.getbuffer() is never called
The ov object is garbage collected, and the byte is permanently lost

Why This Causes "No data received"

The framing protocol prefixes each message with a 4-byte big-endian length. When 1 byte is lost
from a message-mode pipe:

The next read_bytes() call reads the remaining (N-1) bytes of the message
The frame parser misinterprets the remaining bytes as a length header — getting a garbage
size value
frame_from_buffer() never assembles a complete frame
The ReadFile loop eventually gets err == 0 with no more data, breaks out with bdata = None
read_bytes() returns b"", and receive() raises OSError("No data received")

The Fix

After ov.cancel(), always call GetOverlappedResult and getbuffer() to retrieve data from races
where the I/O completed before the cancel took effect:

  for i, ov in pending:
      if _winapi.WaitForSingleObject(ov.event, 0) == _winapi.WAIT_OBJECT_0:
          _, err = ov.GetOverlappedResult(True)
          data = ov.getbuffer()
          if data:
              conns[i].buffer.extend(data)
          ready.append(i)
      else:
          ov.cancel()
          # The I/O may have completed between WaitForSingleObject and cancel.
          # Retrieve any consumed data to avoid losing bytes from the pipe.
          try:
              ov.GetOverlappedResult(True)
              data = ov.getbuffer()
              if data:
                  conns[i].buffer.extend(data)
          except OSError:
              pass  # Properly cancelled, no data consumed

The saved data stays in conn.buffer. The next ready_to_read call will detect the non-empty
buffer (line 381-384) and immediately return that connection as ready.

Why It's Random

The race window is very small (a few instructions between WaitForSingleObject and
ov.cancel())
But with 4 workers, 3 probe reads are cancelled per ready_to_read call
Over hundreds of SCC processing rounds, the cumulative probability of hitting it becomes
non-trivial
CI machines under load are more prone to context switches widening the race window

ilevkivskyi · 2026-04-14T13:34:26Z

It looks like there are 3 different sources of flakes of flakes on Windows:

This one (potentially 1 byte race on read)
I still see database is locked even when only setting WAL from coordinator
There are still Malformed status file when trying to connect to worker

I am currently focusing on last two. For the first one, we can simply first cancel all of them after a wait, then check which were successful and mark those ready, which looks like a simpler fix that what Claude proposes.

This may help debug random Windows crashes (see #21221).

JukkaL · 2026-04-14T14:57:55Z

I can look at the first issues on the list.

ilevkivskyi · 2026-04-14T15:05:01Z

I already have a fix for no 2 in the list (i.e. database is locked) plus more graceful failed worker start handling in #21220. I also added random build ID to the status file names (for better debug logs, but also right now trying to run two parallel builds in parallel will create a huge mess, but with separate build IDs it actually works)

ilevkivskyi · 2026-04-14T15:07:17Z

Btw @JukkaL you can click on little circular arrow it the build list in GitHub to re-run an individual job manually. You need to hover the mouse for the arrow to appear.

ilevkivskyi · 2026-04-14T15:43:20Z

This one:

JukkaL · 2026-04-14T15:54:43Z

I don't see that button to retry a build. Maybe I'm in a different A/B testing group, or it doesn't work on Firefox?

#21228 is my attempt at fixing the first issue discussed above.

ilevkivskyi · 2026-04-14T16:17:43Z

I don't see that button to retry a build. Maybe I'm in a different A/B testing group, or it doesn't work on Firefox?

I am on Firefox. You may need to cancel pending builds to see this (also again, the arrow is hidden until you hover the mouse above it). This feature is at least couple years old (and yeah, this is a horrible UI).

Cancellation is asynchronous, so some data could be read after calling `cancel()`. Rework `ready_to_read` to first cancel all operations and then check for each if data is available. See discussion in #21221 for more context. I heavily relied on coding agent assist for this, but I did multiple review iterations and refinements.

JukkaL · 2026-04-14T16:52:32Z

Ah now I see the button. I'm going to merge my attempted fix first.

JukkaL · 2026-04-14T17:19:45Z

I retried the GitHub actions job a bunch of times and it worked consistently. This looks good enough to merge now -- we can always revert if it's still flaky.

@ilevkivskyi Do you have some fixes you want to merge first?

ilevkivskyi · 2026-04-14T17:30:08Z

Do you have some fixes you want to merge first?

No, please go ahead. My flake fixes are orthogonal to this, I will merge them later.

github-actions · 2026-04-14T17:38:53Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Use parallel checking in self check

c5f6185

Let's start dogfooding parallel type checking. This is about 3x faster than sequential on my mac laptop (not compiled).

JukkaL requested a review from ilevkivskyi April 14, 2026 10:17

ilevkivskyi approved these changes Apr 14, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

JukkaL mentioned this pull request Apr 14, 2026

Report parallel worker exit status on receive failure #21224

Merged

JukkaL added a commit that referenced this pull request Apr 14, 2026

Report parallel worker exit status on receive failure (#21224)

8e3c99a

This may help debug random Windows crashes (see #21221).

Merge branch 'master' into self-check-parallel

6faffd7

This comment has been minimized.

Sign in to view

Empty commit to trigger builds

1f28461

JukkaL mentioned this pull request Apr 14, 2026

Fix Windows IPC race condition when using parallel checking #21228

Merged

JukkaL added 2 commits April 14, 2026 17:52

Merge branch 'master' into self-check-parallel

790756e

Empty commit to trigger builds

961f982

JukkaL merged commit 0eb7292 into master Apr 14, 2026
24 checks passed

JukkaL deleted the self-check-parallel branch April 14, 2026 18:55

Uh oh!

Conversation

JukkaL commented Apr 14, 2026

Uh oh!

ilevkivskyi left a comment

Choose a reason for hiding this comment

Uh oh!

JukkaL commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

This comment has been minimized.

ilevkivskyi commented Apr 14, 2026

Uh oh!

JukkaL commented Apr 14, 2026

Uh oh!

JukkaL commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

This comment has been minimized.

JukkaL commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

JukkaL commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

JukkaL commented Apr 14, 2026

Uh oh!

JukkaL commented Apr 14, 2026

Uh oh!

ilevkivskyi commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants