Skip to content

Use parallel checking in self check#21221

Merged
JukkaL merged 5 commits intomasterfrom
self-check-parallel
Apr 14, 2026
Merged

Use parallel checking in self check#21221
JukkaL merged 5 commits intomasterfrom
self-check-parallel

Conversation

@JukkaL
Copy link
Copy Markdown
Collaborator

@JukkaL JukkaL commented Apr 14, 2026

Let's start dogfooding parallel type checking. This is about 3x faster for a cold run than sequential on my mac laptop (not compiled).

Let's start dogfooding parallel type checking. This is about 3x
faster than sequential on my mac laptop (not compiled).
@JukkaL JukkaL requested a review from ilevkivskyi April 14, 2026 10:17
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if we install ast-serialize in all CI jobs, so you may need to update test.yml

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

It crashed on Windows:

type: commands[0]> python runtests.py self
Success: no issues found in 335 source files
run self: ['D:\\a\\mypy\\mypy\\.tox\\type\\Scripts\\python.EXE', '-m', 'mypy', '--config-file', 'mypy_self_check.ini', '-p', 'mypy', '-p', 'mypyc']
type: commands[1]> python -m mypy --config-file mypy_self_check.ini misc --exclude misc/sync-typeshed.py
Success: no issues found in 20 source files
type: commands[2]> python -m mypy --config-file mypy_self_check.ini test-data/unit/plugins
error: INTERNAL ERROR -- Please try using mypy master on GitHub:
https://mypy.readthedocs.io/en/stable/common_issues.html#using-a-development-mypy-build
Please report a bug at https://github.com/python/mypy/issues
version: 2.0.0+dev.0b3d22c874c34722e6d78be42742738a563ed766
note: use --pdb to drop into pdb
Traceback (most recent call last):
  File "C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\hostedtoolcache\windows\Python\3.10.11\x64\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\a\mypy\mypy\mypy\__main__.py", line 50, in <module>
    console_entry()
  File "D:\a\mypy\mypy\mypy\__main__.py", line 15, in console_entry
    main()
  File "D:\a\mypy\mypy\mypy\main.py", line 143, in main
    res, messages, blockers = run_build(sources, options, fscache, t0, stdout, stderr)
  File "D:\a\mypy\mypy\mypy\main.py", line 233, in run_build
    res = build.build(sources, options, None, flush_errors, fscache, stdout, stderr)
  File "D:\a\mypy\mypy\mypy\build.py", line 406, in build
    result = build_inner(
  File "D:\a\mypy\mypy\mypy\build.py", line 509, in build_inner
    graph = dispatch(sources, manager, stdout, connect_threads)
  File "D:\a\mypy\mypy\mypy\build.py", line 3975, in dispatch
    process_graph(graph, manager)
  File "D:\a\mypy\mypy\mypy\build.py", line 4439, in process_graph
    done, still_working, results = manager.wait_for_done(graph)
  File "D:\a\mypy\mypy\mypy\build.py", line 1351, in wait_for_done
    return self.wait_for_done_workers(graph)
  File "D:\a\mypy\mypy\mypy\build.py", line 1370, in wait_for_done_workers
    buf = receive(self.workers[idx].conn)
  File "D:\a\mypy\mypy\mypy\ipc.py", line 461, in receive
    raise OSError("No data received")
OSError: No data received

type: exit 2 (1.09 seconds) D:\a\mypy\mypy> python -m mypy --config-file mypy_self_check.ini test-data/unit/plugins pid=1312
  type: FAIL code 2 (35.76=setup[0.53]+cmd[32.22,1.92,1.09] seconds)
  evaluation failed :( (36.14 seconds)

Error: Process completed with exit code 1.

@ilevkivskyi
Copy link
Copy Markdown
Member

Yeah, something looks off on Windows, tests has been flaky for a while also, I am trying to debug test flakiness on Windows in #21220

@github-actions

This comment has been minimized.

@ilevkivskyi
Copy link
Copy Markdown
Member

@JukkaL I have re-run the PR build three times, it passed all three times. Unfortunately these Windows flakes are hard to reproduce.

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

I wonder if running with a huge number of workers would reproduce this more reliably. I'm going to wait until we have a better understanding of the Windows issue.

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

I asked Claude Code and Codex to investigate the crash, since I don't know much about Windows IPC. Here's one idea from Claude Code:

⏺ Here's my analysis. The root cause is a TOCTOU race condition in the Windows ready_to_read
implementation that can lose pipe data.

The Bug

The Windows ready_to_read function at mypy/ipc.py:385 uses overlapped 1-byte probe reads on
message-mode named pipes to emulate select(). The pipes are configured as PIPE_TYPE_MESSAGE |
PIPE_READMODE_MESSAGE (line 256-257).

The race is in the cancel path at lines 425-433:

  for i, ov in pending:
      if _winapi.WaitForSingleObject(ov.event, 0) == _winapi.WAIT_OBJECT_0:
          _, err = ov.GetOverlappedResult(True)
          data = ov.getbuffer()
          if data:
              conns[i].buffer.extend(data)
          ready.append(i)
      else:
          ov.cancel()   # <-- BUG: data may have been consumed already

Here's the sequence:

  1. ready_to_read issues ReadFile(conn.connection, 1, overlapped=True) for each worker — a
    1-byte probe read on a message-mode pipe
  2. WaitForMultipleObjects returns when one event is signaled
  3. The code iterates all pending operations, checking each with WaitForSingleObject(ov.event,
  1. (instant poll)
  1. For "not ready" connections, it calls ov.cancel()

The TOCTOU race: Between step 3's WaitForSingleObject returning WAIT_TIMEOUT and step 4's
ov.cancel(), the worker writes to the pipe and the kernel completes the overlapped read —
consuming 1 byte from the message. Then:

  • ov.cancel() calls CancelIoEx which returns ERROR_NOT_FOUND (I/O already completed) — silently
    ignored
  • The 1 byte is in the overlapped buffer, but ov.getbuffer() is never called
  • The ov object is garbage collected, and the byte is permanently lost

Why This Causes "No data received"

The framing protocol prefixes each message with a 4-byte big-endian length. When 1 byte is lost
from a message-mode pipe:

  1. The next read_bytes() call reads the remaining (N-1) bytes of the message
  2. The frame parser misinterprets the remaining bytes as a length header — getting a garbage
    size value
  3. frame_from_buffer() never assembles a complete frame
  4. The ReadFile loop eventually gets err == 0 with no more data, breaks out with bdata = None
  5. read_bytes() returns b"", and receive() raises OSError("No data received")

The Fix

After ov.cancel(), always call GetOverlappedResult and getbuffer() to retrieve data from races
where the I/O completed before the cancel took effect:

  for i, ov in pending:
      if _winapi.WaitForSingleObject(ov.event, 0) == _winapi.WAIT_OBJECT_0:
          _, err = ov.GetOverlappedResult(True)
          data = ov.getbuffer()
          if data:
              conns[i].buffer.extend(data)
          ready.append(i)
      else:
          ov.cancel()
          # The I/O may have completed between WaitForSingleObject and cancel.
          # Retrieve any consumed data to avoid losing bytes from the pipe.
          try:
              ov.GetOverlappedResult(True)
              data = ov.getbuffer()
              if data:
                  conns[i].buffer.extend(data)
          except OSError:
              pass  # Properly cancelled, no data consumed

The saved data stays in conn.buffer. The next ready_to_read call will detect the non-empty
buffer (line 381-384) and immediately return that connection as ready.

Why It's Random

  • The race window is very small (a few instructions between WaitForSingleObject and
    ov.cancel())
  • But with 4 workers, 3 probe reads are cancelled per ready_to_read call
  • Over hundreds of SCC processing rounds, the cumulative probability of hitting it becomes
    non-trivial
  • CI machines under load are more prone to context switches widening the race window

@ilevkivskyi
Copy link
Copy Markdown
Member

It looks like there are 3 different sources of flakes of flakes on Windows:

  • This one (potentially 1 byte race on read)
  • I still see database is locked even when only setting WAL from coordinator
  • There are still Malformed status file when trying to connect to worker

I am currently focusing on last two. For the first one, we can simply first cancel all of them after a wait, then check which were successful and mark those ready, which looks like a simpler fix that what Claude proposes.

JukkaL added a commit that referenced this pull request Apr 14, 2026
This may help debug random Windows crashes (see #21221).
@github-actions

This comment has been minimized.

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

I can look at the first issues on the list.

@ilevkivskyi
Copy link
Copy Markdown
Member

I already have a fix for no 2 in the list (i.e. database is locked) plus more graceful failed worker start handling in #21220. I also added random build ID to the status file names (for better debug logs, but also right now trying to run two parallel builds in parallel will create a huge mess, but with separate build IDs it actually works)

@ilevkivskyi
Copy link
Copy Markdown
Member

Btw @JukkaL you can click on little circular arrow it the build list in GitHub to re-run an individual job manually. You need to hover the mouse for the arrow to appear.

@ilevkivskyi
Copy link
Copy Markdown
Member

This one:
Screenshot from 2026-04-14 16-40-35

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

I don't see that button to retry a build. Maybe I'm in a different A/B testing group, or it doesn't work on Firefox?

#21228 is my attempt at fixing the first issue discussed above.

@ilevkivskyi
Copy link
Copy Markdown
Member

I don't see that button to retry a build. Maybe I'm in a different A/B testing group, or it doesn't work on Firefox?

I am on Firefox. You may need to cancel pending builds to see this (also again, the arrow is hidden until you hover the mouse above it). This feature is at least couple years old (and yeah, this is a horrible UI).

JukkaL added a commit that referenced this pull request Apr 14, 2026
Cancellation is asynchronous, so some data could be read after calling
`cancel()`. Rework `ready_to_read` to first cancel all operations and
then check for each if data is available.

See discussion in #21221 for more context.

I heavily relied on coding agent assist for this, but I did multiple
review iterations and refinements.
@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

Ah now I see the button. I'm going to merge my attempted fix first.

@JukkaL
Copy link
Copy Markdown
Collaborator Author

JukkaL commented Apr 14, 2026

I retried the GitHub actions job a bunch of times and it worked consistently. This looks good enough to merge now -- we can always revert if it's still flaky.

@ilevkivskyi Do you have some fixes you want to merge first?

@ilevkivskyi
Copy link
Copy Markdown
Member

Do you have some fixes you want to merge first?

No, please go ahead. My flake fixes are orthogonal to this, I will merge them later.

@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@JukkaL JukkaL merged commit 0eb7292 into master Apr 14, 2026
24 checks passed
@JukkaL JukkaL deleted the self-check-parallel branch April 14, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants