Skip to content

fix(join): declare loop-spawned closure locals to avoid soft-scope race#71

Merged
krynju merged 1 commit into
mainfrom
kr/fix-join-closure-race
May 24, 2026
Merged

fix(join): declare loop-spawned closure locals to avoid soft-scope race#71
krynju merged 1 commit into
mainfrom
kr/fix-join-closure-race

Conversation

@krynju
Copy link
Copy Markdown
Member

@krynju krynju commented May 23, 2026

Summary

_join(::Symbol, ::Any, ::DTable)'s process_one_chunk closure assigns to inner_l, inner_r, and outer_l — names that are also assigned later in the enclosing _join body (the intersect / Vector{UInt}() initialization for the outer-row merge). Julia's soft-scope rules therefore make those identifiers captured bindings shared between the enclosing function and every Dagger.@spawned invocation of the closure.

With N r-chunks the closure runs N times in parallel, all reading/writing the same three binding cells. The tuple destructure

inner_l, inner_r = match_inner_indices(...)

is two stores; a context switch between them lets one task observe another task's inner_l paired with its own inner_r. Mismatched lengths trigger a BoundsError in build_joined_table, and even when lengths happen to match, find_outer_indices(l, inner_l) runs against the wrong inner_l and the final merge ends up with extra rows. The bug manifested as DTable-on-DTable leftjoin/innerjoin returning the wrong row count under -t>1.

The fix is one line inside the closure:

local inner_l, inner_r, outer_l

That forces fresh per-invocation bindings and resolves the race. No other Dagger.@spawn site in src/ has the same trap — they all use _-prefixed parameter names, have inner locals that don't collide with the enclosing function, or are only spawned once.

This also re-enables the previously commented-out lj10 / ij10 assertions in test/table.jl that exercise DTable-on-DTable joins.

Test plan

  • Full test suite passes: 368/368 under julia -t8
  • lj10 / ij10 DTable-on-DTable join asserts re-enabled and pass
  • 50 fresh-cache leftjoin + innerjoin trials on -t80/50 mismatches (vs reliably wrong on main)
  • 30 stress trials with cold determine_schema cache per iteration — 0/30 mismatches

🤖 Generated with Claude Code

In `_join(::Symbol, ::Any, ::DTable)` the `process_one_chunk` closure
assigned to `inner_l`, `inner_r`, and `outer_l`. Those same identifiers
are also assigned later in the enclosing `_join` body (the
`intersect`/`Vector{UInt}()` initialization for the outer-row merge).
Under Julia's soft-scope rules that turns them into captured bindings
shared between the enclosing function and every concurrent
`Dagger.@spawn`ed invocation of the closure.

With N r-chunks the closure runs N times in parallel, all writing to
the same three binding cells. The tuple destructure
`inner_l, inner_r = match_inner_indices(...)` is two stores, so a
context switch between them lets one task observe another task's
`inner_l` paired with its own `inner_r` — mismatched lengths trigger a
`BoundsError` in `build_joined_table`, and even when lengths happen to
match, `find_outer_indices` runs against the wrong `inner_l` and the
final merge ends up with extra rows.

Declaring `local inner_l, inner_r, outer_l` in the closure body forces
fresh per-invocation bindings and resolves the race. Re-enables the
previously commented-out `lj10`/`ij10` `DTable`-on-`DTable` join tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krynju krynju force-pushed the kr/fix-join-closure-race branch from 3d21233 to 80ecd3a Compare May 24, 2026 09:34
@krynju krynju merged commit 04185cf into main May 24, 2026
4 checks passed
@krynju krynju deleted the kr/fix-join-closure-race branch May 24, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant