Sub-issue of #33. Current dispatch does matmul_kernel → implicit sync via .to(\"cpu\"). Unclear whether the 4 ms host→XLA transfer and 567 µs kernel actually pipeline or run serially.
If serial: explicit xm.mark_step() between transfer and kernel dispatch may let them overlap, shaving up to the kernel's worth of time off the critical path.
Action: instrument _nki_matmul with torch_xla.sync() calls between phases and compare to current wall-clock. Document findings; if pipelining is already happening, close with that note.
Referenced profile: #33 (comment)
Sub-issue of #33. Current dispatch does
matmul_kernel→ implicit sync via.to(\"cpu\"). Unclear whether the 4 ms host→XLA transfer and 567 µs kernel actually pipeline or run serially.If serial: explicit
xm.mark_step()between transfer and kernel dispatch may let them overlap, shaving up to the kernel's worth of time off the critical path.Action: instrument
_nki_matmulwithtorch_xla.sync()calls between phases and compare to current wall-clock. Document findings; if pipelining is already happening, close with that note.Referenced profile: #33 (comment)