Investigate async execution / mark_step pipelining

Sub-issue of #33. Current dispatch does `matmul_kernel` → implicit sync via `.to(\"cpu\")`. Unclear whether the 4 ms host→XLA transfer and 567 µs kernel actually pipeline or run serially.

If serial: explicit `xm.mark_step()` between transfer and kernel dispatch may let them overlap, shaving up to the kernel's worth of time off the critical path.

**Action:** instrument `_nki_matmul` with `torch_xla.sync()` calls between phases and compare to current wall-clock. Document findings; if pipelining is already happening, close with that note.

Referenced profile: https://github.com/trnsci/trntensor/issues/33#issuecomment-4239594619

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate async execution / mark_step pipelining #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate async execution / mark_step pipelining #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions