Skip to content

Investigate async execution / mark_step pipelining #35

@scttfrdmn

Description

@scttfrdmn

Sub-issue of #33. Current dispatch does matmul_kernel → implicit sync via .to(\"cpu\"). Unclear whether the 4 ms host→XLA transfer and 567 µs kernel actually pipeline or run serially.

If serial: explicit xm.mark_step() between transfer and kernel dispatch may let them overlap, shaving up to the kernel's worth of time off the critical path.

Action: instrument _nki_matmul with torch_xla.sync() calls between phases and compare to current wall-clock. Document findings; if pipelining is already happening, close with that note.

Referenced profile: #33 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions