Skip to content

[Performance] Orchestration SO loading via file write + dlopen is costly on AICPU #357

@hw-native-sys-bot

Description

@hw-native-sys-bot

Platform

All / Unknown

Runtime Variant

All / Unknown

Summary

The orchestration function (.so) is sent to the AICPU via device global memory, but dlopen cannot load a shared library directly from memory. The current workaround writes the SO binary to a file on disk, then calls dlopen on that file. This file I/O path is costly and adds unnecessary latency to every execution.

Current flow (per invocation):

  1. Host embeds the orch SO binary into the Runtime struct (up to 4MB inline buffer: RUNTIME_MAX_ORCH_SO_SIZE)
  2. Entire Runtime struct (including the 4MB SO buffer) is DMA'd to device HBM
  3. AICPU reads the SO bytes from device memory
  4. AICPU writes the SO to disk via open()/write() — tries 5 candidate directories (/usr/lib64/aicpu_kernels/..., /var/tmp, /tmp)
  5. AICPU calls dlopen(so_path, RTLD_LAZY | RTLD_LOCAL) on the file
  6. dlsym() resolves function pointers (aicpu_orchestration_entry, etc.)
  7. After execution: dlclose() + unlink() deletes the file

Key costs:

  • File system I/O on AICPU for every single invocation (write + unlink)
  • Fixed 4MB DMA transfer regardless of actual SO size
  • No caching — the SO is written and deleted every run

Locations:

  • Host SO embedding: src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp (lines 226-244)
  • Runtime struct buffer: src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h (line 39, 189-192)
  • AICPU file write + dlopen: src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (lines 1596-1674)
  • Cleanup (dlclose + unlink): same file (lines 2057-2062)
  • Same pattern exists in aicpu_build_graph and a5 variants

Git Commit ID

78f0869

Host Platform

Linux (aarch64)

Reproduction

Any example that uses device orchestration (not host_build_graph):

python examples/scripts/run_example.py \
    -k tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/kernels \
    -g tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/golden.py \
    -p a2a3 -d 4

The file write + dlopen overhead is included in every invocation's orchestration setup time.

Expected Performance

Orchestration SO loading should add near-zero overhead — the binary is already in device memory.

Actual Performance

Each invocation pays filesystem I/O cost (write ~100-500KB SO to disk + dlopen + unlink). No exact timing is instrumented on AICPU side, but host-side TIMING: orch_so_copy shows the DMA portion. The AICPU file write + dlopen cost is hidden within the orchestration startup.

Profiling Data (Optional)

Not yet measured in isolation. The host side logs TIMING: orch_so_copy but the AICPU file-write + dlopen latency has no instrumentation.

Additional Context

Possible alternatives to explore:

  1. memfd_create + dlopen("/proc/self/fd/N") — Create an anonymous in-memory file descriptor, write SO bytes to it, then dlopen via the /proc/self/fd/ path. Avoids real filesystem I/O entirely. Requires Linux 3.17+ (available on AICPU's aarch64 kernel). This is the most promising approach.

  2. Cache across invocations — Write the SO file once during initialization and reuse the dlopen handle across runs (pypto uses this approach with a firstCreatSo_ flag). Only helps for repeated runs.

  3. Right-size the DMA — Instead of always transferring the full 4MB RUNTIME_MAX_ORCH_SO_SIZE buffer, only DMA the actual SO size. This reduces the Runtime struct DMA cost.

  4. Separate DMA for SO binary — Instead of embedding the SO in the Runtime struct, send it as a separate device memory allocation with its own pointer. Avoids bloating the Runtime struct.

pypto comparison: pypto writes the SO file once at init time and caches the handle, avoiding per-invocation file I/O. However it still relies on disk-backed dlopen. The memfd_create approach would be strictly better for both projects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance regression or optimization

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions