Platform
All / Unknown
Runtime Variant
All / Unknown
Summary
The orchestration function (.so) is sent to the AICPU via device global memory, but dlopen cannot load a shared library directly from memory. The current workaround writes the SO binary to a file on disk, then calls dlopen on that file. This file I/O path is costly and adds unnecessary latency to every execution.
Current flow (per invocation):
- Host embeds the orch SO binary into the Runtime struct (up to 4MB inline buffer:
RUNTIME_MAX_ORCH_SO_SIZE)
- Entire Runtime struct (including the 4MB SO buffer) is DMA'd to device HBM
- AICPU reads the SO bytes from device memory
- AICPU writes the SO to disk via
open()/write() — tries 5 candidate directories (/usr/lib64/aicpu_kernels/..., /var/tmp, /tmp)
- AICPU calls
dlopen(so_path, RTLD_LAZY | RTLD_LOCAL) on the file
dlsym() resolves function pointers (aicpu_orchestration_entry, etc.)
- After execution:
dlclose() + unlink() deletes the file
Key costs:
- File system I/O on AICPU for every single invocation (write + unlink)
- Fixed 4MB DMA transfer regardless of actual SO size
- No caching — the SO is written and deleted every run
Locations:
- Host SO embedding:
src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp (lines 226-244)
- Runtime struct buffer:
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h (line 39, 189-192)
- AICPU file write + dlopen:
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (lines 1596-1674)
- Cleanup (dlclose + unlink): same file (lines 2057-2062)
- Same pattern exists in
aicpu_build_graph and a5 variants
Git Commit ID
78f0869
Host Platform
Linux (aarch64)
Reproduction
Any example that uses device orchestration (not host_build_graph):
python examples/scripts/run_example.py \
-k tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/kernels \
-g tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/golden.py \
-p a2a3 -d 4
The file write + dlopen overhead is included in every invocation's orchestration setup time.
Expected Performance
Orchestration SO loading should add near-zero overhead — the binary is already in device memory.
Actual Performance
Each invocation pays filesystem I/O cost (write ~100-500KB SO to disk + dlopen + unlink). No exact timing is instrumented on AICPU side, but host-side TIMING: orch_so_copy shows the DMA portion. The AICPU file write + dlopen cost is hidden within the orchestration startup.
Profiling Data (Optional)
Not yet measured in isolation. The host side logs TIMING: orch_so_copy but the AICPU file-write + dlopen latency has no instrumentation.
Additional Context
Possible alternatives to explore:
-
memfd_create + dlopen("/proc/self/fd/N") — Create an anonymous in-memory file descriptor, write SO bytes to it, then dlopen via the /proc/self/fd/ path. Avoids real filesystem I/O entirely. Requires Linux 3.17+ (available on AICPU's aarch64 kernel). This is the most promising approach.
-
Cache across invocations — Write the SO file once during initialization and reuse the dlopen handle across runs (pypto uses this approach with a firstCreatSo_ flag). Only helps for repeated runs.
-
Right-size the DMA — Instead of always transferring the full 4MB RUNTIME_MAX_ORCH_SO_SIZE buffer, only DMA the actual SO size. This reduces the Runtime struct DMA cost.
-
Separate DMA for SO binary — Instead of embedding the SO in the Runtime struct, send it as a separate device memory allocation with its own pointer. Avoids bloating the Runtime struct.
pypto comparison: pypto writes the SO file once at init time and caches the handle, avoiding per-invocation file I/O. However it still relies on disk-backed dlopen. The memfd_create approach would be strictly better for both projects.
Platform
All / Unknown
Runtime Variant
All / Unknown
Summary
The orchestration function (.so) is sent to the AICPU via device global memory, but
dlopencannot load a shared library directly from memory. The current workaround writes the SO binary to a file on disk, then callsdlopenon that file. This file I/O path is costly and adds unnecessary latency to every execution.Current flow (per invocation):
RUNTIME_MAX_ORCH_SO_SIZE)open()/write()— tries 5 candidate directories (/usr/lib64/aicpu_kernels/...,/var/tmp,/tmp)dlopen(so_path, RTLD_LAZY | RTLD_LOCAL)on the filedlsym()resolves function pointers (aicpu_orchestration_entry, etc.)dlclose()+unlink()deletes the fileKey costs:
Locations:
src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp(lines 226-244)src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h(line 39, 189-192)src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp(lines 1596-1674)aicpu_build_graphanda5variantsGit Commit ID
78f0869
Host Platform
Linux (aarch64)
Reproduction
Any example that uses device orchestration (not
host_build_graph):python examples/scripts/run_example.py \ -k tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/kernels \ -g tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/golden.py \ -p a2a3 -d 4The file write + dlopen overhead is included in every invocation's orchestration setup time.
Expected Performance
Orchestration SO loading should add near-zero overhead — the binary is already in device memory.
Actual Performance
Each invocation pays filesystem I/O cost (write ~100-500KB SO to disk + dlopen + unlink). No exact timing is instrumented on AICPU side, but host-side
TIMING: orch_so_copyshows the DMA portion. The AICPU file write + dlopen cost is hidden within the orchestration startup.Profiling Data (Optional)
Not yet measured in isolation. The host side logs
TIMING: orch_so_copybut the AICPU file-write + dlopen latency has no instrumentation.Additional Context
Possible alternatives to explore:
memfd_create+dlopen("/proc/self/fd/N")— Create an anonymous in-memory file descriptor, write SO bytes to it, thendlopenvia the/proc/self/fd/path. Avoids real filesystem I/O entirely. Requires Linux 3.17+ (available on AICPU's aarch64 kernel). This is the most promising approach.Cache across invocations — Write the SO file once during initialization and reuse the
dlopenhandle across runs (pypto uses this approach with afirstCreatSo_flag). Only helps for repeated runs.Right-size the DMA — Instead of always transferring the full 4MB
RUNTIME_MAX_ORCH_SO_SIZEbuffer, only DMA the actual SO size. This reduces the Runtime struct DMA cost.Separate DMA for SO binary — Instead of embedding the SO in the Runtime struct, send it as a separate device memory allocation with its own pointer. Avoids bloating the Runtime struct.
pypto comparison: pypto writes the SO file once at init time and caches the handle, avoiding per-invocation file I/O. However it still relies on disk-backed dlopen. The
memfd_createapproach would be strictly better for both projects.