[Performance] Orchestration SO loading via file write + dlopen is costly on AICPU

### Platform

All / Unknown

### Runtime Variant

All / Unknown

### Summary

The orchestration function (.so) is sent to the AICPU via device global memory, but `dlopen` cannot load a shared library directly from memory. The current workaround writes the SO binary to a file on disk, then calls `dlopen` on that file. This file I/O path is costly and adds unnecessary latency to every execution.

**Current flow (per invocation):**

1. Host embeds the orch SO binary into the Runtime struct (up to 4MB inline buffer: `RUNTIME_MAX_ORCH_SO_SIZE`)
2. Entire Runtime struct (including the 4MB SO buffer) is DMA'd to device HBM
3. AICPU reads the SO bytes from device memory
4. AICPU writes the SO to disk via `open()`/`write()` — tries 5 candidate directories (`/usr/lib64/aicpu_kernels/...`, `/var/tmp`, `/tmp`)
5. AICPU calls `dlopen(so_path, RTLD_LAZY | RTLD_LOCAL)` on the file
6. `dlsym()` resolves function pointers (`aicpu_orchestration_entry`, etc.)
7. After execution: `dlclose()` + `unlink()` deletes the file

**Key costs:**
- File system I/O on AICPU for every single invocation (write + unlink)
- Fixed 4MB DMA transfer regardless of actual SO size
- No caching — the SO is written and deleted every run

**Locations:**
- Host SO embedding: `src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp` (lines 226-244)
- Runtime struct buffer: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h` (line 39, 189-192)
- AICPU file write + dlopen: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (lines 1596-1674)
- Cleanup (dlclose + unlink): same file (lines 2057-2062)
- Same pattern exists in `aicpu_build_graph` and `a5` variants

### Git Commit ID

78f0869e9fa4522266ccea481115900e8ee786f5

### Host Platform

Linux (aarch64)

### Reproduction

Any example that uses device orchestration (not `host_build_graph`):

```bash
python examples/scripts/run_example.py \
    -k tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/kernels \
    -g tests/device_tests/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/golden.py \
    -p a2a3 -d 4
```

The file write + dlopen overhead is included in every invocation's orchestration setup time.

### Expected Performance

Orchestration SO loading should add near-zero overhead — the binary is already in device memory.

### Actual Performance

Each invocation pays filesystem I/O cost (write ~100-500KB SO to disk + dlopen + unlink). No exact timing is instrumented on AICPU side, but host-side `TIMING: orch_so_copy` shows the DMA portion. The AICPU file write + dlopen cost is hidden within the orchestration startup.

### Profiling Data (Optional)

Not yet measured in isolation. The host side logs `TIMING: orch_so_copy` but the AICPU file-write + dlopen latency has no instrumentation.

### Additional Context

**Possible alternatives to explore:**

1. **`memfd_create` + `dlopen("/proc/self/fd/N")`** — Create an anonymous in-memory file descriptor, write SO bytes to it, then `dlopen` via the `/proc/self/fd/` path. Avoids real filesystem I/O entirely. Requires Linux 3.17+ (available on AICPU's aarch64 kernel). This is the most promising approach.

2. **Cache across invocations** — Write the SO file once during initialization and reuse the `dlopen` handle across runs (pypto uses this approach with a `firstCreatSo_` flag). Only helps for repeated runs.

3. **Right-size the DMA** — Instead of always transferring the full 4MB `RUNTIME_MAX_ORCH_SO_SIZE` buffer, only DMA the actual SO size. This reduces the Runtime struct DMA cost.

4. **Separate DMA for SO binary** — Instead of embedding the SO in the Runtime struct, send it as a separate device memory allocation with its own pointer. Avoids bloating the Runtime struct.

**pypto comparison:** pypto writes the SO file once at init time and caches the handle, avoiding per-invocation file I/O. However it still relies on disk-backed dlopen. The `memfd_create` approach would be strictly better for both projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Orchestration SO loading via file write + dlopen is costly on AICPU #357

Platform

Runtime Variant

Summary

Git Commit ID

Host Platform

Reproduction

Expected Performance

Actual Performance

Profiling Data (Optional)

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Orchestration SO loading via file write + dlopen is costly on AICPU #357

Description

Platform

Runtime Variant

Summary

Git Commit ID

Host Platform

Reproduction

Expected Performance

Actual Performance

Profiling Data (Optional)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions