Compiled Runtime for Apple Neural Engine
Developed by Yilong Li during his research at University of Wisconsin-Madison
Direct Python control of Apple Neural Engine (ANE) via reverse-engineered private APIs. Compile MIL programs with baked weights, execute fused transformer blocks on ANE hardware, and cache kernels for repeated inference — no Core ML required.
- A Python runtime for executing custom compute graphs directly on Apple's Neural Engine
- Fused transformer block kernels: RMSNorm + multi-head attention (with RoPE and windowed masking) + SwiGLU MLP + residual connections — compiled into a single ANE evaluation
- Compile-time weight baking via
_ANEInMemoryModelprivate APIs - IOSurface ping-pong chaining: 32 blocks share 2 surfaces, eliminating all intermediate CPU-ANE transfers
- Baked rotary embeddings: cos/sin as MIL constants, enabling hidden-states-only I/O for chaining
- Dynamic kernel cache with bounded LRU eviction and explicit lifecycle management
- C bridge (
libane_bridge.dylib) wrapping_ANECompiler/_ANEInMemoryModelDescriptor/_ANERequestinto a ctypes-friendly interface
- A replacement for Core ML, MLX, or any production inference stack
- A general-purpose ANE programming framework
- Tested on anything other than Apple Silicon M-series with macOS 15+
Measured on Qwen2.5-VL-3B-Instruct vision encoder (32 transformer blocks, dim=1280, 16 heads, head_dim=80, intermediate=3420), 384x384 input image, Apple Silicon:
| Stage | Warm Latency (ms) | Improvement |
|---|---|---|
| Per-operator ANE kernels, dynamic weights | 5,819 | baseline |
| + Compile-time weight baking (partial) | 4,498 | 1.3x |
| + Remove sequence chunking, full bake | 1,315 | 4.4x |
| + Fused block kernel | 260 | 22.4x |
| + Baked rotary + IOSurface ping-pong chain | 245 | 23.7x |
| Approach | Per-Block (ms) | ANE Evals/Block |
|---|---|---|
| Unfused (5 separate ANE calls + CPU ops) | 36.4 | 5 |
| Fused (1 single ANE call) | 7.9 | 1 |
| Fused + baked rotary + chain | 7.7 | 1 |
| Speedup | 4.7x |
With baked rotary embeddings, input and output have the same shape [1, dim, 1, seq]. This enables IOSurface chaining across blocks:
- Block 0: input=surfaceA, output=surfaceB
- Block 1: input=surfaceB, output=surfaceA
- Block 2: input=surfaceA, output=surfaceB
- ...
Only 1 CPU write (entry) + 1 CPU read (exit) for the entire 32-block chain. Intermediate blocks execute with zero CPU-ANE data transfer.
| Block Type | max_abs_diff | mean_abs_diff |
|---|---|---|
| Windowed attention (block 0) | 0.066 | 0.003 |
| Full attention (block 7) | 0.023 | 0.003 |
| Baked rotary vs dynamic rotary | 0.008 | 0.000002 |
| 2-block chained vs sequential | 0.0 | 0.0 (bit-exact) |
Python (numpy) ANE Hardware
| |
| pack [1, C, 1, S] |
+--- write_input ------> IOSurface (shared memory)
| |
| compile MIL -> ANE program
| bake weights as constants
| |
+--- eval --------------> ANE executes fused kernel
| |
| unpack (S, C) |
+<-- read_output -------- IOSurface (shared memory)
A fused VisionBlock kernel contains ~94 MIL operations:
- 2x RMSNorm (
reduce_sum+pow+rsqrt+mul) - 5x linear projection (
convwith baked[oc, ic, 1, 1]weights) - 5x bias addition
- 1x RoPE (
slice+ negate +concat+mul+add) - 1x multi-head attention (
reshape+transpose+matmul+ maskedsoftmax+matmul) - 1x SiLU activation (
sigmoid+mul) - 2x residual
add - I/O casts (fp32 at boundaries, fp16 internally)
Requires macOS 15+ on Apple Silicon (M1/M2/M3/M4).
cd crane
makeThis compiles libane_bridge.dylib from src/ane_bridge.m. No external dependencies — uses only system frameworks (Foundation, IOSurface) and private ANE APIs resolved at runtime via dlopen.
import numpy as np
from crane import ANEBridgeLibrary
from crane.bridge import run_dyn_matmul
x = np.random.randn(64, 128).astype(np.float32)
w = np.random.randn(128, 256).astype(np.float32)
out = run_dyn_matmul(x, w) # (64, 256), runs on ANEfrom crane import compile_baked_linear_kernel
from crane.runtime import run_baked_linear_kernel
kernel = compile_baked_linear_kernel(
ic=128, oc=256, seq=64,
logical_kernel_name="my_linear",
weights=w, # baked at compile time
)
out = run_baked_linear_kernel(kernel, x) # only activation transferred per callfrom crane import (
compile_fused_vision_block,
run_fused_vision_block,
build_windowed_attention_mask,
)
# All weights baked at compile time
kernel = compile_fused_vision_block(
block_weights=weights_dict,
attention_mask=build_windowed_attention_mask(cu_seqlens, seq_len),
seq=784, dim=1280, num_heads=16, head_dim=80, intermediate=3420,
logical_name="block.0",
)
# Per-image: only hidden_states + cos + sin transferred
out = run_fused_vision_block(kernel, hidden_states, cos, sin,
seq=784, dim=1280, head_dim=80)from crane import (
compile_fused_vision_block,
run_ping_pong_chained_fused_vision_blocks,
build_windowed_attention_mask,
)
# Compile 32 blocks with baked rotary cos/sin
kernels = []
for i in range(32):
kernel = compile_fused_vision_block(
block_weights=all_weights[i],
attention_mask=masks[i],
seq=784, dim=1280, num_heads=16, head_dim=80, intermediate=3420,
logical_name=f"block.{i}",
rotary_cos=cos, # baked as MIL constants
rotary_sin=sin, # enables hidden-only I/O
)
kernels.append(kernel)
# Execute entire 32-block chain with IOSurface ping-pong
# Only 1 write + 1 read for all 32 blocks
out = run_ping_pong_chained_fused_vision_blocks(
kernels, hidden_states,
seq=784, dim=1280,
)make testOr manually:
CRANE_BRIDGE_PATH=src/libane_bridge.dylib python -m pytest tests/ -vcrane/
src/
ane_bridge.h # C API header
ane_bridge.m # Objective-C bridge implementation
crane/
__init__.py # Public API
bridge.py # Python ctypes bindings + MIL generators
runtime.py # ANEKernel compile/run/cache management
fused_block.py # Fused VisionBlock MIL generator + runtime
reference/
ane_runtime.h # Original ANE runtime (compile/eval/IOSurface)
ane_mil_gen.h # MIL generators: conv, matmul, fused QKV, fused FFN
stories_mil.h # Fused SDPA + FFN forward kernels (block-level fusion)
mil_dynamic_gqa.h # GQA-aware dynamic kernels (Qwen3-0.6B)
README.md # Key MIL patterns and weight blob format reference
tests/
test_bridge.py # ANE hardware tests
Makefile
README.md
The reference/ directory contains the original Objective-C MIL generators from the ANE Training project that established the foundational patterns for ANE kernel programming. These include fused SDPA forward kernels, GQA attention, weight blob construction, and the runtime API that ane_bridge.m wraps.
- Private APIs: Uses
_ANEClient,_ANECompiler,_ANEInMemoryModelDescriptor— undocumented, may break with macOS updates - fp16 internal precision: ANE computes in fp16; input/output are fp32 for compatibility
- Static shapes: MIL programs are compiled for fixed tensor shapes; different resolutions need recompilation
- Single input tensor: ANE kernels accept one input; multiple tensors are packed via channel concatenation
- macOS 15+ required: Tested on M-series chips only
Yilong (Jimmy) Li University of Wisconsin-Madison
This project is part of ongoing research on cross-accelerator inference for vision-language models on edge SoCs.
This project uses Apple's private, undocumented APIs for research purposes. These APIs may change or break with any macOS update. This is independent research, not affiliated with or endorsed by Apple Inc. See Sega v. Accolade (1992) and DMCA Section 1201(f) regarding reverse engineering for interoperability.
MIT
