CRANE

Compiled Runtime for Apple Neural Engine

Developed by Yilong Li during his research at University of Wisconsin-Madison

Direct Python control of Apple Neural Engine (ANE) via reverse-engineered private APIs. Compile MIL programs with baked weights, execute fused transformer blocks on ANE hardware, and cache kernels for repeated inference — no Core ML required.

What This Is

A Python runtime for executing custom compute graphs directly on Apple's Neural Engine
Fused transformer block kernels: RMSNorm + multi-head attention (with RoPE and windowed masking) + SwiGLU MLP + residual connections — compiled into a single ANE evaluation
Compile-time weight baking via _ANEInMemoryModel private APIs
IOSurface ping-pong chaining: 32 blocks share 2 surfaces, eliminating all intermediate CPU-ANE transfers
Baked rotary embeddings: cos/sin as MIL constants, enabling hidden-states-only I/O for chaining
Dynamic kernel cache with bounded LRU eviction and explicit lifecycle management
C bridge (libane_bridge.dylib) wrapping _ANECompiler / _ANEInMemoryModelDescriptor / _ANERequest into a ctypes-friendly interface

What This Is Not

A replacement for Core ML, MLX, or any production inference stack
A general-purpose ANE programming framework
Tested on anything other than Apple Silicon M-series with macOS 15+

Results

Measured on Qwen2.5-VL-3B-Instruct vision encoder (32 transformer blocks, dim=1280, 16 heads, head_dim=80, intermediate=3420), 384x384 input image, Apple Silicon:

Optimization Progression

Stage	Warm Latency (ms)	Improvement
Per-operator ANE kernels, dynamic weights	5,819	baseline
+ Compile-time weight baking (partial)	4,498	1.3x
+ Remove sequence chunking, full bake	1,315	4.4x
+ Fused block kernel	260	22.4x
+ Baked rotary + IOSurface ping-pong chain	245	23.7x

Fused Block: Per-Block Latency

Approach	Per-Block (ms)	ANE Evals/Block
Unfused (5 separate ANE calls + CPU ops)	36.4	5
Fused (1 single ANE call)	7.9	1
Fused + baked rotary + chain	7.7	1
Speedup	4.7x

IOSurface Ping-Pong Chain

With baked rotary embeddings, input and output have the same shape [1, dim, 1, seq]. This enables IOSurface chaining across blocks:

Block 0: input=surfaceA, output=surfaceB
Block 1: input=surfaceB, output=surfaceA
Block 2: input=surfaceA, output=surfaceB
...

Only 1 CPU write (entry) + 1 CPU read (exit) for the entire 32-block chain. Intermediate blocks execute with zero CPU-ANE data transfer.

Parity vs Reference (MLX bfloat16)

Block Type	max_abs_diff	mean_abs_diff
Windowed attention (block 0)	0.066	0.003
Full attention (block 7)	0.023	0.003
Baked rotary vs dynamic rotary	0.008	0.000002
2-block chained vs sequential	0.0	0.0 (bit-exact)

Architecture

Python (numpy)                    ANE Hardware
    |                                 |
    |  pack [1, C, 1, S]             |
    +--- write_input ------>  IOSurface (shared memory)
    |                                 |
    |                         compile MIL -> ANE program
    |                         bake weights as constants
    |                                 |
    +--- eval -------------->  ANE executes fused kernel
    |                                 |
    |  unpack (S, C)                 |
    +<-- read_output --------  IOSurface (shared memory)

A fused VisionBlock kernel contains ~94 MIL operations:

2x RMSNorm (reduce_sum + pow + rsqrt + mul)
5x linear projection (conv with baked [oc, ic, 1, 1] weights)
5x bias addition
1x RoPE (slice + negate + concat + mul + add)
1x multi-head attention (reshape + transpose + matmul + masked softmax + matmul)
1x SiLU activation (sigmoid + mul)
2x residual add
I/O casts (fp32 at boundaries, fp16 internally)

Building

Requires macOS 15+ on Apple Silicon (M1/M2/M3/M4).

cd crane
make

This compiles libane_bridge.dylib from src/ane_bridge.m. No external dependencies — uses only system frameworks (Foundation, IOSurface) and private ANE APIs resolved at runtime via dlopen.

Usage

Basic: Dynamic Matrix Multiply

import numpy as np
from crane import ANEBridgeLibrary
from crane.bridge import run_dyn_matmul

x = np.random.randn(64, 128).astype(np.float32)
w = np.random.randn(128, 256).astype(np.float32)
out = run_dyn_matmul(x, w)  # (64, 256), runs on ANE

Baked-Weight Linear

from crane import compile_baked_linear_kernel
from crane.runtime import run_baked_linear_kernel

kernel = compile_baked_linear_kernel(
    ic=128, oc=256, seq=64,
    logical_kernel_name="my_linear",
    weights=w,  # baked at compile time
)
out = run_baked_linear_kernel(kernel, x)  # only activation transferred per call

Fused Transformer Block

from crane import (
    compile_fused_vision_block,
    run_fused_vision_block,
    build_windowed_attention_mask,
)

# All weights baked at compile time
kernel = compile_fused_vision_block(
    block_weights=weights_dict,
    attention_mask=build_windowed_attention_mask(cu_seqlens, seq_len),
    seq=784, dim=1280, num_heads=16, head_dim=80, intermediate=3420,
    logical_name="block.0",
)

# Per-image: only hidden_states + cos + sin transferred
out = run_fused_vision_block(kernel, hidden_states, cos, sin,
                              seq=784, dim=1280, head_dim=80)

Fused Block with Baked Rotary + Ping-Pong Chain

from crane import (
    compile_fused_vision_block,
    run_ping_pong_chained_fused_vision_blocks,
    build_windowed_attention_mask,
)

# Compile 32 blocks with baked rotary cos/sin
kernels = []
for i in range(32):
    kernel = compile_fused_vision_block(
        block_weights=all_weights[i],
        attention_mask=masks[i],
        seq=784, dim=1280, num_heads=16, head_dim=80, intermediate=3420,
        logical_name=f"block.{i}",
        rotary_cos=cos,   # baked as MIL constants
        rotary_sin=sin,   # enables hidden-only I/O
    )
    kernels.append(kernel)

# Execute entire 32-block chain with IOSurface ping-pong
# Only 1 write + 1 read for all 32 blocks
out = run_ping_pong_chained_fused_vision_blocks(
    kernels, hidden_states,
    seq=784, dim=1280,
)

Testing

make test

Or manually:

CRANE_BRIDGE_PATH=src/libane_bridge.dylib python -m pytest tests/ -v

File Structure

crane/
  src/
    ane_bridge.h          # C API header
    ane_bridge.m          # Objective-C bridge implementation
    crane/
      __init__.py         # Public API
      bridge.py           # Python ctypes bindings + MIL generators
      runtime.py          # ANEKernel compile/run/cache management
      fused_block.py      # Fused VisionBlock MIL generator + runtime
  reference/
    ane_runtime.h         # Original ANE runtime (compile/eval/IOSurface)
    ane_mil_gen.h         # MIL generators: conv, matmul, fused QKV, fused FFN
    stories_mil.h         # Fused SDPA + FFN forward kernels (block-level fusion)
    mil_dynamic_gqa.h     # GQA-aware dynamic kernels (Qwen3-0.6B)
    README.md             # Key MIL patterns and weight blob format reference
  tests/
    test_bridge.py        # ANE hardware tests
  Makefile
  README.md

The reference/ directory contains the original Objective-C MIL generators from the ANE Training project that established the foundational patterns for ANE kernel programming. These include fused SDPA forward kernels, GQA attention, weight blob construction, and the runtime API that ane_bridge.m wraps.

Limitations

Private APIs: Uses _ANEClient, _ANECompiler, _ANEInMemoryModelDescriptor — undocumented, may break with macOS updates
fp16 internal precision: ANE computes in fp16; input/output are fp32 for compatibility
Static shapes: MIL programs are compiled for fixed tensor shapes; different resolutions need recompilation
Single input tensor: ANE kernels accept one input; multiple tensors are packed via channel concatenation
macOS 15+ required: Tested on M-series chips only

Author

Yilong (Jimmy) Li University of Wisconsin-Madison

This project is part of ongoing research on cross-accelerator inference for vision-language models on edge SoCs.

Disclaimer

This project uses Apple's private, undocumented APIs for research purposes. These APIs may change or break with any macOS update. This is independent research, not affiliated with or endorsed by Apple Inc. See Sega v. Accolade (1992) and DMCA Section 1201(f) regarding reverse engineering for interoperability.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRANE

What This Is

What This Is Not

Results

Optimization Progression

Fused Block: Per-Block Latency

IOSurface Ping-Pong Chain

Parity vs Reference (MLX bfloat16)

Architecture

Building

Usage

Basic: Dynamic Matrix Multiply

Baked-Weight Linear

Fused Transformer Block

Fused Block with Baked Rotary + Ping-Pong Chain

Testing

File Structure

Limitations

Author

Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
reference		reference
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CRANE

What This Is

What This Is Not

Results

Optimization Progression

Fused Block: Per-Block Latency

IOSurface Ping-Pong Chain

Parity vs Reference (MLX bfloat16)

Architecture

Building

Usage

Basic: Dynamic Matrix Multiply

Baked-Weight Linear

Fused Transformer Block

Fused Block with Baked Rotary + Ping-Pong Chain

Testing

File Structure

Limitations

Author

Disclaimer

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages