Skip to content

JimmyLi-Network/CRANE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRANE logo

CRANE

Compiled Runtime for Apple Neural Engine

University of Wisconsin-Madison

Developed by Yilong Li during his research at University of Wisconsin-Madison

Direct Python control of Apple Neural Engine (ANE) via reverse-engineered private APIs. Compile MIL programs with baked weights, execute fused transformer blocks on ANE hardware, and cache kernels for repeated inference — no Core ML required.

What This Is

  • A Python runtime for executing custom compute graphs directly on Apple's Neural Engine
  • Fused transformer block kernels: RMSNorm + multi-head attention (with RoPE and windowed masking) + SwiGLU MLP + residual connections — compiled into a single ANE evaluation
  • Compile-time weight baking via _ANEInMemoryModel private APIs
  • IOSurface ping-pong chaining: 32 blocks share 2 surfaces, eliminating all intermediate CPU-ANE transfers
  • Baked rotary embeddings: cos/sin as MIL constants, enabling hidden-states-only I/O for chaining
  • Dynamic kernel cache with bounded LRU eviction and explicit lifecycle management
  • C bridge (libane_bridge.dylib) wrapping _ANECompiler / _ANEInMemoryModelDescriptor / _ANERequest into a ctypes-friendly interface

What This Is Not

  • A replacement for Core ML, MLX, or any production inference stack
  • A general-purpose ANE programming framework
  • Tested on anything other than Apple Silicon M-series with macOS 15+

Results

Measured on Qwen2.5-VL-3B-Instruct vision encoder (32 transformer blocks, dim=1280, 16 heads, head_dim=80, intermediate=3420), 384x384 input image, Apple Silicon:

Optimization Progression

Stage Warm Latency (ms) Improvement
Per-operator ANE kernels, dynamic weights 5,819 baseline
+ Compile-time weight baking (partial) 4,498 1.3x
+ Remove sequence chunking, full bake 1,315 4.4x
+ Fused block kernel 260 22.4x
+ Baked rotary + IOSurface ping-pong chain 245 23.7x

Fused Block: Per-Block Latency

Approach Per-Block (ms) ANE Evals/Block
Unfused (5 separate ANE calls + CPU ops) 36.4 5
Fused (1 single ANE call) 7.9 1
Fused + baked rotary + chain 7.7 1
Speedup 4.7x

IOSurface Ping-Pong Chain

With baked rotary embeddings, input and output have the same shape [1, dim, 1, seq]. This enables IOSurface chaining across blocks:

  • Block 0: input=surfaceA, output=surfaceB
  • Block 1: input=surfaceB, output=surfaceA
  • Block 2: input=surfaceA, output=surfaceB
  • ...

Only 1 CPU write (entry) + 1 CPU read (exit) for the entire 32-block chain. Intermediate blocks execute with zero CPU-ANE data transfer.

Parity vs Reference (MLX bfloat16)

Block Type max_abs_diff mean_abs_diff
Windowed attention (block 0) 0.066 0.003
Full attention (block 7) 0.023 0.003
Baked rotary vs dynamic rotary 0.008 0.000002
2-block chained vs sequential 0.0 0.0 (bit-exact)

Architecture

Python (numpy)                    ANE Hardware
    |                                 |
    |  pack [1, C, 1, S]             |
    +--- write_input ------>  IOSurface (shared memory)
    |                                 |
    |                         compile MIL -> ANE program
    |                         bake weights as constants
    |                                 |
    +--- eval -------------->  ANE executes fused kernel
    |                                 |
    |  unpack (S, C)                 |
    +<-- read_output --------  IOSurface (shared memory)

A fused VisionBlock kernel contains ~94 MIL operations:

  • 2x RMSNorm (reduce_sum + pow + rsqrt + mul)
  • 5x linear projection (conv with baked [oc, ic, 1, 1] weights)
  • 5x bias addition
  • 1x RoPE (slice + negate + concat + mul + add)
  • 1x multi-head attention (reshape + transpose + matmul + masked softmax + matmul)
  • 1x SiLU activation (sigmoid + mul)
  • 2x residual add
  • I/O casts (fp32 at boundaries, fp16 internally)

Building

Requires macOS 15+ on Apple Silicon (M1/M2/M3/M4).

cd crane
make

This compiles libane_bridge.dylib from src/ane_bridge.m. No external dependencies — uses only system frameworks (Foundation, IOSurface) and private ANE APIs resolved at runtime via dlopen.

Usage

Basic: Dynamic Matrix Multiply

import numpy as np
from crane import ANEBridgeLibrary
from crane.bridge import run_dyn_matmul

x = np.random.randn(64, 128).astype(np.float32)
w = np.random.randn(128, 256).astype(np.float32)
out = run_dyn_matmul(x, w)  # (64, 256), runs on ANE

Baked-Weight Linear

from crane import compile_baked_linear_kernel
from crane.runtime import run_baked_linear_kernel

kernel = compile_baked_linear_kernel(
    ic=128, oc=256, seq=64,
    logical_kernel_name="my_linear",
    weights=w,  # baked at compile time
)
out = run_baked_linear_kernel(kernel, x)  # only activation transferred per call

Fused Transformer Block

from crane import (
    compile_fused_vision_block,
    run_fused_vision_block,
    build_windowed_attention_mask,
)

# All weights baked at compile time
kernel = compile_fused_vision_block(
    block_weights=weights_dict,
    attention_mask=build_windowed_attention_mask(cu_seqlens, seq_len),
    seq=784, dim=1280, num_heads=16, head_dim=80, intermediate=3420,
    logical_name="block.0",
)

# Per-image: only hidden_states + cos + sin transferred
out = run_fused_vision_block(kernel, hidden_states, cos, sin,
                              seq=784, dim=1280, head_dim=80)

Fused Block with Baked Rotary + Ping-Pong Chain

from crane import (
    compile_fused_vision_block,
    run_ping_pong_chained_fused_vision_blocks,
    build_windowed_attention_mask,
)

# Compile 32 blocks with baked rotary cos/sin
kernels = []
for i in range(32):
    kernel = compile_fused_vision_block(
        block_weights=all_weights[i],
        attention_mask=masks[i],
        seq=784, dim=1280, num_heads=16, head_dim=80, intermediate=3420,
        logical_name=f"block.{i}",
        rotary_cos=cos,   # baked as MIL constants
        rotary_sin=sin,   # enables hidden-only I/O
    )
    kernels.append(kernel)

# Execute entire 32-block chain with IOSurface ping-pong
# Only 1 write + 1 read for all 32 blocks
out = run_ping_pong_chained_fused_vision_blocks(
    kernels, hidden_states,
    seq=784, dim=1280,
)

Testing

make test

Or manually:

CRANE_BRIDGE_PATH=src/libane_bridge.dylib python -m pytest tests/ -v

File Structure

crane/
  src/
    ane_bridge.h          # C API header
    ane_bridge.m          # Objective-C bridge implementation
    crane/
      __init__.py         # Public API
      bridge.py           # Python ctypes bindings + MIL generators
      runtime.py          # ANEKernel compile/run/cache management
      fused_block.py      # Fused VisionBlock MIL generator + runtime
  reference/
    ane_runtime.h         # Original ANE runtime (compile/eval/IOSurface)
    ane_mil_gen.h         # MIL generators: conv, matmul, fused QKV, fused FFN
    stories_mil.h         # Fused SDPA + FFN forward kernels (block-level fusion)
    mil_dynamic_gqa.h     # GQA-aware dynamic kernels (Qwen3-0.6B)
    README.md             # Key MIL patterns and weight blob format reference
  tests/
    test_bridge.py        # ANE hardware tests
  Makefile
  README.md

The reference/ directory contains the original Objective-C MIL generators from the ANE Training project that established the foundational patterns for ANE kernel programming. These include fused SDPA forward kernels, GQA attention, weight blob construction, and the runtime API that ane_bridge.m wraps.

Limitations

  • Private APIs: Uses _ANEClient, _ANECompiler, _ANEInMemoryModelDescriptor — undocumented, may break with macOS updates
  • fp16 internal precision: ANE computes in fp16; input/output are fp32 for compatibility
  • Static shapes: MIL programs are compiled for fixed tensor shapes; different resolutions need recompilation
  • Single input tensor: ANE kernels accept one input; multiple tensors are packed via channel concatenation
  • macOS 15+ required: Tested on M-series chips only

Author

Yilong (Jimmy) Li University of Wisconsin-Madison

This project is part of ongoing research on cross-accelerator inference for vision-language models on edge SoCs.

Disclaimer

This project uses Apple's private, undocumented APIs for research purposes. These APIs may change or break with any macOS update. This is independent research, not affiliated with or endorsed by Apple Inc. See Sega v. Accolade (1992) and DMCA Section 1201(f) regarding reverse engineering for interoperability.

License

MIT

About

Direct Python control of Apple Neural Engine (ANE) via reverse-engineered private APIs. Compile MIL programs with baked weights, execute fused transformer blocks on ANE hardware, and cache kernels for repeated inference — no Core ML required

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors