Write compute kernels in explicit, portable syntax. Compile to shared libraries. Generate native bindings for Python, Rust, C++, PyTorch, and CMake.
No runtime. No garbage collector. No glue code.
Targets x86-64 (AVX2, AVX-512) and AArch64 (NEON).
Three workloads, measured honestly: warm-up discarded, 10 trials × 50 iterations, reporting peak throughput in GB/s. 16M float32 elements (64 MB). All Eä kernels are autoresearch-optimized (dual accumulators, FMA, restrict pointers). Full benchmark script and methodology.
FMA: out[i] = a[i]*b[i] + c[i] — compute-bound
| Method | Time | GB/s | vs NumPy |
|---|---|---|---|
| NumPy (2-pass multiply+add) | 45,994 µs | 5.6 | baseline |
| Eä 1 thread | 6,921 µs | 37.0 | 6.6× |
| Eä 2 threads | 6,540 µs | 39.1 | 7.0× |
| Dask (2 chunks) | 56,448 µs | 4.5 | 0.81× |
| Ray (2 workers) | 89,106 µs | 2.9 | 0.52× |
Dot product: sum(a[i]*b[i]) — bandwidth-bound
| Method | Time | GB/s | vs NumPy |
|---|---|---|---|
| NumPy BLAS sdot | 3,570 µs | 35.9 | baseline |
| Eä 1 thread | 3,517 µs | 36.4 | 1.01× |
| Dask (2 chunks) | 6,657 µs | 19.2 | 0.54× |
| Ray (2 workers) | 26,159 µs | 4.9 | 0.14× |
SAXPY: y[i] = a*x[i] + y[i] — bandwidth-bound
| Method | Time | GB/s | vs NumPy |
|---|---|---|---|
| NumPy (2-pass multiply+add) | 7,637 µs | 16.8 | baseline |
| Eä 1 thread | 3,635 µs | 35.2 | 2.1× |
| Dask (2 chunks) | 57,131 µs | 2.2 | 0.13× |
| Ray (2 workers) | 91,306 µs | 1.4 | 0.08× |
Why: Eä fuses operations into single-pass SIMD (one FMA instruction where NumPy does two array passes). The dot product matches BLAS because dual accumulators with 4× unroll hide FMA latency and saturate memory bandwidth. Ray and Dask add serialization overhead that makes them 7–50× slower for single-machine work.
export kernel vscale(data: *f32, out result: *mut f32 [cap: n], factor: f32)
over i in n step 8
tail scalar { result[i] = data[i] * factor }
{
let v: f32x8 = load(data, i)
store(result, i, v .* splat(factor))
}
Compile, bind, call:
ea kernel.ea --lib # -> kernel.so + kernel.ea.json
ea bind kernel.ea --python --rust --cpp # -> kernel.py, kernel.rs, kernel.hppimport numpy as np, kernel
data = np.random.rand(1_000_000).astype(np.float32)
result = kernel.vscale(data, 2.0) # output auto-allocated, length auto-filled, dtype checkedOne kernel. Any host language. The binding handles allocation, length inference, and type checking.
Three workloads benchmarked against industry tools. Warm-cache medians, 20–50 timed runs, 5–10 warmup. Source, data, and scripts in each demo directory.
| Workload | Compared against | Speedup | Method |
|---|---|---|---|
| Vector search (dim=384) | FAISS IndexFlatIP | 4–8× | Dual-acc FMA, f32x8, next-vector prefetch |
| Sobel edge detection (720p–4K) | OpenCV | 5–6× (single-threaded) | Stencil f32x4, prefetch, L3 scaling analysis |
| CSV analytics (10–544 MB) | polars | 1.4–2.2× | Structural scan, SIMD reduction, binary search |
All three use ea bind for Python integration — zero manual ctypes. Validated across multiple input sizes. Full methodology and additional demos (conv2d at 265×, tokenizer at 406× vs NumPy) in COMPUTE_PATTERNS.md.
Reads the compiler's JSON metadata and generates idiomatic wrappers per target:
ea bind kernel.ea --python # -> kernel.py (NumPy + ctypes)
ea bind kernel.ea --rust # -> kernel.rs (FFI + safe wrappers)
ea bind kernel.ea --cpp # -> kernel.hpp (std::span + extern "C")
ea bind kernel.ea --pytorch # -> kernel_torch.py (autograd.Function)
ea bind kernel.ea --cmake # -> CMakeLists.txt + EaCompiler.cmakePointer args become slices/arrays/tensors. Length params collapse. Types are checked at the boundary. Multiple targets in one invocation: ea bind kernel.ea --python --rust --cpp
See what the compiler produced:
ea kernel.ea --emit-asm # assembly output
ea kernel.ea --emit-llvm # LLVM IR
ea kernel.ea --header # C header# Requirements: LLVM 18, Rust
sudo apt install llvm-18-dev clang-18 libpolly-18-dev libzstd-dev
cargo build --features=llvm
# Compile + bind + run
ea kernel.ea --lib
ea bind kernel.ea --python
python -c "import kernel; print(kernel.vscale([1.0, 2.0, 3.0], 10.0))"
# Run a demo
cd demo/eastat && python run.py
# Tests (475+ passing)
cargo test --features=llvmf32x4, f32x8, f32x16¹, f64x2, f64x4, i32x4, i32x8, i32x16¹, i8x16, i8x32, u8x16, i16x8, i16x16
load, store, splat, fma, shuffle, select, load_masked, store_masked, gather, scatter¹, prefetch
reduce_add, reduce_max, reduce_min, min, max
maddubs_i16(u8x16, i8x16) -> i16x8 — SSSE3 pmaddubsw, 16 pairs/cycle
maddubs_i32(u8x16, i8x16) -> i32x4 — pmaddubsw+pmaddwd, safe i32 accumulation
widen_u8_f32x4, widen_i8_f32x4, widen_u8_f32x8, widen_i8_f32x8, widen_u8_f32x16¹, widen_i8_f32x16¹, widen_u8_i32x4, widen_u8_i32x8, widen_u8_i32x16¹, narrow_f32x4_i8, sqrt, rsqrt, exp, to_f32, to_i32, to_f64, to_i64
Bitwise: .&, .|, .^, .<<, .>> on integer vectors. Restrict pointers: *restrict T, *mut restrict T.
¹ Requires --avx512
export kernel name(...) over i in n step N tail <strategy> { ... }
Tail strategies: tail scalar { ... } (scalar fallback), tail mask { ... } (masked SIMD), tail pad (caller pads input). Output annotations (out name: *mut T [cap: expr]) drive auto-allocation in bindings.
Also: for i in 0..n step 8 { ... } counted loops, foreach (i in 0..n) { ... } element-wise loops (LLVM auto-vectorizes at O2+), unroll(N), compile-time const, static_assert, #[cfg(x86_64)] / #[cfg(aarch64)] conditional compilation, C-compatible structs, multi-kernel files, pointer-to-pointer **T parameters.
Fusion eliminates memory round-trips between pipeline stages:
3 kernels (unfused): 8.55 ms — 0.9× (slightly slower, FFI + memory overhead)
1 kernel (fused): 0.07 ms — 111× faster than NumPy
If data leaves registers, you probably ended a kernel too early.
Analysis of when fusion helps and when it hurts: COMPUTE_PATTERNS.md.
Explicit over implicit. SIMD width, loop stepping, and memory access are programmer-controlled. No hidden allocations, no auto-vectorizer in the default path, no runtime. Ea is not a general-purpose language — no strings, collections, or modules. It accelerates host languages, it does not replace them.
.ea -> Lexer -> Parser -> Desugar -> Type Check -> Codegen (LLVM 18) -> .o / .so
-> .ea.json -> ea bind
~12,000 lines of Rust. 475+ tests covering SIMD ops, C interop, structs, kernel constructs, tail strategies, binding generation, error suggestions, ARM targets. CI on x86-64, AArch64, Windows.
BENCHMARKS.md — performance tables. CHANGELOG.md — version history. 1.6.md — language specification.
Apache 2.0