DigitalServerHost · westkevin12 · Jun 5, 2026 · Jun 5, 2026
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 [![Tech: Go](https://img.shields.io/badge/Tech-Go_1.20%2B-00ADD8.svg)](#)
 [![Tech: Python](https://img.shields.io/badge/Tech-Python_3.10%2B-blue.svg)](#)
 [![Tech: C](https://img.shields.io/badge/Tech-C11-blue.svg)](#)
-[![Tech: Assembly](https://img.shields.io/badge/Tech-x86--64_Assembly-orange.svg)](#)
+[![Tech: Assembly](https://img.shields.io/badge/Tech-x86--64%20%2F%20ARM64%20Assembly-orange.svg)](#)
 [![GitHub Release](https://img.shields.io/github/v/release/DigitalServerHost/ORCHID?include_prereleases&sort=semver&color=FF69B4)](https://github.com/DigitalServerHost/ORCHID/releases/latest)
 [![GHCR Container](https://img.shields.io/badge/GHCR-Package_Registry-blueviolet.svg?logo=docker&logoColor=white)](https://github.com/DigitalServerHost/ORCHID/pkgs/container/orchid)
 [![Downloads](https://img.shields.io/github/downloads/DigitalServerHost/ORCHID/total?color=blue)](https://github.com/DigitalServerHost/ORCHID/releases)
@@ -29,12 +29,11 @@ Project **ORCHID** is the low-level micro-architectural execution core of the RA
 
 The absolute base foundation, research primitives, and original codebase layout can be found preserved on the legacy archive branch:
 👉 **[View the Baseline Concept Code (`tree/gatchimuchio-original`)](https://github.com/DigitalServerHost/ORCHID/tree/gatchimuchio-original)**
-
 ---
 
 ## 📊 Reproduced Locality Performance
 
-Under identical, mathematically verified logical execution constraints (512x512 matrix size, double-triplicate verification, and total 64 MiB L1-L3 cache flushes between timing runs), the locality-aligned (I-K-J) memory mapping sweeps demonstrate exceptionally high performance improvements. Badges below are dynamically parsed from current timing sweeps:
+Under identical, mathematically verified logical execution constraints (512x512 matrix size and double-triplicate verification), the locality-aligned memory mapping sweeps demonstrate exceptionally high performance improvements. Badges below are dynamically parsed from current timing sweeps:
 
 | Metric              | Speedup                                                                                                                                                                                                                                       |
 | :------------------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -45,11 +44,23 @@ Under identical, mathematically verified logical execution constraints (512x512
 
 > [!NOTE]
 > **Understanding the Speedup Profiles:**
-> - **Physical Cache Locality (C Harness)**: The dynamic badges above measure the hardware execution speedup of cache-blocked locality-aligned loops (matrix multiplication) over flat baselines, yielding **3.6x - 4.0x** actual hardware speedups.
+> - **Physical Cache Locality (C Harness)**: The dynamic badges above measure the hardware execution speedup of cache-blocked locality-aligned loops (matrix multiplication) over flat baselines, yielding **3.0x - 3.4x** actual hardware speedups on warm cache lines.
 > - **Parallel Memory Scheduler (Go Simulator)**: The scheduler unit tests (`TestBankedSchedulerTriad`) run a software-simulated queue model (STREAM-Triad) to measure bank serialization and parallel role routing. Because STREAM-Triad partitions requests into 3 distinct logical data streams (B-read, C-read, A-write), mapping them to 3 independent memory banks achieves a theoretical parallel speedup limit of exactly **3.0x** (which the Go scheduler hits at exactly **3.000x** cycle reduction).
 
 ---
 
+## 🖥️ Platform Target Support
+
+Project ORCHID features a **Heterogeneous Hardware Dispatch Plane** to scale execution guarantees across multiple architectures. The assembler (`orchid/assembler.py`) dynamically auto-detects the host architecture (or accepts a target override parameter via `--target`) and emits optimized assembly targets:
+
+- **`x86_64` (AVX-512)**: Standard vectorized loop utilizing 512-bit vector registers with active `prefetcht0` hardware preloading.
+- **`arm64` (NEON / SVE)**: Vectorized execution using ARM64 NEON registers (`v0-v31`) with `prfm pldl1keep` software lookahead prefetching offsets.
+- **`apple_amx` (Apple Silicon)**: Low-level matrix coprocessor wrapper with custom `amxinit`/`amxstop` instructions (`.word` directives).
+
+At runtime, the benchmarking harness (`locality/fair_harness.c`) performs dynamic hardware capability telemetry (`CPUID` for x86-64, `getauxval(AT_HWCAP)` for ARM64 SVE/ASIMD on Linux) to dispatch execution to the optimal native assembly kernel.
+
+---
+
 ## 🏛️ Centralized Architectural Design & Blueprint
 
 To ensure professional documentation standards and maintain a clean, readable quickstart guide, Project ORCHID's deep technical designs, mathematical formulations, and nested folder blueprints have been centralized:

diff --git a/evidence/reproduced/speedups.json b/evidence/reproduced/speedups.json
@@ -1,6 +1,6 @@
 {
-  "min": "3.047x",
-  "median": "3.156x",
-  "max": "3.241x",
-  "mean": "3.150x"
+  "min": "2.871x",
+  "median": "3.171x",
+  "max": "3.396x",
+  "mean": "3.176x"
 }
diff --git a/locality/fair_harness.c b/locality/fair_harness.c
@@ -17,8 +17,15 @@
 #include <stdlib.h>
 #include <string.h>
 #include <time.h>
+
+#ifdef __x86_64__
 #include <cpuid.h>
-#include <xmmintrin.h>
+#elif defined(__aarch64__)
+#ifdef __linux__
+#include <sys/auxv.h>
+#include <asm/hwcap.h>
+#endif
+#endif
 
 /**
  * @name Configuration Constants
@@ -43,6 +50,7 @@ extern void matmul_flat(const int32_t *a, const int32_t *b, int32_t *c);
  */
 extern void matmul_locality(const int32_t *a, const int32_t *b, int32_t *c);
 
+#ifdef __x86_64__
 /**
  * @brief Dynamic CPUID hardware capability check for AVX-512 foundation support.
  */
@@ -54,11 +62,38 @@ static int has_avx512f(void) {
     __cpuid_count(7, 0, eax, ebx, ecx, edx);
     return (ebx & (1 << 16)) != 0; // AVX-512 Foundation is bit 16 in EBX of CPUID leaf 7, subleaf 0
 }
+#elif defined(__aarch64__)
+/**
+ * @brief Dynamic hardware capability check for ARM64 SVE support.
+ */
+static int has_sve(void) {
+#if defined(__linux__) && defined(HWCAP_SVE)
+    return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+    return 0;
+#endif
+}
+
+/**
+ * @brief Dynamic hardware capability check for ARM64 NEON/ASIMD support.
+ */
+static int has_asimd(void) {
+#if defined(__linux__) && defined(HWCAP_ASIMD)
+    return (getauxval(AT_HWCAP) & HWCAP_ASIMD) != 0;
+#else
+    #if defined(__APPLE__)
+    return 1; // Apple Silicon always has NEON/ASIMD
+    #else
+    return 0;
+    #endif
+#endif
+}
+#endif
 
 /**
  * @brief Contiguous Locality-Aligned (I-K-J) fallback kernel in C.
- * Used when the host processor does not support native AVX-512 vector instructions.
- * Implements software cache prefetching via _mm_prefetch compiler intrinsics.
+ * Used when the host processor does not support native vector instructions.
+ * Implements software cache prefetching via GCC/Clang __builtin_prefetch.
  */
 static void matmul_locality_fallback(const int32_t *a, const int32_t *b, int32_t *c) {
     const int lookahead_stride = 16; // Prefetch 16 elements (64 bytes, 1 cache line) ahead
@@ -67,8 +102,8 @@ static void matmul_locality_fallback(const int32_t *a, const int32_t *b, int32_t
             int32_t aik = a[i * N + k];
             for (int j = 0; j < N; ++j) {
                 if (j + lookahead_stride < N) {
-                    _mm_prefetch((const char *)&b[k * N + j + lookahead_stride], _MM_HINT_T0);
-                    _mm_prefetch((const char *)&c[i * N + j + lookahead_stride], _MM_HINT_T0);
+                    __builtin_prefetch(&b[k * N + j + lookahead_stride], 0, 3);
+                    __builtin_prefetch(&c[i * N + j + lookahead_stride], 1, 3);
                 }
                 c[i * N + j] += aik * b[k * N + j];
             }
@@ -170,16 +205,32 @@ int main(void) {
 
     fill(a, b);
 
-    // Detect host AVX-512 capability at runtime
-    int use_avx512 = has_avx512f();
-    if (use_avx512) {
+    // Detect host capabilities at runtime and select appropriate dispatch path
+    int use_vector = 0;
+#ifdef __x86_64__
+    use_vector = has_avx512f();
+    if (use_vector) {
         printf("HARDWARE TELEMETRY: Native AVX-512 support detected. Dispatching to assembly vector kernel.\n");
     } else {
         printf("HARDWARE TELEMETRY: AVX-512 not supported. Dispatching to optimized scalar fallback kernel.\n");
     }
+#elif defined(__aarch64__)
+    use_vector = has_sve() || has_asimd();
+    if (use_vector) {
+        if (has_sve()) {
+            printf("HARDWARE TELEMETRY: Native ARM64 SVE support detected. Dispatching to assembly vector kernel.\n");
+        } else {
+            printf("HARDWARE TELEMETRY: Native ARM64 NEON/ASIMD support detected. Dispatching to assembly vector kernel.\n");
+        }
+    } else {
+        printf("HARDWARE TELEMETRY: ARM64 Vector extensions not supported. Dispatching to optimized scalar fallback kernel.\n");
+    }
+#else
+    printf("HARDWARE TELEMETRY: Unsupported architecture. Dispatching to optimized scalar fallback kernel.\n");
+#endif
 
     void (*locality_kernel)(const int32_t*, const int32_t*, int32_t*) = 
-        use_avx512 ? matmul_locality : matmul_locality_fallback;
+        use_vector ? matmul_locality : matmul_locality_fallback;
 
     // Initial warm run & arithmetic validation check
     memset(cf, 0, BYTES);