Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[![Tech: Go](https://img.shields.io/badge/Tech-Go_1.20%2B-00ADD8.svg)](#)
[![Tech: Python](https://img.shields.io/badge/Tech-Python_3.10%2B-blue.svg)](#)
[![Tech: C](https://img.shields.io/badge/Tech-C11-blue.svg)](#)
[![Tech: Assembly](https://img.shields.io/badge/Tech-x86--64_Assembly-orange.svg)](#)
[![Tech: Assembly](https://img.shields.io/badge/Tech-x86--64%20%2F%20ARM64%20Assembly-orange.svg)](#)
[![GitHub Release](https://img.shields.io/github/v/release/DigitalServerHost/ORCHID?include_prereleases&sort=semver&color=FF69B4)](https://github.com/DigitalServerHost/ORCHID/releases/latest)
[![GHCR Container](https://img.shields.io/badge/GHCR-Package_Registry-blueviolet.svg?logo=docker&logoColor=white)](https://github.com/DigitalServerHost/ORCHID/pkgs/container/orchid)
[![Downloads](https://img.shields.io/github/downloads/DigitalServerHost/ORCHID/total?color=blue)](https://github.com/DigitalServerHost/ORCHID/releases)
Expand All @@ -29,12 +29,11 @@ Project **ORCHID** is the low-level micro-architectural execution core of the RA

The absolute base foundation, research primitives, and original codebase layout can be found preserved on the legacy archive branch:
👉 **[View the Baseline Concept Code (`tree/gatchimuchio-original`)](https://github.com/DigitalServerHost/ORCHID/tree/gatchimuchio-original)**

---

## 📊 Reproduced Locality Performance

Under identical, mathematically verified logical execution constraints (512x512 matrix size, double-triplicate verification, and total 64 MiB L1-L3 cache flushes between timing runs), the locality-aligned (I-K-J) memory mapping sweeps demonstrate exceptionally high performance improvements. Badges below are dynamically parsed from current timing sweeps:
Under identical, mathematically verified logical execution constraints (512x512 matrix size and double-triplicate verification), the locality-aligned memory mapping sweeps demonstrate exceptionally high performance improvements. Badges below are dynamically parsed from current timing sweeps:

| Metric | Speedup |
| :------------------ | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Expand All @@ -45,11 +44,23 @@ Under identical, mathematically verified logical execution constraints (512x512

> [!NOTE]
> **Understanding the Speedup Profiles:**
> - **Physical Cache Locality (C Harness)**: The dynamic badges above measure the hardware execution speedup of cache-blocked locality-aligned loops (matrix multiplication) over flat baselines, yielding **3.6x - 4.0x** actual hardware speedups.
> - **Physical Cache Locality (C Harness)**: The dynamic badges above measure the hardware execution speedup of cache-blocked locality-aligned loops (matrix multiplication) over flat baselines, yielding **3.0x - 3.4x** actual hardware speedups on warm cache lines.
> - **Parallel Memory Scheduler (Go Simulator)**: The scheduler unit tests (`TestBankedSchedulerTriad`) run a software-simulated queue model (STREAM-Triad) to measure bank serialization and parallel role routing. Because STREAM-Triad partitions requests into 3 distinct logical data streams (B-read, C-read, A-write), mapping them to 3 independent memory banks achieves a theoretical parallel speedup limit of exactly **3.0x** (which the Go scheduler hits at exactly **3.000x** cycle reduction).

---

## 🖥️ Platform Target Support

Project ORCHID features a **Heterogeneous Hardware Dispatch Plane** to scale execution guarantees across multiple architectures. The assembler (`orchid/assembler.py`) dynamically auto-detects the host architecture (or accepts a target override parameter via `--target`) and emits optimized assembly targets:

- **`x86_64` (AVX-512)**: Standard vectorized loop utilizing 512-bit vector registers with active `prefetcht0` hardware preloading.
- **`arm64` (NEON / SVE)**: Vectorized execution using ARM64 NEON registers (`v0-v31`) with `prfm pldl1keep` software lookahead prefetching offsets.
- **`apple_amx` (Apple Silicon)**: Low-level matrix coprocessor wrapper with custom `amxinit`/`amxstop` instructions (`.word` directives).

At runtime, the benchmarking harness (`locality/fair_harness.c`) performs dynamic hardware capability telemetry (`CPUID` for x86-64, `getauxval(AT_HWCAP)` for ARM64 SVE/ASIMD on Linux) to dispatch execution to the optimal native assembly kernel.

---

## 🏛️ Centralized Architectural Design & Blueprint

To ensure professional documentation standards and maintain a clean, readable quickstart guide, Project ORCHID's deep technical designs, mathematical formulations, and nested folder blueprints have been centralized:
Expand Down
8 changes: 4 additions & 4 deletions evidence/reproduced/speedups.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"min": "3.047x",
"median": "3.156x",
"max": "3.241x",
"mean": "3.150x"
"min": "2.871x",
"median": "3.171x",
"max": "3.396x",
"mean": "3.176x"
}
69 changes: 60 additions & 9 deletions locality/fair_harness.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,15 @@
#include <stdlib.h>
#include <string.h>
#include <time.h>

#ifdef __x86_64__
#include <cpuid.h>
#include <xmmintrin.h>
#elif defined(__aarch64__)
#ifdef __linux__
#include <sys/auxv.h>
#include <asm/hwcap.h>
#endif
#endif

/**
* @name Configuration Constants
Expand All @@ -43,6 +50,7 @@ extern void matmul_flat(const int32_t *a, const int32_t *b, int32_t *c);
*/
extern void matmul_locality(const int32_t *a, const int32_t *b, int32_t *c);

#ifdef __x86_64__
/**
* @brief Dynamic CPUID hardware capability check for AVX-512 foundation support.
*/
Expand All @@ -54,11 +62,38 @@ static int has_avx512f(void) {
__cpuid_count(7, 0, eax, ebx, ecx, edx);
return (ebx & (1 << 16)) != 0; // AVX-512 Foundation is bit 16 in EBX of CPUID leaf 7, subleaf 0
}
#elif defined(__aarch64__)
/**
* @brief Dynamic hardware capability check for ARM64 SVE support.
*/
static int has_sve(void) {
#if defined(__linux__) && defined(HWCAP_SVE)
return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
#else
return 0;
#endif
}

/**
* @brief Dynamic hardware capability check for ARM64 NEON/ASIMD support.
*/
static int has_asimd(void) {
#if defined(__linux__) && defined(HWCAP_ASIMD)
return (getauxval(AT_HWCAP) & HWCAP_ASIMD) != 0;
#else
#if defined(__APPLE__)
return 1; // Apple Silicon always has NEON/ASIMD
#else
return 0;
#endif
#endif
}
#endif

/**
* @brief Contiguous Locality-Aligned (I-K-J) fallback kernel in C.
* Used when the host processor does not support native AVX-512 vector instructions.
* Implements software cache prefetching via _mm_prefetch compiler intrinsics.
* Used when the host processor does not support native vector instructions.
* Implements software cache prefetching via GCC/Clang __builtin_prefetch.
*/
static void matmul_locality_fallback(const int32_t *a, const int32_t *b, int32_t *c) {
const int lookahead_stride = 16; // Prefetch 16 elements (64 bytes, 1 cache line) ahead
Expand All @@ -67,8 +102,8 @@ static void matmul_locality_fallback(const int32_t *a, const int32_t *b, int32_t
int32_t aik = a[i * N + k];
for (int j = 0; j < N; ++j) {
if (j + lookahead_stride < N) {
_mm_prefetch((const char *)&b[k * N + j + lookahead_stride], _MM_HINT_T0);
_mm_prefetch((const char *)&c[i * N + j + lookahead_stride], _MM_HINT_T0);
__builtin_prefetch(&b[k * N + j + lookahead_stride], 0, 3);
__builtin_prefetch(&c[i * N + j + lookahead_stride], 1, 3);
}
c[i * N + j] += aik * b[k * N + j];
}
Expand Down Expand Up @@ -170,16 +205,32 @@ int main(void) {

fill(a, b);

// Detect host AVX-512 capability at runtime
int use_avx512 = has_avx512f();
if (use_avx512) {
// Detect host capabilities at runtime and select appropriate dispatch path
int use_vector = 0;
#ifdef __x86_64__
use_vector = has_avx512f();
if (use_vector) {
printf("HARDWARE TELEMETRY: Native AVX-512 support detected. Dispatching to assembly vector kernel.\n");
} else {
printf("HARDWARE TELEMETRY: AVX-512 not supported. Dispatching to optimized scalar fallback kernel.\n");
}
#elif defined(__aarch64__)
use_vector = has_sve() || has_asimd();
if (use_vector) {
if (has_sve()) {
printf("HARDWARE TELEMETRY: Native ARM64 SVE support detected. Dispatching to assembly vector kernel.\n");
} else {
printf("HARDWARE TELEMETRY: Native ARM64 NEON/ASIMD support detected. Dispatching to assembly vector kernel.\n");
}
} else {
printf("HARDWARE TELEMETRY: ARM64 Vector extensions not supported. Dispatching to optimized scalar fallback kernel.\n");
}
#else
printf("HARDWARE TELEMETRY: Unsupported architecture. Dispatching to optimized scalar fallback kernel.\n");
#endif

void (*locality_kernel)(const int32_t*, const int32_t*, int32_t*) =
use_avx512 ? matmul_locality : matmul_locality_fallback;
use_vector ? matmul_locality : matmul_locality_fallback;

// Initial warm run & arithmetic validation check
memset(cf, 0, BYTES);
Expand Down
Loading
Loading