feat: Heterogeneous Hardware Dispatch Plane by westkevin12 · Pull Request #14 · DigitalServerHost/ORCHID

westkevin12 · 2026-06-05T17:15:48Z

Description

This PR closes #5 implementing a dynamic, multi-platform Heterogeneous Hardware Dispatch Plane for Project ORCHID.

Instead of locking compile targets and dynamic telemetry strictly to x86-64 micro-architectures featuring AVX-512 extensions, this change expands the compilation pipeline to target ARM64 and Apple Silicon architectures. It also modernizes the dynamic hardware telemetry with feature detection for ARM64 SVE and NEON instruction sets.

Proposed Changes

1. Emitter Target Refactoring (`orchid/assembler.py`)

Refactored the core python assembly generator to emit target-specific vector modules:
- x86-64: Standard AVX-512 foundation vectorized execution loop with prefetcht0 latency masking.
- ARM64: Wide NEON vector registers (v0-v31) with native prfm pldl1keep software lookahead prefetching offsets.
- Apple Silicon (AMX): Proprietary Apple Matrix Coprocessor instructions (emitted via custom .word 0x00201000 / 0x00201020 directives for coprocessor startup/shutdown constraints).
Added target selector CLI argument (--target supporting x86_64, arm64, and apple_amx).
Implemented automatic platform detection using Python's platform.machine() and sys.platform to determine the native host target by default.

2. Multi-Platform Dynamic Telemetry (`locality/fair_harness.c`)

Replaced the x86-exclusive CPUID check with platform-conditional detection blocks:
- x86-64: Queries processor leaf checks to inspect for native AVX-512 foundation support.
- ARM64 (Linux): Includes <sys/auxv.h> and queries system auxiliary vector flags (getauxval(AT_HWCAP)) for HWCAP_SVE and HWCAP_ASIMD support.
- ARM64 (macOS): Automatically assumes standard NEON/ASIMD capability.
Replaced SSE architecture-specific _mm_prefetch compiler intrinsics in the C scalar fallback with the architecture-independent __builtin_prefetch helper. This ensures compilation succeeds across diverse compilers (GCC/Clang) on all targets.

Verification Results

1. Timing Benchmarks (`./scripts/run_locality.sh`)

Verification passes on the native host (x86_64) using the scalar fallback path (which implements portable __builtin_prefetch instructions):

$ ./scripts/run_locality.sh
Step 1: Parsing matmul.plan and generating raw x86-64 assembly...
EMITTED Assembly Modules target=x86_64 size=512 flat.S locality.S to /home/west/github.com/westkevin12/RAMNET/ORCHID/locality/build
Step 2: Compiling fair_harness.c and generated assembly kernels...
Step 3: Running benchmark timing harness (alternating loop patterns)...
HARDWARE TELEMETRY: AVX-512 not supported. Dispatching to optimized scalar fallback kernel.
VERIFY equal N=512 operations=134217728
PAIR 1 order=flat-first flat_sec=0.239710875 locality_sec=0.070579891 speedup=3.396x
PAIR 2 order=locality-first flat_sec=0.220098070 locality_sec=0.067882409 speedup=3.242x
...
speedup_min=2.871x
speedup_median=3.171x
speedup_max=3.396x
speedup_mean=3.176x

2. Multi-Target Compilation Assertions

Compiling the specific target configurations runs successfully:

# ARM64 Emitter
$ python3 -m orchid.assembler locality/matmul.plan --out-dir locality/build --target arm64
EMITTED Assembly Modules target=arm64 size=512 flat.S locality.S to locality/build

# Apple AMX Emitter
$ python3 -m orchid.assembler locality/matmul.plan --out-dir locality/build --target apple_amx
EMITTED Assembly Modules target=apple_amx size=512 flat.S locality.S to locality/build

3. Go Scheduler Regression Stability

No concurrency regressions or thread-safety issues detected in the Go plane:

$ go test -race -v ./scheduler/...
=== RUN   TestBankedSchedulerTriad
    scheduler_test.go:129: VERIFY: Mathematical calculations are 100% identical!
    scheduler_test.go:130: Deterministic Serial Cycles: 4915200
    scheduler_test.go:131: Deterministic Parallel Cycles: 1638601
    scheduler_test.go:132: Theoretical Parallel Speedup achieved in Go: 3.000x
--- PASS: TestBankedSchedulerTriad (0.89s)
=== RUN   TestPhysicalNUMAAllocation
--- PASS: TestPhysicalNUMAAllocation (0.09s)
PASS
ok  	ORCHID/scheduler	2.059s

…mbler

feat: add ARM64 and Apple AMX assembly generation support to the asse…

ef93b77

…mbler

westkevin12 added the patch label Jun 5, 2026

westkevin12 self-assigned this Jun 5, 2026

westkevin12 merged commit 7a3819f into main Jun 5, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Heterogeneous Hardware Dispatch Plane#14

feat: Heterogeneous Hardware Dispatch Plane#14
westkevin12 merged 1 commit into
mainfrom
Heterogeneous_Hardware_Dispatch_Plane

westkevin12 commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

westkevin12 commented Jun 5, 2026

Description

Proposed Changes

1. Emitter Target Refactoring (orchid/assembler.py)

2. Multi-Platform Dynamic Telemetry (locality/fair_harness.c)

Verification Results

1. Timing Benchmarks (./scripts/run_locality.sh)

2. Multi-Target Compilation Assertions

3. Go Scheduler Regression Stability

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Emitter Target Refactoring (`orchid/assembler.py`)

2. Multi-Platform Dynamic Telemetry (`locality/fair_harness.c`)

1. Timing Benchmarks (`./scripts/run_locality.sh`)