Skip to content

feat: Heterogeneous Hardware Dispatch Plane#14

Merged
westkevin12 merged 1 commit into
mainfrom
Heterogeneous_Hardware_Dispatch_Plane
Jun 5, 2026
Merged

feat: Heterogeneous Hardware Dispatch Plane#14
westkevin12 merged 1 commit into
mainfrom
Heterogeneous_Hardware_Dispatch_Plane

Conversation

@westkevin12
Copy link
Copy Markdown
Member

Description

This PR closes #5 implementing a dynamic, multi-platform Heterogeneous Hardware Dispatch Plane for Project ORCHID.

Instead of locking compile targets and dynamic telemetry strictly to x86-64 micro-architectures featuring AVX-512 extensions, this change expands the compilation pipeline to target ARM64 and Apple Silicon architectures. It also modernizes the dynamic hardware telemetry with feature detection for ARM64 SVE and NEON instruction sets.


Proposed Changes

1. Emitter Target Refactoring (orchid/assembler.py)

  • Refactored the core python assembly generator to emit target-specific vector modules:
    • x86-64: Standard AVX-512 foundation vectorized execution loop with prefetcht0 latency masking.
    • ARM64: Wide NEON vector registers (v0-v31) with native prfm pldl1keep software lookahead prefetching offsets.
    • Apple Silicon (AMX): Proprietary Apple Matrix Coprocessor instructions (emitted via custom .word 0x00201000 / 0x00201020 directives for coprocessor startup/shutdown constraints).
  • Added target selector CLI argument (--target supporting x86_64, arm64, and apple_amx).
  • Implemented automatic platform detection using Python's platform.machine() and sys.platform to determine the native host target by default.

2. Multi-Platform Dynamic Telemetry (locality/fair_harness.c)

  • Replaced the x86-exclusive CPUID check with platform-conditional detection blocks:
    • x86-64: Queries processor leaf checks to inspect for native AVX-512 foundation support.
    • ARM64 (Linux): Includes <sys/auxv.h> and queries system auxiliary vector flags (getauxval(AT_HWCAP)) for HWCAP_SVE and HWCAP_ASIMD support.
    • ARM64 (macOS): Automatically assumes standard NEON/ASIMD capability.
  • Replaced SSE architecture-specific _mm_prefetch compiler intrinsics in the C scalar fallback with the architecture-independent __builtin_prefetch helper. This ensures compilation succeeds across diverse compilers (GCC/Clang) on all targets.

Verification Results

1. Timing Benchmarks (./scripts/run_locality.sh)

Verification passes on the native host (x86_64) using the scalar fallback path (which implements portable __builtin_prefetch instructions):

$ ./scripts/run_locality.sh
Step 1: Parsing matmul.plan and generating raw x86-64 assembly...
EMITTED Assembly Modules target=x86_64 size=512 flat.S locality.S to /home/west/github.com/westkevin12/RAMNET/ORCHID/locality/build
Step 2: Compiling fair_harness.c and generated assembly kernels...
Step 3: Running benchmark timing harness (alternating loop patterns)...
HARDWARE TELEMETRY: AVX-512 not supported. Dispatching to optimized scalar fallback kernel.
VERIFY equal N=512 operations=134217728
PAIR 1 order=flat-first flat_sec=0.239710875 locality_sec=0.070579891 speedup=3.396x
PAIR 2 order=locality-first flat_sec=0.220098070 locality_sec=0.067882409 speedup=3.242x
...
speedup_min=2.871x
speedup_median=3.171x
speedup_max=3.396x
speedup_mean=3.176x

2. Multi-Target Compilation Assertions

Compiling the specific target configurations runs successfully:

# ARM64 Emitter
$ python3 -m orchid.assembler locality/matmul.plan --out-dir locality/build --target arm64
EMITTED Assembly Modules target=arm64 size=512 flat.S locality.S to locality/build

# Apple AMX Emitter
$ python3 -m orchid.assembler locality/matmul.plan --out-dir locality/build --target apple_amx
EMITTED Assembly Modules target=apple_amx size=512 flat.S locality.S to locality/build

3. Go Scheduler Regression Stability

No concurrency regressions or thread-safety issues detected in the Go plane:

$ go test -race -v ./scheduler/...
=== RUN   TestBankedSchedulerTriad
    scheduler_test.go:129: VERIFY: Mathematical calculations are 100% identical!
    scheduler_test.go:130: Deterministic Serial Cycles: 4915200
    scheduler_test.go:131: Deterministic Parallel Cycles: 1638601
    scheduler_test.go:132: Theoretical Parallel Speedup achieved in Go: 3.000x
--- PASS: TestBankedSchedulerTriad (0.89s)
=== RUN   TestPhysicalNUMAAllocation
--- PASS: TestPhysicalNUMAAllocation (0.09s)
PASS
ok  	ORCHID/scheduler	2.059s

@westkevin12 westkevin12 self-assigned this Jun 5, 2026
@westkevin12 westkevin12 merged commit 7a3819f into main Jun 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heterogeneous Hardware Dispatch Plane

1 participant