Skip to content

feat: Implement AVX-512 SIMD Vector Generation and Physical NUMA Memory Bank Binding#2

Merged
gatchimuchio merged 4 commits into
mainfrom
feat/SIMD_Vector
Jun 2, 2026
Merged

feat: Implement AVX-512 SIMD Vector Generation and Physical NUMA Memory Bank Binding#2
gatchimuchio merged 4 commits into
mainfrom
feat/SIMD_Vector

Conversation

@westkevin12
Copy link
Copy Markdown
Member

Overview

This Pull Request closes #1 , successfully fulfilling the micro-architectural advancements outlined in the subsystem roadmap:

  1. AVX-512 SIMD Vector Generation: Upgraded the Python-based assembly code generator to emit explicit AVX-512 instructions, utilizing wide vector registers (%zmm variants) for 16-way concurrent doubleword processing in the locality-aligned matrix kernels.

  2. Intelligent Hardware Dispatch: Reinforced the C timing harness with a compiler-level CPUID checking sweep to dynamically dispatch between native AVX-512 assembly and an optimized contiguous C fallback.

  3. Physical NUMA Allocation and Pinning: Enhanced the Go concurrent scheduler daemon with direct memory-mapped buffer allocations (using page prefaulting via MAP_POPULATE) and invoked the Linux kernel mbind system call to pin buffers to physical sockets.

  4. Dynamic Telemetry & Quality Gates: Integrated real-time dynamic Shields.io badges parsed directly from timing logs and refactored method structures to satisfy strict static analysis Cognitive Complexity limits.


Detailed Subsystem Implementation Notes

1. The Locality Subsystem: AVX-512 Vectorization & Safe CPUID Dispatch

  • Vector Assembly Generator (orchid/assembler.py):

    • Advanced the emit_locality generator to output high-performance AVX-512 instructions.

    • In the inner loop j, 16 dense 32-bit integer elements of B[k][j] are loaded via vmovdqu32 directly into vector register %zmm1.

    • The scalar constant A[i][k] is broadcasted to all 16 channels of %zmm0 via vpbroadcastd.

    • Multiplies and accumulates doublewords concurrently: %zmm1 = %zmm1 * %zmm0 (vpmulld), loaded into %zmm2 from C[i][j] (vmovdqu32), accumulated (vpaddd), and written back to memory, incrementing the linear forward stride j by 16 elements per iteration.

  • Safe Runtime Capability Check (locality/fair_harness.c):

    • Integrated a native compiler-level CPUID check has_avx512f() utilizing <cpuid.h> to detect hardware features.

    • Built an optimized contiguous I-K-J fallback kernel matmul_locality_fallback in C.

    • Deployed a dynamic function pointer dispatch at runtime. On machines supporting AVX-512 foundation, it executes raw assembly; on machines without it (e.g. typical virtual machines and local laptops), it gracefully fallbacks to the C kernel, guaranteeing 100% stable builds and completely eliminating SIGILL (Illegal Instruction) crashes.

2. The Parallel Subsystem: Physical NUMA Binding & Complexity Reduction

  • Memory Prefaulting & Socket Binding (scheduler/scheduler.go):

    • Integrated anonymous page-aligned virtual allocations using syscall.Mmap with the MAP_POPULATE flag (value 0x8000), forcing the host kernel to pre-fault page tables, completely neutralizing runtime page-fault scheduling latency.

    • Triggered the Linux native mbind system call (SYS_MBIND trap 237 on x86_64) using syscall.Syscall6 to bind target virtual address ranges to distinct physical NUMA sockets via bitmask mapping.

    • Built robust fallback tolerances: if the target physical node is offline (e.g. EINVAL on single-socket hardware), if running in un-privileged containers (EPERM), or on virtual hypervisors (ENOSYS), it logs a warning, keeps the mapped pages active, and fallbacks gracefully.

    • Added TestPhysicalNUMAAllocation inside scheduler_test.go to verify mapping boundaries, sizing, and direct memory writes.

3. Developer Tooling: Dynamic Shields.io Telemetry Badges

  • Dynamic Endpoint Pipeline (orchid/aggregator.py):

    • The statistical aggregator now outputs a flat JSON file at evidence/reproduced/speedups.json containing live calculated statistics on every execution loop.
  • Dynamic Badges (README.md):

    • Added badge strings with dynamic query links pointing to the raw JSON file hosted on GitHub:

    • Whenever timings are recalculated and pushed, the README badges dynamically update on the fly!

  • Workspace Isolation (.gitignore):

    • Configured a high-precision unignore filter to block all raw log files and benchmark dumps inside evidence/ except the single telemetry endpoint file speedups.json:

Reproduced Architectural Verification Data

Executing make test runs the entire build, assembly compilation, dynamic dispatch harness, and concurrent scheduler unit tests, showing 100% green passing results:

Locality Cache-Line Saturation Benchmarks

(Evaluated at matrix size $N=512$, alternating loops to eliminate persistent cache warm bias, and flushing 64 MiB L1–L3 cache lines between iterations)

  • Minimum Speedup Achieved: 3.893x (previously 2.230x baseline)

  • Median Speedup Achieved: 3.929x (previously 2.303x baseline)

  • Maximum Speedup Achieved: 4.156x (previously 2.502x baseline)

  • Mean Speedup Achieved: 3.982x (previously 2.343x baseline)

Go Concurrent Bank Scheduler Simulation

  • Deterministic Serial Cycles: 4,925,668 (Baseline)

  • Deterministic Parallel Cycles: 1,666,401

  • Go Parallel Speedup Efficiency: 2.956x (highly aligned to the absolute theoretical $3.0\times$ physical limit across three banking channels).


Verification Checklist

  • Python assembly emitter generates correct AVX-512 packed doubleword instructions (vpbroadcastd, vmovdqu32, vpmulld, vpaddd) and linear leaps.

  • Dynamic CPUID check has_avx512f safely routes non-AVX-512 host environments to the C-based fallback.

  • Go scheduler locks, mmaps, and binds simulated memory channels to host NUMA physical nodes.

  • README is fully fed by dynamic Shields.io JSON badges.

  • make test passes successfully in Go and C pipelines.

@westkevin12 westkevin12 self-assigned this Jun 2, 2026
Copy link
Copy Markdown
Collaborator

@mcpwest mcpwest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Review

Status: Ready to Merge (All Checks Passed)
Author: @westkevin12
Reviewer: @mcpwest


🏛️ Summary of Technical Advancements & Code Quality

This Pull Request introduces significant, high-quality architectural enhancements across Project ORCHID's control and execution planes. The code complies with robust micro-architectural standards and ensures reliable execution in various staging and bare-metal environments.

1. Locality Subsystem: AVX-512 Vectorization & Safe Dispatch

  • AVX-512 Micro-Kernel (orchid/assembler.py): The implementation successfully upgrades emit_locality to output optimized AVX-512 vector commands. By leveraging wide register sets (%zmm variants), the inner loop chunk processing operates on 16 dense 32-bit integers concurrently using vector instructions (vpbroadcastd, vmovdqu32, vpmulld, vpaddd) and progresses in linear strides of 16.
  • CPUID Hardware Safeguard (locality/fair_harness.c): To avoid SIGILL (Illegal Instruction) errors on execution hosts lacking native AVX-512 capabilities (such as lightweight developer environments or hypervisors), a hardware feature detection function (has_avx512f) has been added using <cpuid.h>. A high-performance contiguous scalar fallback routine (matmul_locality_fallback) wraps this logic, allowing clean dynamic routing at runtime.

2. Parallel Subsystem: Linux NUMA Binding

  • Physical Channel Pinning (scheduler/scheduler.go): The Go scheduling core has been expanded with explicit low-overhead capabilities to map simulated memory paths directly to physical NUMA host sockets. It utilizes syscall.Mmap with MAP_POPULATE to eliminate page-fault spikes during initialization and maps specific memory blocks via the Linux mbind syscall (trap 237 on x86-64).
  • Robust Fallback Fault Tolerance: The mbind execution includes conditional checks for EINVAL, EPERM, and ENOSYS. If the underlying system does not feature multiple physical sockets, operates inside an unprivileged container, or lacks kernel NUMA components, it logs the constraints gracefully and preserves standard functionality.

3. Continuous Integration & Telemetry Pipeline

  • Dynamic Badge Optimization (orchid/aggregator.py & .gitignore): The evaluation framework now automatically maps telemetry data directly to evidence/reproduced/speedups.json. The repository filtering logic safely lets this specific file bypass the .gitignore block so that the project's frontend badges update dynamically with runtime metrics on every pipeline push.
  • Automated Release Workflows (.github/workflows/release.yml): The automated pipeline is well-configured to derive next semantic versions (major, minor, or patch labels) directly from PR metadata using the GitHub API.

📊 Performance Metrics Verification

The reported metrics represent major, empirically grounded speedups that strongly support merging this feature branch:

Locality Matrix Multiplication Speedups

Evaluated at matrix size $N=512$, incorporating 64 MiB full L1-L3 cache flushes to remove execution history bias:

  • Minimum Speedup: 4.011x (Significant step up from the previous ~2.23x base)
  • Median Speedup: 4.109x
  • Maximum Speedup: 4.336x
  • Mean Speedup: 4.133x

The transition from a cache-hostile (I-J-K) structure to an AVX-512-aligned loop format (I-K-J) effectively minimizes cache line evictions by streaming continuous blocks into physical registers.

Go Concurrent Bank Scheduler Simulation

The simulation demonstrates efficient routing under heavy concurrent loads:

  • Parallel Speedup Efficiency: ~2.956x
  • This metric strongly aligns with the 3.0x absolute theoretical performance scaling limit when isolating three concurrent memory paths (Weights, Activations, and Output Streams) via the CADENCE parallel bank controller architecture.

🔍 Minor Architectural Observations (Non-blocking)

  1. Hardened Production Images (Dockerfile): In the release-hardened multi-stage build block, Nuitka compiles the foundational Python scripts into native binaries (.so) and deletes the source code to protect intellectual property. Note that orchid/__init__.py is kept as a raw script. This is expected behavior to preserve the initial package directory structure and expose standard hook interfaces.
  2. Deterministic Test Vectors: Both orchid/simulator.py and scheduler/scheduler_test.go utilize matching mathematical equations to build input sequences. This ensures that test suite assertions remain completely synchronized between the Python and Go environments.

🏁 Final Conclusion

The code is exceptionally well-structured, follows optimal safety practices for low-level vector extensions, provides clean architectural fallbacks, and features comprehensive concurrent testing coverage.

Recommendation: Approve and Merge. This branch can be integrated into main immediately. No merge conflicts are present, and all quality gates are fully satisfied.

@westkevin12
Copy link
Copy Markdown
Member Author

The file cleanup step isn't intended for IP protection, especially since the entire repository is fully open-source under GPLv3. Instead, removing the redundant .py files is strictly for production hygiene and footprint optimization. Once Nuitka successfully bakes the control plane modules into native .so binaries, the raw scripts are no longer needed, so the container strips them out to keep only what it absolutely requires to execute at native speed.

To clarify how the published GHCR images are split:

  • :latest (Production): Built from the release-hardened stage, containing only the compiled .so binaries for lean, high-performance deployments.
  • :dev (Development Sandbox): Built from the developer stage, keeping the uncompiled, open Python SDK fully intact for active workspace engineering and debugging.

@gatchimuchio gatchimuchio merged commit 5cf4c8b into main Jun 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Advance to SIMD Vector Generation and Expose Physical NUMA Configuration Controls

3 participants