feat: Transition from Mutex Fences to Lock-Free Queues#12
Merged
Conversation
…s and update performance metrics
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR closes #6 by refactoring the concurrency control in the Go scheduler daemon (
scheduler/scheduler.go) to optimize memory bank access for nanosecond-scale simulation. We have successfully replaced kernel-managed mutual exclusion locks (sync.Mutexfences) with a lock-free, array-backed circular queue lock (ALock) that coordinates concurrent operations entirely within user-space.Proposed Architectural Solution
1. Lockless Ticket-Based Queueing (Ring-Buffer Topology)
sync.MutexwithbankQueues []BankQueue.BankQueue) containing slots ofQueueItem(pre-allocated to a power-of-two size65536for fast bitmask indexing).atomic.AddUint64(&q.tail, 1) - 1. This eliminates Compare-and-Swap (CAS) retry loop overhead and cache invalidation storms during high-concurrency saturations.2. State-Machine & Channeled Handoff
To solve the classic CPU starvation and context-switch overhead of spinning when thousands of goroutines are scheduled concurrently:
Statevalues:0: Idle,1: Enqueued/Waiting,2: Processing).head == ticket(it is the current turn), it attempts to CAS the slot state from1 (Enqueued)to2 (Processing). If successful, it enters the critical section immediately without parking or allocating resources.q.ring[idx].sem). When the active slot holder releases the bank, it advancesheadand signals the next waiting thread using its channel, ensuring strict sequential handoff with minimal latency.runtime.Gosched()for lightweight hardware back-offs during state transit synchronization.3. 64-bit Alignment & API Contract Preservation
NewMemorySchedulerandAccess), allowing tests, matrix multiplication wrappers, and Python bindings to interface without modification.Verification and Benchmarks
1. Concurrency Correctness (Go Scheduler Unit Tests)
The test harness spawning 16,384 concurrent threads passes with 100% mathematical output identity and achieves the theoretical speedup maximum limit of 3.000x (STREAM-Triad 3-bank layout):
2. Go Race Detector Sweeps
Running the tests under the Go race detector confirms zero data races, leaks, or deadlock conditions:
3. Physical CPU Cache Locality Timing Loop
The physical hardware locality timing sweeps compiled from assembly verify that the system operates at maximum efficiency: