Micro-benchmarks for Go concurrency patterns in polling hot-loops.
⚠️ Scope: These benchmarks apply to polling patterns (withdefault:case) where you check channels millions of times per second. Most Go code uses blocking patterns instead—see Polling vs Blocking before drawing conclusions.
📖 New to this repo? Start with the Walkthrough for a guided tour with example outputs.
Measured on AMD Ryzen Threadripper PRO 3945WX, Go 1.25, Linux:
| Operation | Standard | Optimized | Speedup |
|---|---|---|---|
| Cancel check | 8.2 ns | 0.36 ns | 23x |
| Tick check | 86 ns | 5.6 ns | 15x |
Queue performance depends heavily on your goroutine topology:
SPSC (1 Producer → 1 Consumer):
| Implementation | Latency | Speedup |
|---|---|---|
| Channel | 248 ns | baseline |
| go-lock-free-ring (1 shard) | 114 ns | 2.2x |
| Our SPSC Ring (unguarded) | 36.5 ns | 6.8x |
MPSC (Multiple Producers → 1 Consumer):
| Producers | Channel | go-lock-free-ring | Speedup |
|---|---|---|---|
| 4 | 35 µs | 539 ns | 65x |
| 8 | 47 µs | 464 ns | 101x |
Key insight: Channels scale terribly with multiple producers due to lock contention. For MPSC patterns, go-lock-free-ring provides 65-100x speedup through sharded lock-free design.
for {
if ctx.Done() { return } // ← Cancel check
if ticker.Tick() { flush() } // ← Tick check
process(queue.Pop()) // ← Queue op
}| Pattern | Standard | Optimized | Speedup |
|---|---|---|---|
| Cancel + Tick | 90 ns | 27 ns | 3.4x |
| Full loop | 130 ns | 63 ns | 2.1x |
| Throughput | Standard CPU | Optimized CPU | You Save |
|---|---|---|---|
| 100K ops/sec | 1.3% | 0.6% | 0.7% of a core |
| 1M ops/sec | 13% | 6% | 7% of a core |
| 10M ops/sec | 130% | 63% | 67% of a core |
TL;DR: At 10M ops/sec, switching to optimized patterns frees up 2/3 of a CPU core.
At the scale of millions of operations per second, idiomatic Go constructs like select on time.Ticker or standard channels introduce significant overhead. These bottlenecks stem from:
- Runtime Scheduling: The cost of parking/unparking goroutines.
- Lock Contention: The centralized timer heap in the Go runtime.
- Channel Internals: The overhead of hchan locking and memory barriers.
Example of code that can hit limits in tight loops:
select {
case <-ctx.Done(): return
case <-dropTicker.C: ...
default: // Non-blocking: returns immediately if nothing ready
}Most Go code blocks rather than polls. Understanding this distinction is critical for interpreting these benchmarks.
select {
case <-ctx.Done():
return
case v := <-ch:
process(v)
// No default: goroutine parks until something is ready
}- How it works: Goroutine yields to scheduler, wakes when a channel is ready
- CPU usage: Near zero while waiting
- Latency: Adds ~1-5µs scheduler wake-up time
- When to use: 99% of Go code—network servers, background workers, most pipelines
for {
select {
case <-ctx.Done():
return
case v := <-ch:
process(v)
default:
// Do other work, check again immediately
}
}- How it works: Goroutine never parks, continuously checks channels
- CPU usage: 100% of one core while running
- Latency: Sub-microsecond response to channel events
- When to use: High-throughput loops, soft real-time, packet processing
| Your Situation | Pattern | These Benchmarks Apply? |
|---|---|---|
| HTTP server handlers | Blocking | ❌ Scheduler cost dominates |
| Background job worker | Blocking | ❌ Use standard patterns |
| Packet processing at 1M+ pps | Polling | ✅ Check overhead matters |
| Game loop / audio processing | Polling | ✅ Every nanosecond counts |
| Streaming data pipeline | Either |
Key insight: In blocking code, the scheduler wake-up cost (~1-5µs) dwarfs the channel check overhead (~20ns). Optimizing the check is pointless. In polling code, you're paying that check cost millions of times per second—that's where these optimizations shine.
This repo benchmarks polling hot-loop patterns where check overhead is the bottleneck.
Measure the raw cost of individual operations:
| Category | Standard Approach | Optimized Alternatives |
|---|---|---|
| Cancellation | select on ctx.Done() |
atomic.Bool flag |
| Messaging | Buffered chan (SPSC) |
Lock-free Ring Buffer |
| Time/Tick | time.Ticker in select |
Batching / Atomic / nanotime / TSC assembly |
The most credible guidance comes from testing interactions, not isolated micro-costs:
| Benchmark | What It Measures |
|---|---|
context-ticker |
Combined cost of checking cancellation + periodic tick |
channel-context |
Message processing with cancellation check per message |
full-loop |
Realistic hot loop: receive → process → check cancel → check tick |
Why combined matters: Isolated benchmarks can be misleading. A 10x speedup on context checking means nothing if your loop is bottlenecked on channel receives. The combined benchmarks reveal the actual improvement in realistic scenarios.
Queue performance varies dramatically based on goroutine topology. We benchmark three implementations:
| Implementation | Type | Best For |
|---|---|---|
| Go Channel | MPSC | Simple code, moderate throughput |
| Our SPSC Ring | SPSC | Maximum SPSC performance, zero allocs |
| go-lock-free-ring | MPSC | High-throughput multi-producer scenarios |
Cross-goroutine polling (our benchmark - separate producer/consumer goroutines):
| Implementation | Latency | Allocs | Speedup |
|---|---|---|---|
| Channel | 248 ns | 0 | baseline |
| go-lock-free-ring (1 shard) | 114 ns | 1 | 2.2x |
| Our SPSC Ring (unguarded) | 36.5 ns | 0 | 6.8x |
Same-goroutine (go-lock-free-ring native benchmarks):
| Benchmark | Latency | Notes |
|---|---|---|
BenchmarkWrite |
35 ns | Single write operation |
BenchmarkTryRead |
31 ns | Single read operation |
BenchmarkProducerConsumer |
31 ns | Write + periodic drain in same goroutine |
BenchmarkConcurrentWrite (8 producers) |
10.7 ns | Parallel writes, sharded |
Note: Cross-goroutine coordination adds ~80ns overhead. For batched same-goroutine patterns, go-lock-free-ring achieves 31 ns/op.
This is where go-lock-free-ring shines:
| Producers | Channel | go-lock-free-ring | Speedup |
|---|---|---|---|
| 4 | 35.3 µs | 539 ns | 65x |
| 8 | 47.1 µs | 464 ns | 101x |
Key insight: Channel lock contention scales terribly. With 8 producers, go-lock-free-ring is 101x faster due to its sharded design.
| Your Pattern | Recommendation | Why |
|---|---|---|
| 1 producer, 1 consumer | Our SPSC Ring | Fastest, zero allocs |
| N producers, 1 consumer | go-lock-free-ring | Sharding eliminates contention |
| Simple/infrequent | Channel | Simplicity, good enough |
For SPSC scenarios with separate producer/consumer goroutines, our simple ring (36.5 ns) beats go-lock-free-ring (114 ns).
Important: go-lock-free-ring's native benchmarks show ~31 ns/op for producer-consumer, but that's in the same goroutine. Our 114 ns measurement is for cross-goroutine polling, which adds coordination overhead. Both measurements are valid for their respective patterns.
Here's why our ring is faster in cross-goroutine scenarios:
1. CAS vs Simple Store
go-lock-free-ring must use Compare-And-Swap to safely handle multiple producers:
// go-lock-free-ring: CAS to claim slot (expensive!)
if !atomic.CompareAndSwapUint64(&s.writePos, pos, pos+1) {
continue // Retry if another producer won
}Our SPSC ring just does a simple atomic store:
// Our SPSC: simple store (fast!)
r.head.Store(head + 1)CAS is 3-10x more expensive than a simple store because it must read, compare, and conditionally write while handling cache invalidation across cores.
2. Sequence Numbers for Race Protection
go-lock-free-ring uses per-slot sequence numbers to prevent a consumer from reading partially-written data:
// go-lock-free-ring: extra atomic ops for safety
seq := atomic.LoadUint64(&sl.seq) // Check slot ready
if seq != pos { return false }
// ... write value ...
atomic.StoreUint64(&sl.seq, pos+1) // Signal to readerOur SPSC ring skips this because we trust only one producer exists.
3. Boxing Allocations
// go-lock-free-ring uses 'any' → 8 B allocation per write
sl.value = value
// Our ring uses generics → zero allocations
r.buf[head&r.mask] = vWhat We Give Up:
| Safety Feature | Our SPSC Ring | go-lock-free-ring |
|---|---|---|
| Multiple producers | ❌ Undefined behavior | ✅ Safe |
| Race protection | ❌ Trust-based | ✅ Sequence numbers |
| Weak memory (ARM) | ✅ Proven safe |
Bottom line: Our SPSC ring is faster because it makes dangerous assumptions (single producer, x86 memory model). go-lock-free-ring is slower because it's provably safe for MPSC with explicit race protection. Use go-lock-free-ring for production multi-producer scenarios.
The in-repo RingBuffer includes debug guards that add ~25ns overhead:
func (r *RingBuffer[T]) Push(v T) bool {
if !r.pushActive.CompareAndSwap(0, 1) { // +10-15ns
panic("concurrent Push")
}
defer r.pushActive.Store(0) // +10-15ns
// ...
}For production: Use the unguarded version or go-lock-free-ring.
We provide two lock-free queue implementations with different safety/performance tradeoffs:
1. Our SPSC Ring Buffer (internal/queue/ringbuf.go)
- Single-Producer, Single-Consumer only
- Generics-based (
[T any]) — zero boxing allocations - Simple atomic Load/Store (no CAS) — maximum speed
- Debug guards catch contract violations (disable for production)
⚠️ No race protection — trusts caller to maintain SPSC contract⚠️ x86 optimized — may need memory barriers on ARM- Best for: Dedicated producer/consumer goroutine pairs where you control both ends
2. go-lock-free-ring (external library)
- Multi-Producer, Single-Consumer (MPSC)
- Sharded design reduces contention across producers
- Uses CAS + sequence numbers for proven race-free operation
- Uses
anytype (causes boxing allocations) - Configurable retry strategies for different load patterns
- ✅ Production-tested at 2300+ Mb/s throughput
- Best for: Fan-in patterns, worker pools, high-throughput pipelines
| Feature | Our SPSC Ring | go-lock-free-ring |
|---|---|---|
| Producers | 1 only | Multiple |
| Consumers | 1 only | 1 only |
| Allocations | 0 | 1+ (boxing) |
| SPSC latency | 36.5 ns | 114 ns |
| 8-producer latency | N/A | 464 ns |
| Race protection | ❌ None | ✅ Sequence numbers |
| Write mechanism | Store | CAS + retry |
| Production ready | ✅ Battle-tested |
Instead of polling ctx.Done() in a select block, we use an atomic.Bool updated by a separate watcher goroutine. This replaces a channel receive with a much faster atomic load operation.
Standard time.Ticker uses the runtime's central timer heap, which can cause contention in high-performance apps. We are exploring:
- Batch-based counters: Only checking the time every N operations.
- Atomic time-sampling: Using a single global goroutine to update an atomic timestamp.
If your loop processes items rapidly, checking the clock on every iteration is expensive. Instead, check the time only once every 1,000 or 10,000 iterations.
if count++; count % 1000 == 0 {
if time.Since(lastTick) >= interval {
// Run logic
lastTick = time.Now()
}
}
If you have many goroutines that all need a "ticker," don't give them each a time.Ticker. Use one background goroutine that updates a global atomic variable with the current Unix nanoseconds. Your workers can then perform a simple atomic comparison.
For sub-microsecond precision where CPU usage is less important than latency, you can "spin" on the CPU until a specific runtime.nanotime is reached. This avoids the overhead of the Go scheduler parking and unparking your goroutine.
For the lowest possible latency on x86, bypass the OS clock entirely and read the CPU's TSC directly. This is significantly faster than time.Now() because it avoids the overhead of the Go runtime and VDSO.
- Mechanism: Use a small assembly stub or
unsafeto call theRDTSCinstruction. - Trade-off: Requires calibration (mapping cycles to nanoseconds) and can be affected by CPU frequency scaling.
// internal/tick/tsc_amd64.s
TEXT ·rdtsc(SB), NOSPLIT, $0-8
RDTSC
SHLQ $32, DX
ORQ DX, AX
MOVQ AX, ret+0(FP)
RETThe Go runtime has an internal function nanotime() that returns a monotonic clock value. It is faster than time.Now() because it returns a single int64 and avoids the overhead of constructing a time.Time struct.
- Mechanism: Access via
//go:linkname. - Benefit: Provides a middle ground between standard library safety and raw assembly speed.
//go:linkname nanotime runtime.nanotime
func nanotime() int64.
├── cmd/ # CLI tools for interactive benchmarking
│ ├── channel/main.go # Queue comparison demo
│ ├── context/main.go # Cancel check comparison demo
│ ├── context-ticker/main.go # Combined benchmark demo
│ └── ticker/main.go # Tick check comparison demo
│
├── internal/
│ ├── cancel/ # Cancellation signaling
│ │ ├── cancel.go # Canceler interface
│ │ ├── context.go # Standard: ctx.Done() via select
│ │ ├── atomic.go # Optimized: atomic.Bool
│ │ └── *_test.go # Unit + benchmark tests
│ │
│ ├── queue/ # SPSC message passing
│ │ ├── queue.go # Queue[T] interface
│ │ ├── channel.go # Standard: buffered channel
│ │ ├── ringbuf.go # Optimized: lock-free ring buffer
│ │ └── *_test.go # Unit + benchmark + contract tests
│ │
│ ├── tick/ # Periodic triggers
│ │ ├── tick.go # Ticker interface
│ │ ├── ticker.go # Standard: time.Ticker
│ │ ├── batch.go # Optimized: check every N ops
│ │ ├── atomic.go # Optimized: runtime.nanotime
│ │ ├── tsc_amd64.go/.s # Optimized: raw RDTSC (x86 only)
│ │ ├── tsc_stub.go # Stub for non-x86 architectures
│ │ └── *_test.go # Unit + benchmark tests
│ │
│ └── combined/ # Interaction benchmarks
│ └── combined_bench_test.go
│
├── .github/workflows/ci.yml # CI: multi-version, multi-platform
├── Makefile # Build targets
├── README.md # This file
├── WALKTHROUGH.md # Guided tutorial with example output
├── BENCHMARKING.md # Environment setup & methodology
├── IMPLEMENTATION_PLAN.md # Design document
└── IMPLEMENTATION_LOG.md # Development log
Key directories:
internal/— Core library implementations (standard vs optimized)cmd/— CLI tools that demonstrate the libraries with human-readable output.github/workflows/— CI testing across Go 1.21-1.23, Linux/macOS
# Run all tests
go test ./...
# Run benchmarks with memory stats
go test -bench=. -benchmem ./internal/...
# Run specific benchmark with multiple iterations (recommended for microbenches)
go test -run=^$ -bench=BenchmarkQueue -count=10 ./internal/queue
# Run with race detector (slower, but catches concurrency bugs)
go test -race ./...
# Compare results with benchstat (install: go install golang.org/x/perf/cmd/benchstat@latest)
go test -bench=. -count=10 ./internal/cancel > old.txt
# make changes...
go test -bench=. -count=10 ./internal/cancel > new.txt
benchstat old.txt new.txtMicro-benchmarks measure one dimension in one environment. Keep these caveats in mind:
| Factor | Impact |
|---|---|
| Go version | Runtime internals change between releases |
| CPU architecture | x86 vs ARM, cache sizes, branch prediction |
GOMAXPROCS |
Contention patterns vary with parallelism |
| Power management | Turbo boost, frequency scaling affect TSC |
| Thermal state | Sustained load causes thermal throttling |
Recommendations:
- Use
benchstat— Run benchmarks 10+ times and usebenchstatto get statistically meaningful comparisons - Pin CPU frequency — For TSC benchmarks:
sudo cpupower frequency-set -g performance - Isolate cores — For lowest variance:
taskset -c 0 go test -bench=... - Test your workload — These are micro-benchmarks; your mileage will vary in real applications
- Profile, don't assume — Use
go tool pprofto confirm where time actually goes
Remember: A 10x speedup on a 20ns operation saves 180ns per call. If your loop runs 1M times/second, that's 180ms saved per second. If it runs 1000 times/second, that's 0.18ms—probably not worth the complexity.
The internal/ package provides minimal, focused implementations for benchmarking. Each sub-package exposes a single interface with two implementations: the standard library approach and the optimized alternative.
internal/
├── cancel/ # Cancellation signaling
│ ├── cancel.go # Interface definition
│ ├── context.go # Standard: ctx.Done() via select
│ └── atomic.go # Optimized: atomic.Bool flag
│
├── queue/ # SPSC message passing
│ ├── queue.go # Interface definition
│ ├── channel.go # Standard: buffered channel
│ └── ringbuf.go # Optimized: lock-free ring buffer
│
└── tick/ # Periodic triggers
├── tick.go # Interface definition
├── ticker.go # Standard: time.Ticker in select
├── batch.go # Optimized: check every N ops
├── atomic.go # Optimized: shared atomic timestamp
├── nanotime.go # Optimized: runtime.nanotime via linkname
└── tsc_amd64.s # Optimized: raw RDTSC assembly (x86)
Each package defines a minimal interface that both implementations satisfy:
// internal/cancel/cancel.go
package cancel
// Canceler signals shutdown to workers.
type Canceler interface {
Done() bool // Returns true if cancelled
Cancel() // Trigger cancellation
}// internal/queue/queue.go
package queue
// Queue is a single-producer single-consumer queue.
type Queue[T any] interface {
Push(T) bool // Returns false if full
Pop() (T, bool)
}// internal/tick/tick.go
package tick
// Ticker signals periodic events.
type Ticker interface {
Tick() bool // Returns true if interval elapsed
Reset() // Reset without reallocation
Stop()
}Standard Go convention—return concrete types, accept interfaces:
// Standard implementations
cancel.NewContext(ctx context.Context) *ContextCanceler
queue.NewChannel[T any](size int) *ChannelQueue[T]
tick.NewTicker(interval time.Duration) *StdTicker
// Optimized implementations
cancel.NewAtomic() *AtomicCanceler
queue.NewRingBuffer[T any](size int) *RingBuffer[T]
tick.NewBatch(interval time.Duration, every int) *BatchTicker
tick.NewAtomicTicker(interval time.Duration) *AtomicTicker
tick.NewNanotime(interval time.Duration) *NanotimeTicker
tick.NewTSC(interval, cyclesPerNs float64) *TSCTicker // x86 only
tick.NewTSCCalibrated(interval time.Duration) *TSCTicker // auto-calibratesEach cmd/ binary follows the same structure:
func main() {
// Parse flags for iterations, warmup, etc.
// Run standard implementation
std := runBenchmark(standardImpl, iterations)
// Run optimized implementation
opt := runBenchmark(optimizedImpl, iterations)
// Print comparison
fmt.Printf("Standard: %v\nOptimized: %v\nSpeedup: %.2fx\n",
std, opt, float64(std)/float64(opt))
}- No abstraction for abstraction's sake—interfaces exist only because we need to swap implementations
- Zero allocations in hot paths—pre-allocate, reuse, avoid escape to heap
- Benchmark-friendly—implementations expose internals needed for accurate measurement
- Copy-paste ready—each optimized implementation is self-contained for easy extraction