This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
PyDotCompute is a Python port of DotCompute's Ring Kernel System - a GPU-native actor model with persistent kernels and message passing. It enables developers to create persistent GPU kernels that communicate through message queues, ideal for real-time GPU compute pipelines and streaming data processing.
# Install for development
pip install -e ".[dev]"
# Install with CUDA support
pip install -e ".[cuda]"
# Install with Metal support (macOS only)
pip install -e ".[metal]"
# Install with performance optimizations (uvloop)
pip install -e ".[fast]"
# Install with Cython extensions (maximum performance)
pip install -e ".[cython]"
python setup_cython.py build_ext --inplace
# Run all tests
pytest
# Run tests with coverage
pytest --cov=pydotcompute
# Run only unit tests
pytest tests/unit/
# Skip CUDA tests (if no GPU)
pytest -m "not cuda"
# Skip Metal tests (if not on macOS)
pytest -m "not metal"
# Run benchmarks
python benchmarks/extended_benchmark.py
python benchmarks/pagerank_benchmark.py
python benchmarks/realtime_anomaly_benchmark.py
python benchmarks/metal_benchmark.py # macOS only
# Type checking
mypy pydotcompute
# Linting
ruff check pydotcompute
# Build documentation
mkdocs serveRing Kernel System (pydotcompute/ring_kernels/)
runtime.py-RingKernelRuntime: Main coordinator managing kernel lifecycle, message routing, and telemetry. Use as async context manager. Auto-installs uvloop for 21μs message latency.lifecycle.py-RingKernel,KernelContext,KernelState: Two-phase launch (launch -> activate) with graceful shutdown.message.py-@messagedecorator andRingKernelMessagebase class for type-safe msgpack serialization.queue.py-MessageQueue: Async message queues with backpressure strategies (block/reject/drop_oldest).fast_queue.py-FastMessageQueue: O(1) priority banding with 4 bands (SYSTEM/HIGH/NORMAL/LOW). Zero-copy mode for in-process messaging.telemetry.py- Real-time GPU monitoring and kernel performance metrics.
Performance Tiers (pydotcompute/ring_kernels/)
_loop.py- uvloop auto-installation + eager_task_factory (Python 3.12+)sync_queue.py-SyncQueue,SPSCQueue: Threading-based queues for GIL-releasing workloadsthreaded_kernel.py-ThreadedRingKernel: Dedicated thread execution for blocking I/Ocython_kernel.py-CythonRingKernel: Maximum performance using Cython FastSPSCQueue_cython/fast_spsc.pyx- Lock-free SPSC queue with 0.33μs operations
Memory Management (pydotcompute/core/)
unified_buffer.py-UnifiedBuffer: Transparent host-device memory with lazy synchronization. Tracks dirty state (HOST_DIRTY/DEVICE_DIRTY/SYNCHRONIZED) to minimize transfers.memory_pool.py-MemoryPool: Memory pooling for buffer reuse.accelerator.py-Accelerator: GPU device abstraction (singleton).orchestrator.py-ComputeOrchestrator: Compute coordination.
Backends (pydotcompute/backends/)
base.py-BackendABC: Interface all backends implement (allocate, free, copy_to_device, execute_kernel, compile_kernel).cpu.py- CPU simulation backend.cuda.py- CUDA backend via Numba JIT and CuPy arrays.metal.py- Metal backend via Apple MLX for macOS/Apple Silicon GPU acceleration.
Decorators (pydotcompute/decorators/)
ring_kernel.py-@ring_kernel: Decorator for defining persistent GPU actor kernels. Auto-registers with global registry.kernel.py-@kernel,@gpu_kernel: Standard kernel decorators.validators.py- Runtime validation utilities.
| Tier | Implementation | Latency (p50) | Use Case |
|---|---|---|---|
| 1 (Default) | uvloop + FastMessageQueue | 21μs | Async Python code |
| 2 | ThreadedRingKernel | ~100μs | Blocking I/O, GIL-releasing |
| 3 | CythonRingKernel + FastSPSCQueue | 0.33μs queue ops | Multi-process, Cython extensions |
Key Insight: uvloop (Tier 1) is optimal for pure Python due to GIL. Threading adds overhead. Cython queues shine in multi-process scenarios.
Ring Kernel Definition:
@ring_kernel(kernel_id="my_actor", input_type=RequestType, output_type=ResponseType)
async def my_actor(ctx: KernelContext):
while not ctx.should_terminate:
msg = await ctx.receive()
await ctx.send(ResponseType(...))Runtime Usage (auto-installs uvloop):
async with RingKernelRuntime() as runtime:
await runtime.launch("my_actor") # Phase 1: allocate resources
await runtime.activate("my_actor") # Phase 2: start processing
await runtime.send("my_actor", request)
response = await runtime.receive("my_actor")Threaded Kernel (for blocking operations):
def blocking_kernel(ctx: ThreadedKernelContext):
while not ctx.should_terminate:
msg = ctx.receive(timeout=0.1) # Blocking receive
if msg:
ctx.send(process(msg))
with ThreadedRingKernel("worker", blocking_kernel) as kernel:
kernel.send(request)
response = kernel.receive()Buffer State Machine: UNINITIALIZED -> HOST_ONLY/DEVICE_ONLY/SYNCHRONIZED HOST_DIRTY (after host write) -> SYNCHRONIZED (after device access) DEVICE_DIRTY (after device write) -> SYNCHRONIZED (after host access)
Metal Backend Usage (macOS):
from pydotcompute.backends.metal import MetalBackend, get_vector_add_kernel
import numpy as np
backend = MetalBackend()
if backend.is_available:
# Allocate and copy data
data = backend.copy_to_device(np.array([1, 2, 3], dtype=np.float32))
# Use pre-built kernels
add_kernel = get_vector_add_kernel()
result = add_kernel(data, data) # [2, 4, 6]
# Or compile custom kernels
compiled = backend.compile_kernel(lambda x: x * 2 + 1)
result = compiled(np.array([1, 2, 3], dtype=np.float32))UnifiedBuffer with Metal:
from pydotcompute.core.unified_buffer import UnifiedBuffer
import numpy as np
buffer = UnifiedBuffer((1000,), dtype=np.float32)
buffer.allocate()
buffer.host[:] = np.random.randn(1000)
buffer.mark_host_dirty()
# Access .metal property for Metal GPU operations (auto-syncs from host)
metal_arr = buffer.metal # MLX array on Metal GPU- p50: 63μs (full actor roundtrip)
- p99: 131μs
- Isolated queue: 21μs p50
- GPU wins at: 50K+ nodes (7-9x faster than CPU)
- Peak throughput: 1.7M edges/sec (GPU Sparse)
- Crossover point: ~1000 nodes dense, 5000 nodes sparse
- GPU Actors advantage: Persistent GPU state (no repeated transfers)
- Transfer overhead: 16-28% of batch processing time
- Best for: Long-running pipelines with context
| Queue Type | Put+Get (same thread) |
|---|---|
| FastMessageQueue (Python) | 1.8μs |
| FastSPSCQueue (Cython) | 0.33μs |
- Matrix multiply speedup: 4-10x vs CPU at 1024x1024+
- Unified memory: Zero-copy host-device transfers
- Best for: Large matrix operations, streaming pipelines
-
uvloop beats threading for Python: The GIL makes native threading slower than uvloop's libuv-based event loop for message passing.
-
Queue operations are fast, synchronization is slow: Raw queue ops are ~1-2μs, but thread context switching adds 50-100μs.
-
Cython queues need multi-process: The Cython FastSPSCQueue achieves 0.33μs but only shines in multi-process scenarios where GIL isn't shared.
-
GPU wins at scale: GPU acceleration becomes beneficial at 50K+ nodes for graphs, and for streaming with persistent state.
-
Zero-copy matters: Using
serialize=Falsefor in-process messaging eliminates serialization overhead.
- pytest-asyncio is configured with
asyncio_mode = "auto"- async tests run automatically. - Fixtures in
tests/conftest.pyprovideruntime,accelerator,memory_pool,unified_buffer,message_queue. - CUDA tests are automatically skipped if CUDA is not available.
- Metal tests are automatically skipped if Metal/MLX is not available (non-macOS).
- Use
@pytest.mark.cudafor CUDA-specific tests. - Use
@pytest.mark.metalfor Metal-specific tests. - Use
@pytest.mark.slowfor slow-running tests.
- Python >= 3.11
- Core: numpy, msgpack
- Performance: uvloop (Linux/macOS, auto-installed)
- CUDA (optional): cupy-cuda12x, numba, pynvml
- Metal (optional, macOS only): mlx >= 0.4.0
- Cython (optional): cython >= 3.0.0
Set environment variable before import:
PYDOTCOMPUTE_NO_UVLOOP=1 python my_script.py