CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

PyDotCompute is a Python port of DotCompute's Ring Kernel System - a GPU-native actor model with persistent kernels and message passing. It enables developers to create persistent GPU kernels that communicate through message queues, ideal for real-time GPU compute pipelines and streaming data processing.

Development Commands

# Install for development
pip install -e ".[dev]"

# Install with CUDA support
pip install -e ".[cuda]"

# Install with Metal support (macOS only)
pip install -e ".[metal]"

# Install with performance optimizations (uvloop)
pip install -e ".[fast]"

# Install with Cython extensions (maximum performance)
pip install -e ".[cython]"
python setup_cython.py build_ext --inplace

# Run all tests
pytest

# Run tests with coverage
pytest --cov=pydotcompute

# Run only unit tests
pytest tests/unit/

# Skip CUDA tests (if no GPU)
pytest -m "not cuda"

# Skip Metal tests (if not on macOS)
pytest -m "not metal"

# Run benchmarks
python benchmarks/extended_benchmark.py
python benchmarks/pagerank_benchmark.py
python benchmarks/realtime_anomaly_benchmark.py
python benchmarks/metal_benchmark.py  # macOS only

# Type checking
mypy pydotcompute

# Linting
ruff check pydotcompute

# Build documentation
mkdocs serve

Architecture

Core Components

Ring Kernel System (pydotcompute/ring_kernels/)

runtime.py - RingKernelRuntime: Main coordinator managing kernel lifecycle, message routing, and telemetry. Use as async context manager. Auto-installs uvloop for 21μs message latency.
lifecycle.py - RingKernel, KernelContext, KernelState: Two-phase launch (launch -> activate) with graceful shutdown.
message.py - @message decorator and RingKernelMessage base class for type-safe msgpack serialization.
queue.py - MessageQueue: Async message queues with backpressure strategies (block/reject/drop_oldest).
fast_queue.py - FastMessageQueue: O(1) priority banding with 4 bands (SYSTEM/HIGH/NORMAL/LOW). Zero-copy mode for in-process messaging.
telemetry.py - Real-time GPU monitoring and kernel performance metrics.

Performance Tiers (pydotcompute/ring_kernels/)

_loop.py - uvloop auto-installation + eager_task_factory (Python 3.12+)
sync_queue.py - SyncQueue, SPSCQueue: Threading-based queues for GIL-releasing workloads
threaded_kernel.py - ThreadedRingKernel: Dedicated thread execution for blocking I/O
cython_kernel.py - CythonRingKernel: Maximum performance using Cython FastSPSCQueue
_cython/fast_spsc.pyx - Lock-free SPSC queue with 0.33μs operations

Memory Management (pydotcompute/core/)

unified_buffer.py - UnifiedBuffer: Transparent host-device memory with lazy synchronization. Tracks dirty state (HOST_DIRTY/DEVICE_DIRTY/SYNCHRONIZED) to minimize transfers.
memory_pool.py - MemoryPool: Memory pooling for buffer reuse.
accelerator.py - Accelerator: GPU device abstraction (singleton).
orchestrator.py - ComputeOrchestrator: Compute coordination.

Backends (pydotcompute/backends/)

base.py - Backend ABC: Interface all backends implement (allocate, free, copy_to_device, execute_kernel, compile_kernel).
cpu.py - CPU simulation backend.
cuda.py - CUDA backend via Numba JIT and CuPy arrays.
metal.py - Metal backend via Apple MLX for macOS/Apple Silicon GPU acceleration.

Decorators (pydotcompute/decorators/)

ring_kernel.py - @ring_kernel: Decorator for defining persistent GPU actor kernels. Auto-registers with global registry.
kernel.py - @kernel, @gpu_kernel: Standard kernel decorators.
validators.py - Runtime validation utilities.

Performance Tiers

Tier	Implementation	Latency (p50)	Use Case
1 (Default)	uvloop + FastMessageQueue	21μs	Async Python code
2	ThreadedRingKernel	~100μs	Blocking I/O, GIL-releasing
3	CythonRingKernel + FastSPSCQueue	0.33μs queue ops	Multi-process, Cython extensions

Key Insight: uvloop (Tier 1) is optimal for pure Python due to GIL. Threading adds overhead. Cython queues shine in multi-process scenarios.

Key Patterns

Ring Kernel Definition:

@ring_kernel(kernel_id="my_actor", input_type=RequestType, output_type=ResponseType)
async def my_actor(ctx: KernelContext):
    while not ctx.should_terminate:
        msg = await ctx.receive()
        await ctx.send(ResponseType(...))

Runtime Usage (auto-installs uvloop):

async with RingKernelRuntime() as runtime:
    await runtime.launch("my_actor")      # Phase 1: allocate resources
    await runtime.activate("my_actor")    # Phase 2: start processing
    await runtime.send("my_actor", request)
    response = await runtime.receive("my_actor")

Threaded Kernel (for blocking operations):

def blocking_kernel(ctx: ThreadedKernelContext):
    while not ctx.should_terminate:
        msg = ctx.receive(timeout=0.1)  # Blocking receive
        if msg:
            ctx.send(process(msg))

with ThreadedRingKernel("worker", blocking_kernel) as kernel:
    kernel.send(request)
    response = kernel.receive()

Buffer State Machine: UNINITIALIZED -> HOST_ONLY/DEVICE_ONLY/SYNCHRONIZED HOST_DIRTY (after host write) -> SYNCHRONIZED (after device access) DEVICE_DIRTY (after device write) -> SYNCHRONIZED (after host access)

Metal Backend Usage (macOS):

from pydotcompute.backends.metal import MetalBackend, get_vector_add_kernel
import numpy as np

backend = MetalBackend()
if backend.is_available:
    # Allocate and copy data
    data = backend.copy_to_device(np.array([1, 2, 3], dtype=np.float32))

    # Use pre-built kernels
    add_kernel = get_vector_add_kernel()
    result = add_kernel(data, data)  # [2, 4, 6]

    # Or compile custom kernels
    compiled = backend.compile_kernel(lambda x: x * 2 + 1)
    result = compiled(np.array([1, 2, 3], dtype=np.float32))

UnifiedBuffer with Metal:

from pydotcompute.core.unified_buffer import UnifiedBuffer
import numpy as np

buffer = UnifiedBuffer((1000,), dtype=np.float32)
buffer.allocate()
buffer.host[:] = np.random.randn(1000)
buffer.mark_host_dirty()

# Access .metal property for Metal GPU operations (auto-syncs from host)
metal_arr = buffer.metal  # MLX array on Metal GPU

Performance Benchmarks

Message Latency (Extended Benchmark)

p50: 63μs (full actor roundtrip)
p99: 131μs
Isolated queue: 21μs p50

Graph Processing (PageRank Benchmark)

GPU wins at: 50K+ nodes (7-9x faster than CPU)
Peak throughput: 1.7M edges/sec (GPU Sparse)
Crossover point: ~1000 nodes dense, 5000 nodes sparse

Streaming (Real-Time Anomaly Benchmark)

GPU Actors advantage: Persistent GPU state (no repeated transfers)
Transfer overhead: 16-28% of batch processing time
Best for: Long-running pipelines with context

Queue Operations

Queue Type	Put+Get (same thread)
FastMessageQueue (Python)	1.8μs
FastSPSCQueue (Cython)	0.33μs

Metal Backend (Apple Silicon)

Matrix multiply speedup: 4-10x vs CPU at 1024x1024+
Unified memory: Zero-copy host-device transfers
Best for: Large matrix operations, streaming pipelines

Lessons Learned

uvloop beats threading for Python: The GIL makes native threading slower than uvloop's libuv-based event loop for message passing.
Queue operations are fast, synchronization is slow: Raw queue ops are ~1-2μs, but thread context switching adds 50-100μs.
Cython queues need multi-process: The Cython FastSPSCQueue achieves 0.33μs but only shines in multi-process scenarios where GIL isn't shared.
GPU wins at scale: GPU acceleration becomes beneficial at 50K+ nodes for graphs, and for streaming with persistent state.
Zero-copy matters: Using serialize=False for in-process messaging eliminates serialization overhead.

Testing

pytest-asyncio is configured with asyncio_mode = "auto" - async tests run automatically.
Fixtures in tests/conftest.py provide runtime, accelerator, memory_pool, unified_buffer, message_queue.
CUDA tests are automatically skipped if CUDA is not available.
Metal tests are automatically skipped if Metal/MLX is not available (non-macOS).
Use @pytest.mark.cuda for CUDA-specific tests.
Use @pytest.mark.metal for Metal-specific tests.
Use @pytest.mark.slow for slow-running tests.

Requirements

Python >= 3.11
Core: numpy, msgpack
Performance: uvloop (Linux/macOS, auto-installed)
CUDA (optional): cupy-cuda12x, numba, pynvml
Metal (optional, macOS only): mlx >= 0.4.0
Cython (optional): cython >= 3.0.0

Disabling uvloop

Set environment variable before import:

PYDOTCOMPUTE_NO_UVLOOP=1 python my_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLAUDE.md

Project Overview

Development Commands

Architecture

Core Components

Performance Tiers

Key Patterns

Performance Benchmarks

Message Latency (Extended Benchmark)

Graph Processing (PageRank Benchmark)

Streaming (Real-Time Anomaly Benchmark)

Queue Operations

Metal Backend (Apple Silicon)

Lessons Learned

Testing

Requirements

Disabling uvloop

Uh oh!

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Development Commands

Architecture

Core Components

Performance Tiers

Key Patterns

Performance Benchmarks

Message Latency (Extended Benchmark)

Graph Processing (PageRank Benchmark)

Streaming (Real-Time Anomaly Benchmark)

Queue Operations

Metal Backend (Apple Silicon)

Lessons Learned

Testing

Requirements

Disabling uvloop