pagedattention
Here are 9 public repositories matching this topic...
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
-
Updated
Apr 24, 2026 - Python
An Efficient and Versatile Inference Engine for Distributed LLM Serving
-
Updated
Jun 15, 2026 - Python
A deterministic PyTorch autograd verification trap for catching silent KV-cache routing and block-alignment failures in vLLM and SGLang serving infrastructure.
-
Updated
Jun 7, 2026 - Python
Deadline-aware KV-cache scheduling for protecting decode-critical request-state under long-context LLM inference pressure.
-
Updated
Jun 15, 2026 - Python
🚀 Accelerate LLM inference with Mini-Infer, a high-performance engine designed for efficiency and power in AI model deployment.
-
Updated
Jun 16, 2026 - Python
What to consider when running AI Inference at scale on Kubernetes
-
Updated
May 21, 2026
A minimal LLM inference engine implementing PagedAttention-style KV cache management on NanoGPT. Based on the "Efficient Memory Management for Large Language Model Serving with PagedAttention" paper.
-
Updated
Apr 16, 2026 - Jupyter Notebook
From-scratch model of an LLM serving engine's systems core: paged KV-cache, continuous batching, preemption, and prefix caching — GPU-free, with reproducible benchmarks.
-
Updated
May 30, 2026 - Python
Improve this page
Add a description, image, and links to the pagedattention topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the pagedattention topic, visit your repo's landing page and select "manage topics."