The hardware multiplier for AI.
Break the VRAM barrier without rewriting a single line of Python.
Quickstart • How It Works • Benchmarks • Docs • Contributing
TensorShunt is an AOT (Ahead-Of-Time) compiler pass and native runtime that intercepts PyTorch computation graphs and automatically manages memory across your entire hardware stack — GPU VRAM, CPU RAM, and NVMe storage — while completely hiding data-movement latency behind active computation.
The problem: You have a model that needs 80GB of VRAM. You have a 24GB GPU. Current options are either too slow (HuggingFace Accelerate), too complex (DeepSpeed with 50+ config knobs), or too limited (Unsloth = LoRA only).
The solution: One line of Python.
import torch
import tensorshunt
model = YourMassiveModel()
optimized = torch.compile(model, backend=tensorshunt.backend())
# Training works. Inference works. No config files. No PhD required.
loss = optimized(inputs).sum()
loss.backward()Every existing solution operates at the framework level — they see "layers" and "parameter groups." TensorShunt operates at the compiler IR level (MLIR), which means it sees every individual tensor, every operation, and every data dependency. This enables optimizations that are architecturally impossible in framework-level tools:
| Capability | DeepSpeed | FSDP | Accelerate | TensorShunt |
|---|---|---|---|---|
| NVMe offloading | ✅ | ❌ | ✅ | |
| Per-tensor scheduling | ❌ | ❌ | ❌ | ✅ |
| Guaranteed latency hiding | ❌ | ❌ | ✅ (Compiler-driven) | |
| Auto rematerialization | ❌ | ❌ | ❌ | ✅ |
io_uring (modern async I/O) |
❌ | ❌ | ❌ | ✅ |
| Zero-config setup | ❌ | ❌ | ✅ | ✅ |
| Training + Inference | ✅ | ✅ | ✅ | ✅ |
- Linux (kernel ≥ 5.11 for
io_uringsupport) - Python ≥ 3.10
- PyTorch ≥ 2.2
- CUDA ≥ 12.0
- An NVMe drive (recommended, not required — falls back to RAM-only mode)
Currently, TensorShunt requires a source build to compile the native C++ runtime (CUDA 12.x required). Pre-compiled manylinux wheels are coming soon.
git clone https://github.com/eladwf/TensorShunt.git
cd TensorShunt
python -m venv .venv
source .venv/bin/activate
# Required if building the MLIR compiler passes (adjust paths for your LLVM version)
export LLVM_DIR=/usr/lib/llvm-20/lib/cmake/llvm
export MLIR_DIR=/usr/lib/llvm-20/lib/cmake/mlir
pip install -e ".[dev]"import torch
import tensorshunt
# Any PyTorch model — no modifications needed
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
# Compile with TensorShunt backend
optimized = torch.compile(model, backend=tensorshunt.backend())
# Run training as normal
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for batch in dataloader:
loss = optimized(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()optimized = torch.compile(model, backend=tensorshunt.backend(
nvme_path="/mnt/fast_nvme", # NVMe spill directory
max_vram_budget_gb=20, # Leave 4GB headroom on a 24GB card
remat_strategy="aggressive", # Prefer recompute over transfer
compression="lossy", # FP8 compression on offloaded tensors
profile=True, # Emit execution trace
))TensorShunt has three layers:
┌─────────────────────────────────────────────────┐
│ Python Binding Layer (torch.compile backend) │ ← You interact here
├─────────────────────────────────────────────────┤
│ MLIR Compiler Engine │ ← Graph analysis & rewriting
│ ┌───────────┬──────────┬──────────┬──────────┐ │
│ │ Cost │ Liveness │ Transfer │ Remat │ │
│ │ Modeler │ Analysis │ Inject │ Pass │ │
│ └───────────┴──────────┴──────────┴──────────┘ │
├─────────────────────────────────────────────────┤
│ Native Orchestrator (C++ Runtime) │ ← Bare-metal execution
│ ┌──────────────┬────────────┬────────────────┐ │
│ │ io_uring │ CUDA │ Memory Pool │ │
│ │ I/O Engine │ Dispatcher │ Manager │ │
│ └──────────────┴────────────┴────────────────┘ │
└─────────────────────────────────────────────────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ NVMe │ │ GPU │ │ RAM │
└─────────┘ └─────────┘ └─────────┘
- Python layer captures the model graph via
torch.compile/torch._dynamo - MLIR compiler analyzes tensor lifetimes, queries hardware capabilities, and rewrites the graph to insert async memory transfers at optimal points
- Native runtime executes the scheduled graph on bare metal using
io_uringfor disk I/O and CUDA streams for GPU compute — always overlapping data movement with computation
For the full architecture deep-dive, see DESIGN.md.
TensorShunt is currently in Beta. It proves the core latency-hiding hypothesis, but has a few limitations we are actively working to resolve:
- Incomplete Native Kernel Coverage: We natively support the core operations required for LLM MLPs (Linear, RMSNorm, SiLU, Mul, Add). However, some complex operators (like certain variants of RoPE or FlashAttention) still fall back to eager PyTorch execution.
- The Fix: We are actively expanding the native
OpKindC++ dispatchers inruntime/src/graph_executor.cppand integrating Cutlass/FlashAttention directly into the bare-metal runtime.
- The Fix: We are actively expanding the native
- Static Graph Requirement: TensorShunt currently relies on
torch.compilecapturing static computational graphs. Highly dynamic shapes or Python control flow (torch.cond) cause graph breaks that limit offloading efficiency.- The Fix: Enhancing the MLIR compiler passes to support dynamic shape propagation and symbolic memory budgeting.
- Single-Node Focus: The current engine is optimized for single-GPU or single-node NVMe/RAM offloading.
- The Fix: Integration with FSDP/DDP for distributed, multi-node TensorShunt clusters with NVMe striping.
We simulated a 3.5 GB VRAM GPU on an RTX 4070 Super using expandable_segments to limit PyTorch's memory access. We then attempted to run mistralai/Mistral-7B-v0.1.
| Execution Method | Model Precision | VRAM Required | Result |
|---|---|---|---|
| Eager PyTorch | FP16 | ~14 GB | OOM: CUDA out of memory |
| PyTorch Quantized | 4-bit (bitsandbytes) | ~4.5 GB | OOM: CUDA out of memory |
| TensorShunt | FP16 | 3.5 GB | Success (322 ms) |
TensorShunt dynamically paged 12.05 GB of weights from pinned host RAM directly into a 3.5GB VRAM staging pool, hiding 35.4% of the PCIe transfer latency behind active computation. It successfully executed a model in full precision that natively crashes PyTorch even in 4-bit quantization.
To reproduce this benchmark locally:
TENSORSHUNT_MAX_VRAM_GB=3.5 TENSORSHUNT_MAX_RAM_GB=4.0 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python benchmarks/scripts/run_real_hf_trial.py \
--model-id "mistralai/Mistral-7B-v0.1" \
--batch-size 1 --seq-len 128 --dtype float16 \
--fallback-policy raise \
--include-quantized --quantization 4bit \
--simulate-gpu-vram-gb 3.5./benchmarks/scripts/run_python_e2e_demo.sh --device auto --profileThis generates a reproducible JSON artifact (default:
benchmarks/results/python_e2e_demo_latest.json) containing latency and peak-VRAM comparisons
plus environment/config metadata for reruns. Use TENSORSHUNT_PY_E2E_OUTPUT_NAME and
TENSORSHUNT_PY_E2E_RUN_LABEL to capture profile-specific runs.
TensorShunt/
├── runtime/ # C++ native orchestrator (io_uring, CUDA, memory pools)
├── compiler/ # C++ MLIR compiler passes (cost model, liveness, scheduling)
├── python/ # Python bindings and torch.compile backend
├── profiler/ # Execution profiler and dashboard
├── benchmarks/ # Cross-component benchmark suite
├── docs/ # Detailed documentation
├── third_party/ # Vendored dependencies
└── tools/ # Dev scripts and utilities
| Document | Description |
|---|---|
| DESIGN.md | Full product design, architecture, competitive analysis, and roadmap |
| CONTRIBUTING.md | How to contribute, build, and test |
| docs/architecture.md | Detailed technical architecture |
| docs/getting-started.md | Installation and first-use guide |
| docs/configuration.md | All configuration options explained |
| docs/benchmarking.md | How to run and interpret benchmarks |
We welcome contributions! See CONTRIBUTING.md for:
- Build instructions
- Code style and conventions
- Testing requirements
- PR process
TensorShunt is licensed under the Business Source License 1.1 (BSL). It is free for non-production use and internal deployments. It restricts offering the software as a competing managed commercial service. The license automatically converts to an open-source Apache 2.0 license after four years. See the LICENSE file for details.
Beta — Core Engine Proven. TensorShunt successfully intercepts real HuggingFace models, lowers them to a native io_uring + CUDA C++ runtime, and executes them in constrained VRAM environments where native PyTorch fails. Pre-compiled wheels and multi-node FSDP integration are slated for upcoming releases.