Skip to content

eladwf/TensorShunt

Repository files navigation

TensorShunt

The hardware multiplier for AI.
Break the VRAM barrier without rewriting a single line of Python.

QuickstartHow It WorksBenchmarksDocsContributing


What is TensorShunt?

TensorShunt is an AOT (Ahead-Of-Time) compiler pass and native runtime that intercepts PyTorch computation graphs and automatically manages memory across your entire hardware stack — GPU VRAM, CPU RAM, and NVMe storage — while completely hiding data-movement latency behind active computation.

The problem: You have a model that needs 80GB of VRAM. You have a 24GB GPU. Current options are either too slow (HuggingFace Accelerate), too complex (DeepSpeed with 50+ config knobs), or too limited (Unsloth = LoRA only).

The solution: One line of Python.

import torch
import tensorshunt

model = YourMassiveModel()
optimized = torch.compile(model, backend=tensorshunt.backend())

# Training works. Inference works. No config files. No PhD required.
loss = optimized(inputs).sum()
loss.backward()

Why Not Just Use DeepSpeed / FSDP / Accelerate?

Every existing solution operates at the framework level — they see "layers" and "parameter groups." TensorShunt operates at the compiler IR level (MLIR), which means it sees every individual tensor, every operation, and every data dependency. This enables optimizations that are architecturally impossible in framework-level tools:

Capability DeepSpeed FSDP Accelerate TensorShunt
NVMe offloading ⚠️ (Sync only)
Per-tensor scheduling
Guaranteed latency hiding ⚠️ (Heuristics) ✅ (Compiler-driven)
Auto rematerialization
io_uring (modern async I/O)
Zero-config setup
Training + Inference

Quickstart

Requirements

  • Linux (kernel ≥ 5.11 for io_uring support)
  • Python ≥ 3.10
  • PyTorch ≥ 2.2
  • CUDA ≥ 12.0
  • An NVMe drive (recommended, not required — falls back to RAM-only mode)

Installation

Currently, TensorShunt requires a source build to compile the native C++ runtime (CUDA 12.x required). Pre-compiled manylinux wheels are coming soon.

git clone https://github.com/eladwf/TensorShunt.git
cd TensorShunt

python -m venv .venv
source .venv/bin/activate

# Required if building the MLIR compiler passes (adjust paths for your LLVM version)
export LLVM_DIR=/usr/lib/llvm-20/lib/cmake/llvm
export MLIR_DIR=/usr/lib/llvm-20/lib/cmake/mlir

pip install -e ".[dev]"

Basic Usage

import torch
import tensorshunt

# Any PyTorch model — no modifications needed
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")

# Compile with TensorShunt backend
optimized = torch.compile(model, backend=tensorshunt.backend())

# Run training as normal
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for batch in dataloader:
    loss = optimized(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Advanced Configuration

optimized = torch.compile(model, backend=tensorshunt.backend(
    nvme_path="/mnt/fast_nvme",          # NVMe spill directory
    max_vram_budget_gb=20,               # Leave 4GB headroom on a 24GB card
    remat_strategy="aggressive",         # Prefer recompute over transfer
    compression="lossy",                 # FP8 compression on offloaded tensors
    profile=True,                        # Emit execution trace
))

How It Works

TensorShunt has three layers:

┌─────────────────────────────────────────────────┐
│  Python Binding Layer (torch.compile backend)   │  ← You interact here
├─────────────────────────────────────────────────┤
│  MLIR Compiler Engine                           │  ← Graph analysis & rewriting
│  ┌───────────┬──────────┬──────────┬──────────┐ │
│  │ Cost      │ Liveness │ Transfer │ Remat    │ │
│  │ Modeler   │ Analysis │ Inject   │ Pass     │ │
│  └───────────┴──────────┴──────────┴──────────┘ │
├─────────────────────────────────────────────────┤
│  Native Orchestrator (C++ Runtime)              │  ← Bare-metal execution
│  ┌──────────────┬────────────┬────────────────┐ │
│  │ io_uring     │ CUDA       │ Memory Pool    │ │
│  │ I/O Engine   │ Dispatcher │ Manager        │ │
│  └──────────────┴────────────┴────────────────┘ │
└─────────────────────────────────────────────────┘
         │              │              │
    ┌────┴────┐   ┌────┴────┐   ┌────┴────┐
    │  NVMe   │   │   GPU   │   │   RAM   │
    └─────────┘   └─────────┘   └─────────┘
  1. Python layer captures the model graph via torch.compile / torch._dynamo
  2. MLIR compiler analyzes tensor lifetimes, queries hardware capabilities, and rewrites the graph to insert async memory transfers at optimal points
  3. Native runtime executes the scheduled graph on bare metal using io_uring for disk I/O and CUDA streams for GPU compute — always overlapping data movement with computation

For the full architecture deep-dive, see DESIGN.md.

Current Limitations & Roadmap

TensorShunt is currently in Beta. It proves the core latency-hiding hypothesis, but has a few limitations we are actively working to resolve:

  1. Incomplete Native Kernel Coverage: We natively support the core operations required for LLM MLPs (Linear, RMSNorm, SiLU, Mul, Add). However, some complex operators (like certain variants of RoPE or FlashAttention) still fall back to eager PyTorch execution.
    • The Fix: We are actively expanding the native OpKind C++ dispatchers in runtime/src/graph_executor.cpp and integrating Cutlass/FlashAttention directly into the bare-metal runtime.
  2. Static Graph Requirement: TensorShunt currently relies on torch.compile capturing static computational graphs. Highly dynamic shapes or Python control flow (torch.cond) cause graph breaks that limit offloading efficiency.
    • The Fix: Enhancing the MLIR compiler passes to support dynamic shape propagation and symbolic memory budgeting.
  3. Single-Node Focus: The current engine is optimized for single-GPU or single-node NVMe/RAM offloading.
    • The Fix: Integration with FSDP/DDP for distributed, multi-node TensorShunt clusters with NVMe striping.

Mistral-7B on a 3.5GB GPU

We simulated a 3.5 GB VRAM GPU on an RTX 4070 Super using expandable_segments to limit PyTorch's memory access. We then attempted to run mistralai/Mistral-7B-v0.1.

Execution Method Model Precision VRAM Required Result
Eager PyTorch FP16 ~14 GB OOM: CUDA out of memory
PyTorch Quantized 4-bit (bitsandbytes) ~4.5 GB OOM: CUDA out of memory
TensorShunt FP16 3.5 GB Success (322 ms)

TensorShunt dynamically paged 12.05 GB of weights from pinned host RAM directly into a 3.5GB VRAM staging pool, hiding 35.4% of the PCIe transfer latency behind active computation. It successfully executed a model in full precision that natively crashes PyTorch even in 4-bit quantization.

To reproduce this benchmark locally:

TENSORSHUNT_MAX_VRAM_GB=3.5 TENSORSHUNT_MAX_RAM_GB=4.0 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python benchmarks/scripts/run_real_hf_trial.py \
  --model-id "mistralai/Mistral-7B-v0.1" \
  --batch-size 1 --seq-len 128 --dtype float16 \
  --fallback-policy raise \
  --include-quantized --quantization 4bit \
  --simulate-gpu-vram-gb 3.5
./benchmarks/scripts/run_python_e2e_demo.sh --device auto --profile

This generates a reproducible JSON artifact (default: benchmarks/results/python_e2e_demo_latest.json) containing latency and peak-VRAM comparisons plus environment/config metadata for reruns. Use TENSORSHUNT_PY_E2E_OUTPUT_NAME and TENSORSHUNT_PY_E2E_RUN_LABEL to capture profile-specific runs.

Project Structure

TensorShunt/
├── runtime/          # C++ native orchestrator (io_uring, CUDA, memory pools)
├── compiler/         # C++ MLIR compiler passes (cost model, liveness, scheduling)
├── python/           # Python bindings and torch.compile backend
├── profiler/         # Execution profiler and dashboard
├── benchmarks/       # Cross-component benchmark suite
├── docs/             # Detailed documentation
├── third_party/      # Vendored dependencies
└── tools/            # Dev scripts and utilities

Documentation

Document Description
DESIGN.md Full product design, architecture, competitive analysis, and roadmap
CONTRIBUTING.md How to contribute, build, and test
docs/architecture.md Detailed technical architecture
docs/getting-started.md Installation and first-use guide
docs/configuration.md All configuration options explained
docs/benchmarking.md How to run and interpret benchmarks

Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • Build instructions
  • Code style and conventions
  • Testing requirements
  • PR process

License

TensorShunt is licensed under the Business Source License 1.1 (BSL). It is free for non-production use and internal deployments. It restricts offering the software as a competing managed commercial service. The license automatically converts to an open-source Apache 2.0 license after four years. See the LICENSE file for details.

Status

Beta — Core Engine Proven. TensorShunt successfully intercepts real HuggingFace models, lowers them to a native io_uring + CUDA C++ runtime, and executes them in constrained VRAM environments where native PyTorch fails. Pre-compiled wheels and multi-node FSDP integration are slated for upcoming releases.

About

A native C++ PyTorch compiler and execution engine that transparently expands GPU VRAM using NVMe and system RAM. Achieve massive model training and inference on consumer hardware with compiler-guaranteed async I/O latency hiding.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors