TensorShunt

The hardware multiplier for AI.
Break the VRAM barrier without rewriting a single line of Python.

Quickstart • How It Works • Benchmarks • Docs • Contributing

What is TensorShunt?

TensorShunt is an AOT (Ahead-Of-Time) compiler pass and native runtime that intercepts PyTorch computation graphs and automatically manages memory across your entire hardware stack — GPU VRAM, CPU RAM, and NVMe storage — while completely hiding data-movement latency behind active computation.

The problem: You have a model that needs 80GB of VRAM. You have a 24GB GPU. Current options are either too slow (HuggingFace Accelerate), too complex (DeepSpeed with 50+ config knobs), or too limited (Unsloth = LoRA only).

The solution: One line of Python.

import torch
import tensorshunt

model = YourMassiveModel()
optimized = torch.compile(model, backend=tensorshunt.backend())

# Training works. Inference works. No config files. No PhD required.
loss = optimized(inputs).sum()
loss.backward()

Why Not Just Use DeepSpeed / FSDP / Accelerate?

Every existing solution operates at the framework level — they see "layers" and "parameter groups." TensorShunt operates at the compiler IR level (MLIR), which means it sees every individual tensor, every operation, and every data dependency. This enables optimizations that are architecturally impossible in framework-level tools:

Capability	DeepSpeed	FSDP	Accelerate	TensorShunt
NVMe offloading	✅	❌	⚠️ (Sync only)	✅
Per-tensor scheduling	❌	❌	❌	✅
Guaranteed latency hiding	⚠️ (Heuristics)	❌	❌	✅ (Compiler-driven)
Auto rematerialization	❌	❌	❌	✅
`io_uring` (modern async I/O)	❌	❌	❌	✅
Zero-config setup	❌	❌	✅	✅
Training + Inference	✅	✅	✅	✅

Quickstart

Requirements

Linux (kernel ≥ 5.11 for io_uring support)
Python ≥ 3.10
PyTorch ≥ 2.2
CUDA ≥ 12.0
An NVMe drive (recommended, not required — falls back to RAM-only mode)

Installation

Currently, TensorShunt requires a source build to compile the native C++ runtime (CUDA 12.x required). Pre-compiled manylinux wheels are coming soon.

git clone https://github.com/eladwf/TensorShunt.git
cd TensorShunt

python -m venv .venv
source .venv/bin/activate

# Required if building the MLIR compiler passes (adjust paths for your LLVM version)
export LLVM_DIR=/usr/lib/llvm-20/lib/cmake/llvm
export MLIR_DIR=/usr/lib/llvm-20/lib/cmake/mlir

pip install -e ".[dev]"

Basic Usage

import torch
import tensorshunt

# Any PyTorch model — no modifications needed
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")

# Compile with TensorShunt backend
optimized = torch.compile(model, backend=tensorshunt.backend())

# Run training as normal
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for batch in dataloader:
    loss = optimized(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Advanced Configuration

optimized = torch.compile(model, backend=tensorshunt.backend(
    nvme_path="/mnt/fast_nvme",          # NVMe spill directory
    max_vram_budget_gb=20,               # Leave 4GB headroom on a 24GB card
    remat_strategy="aggressive",         # Prefer recompute over transfer
    compression="lossy",                 # FP8 compression on offloaded tensors
    profile=True,                        # Emit execution trace
))

How It Works

TensorShunt has three layers:

┌─────────────────────────────────────────────────┐
│  Python Binding Layer (torch.compile backend)   │  ← You interact here
├─────────────────────────────────────────────────┤
│  MLIR Compiler Engine                           │  ← Graph analysis & rewriting
│  ┌───────────┬──────────┬──────────┬──────────┐ │
│  │ Cost      │ Liveness │ Transfer │ Remat    │ │
│  │ Modeler   │ Analysis │ Inject   │ Pass     │ │
│  └───────────┴──────────┴──────────┴──────────┘ │
├─────────────────────────────────────────────────┤
│  Native Orchestrator (C++ Runtime)              │  ← Bare-metal execution
│  ┌──────────────┬────────────┬────────────────┐ │
│  │ io_uring     │ CUDA       │ Memory Pool    │ │
│  │ I/O Engine   │ Dispatcher │ Manager        │ │
│  └──────────────┴────────────┴────────────────┘ │
└─────────────────────────────────────────────────┘
         │              │              │
    ┌────┴────┐   ┌────┴────┐   ┌────┴────┐
    │  NVMe   │   │   GPU   │   │   RAM   │
    └─────────┘   └─────────┘   └─────────┘

Python layer captures the model graph via torch.compile / torch._dynamo
MLIR compiler analyzes tensor lifetimes, queries hardware capabilities, and rewrites the graph to insert async memory transfers at optimal points
Native runtime executes the scheduled graph on bare metal using io_uring for disk I/O and CUDA streams for GPU compute — always overlapping data movement with computation

For the full architecture deep-dive, see DESIGN.md.

Current Limitations & Roadmap

TensorShunt is currently in Beta. It proves the core latency-hiding hypothesis, but has a few limitations we are actively working to resolve:

Incomplete Native Kernel Coverage: We natively support the core operations required for LLM MLPs (Linear, RMSNorm, SiLU, Mul, Add). However, some complex operators (like certain variants of RoPE or FlashAttention) still fall back to eager PyTorch execution.
- The Fix: We are actively expanding the native OpKind C++ dispatchers in runtime/src/graph_executor.cpp and integrating Cutlass/FlashAttention directly into the bare-metal runtime.
Static Graph Requirement: TensorShunt currently relies on torch.compile capturing static computational graphs. Highly dynamic shapes or Python control flow (torch.cond) cause graph breaks that limit offloading efficiency.
- The Fix: Enhancing the MLIR compiler passes to support dynamic shape propagation and symbolic memory budgeting.
Single-Node Focus: The current engine is optimized for single-GPU or single-node NVMe/RAM offloading.
- The Fix: Integration with FSDP/DDP for distributed, multi-node TensorShunt clusters with NVMe striping.

Mistral-7B on a 3.5GB GPU

We simulated a 3.5 GB VRAM GPU on an RTX 4070 Super using expandable_segments to limit PyTorch's memory access. We then attempted to run mistralai/Mistral-7B-v0.1.

Execution Method	Model Precision	VRAM Required	Result
Eager PyTorch	FP16	~14 GB	OOM: CUDA out of memory
PyTorch Quantized	4-bit (bitsandbytes)	~4.5 GB	OOM: CUDA out of memory
TensorShunt	FP16	3.5 GB	Success (322 ms)

TensorShunt dynamically paged 12.05 GB of weights from pinned host RAM directly into a 3.5GB VRAM staging pool, hiding 35.4% of the PCIe transfer latency behind active computation. It successfully executed a model in full precision that natively crashes PyTorch even in 4-bit quantization.

To reproduce this benchmark locally:

TENSORSHUNT_MAX_VRAM_GB=3.5 TENSORSHUNT_MAX_RAM_GB=4.0 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python benchmarks/scripts/run_real_hf_trial.py \
  --model-id "mistralai/Mistral-7B-v0.1" \
  --batch-size 1 --seq-len 128 --dtype float16 \
  --fallback-policy raise \
  --include-quantized --quantization 4bit \
  --simulate-gpu-vram-gb 3.5

./benchmarks/scripts/run_python_e2e_demo.sh --device auto --profile

This generates a reproducible JSON artifact (default: benchmarks/results/python_e2e_demo_latest.json) containing latency and peak-VRAM comparisons plus environment/config metadata for reruns. Use TENSORSHUNT_PY_E2E_OUTPUT_NAME and TENSORSHUNT_PY_E2E_RUN_LABEL to capture profile-specific runs.

Project Structure

TensorShunt/
├── runtime/          # C++ native orchestrator (io_uring, CUDA, memory pools)
├── compiler/         # C++ MLIR compiler passes (cost model, liveness, scheduling)
├── python/           # Python bindings and torch.compile backend
├── profiler/         # Execution profiler and dashboard
├── benchmarks/       # Cross-component benchmark suite
├── docs/             # Detailed documentation
├── third_party/      # Vendored dependencies
└── tools/            # Dev scripts and utilities

Documentation

Document	Description
DESIGN.md	Full product design, architecture, competitive analysis, and roadmap
CONTRIBUTING.md	How to contribute, build, and test
docs/architecture.md	Detailed technical architecture
docs/getting-started.md	Installation and first-use guide
docs/configuration.md	All configuration options explained
docs/benchmarking.md	How to run and interpret benchmarks

Contributing

We welcome contributions! See CONTRIBUTING.md for:

Build instructions
Code style and conventions
Testing requirements
PR process

License

TensorShunt is licensed under the Business Source License 1.1 (BSL). It is free for non-production use and internal deployments. It restricts offering the software as a competing managed commercial service. The license automatically converts to an open-source Apache 2.0 license after four years. See the LICENSE file for details.

Status

Beta — Core Engine Proven. TensorShunt successfully intercepts real HuggingFace models, lowers them to a native io_uring + CUDA C++ runtime, and executes them in constrained VRAM environments where native PyTorch fails. Pre-compiled wheels and multi-node FSDP integration are slated for upcoming releases.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
Testing/Temporary		Testing/Temporary
benchmarks		benchmarks
compiler		compiler
docs		docs
python		python
runtime		runtime
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorShunt

What is TensorShunt?

Why Not Just Use DeepSpeed / FSDP / Accelerate?

Quickstart

Requirements

Installation

Basic Usage

Advanced Configuration

How It Works

Current Limitations & Roadmap

Mistral-7B on a 3.5GB GPU

Project Structure

Documentation

Contributing

License

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TensorShunt

What is TensorShunt?

Why Not Just Use DeepSpeed / FSDP / Accelerate?

Quickstart

Requirements

Installation

Basic Usage

Advanced Configuration

How It Works

Current Limitations & Roadmap

Mistral-7B on a 3.5GB GPU

Project Structure

Documentation

Contributing

License

Status

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages