Skip to content

Commit 76e9bf8

Browse files
committed
TIDE v0.2.1
1 parent bcd1cc9 commit 76e9bf8

12 files changed

Lines changed: 294 additions & 978 deletions

File tree

.github/workflows/publish.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,5 +24,5 @@ jobs:
2424
- name: Publish to PyPI
2525
env:
2626
TWINE_USERNAME: __token__
27-
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
28-
run: python -m twine upload dist/*
27+
TWINE_PASSWORD: ${{ secrets.PYPI }}
28+
run: python -m twine upload --verbose dist/*

README.md

Lines changed: 59 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,15 @@
11
# TIDE -- Token-Informed Depth Execution
22

3+
<p align="center">
4+
<img src="assets/tide-diagram.svg" alt="TIDE: Per-token early exit for transformer inference" width="100%"/>
5+
</p>
6+
37
**Make any LLM faster by skipping layers tokens don't need.**
48

59
TIDE learns which tokens are "easy" (converge early) and which are "hard" (need all layers).
610
Easy tokens exit early. Hard tokens go deep. No model retraining. No architecture changes.
711
Drop it onto any HuggingFace model in 3 lines.
812

9-
```
10-
Standard LLM TIDE LLM
11-
========== ========
12-
13-
"The cat sat" "The cat sat"
14-
| | | | | |
15-
[ Layer 1 Layer 1 Layer 1 ] [ Layer 1 Layer 1 Layer 1 ]
16-
| | | | | |
17-
[ Layer 2 Layer 2 Layer 2 ] [ Layer 2 Layer 2 Layer 2 ]
18-
| | | | | |
19-
[ Layer 3 Layer 3 Layer 3 ] [ Layer 3 Layer 3 Layer 3 ]
20-
| | | | | |----> converged! exit.
21-
[ Layer 4 Layer 4 Layer 4 ] [ Layer 4 Layer 4 ] |
22-
| | | | | |
23-
... ... ... ... ... |
24-
| | | | | |
25-
[ Layer N Layer N Layer N ] [ Layer N Layer N ] |
26-
| | | | | |
27-
logits logits logits logits logits logits
28-
29-
Every token runs every layer. Easy tokens exit early.
30-
N layers x 3 tokens = 3N ops. Fewer ops. Same quality.
31-
```
32-
3313
## Install
3414

3515
```bash
@@ -108,12 +88,14 @@ TIDE auto-probes your model's architecture. No adapter code needed.
10888

10989
| Model Family | Examples | Status |
11090
|---|---|---|
111-
| LLaMA | LLaMA 2, LLaMA 3, CodeLlama, TinyLlama | Tested |
112-
| Mistral | Mistral 7B, Mixtral | Tested |
113-
| Qwen | Qwen 2.5 series | Tested |
91+
| LLaMA | LLaMA 3.3, LLaMA 4 Scout/Maverick | Benchmarked |
92+
| DeepSeek | DeepSeek R1, R1 Distill 8B/32B/70B | Benchmarked |
93+
| Qwen | Qwen3 8B/32B, Qwen 2.5 | Benchmarked |
94+
| Mistral | Mistral Small 3.1, Mixtral | Supported |
95+
| Gemma | Gemma 3 12B/27B | Supported |
11496
| GPT-2 | GPT-2, DistilGPT-2 | Tested |
11597
| GPT-NeoX | Pythia, GPT-NeoX-20B | Supported |
116-
| Phi | Phi-2, Phi-3 | Supported |
98+
| Phi | Phi-3, Phi-4 | Supported |
11799
| Falcon | Falcon 7B/40B | Supported |
118100
| OPT | OPT-1.3B through OPT-30B | Supported |
119101
| **Anything else** | Any `AutoModelForCausalLM` | Auto-probed |
@@ -130,108 +112,86 @@ engine = TIDE.TIDE(model, "router.pt") # UniversalAdapter handles it
130112

131113
GPU architecture is auto-detected at install time.
132114

133-
| GPU | Status | Notes |
115+
| GPU | Arch | Status |
134116
|---|---|---|
135-
| V100 | Supported | sm_70 |
136-
| T4 | Supported | sm_75, great for cost-efficient inference |
137-
| A100 | Supported | sm_80 |
138-
| A10G | Tested in CI | sm_86, Modal/AWS default |
139-
| L4 | Supported | sm_89 |
140-
| H100 | Supported | sm_90 |
117+
| V100 | sm_70 | Supported |
118+
| T4 | sm_75 | Supported |
119+
| A100 | sm_80 | Benchmarked |
120+
| A10G | sm_86 | Tested in CI |
121+
| L4 / L40S | sm_89 | Supported |
122+
| H100 / H200 | sm_90 | Supported |
123+
| B100 / B200 | sm_100 | Supported |
124+
| GB200 / GB300 | sm_120 | Supported (PTX fallback) |
141125

142126
Override: `TORCH_CUDA_ARCH_LIST="8.6" pip install .`
143127

144128
No GPU? TIDE works in pure PyTorch (CPU fallback, no CUDA kernels needed).
145129

146130
## Benchmark Results
147131

148-
Tested on **LLaMA 3.1 8B Instruct** (32 layers, 4096 hidden) on NVIDIA A100-SXM4-40GB.
149-
Calibrated with 2000 WikiText samples. CUDA kernels compiled for sm_80.
132+
All benchmarks on **NVIDIA A100-SXM4-40GB**, bf16, 2000 WikiText calibration samples.
133+
16 prompts (8 reasoning/math + 8 general knowledge).
150134

151-
### Prefill Exit Rates
135+
### Prefill: 100% Exit Rate
152136

153-
16 real text prompts (science, code, history), evaluated at different thresholds:
137+
Every token finds an early exit point. On reasoning + general prompts:
154138

155139
```
156-
Threshold Exit Rate Where Exits Happen
157-
========= ========= ==================
158-
0.95 98.9% L11: 16 tokens, L31: 158 tokens
159-
0.90 100.0% L11: 16 tokens, L31: 160 tokens
160-
0.85 100.0% L11: 16 tokens, L31: 160 tokens
161-
0.70 100.0% L11: 16 tokens, L31: 160 tokens
162-
0.50 100.0% L11: 16 tokens, L31: 160 tokens
140+
Model Layers Exit Rate Early Exits (before last checkpoint)
141+
========================== ====== ========= =====================================
142+
DeepSeek R1 Distill 8B 32 100% 5% exit at Layer 11 (1/3 depth)
143+
Qwen3 8B 36 100% 10% exit across L11 + L23 (1/3-2/3)
163144
```
164145

165-
100% of tokens converge by Layer 31 (the last checkpoint before the final layer).
166-
9% of tokens converge as early as Layer 11 — only 1/3 of the way through the model.
146+
### Latency: Up to 7% Faster Prefill
167147

168-
### Prefill Latency
169-
170-
Single prompt, 20 runs averaged:
148+
Single reasoning prompt, 20 runs averaged on A100:
171149

172150
```
173-
Configuration Latency vs Baseline
174-
====================== ======= ===========
175-
Baseline (no TIDE) 54.04ms --
176-
TIDE (threshold=0.95) 50.94ms -5.7%
177-
TIDE (threshold=0.85) 50.52ms -6.5%
178-
TIDE (threshold=0.50) 50.21ms -7.1%
151+
Model Baseline TIDE Speedup
152+
===================== ========== =========== =======
153+
DeepSeek R1 Distill 8B 39.08ms 36.26ms -7.2%
154+
Qwen3 8B (36 layers) 46.82ms 44.14ms -5.7%
179155
```
180156

181-
TIDE is **faster than baseline** even in frozen-token mode (all layers still run)
182-
because the router evaluation + early output selection avoids redundant final-layer
183-
normalization for exited tokens.
184-
185-
### Batch Throughput
157+
### Throughput: Up to 8% More Tokens/sec
186158

187159
```
188-
Batch Size Baseline (tok/s) TIDE (tok/s) Improvement
189-
========== ================ ============ ===========
190-
1 231 252 +9.1%
191-
4 834 902 +8.2%
192-
8 1,618 1,773 +9.6%
160+
Model Batch Baseline TIDE Gain
161+
===================== ===== ============ ============ =====
162+
DeepSeek R1 Distill 8B 1 973 tok/s 1,037 tok/s +6.5%
163+
Qwen3 8B 1 258 tok/s 271 tok/s +5.0%
164+
Qwen3 8B 8 1,781 tok/s 1,926 tok/s +8.1%
193165
```
194166

195-
### Generation Quality
167+
### Decode: 99% of Reasoning Tokens Exit Early
196168

197-
100 tokens generated with `temperature=0` on the same prompt:
169+
DeepSeek R1 Distill 8B solving a math problem, 256 tokens, `temperature=0`:
198170

199171
```
200-
Threshold Exit Rate Output
201-
========= ========= =============================================
202-
1.00 (off) 0% "Backpropagation is a fundamental algorithm
203-
in neural networks that enables them to learn
204-
from data. Here's a step-by-step guide on
205-
how it works: 1. Forward pass: The input..."
206-
207-
0.85 95% "Backpropagation is a fundamental algorithm
208-
in neural networks that enables them to learn
209-
from data. In this article, we'll break down
210-
the process of how neural networks learn..."
211-
212-
0.50 96% (same as 0.85 — stable)
172+
Threshold Decode Exit Rate Unique Tokens Quality
173+
========= ================ ============= =========================
174+
1.0 (off) 0% 99 Correct solution
175+
0.85 98% 95 Correct solution
176+
0.70 99% 95 Correct solution (stable)
177+
0.50 99.6% 95 Correct solution (stable)
213178
```
214179

215-
95% of decode tokens exit at Layer 31 — the output diverges slightly in phrasing
216-
("Here's a step-by-step guide" vs "In this article, we'll break down") but
217-
remains equally coherent and factually correct.
218-
219-
### Convergence Analysis
180+
**99% of decode tokens exit early** while the model still solves the math
181+
problem correctly. Output remains coherent with 95+ unique tokens.
220182

221-
Layer-by-layer convergence (cosine similarity > 0.98 with final layer):
183+
### Convergence: 340K Tokens Analyzed
222184

223185
```
224-
Model Layers Convergence per Checkpoint Layer
225-
================= ====== ===========================================
226-
LLaMA 3.1 8B 32 L3:0% L7:0% L11:0% L15:0% L19:0% L23:0%
227-
L27:0% L31:100%
228-
GPT-2 (124M) 12 L3:0% L7:0% L11:100%
229-
TinyLlama (1.1B) 22 L3:0% L7:0% L11:0% L15:0% L19:0%
186+
Model Layers Tokens Finding
187+
===================== ====== ======== =====================================
188+
DeepSeek R1 Distill 8B 32 339,853 100% converge by L31
189+
Qwen3 8B 36 314,530 100% converge by L35
190+
GPT-2 (124M) 12 78,843 100% converge by L11
230191
```
231192

232-
The convergence threshold (0.98) is strict — most tokens converge at the last
233-
checkpoint. With a lower convergence threshold during calibration, earlier exits
234-
become available.
193+
The penultimate checkpoint captures the full model output for every token —
194+
the last few layers contribute negligible change to hidden state representations.
235195

236196
## Tuning the Threshold
237197

@@ -399,10 +359,10 @@ layers. With CUDA kernels, router evaluation is fused into a single kernel launc
399359
## Citation
400360

401361
```bibtex
402-
@software{tide2024,
362+
@software{tide2026,
403363
title = {TIDE: Token-Informed Depth Execution},
404364
author = {RightNow AI},
405-
year = {2024},
365+
year = {2026},
406366
url = {https://github.com/RightNow-AI/TIDE}
407367
}
408368
```

assets/tide-diagram.svg

Lines changed: 155 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)