I experimented with a few optimizations: weight/activation quantization, KV cache quantization, and KV cache downsampling.
Tests were conducted on an RTX 4060 Laptop GPU.
Baseline scene: keble-college-02, streaming mode with flags --first_k 320 --num_scale_frames 2 --kv_cache_sliding_window 48.
More details are available here: https://github.com/ureeey/lingbot-map-rtx4060-8g/tree/rtx4060_8g.
| 摘要 |
权重激活量化 |
KV Cache fp8 量化 |
KV Cache 下采样 |
显存峰值(GB) |
FPS |
ATE |
轨迹平滑 |
| 基准 |
禁用 |
禁用 |
禁用 |
7.13 |
3.6 |
基准 |
基准 |
| 单项分析 |
int4、int8混合 |
- |
- |
5.37 |
2.2 |
轻微变化 |
轻微变化 |
| 单项分析 |
fp8 |
- |
- |
5.68 |
3.7 |
轻微变化 |
显著变差 |
| 单项分析 |
- |
启用 |
- |
5.17 |
3.3 |
轻微变化 |
轻微变化 |
| 单项分析 |
- |
- |
启用 |
4.4 |
5.4 |
显著变差 |
轻微变化 |
| 最小内存 |
int4、int8混合 |
启用 |
启用 |
2.05 |
2.7 |
显著变差 |
轻微变化 |
| 最快 |
fp8 |
- |
启用 |
2.96 |
5.8 |
显著变差 |
显著变差 |
| 平衡 |
int4、int8混合 |
启用 |
- |
3.41 |
2.1 |
轻微变化 |
轻微变化 |
I experimented with a few optimizations: weight/activation quantization, KV cache quantization, and KV cache downsampling.
Tests were conducted on an RTX 4060 Laptop GPU.
Baseline scene: keble-college-02, streaming mode with flags --first_k 320 --num_scale_frames 2 --kv_cache_sliding_window 48.
More details are available here: https://github.com/ureeey/lingbot-map-rtx4060-8g/tree/rtx4060_8g.