Skip to content

Min memory (2.05GB), higher FPS (1.6x), balanced mode — pick whichever you like. #80

Description

@ureeey

I experimented with a few optimizations: weight/activation quantization, KV cache quantization, and KV cache downsampling.
Tests were conducted on an RTX 4060 Laptop GPU.
Baseline scene: keble-college-02, streaming mode with flags --first_k 320 --num_scale_frames 2 --kv_cache_sliding_window 48.
More details are available here: https://github.com/ureeey/lingbot-map-rtx4060-8g/tree/rtx4060_8g.

摘要 权重激活量化 KV Cache fp8 量化 KV Cache 下采样 显存峰值(GB) FPS ATE 轨迹平滑
基准 禁用 禁用 禁用 7.13 3.6 基准 基准
单项分析 int4、int8混合 - - 5.37 2.2 轻微变化 轻微变化
单项分析 fp8 - - 5.68 3.7 轻微变化 显著变差
单项分析 - 启用 - 5.17 3.3 轻微变化 轻微变化
单项分析 - - 启用 4.4 5.4 显著变差 轻微变化
最小内存 int4、int8混合 启用 启用 2.05 2.7 显著变差 轻微变化
最快 fp8 - 启用 2.96 5.8 显著变差 显著变差
平衡 int4、int8混合 启用 - 3.41 2.1 轻微变化 轻微变化

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions