Skip to content

V0.0.1.0209#28

Merged
luozixin2 merged 80 commits intoSJTU-DENG-Lab:v0.0.1.0209from
luozixin2:v0.0.1.0209
Feb 9, 2026
Merged

V0.0.1.0209#28
luozixin2 merged 80 commits intoSJTU-DENG-Lab:v0.0.1.0209from
luozixin2:v0.0.1.0209

Conversation

@luozixin2
Copy link
Collaborator

@luozixin2 luozixin2 commented Feb 9, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive quantization support including FP8, INT8, and INT4 weight quantization with GPTQ/AWQ/Marlin formats
    • Introduced FastDLLMV2 diffusion-based decoding strategy with improved performance
    • Added benchmarking framework (diffulex_bench) for model evaluation and metrics
    • Introduced profiling module (diffulex_profiler) with multiple backend support (VizTracer, PyTorch Profiler)
    • Enhanced KV cache with FP8 support for memory efficiency
  • Refactor

    • Reorganized quantization utilities into dedicated module with layered architecture
    • Implemented lazy loading for improved import-time performance

lzx and others added 30 commits December 15, 2025 05:29
- Add KvCacheDType enum supporting bf16/fp16/fp32/fp8_e4m3/fp8_e5m2
- Add parse_kv_cache_dtype() to convert string to dtype
- Add get_fp8_dtype_for_storage() to get FP8 dtype from vLLM platform
- Add compute_fp8_scale() to compute quantization scale using absmax
- Support FP8 storage as uint8 + view(fp8_dtype) pattern
- Add helper functions for FP8 min/max bounds
…che kernels

Core changes:
- Add kv_cache_dtype and k_scale/v_scale parameters to store/load wrappers
- Refactor store kernels to support FP8 quantization with per-head scale:
  * store_kvcache_kernel_causal_lm: add FP8 quantization logic
  * store_kvcache_kernel_diffusion_lm: add FP8 quantization logic
  * store_kvcache_kernel_diffusion_lm_distinct: add FP8 quantization logic
- Refactor load_kvcache_kernel_kv to support FP8 dequantization:
  * Load FP8 values from cache (uint8 storage + view to FP8 dtype)
  * Dequantize using per-head scale and cast to output dtype
  * Support BF16/FP16/FP32 cache without quantization overhead
- Update store_kvcache_unified_layout() to handle FP8 uint8->fp8 view
- Update store_kvcache_distinct_layout() to handle FP8 uint8->fp8 view
- Update load_kvcache() to support configurable output dtype (defaults to k_new.dtype)
- Use constexpr int constants instead of enum in Triton kernels (Triton limitation)

Technical details:
- FP8 uses absmax-based quantization: value_fp8 = clamp(value_fp32 / scale, fp8_range)
- FP8 dequantization: value_out = (value_fp8.to(float32) * scale).to(output_dtype)
- Scale can be scalar or per-head vector [num_kv_heads]
- Maintains backward compatibility: defaults to BF16 when kv_cache_dtype not specified
- Update import from attention_v4 to ops module
- Fix function name from store_kvcache_unified to store_kvcache_unified_layout
- Add test_kv_cache_fp8_unified_roundtrip.py for unified layout FP8 store/load roundtrip
- Add test_kv_cache_fp8_distinct_roundtrip.py for distinct layout FP8 store test
- Test FP8 quantization/dequantization with per-head scales
- Verify roundtrip accuracy with atol=1e-1, rtol=1e-1 tolerance for FP8 precision
- Reduce num_warps from 4 to 1 to reduce shared memory usage
- Reduce num_unroll_cache from 4 to 2 to reduce shared memory usage
- Add comments explaining why BLOCK_M/BLOCK_N cannot be reduced
- Minor code formatting fix in kv_cache_kernels.py
- Add kv_cache_dtype field to Config class (default: bf16)
- Add _get_kv_cache_storage_info() helper function to determine storage dtype and itemsize
- Update allocate_kv_cache() in ModelRunnerForCausalLM to use kv_cache_dtype
- Update allocate_kv_cache() in ModelRunnerForDiffusionLM to use kv_cache_dtype
- Support FP8 KV cache allocation using uint8 storage dtype
- Add kv_cache_dtype parameter passing in attention layers (v4 and v5)
- Implement running max strategy for FP8 scale computation
- Pass scale parameters to store/load functions in forward method
- Update ContextForCausalLM to support kv_cache_dtype
- Update ModelRunnerForCausalLM to pass kv_cache_dtype to context

Changes:
- attention_v4.py: Add _get_kv_cache_dtype(), _update_and_compute_fp8_scales(),
  _get_fp8_scales_from_max() methods; update forward() to pass scales
- attention_v5.py: Same changes as attention_v4.py
- context.py: Add kv_cache_dtype field to ContextForCausalLM
- model_runner.py: Pass kv_cache_dtype to set_context_causal_lm() calls

All tests passed including unit tests and FP8 roundtrip tests.
- Fix store_kvcache calls to pass context as keyword argument
- Resolves 'got multiple values for argument' error when using FP8 KV cache
- Verified with full pipeline test using FP8 KV cache

Changes:
- attention_v4.py: Pass context as keyword argument in store_kvcache call
- attention_v5.py: Same fix as attention_v4.py
- test_fp8_kv_cache_pipeline.py: Add integration test for FP8 KV cache in full pipeline

Test results:
- Successfully generated text using FP8 KV cache (fp8_e4m3)
- All 3 test prompts generated correctly
- No errors in FP8 quantization/dequantization path
- Add test_kv_cache_memory_usage.py to verify KV cache memory allocation
- Add test_kv_cache_speed_comparison.py to compare FP8 vs BF16 performance
- Verified FP8 reduces per-block memory by 50% and allows 2x blocks allocation
- Performance tests show FP8 is comparable to BF16 in speed

Test results:
- FP8: 428 blocks × 7 MB/block = 2996 MB total
- BF16: 214 blocks × 14 MB/block = 2996 MB total
- FP8 throughput: 63.15 tok/s vs BF16: 56.27 tok/s (12% faster)
…rom global memory fetching into fragment fetching
…ilable, checking errors of cuda graph capturing fixed.
- Fix quantize function to support 2D input tensors
- Implement FP8 unified store kernel and helper
- Implement FP8 load with Python-level dequantization
- Support both static and varlen decode modes
- Remove debug code
- Update documentation

Note: temp/ directory excluded from commit
- Add FP8 distinct store kernel (Triton)
- Add FP8 distinct store helper with Python-level quantization
- Update store_kvcache_distinct_layout to support FP8 strategy
- Extend _load_kvcache_fp8 to support distinct layout
- Fix _load_kvcache_bf16 to handle distinct layout stride calculation
- Implement distinct layout decode path in attn_impl.py
- Add load_kvcache export to diffulex_kernel/__init__.py
- Add test script for distinct layout
- Update .gitignore to exclude temp/ directory
…zation strategy support

- Rename dllm_flash_attn_prefill to _dllm_flash_attn_prefill_bf16
- Rename dllm_flash_attn_decode to _dllm_flash_attn_decode_bf16
- Add new dllm_flash_attn_prefill wrapper that dynamically selects kernel based on quantization strategy
- Add new dllm_flash_attn_decode wrapper that dynamically selects kernel based on quantization strategy
- Currently FP8 strategy uses BF16 kernel (FP8 kernels to be implemented later)
- Maintain backward compatibility with same function signatures
- Tested: BF16 path works correctly in end-to-end tests
Key optimizations:
1. Replace element-wise FP8->FP32->BF16 dequantization loops with T.copy for vectorized cast
2. Fuse K_Scale into score computation (avoid element-wise multiplication)
3. Fuse V_Scale into cache branch output (only affects cache path, not V_new)

Performance improvement:
- FP8 decode throughput: ~11.9 tok/s -> ~24.4 tok/s (2x improvement)
- FP8/BF16 decode ratio: 0.759x (was ~0.38x)

Technical details:
- Removed K_Cache_shared_fp8/V_Cache_shared_fp8 buffers and element-wise conversion loops
- Use T.copy(K_Cache[..], K_Cache_shared_bf16) for direct FP8->BF16 cast
- Apply K_Scale[kv_head_idx] to acc_score_kvcache after GEMM (before softmax)
- Apply V_Scale[kv_head_idx] to acc_score_kvcache before V_Cache GEMM (only cache branch)
- Maintains numerical equivalence with previous implementation
主要变更:
1. 重构量化模块架构:
   - 新增 QuantizationConfig 和 registry 系统
   - 支持 KV cache 和 Attention-Q 的量化策略
   - 实现策略能力接口,移除硬编码的 isinstance 检查
   - 添加 AttnQQuantizationStrategy 支持(架构层,kernel 待实现)

2. 重命名 FP8 内核:
   - dllm_flash_attn_decode_kernel_fp8 -> dllm_flash_attn_decode_kernel_bf16_q_fp8_kv
   - 更准确地反映内核的实际功能(BF16 Q + FP8 KV)

3. 简化内核实现:
   - 移除 USE_KV_SHARED 环境变量开关
   - 移除 fragment 路径,只保留 shared memory 路径
   - 简化配置管理(从字典改为单个配置对象)

4. 测试和验证:
   - 添加端到端测试验证 BF16 和 BF16+FP8 KV 路径
   - 所有测试通过,文本生成功能正常

向后兼容:保持现有 API 不变,现有代码无需修改
合并 origin/main 的更新:
- 更新 README.md 的设备列表
- 更新 .gitignore,添加 cuda_cache/
- 更新 GitHub workflows 权限配置

保持 README.md 为 main 分支的原始版本,不包含量化相关文档。
- Add LinearQuantizationStrategy interface supporting weight+activation quantization
- Support layer-type-specific strategies (attn/mlp/other)
- Add registry system for linear quantization strategies
- Add Config fields: linear_attn_weight_dtype, linear_mlp_weight_dtype, linear_attn_act_dtype, linear_mlp_act_dtype
- Integrate factory to inject strategies into QuantizationContext
- Add dynamic dispatch in Linear.forward() based on quant_kind
- Tag Linear layers in models (dream/llada/sdar/fast_dllm_v2) with quant_kind
- Add placeholder strategies (stub) that raise NotImplementedError for non-bf16 dtypes
- Add unit tests for registry/factory/dispatch behavior
- Default bf16 behavior unchanged (fully backward compatible)

All non-bf16 paths currently raise NotImplementedError with clear error messages,
providing stable interface for future kernel/packed weight implementations.
luozixin2 and others added 21 commits January 13, 2026 16:29
- 从 git 跟踪中移除 .cursor 目录
- 将 .cursor/ 添加到 .gitignore 以避免将来误提交
- Optimize W8A16 small-M decode: pad M<16 to 16 (instead of 64) and use block_M=16/32/64
- Add w8a16_gemm_bias kernel with fused bias epilogue (opt-in via DIFFULEX_W8A16_FUSE_BIAS)
- Add runtime profiling hooks for W8A16 (DIFFULEX_LINEAR_PROFILE) to track M distribution and fallbacks
- Implement FP8 KV varlen fused dequantization kernel (Triton) for unified layout
- Add benchmark configs for W4A8 and W8A8 quantization strategies
- Add profiling hooks for KV cache load timing (DIFFULEX_PROFILE_KVCACHE)
主要新增内容:

1. **Marlin/AllSpark INT8 W8A16 量化策略集成**:
   - 新增 linear_marlin_int8_w8a16.py:实现基于 vLLM AllSpark kernel 的 W8A16 量化策略
   - 新增 diffulex_kernel/csrc/marlin/:vendored vLLM 的 AllSpark CUDA kernels
     * allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel
     * allspark_repack.cu: N32K16 权重重排 kernel
     * allspark_utils.cuh: 工具函数和数据结构
     * torch_bindings_marlin.cpp: PyTorch C++ 绑定
   - 新增 diffulex_kernel/python/marlin_ops.py:Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels

2. **量化策略注册更新**:
   - 在 registry.py 中添加 'marlin' 别名支持(映射到 marlin_int8)
   - 在 strategies/__init__.py 中导入新的策略

3. **性能改进**:
   - Marlin W8A16 策略显著提升了 Prefill 吞吐量(从 4518.92 tok/s 提升到 9520.91 tok/s,约 2.1 倍)
   - Decode 吞吐量接近 BF16 基线(23.16 tok/s vs 23.36 tok/s)
   - 支持与 FP8 KV cache 组合使用

4. **其他改进**:
   - 优化了多个量化策略的实现
   - 改进了 KV cache 管理
   - 增强了 profiler 功能
   - 新增了多个 benchmark 配置文件
主要新增内容:

1. **Marlin/AllSpark INT8 W8A16 量化策略集成**:
   - 新增 linear_marlin_int8_w8a16.py:实现基于 vLLM AllSpark kernel 的 W8A16 量化策略
   - 新增 diffulex_kernel/csrc/marlin/:vendored vLLM 的 AllSpark CUDA kernels
     * allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel
     * allspark_repack.cu: N32K16 权重重排 kernel
     * allspark_utils.cuh: 工具函数和数据结构
     * torch_bindings_marlin.cpp: PyTorch C++ 绑定
   - 新增 diffulex_kernel/python/marlin_ops.py:Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels

2. **量化策略注册更新**:
   - 在 registry.py 中添加 'marlin' 别名支持(映射到 marlin_int8)
   - 在 strategies/__init__.py 中导入新的策略

3. **性能改进**:
   - Marlin W8A16 策略显著提升了 Prefill 吞吐量(从 4518.92 tok/s 提升到 9520.91 tok/s,约 2.1 倍)
   - Decode 吞吐量接近 BF16 基线(23.16 tok/s vs 23.36 tok/s)
   - 支持与 FP8 KV cache 组合使用

4. **其他改进**:
   - 优化了多个量化策略的实现
   - 改进了 KV cache 管理
   - 增强了 profiler 功能
   - 新增了多个 benchmark 配置文件
…support

feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy
主要变更:
- 添加 GPTQ Marlin (W4A16) 和 AWQ Marlin (W4A16) 量化策略
- 修复 loader.py 以正确加载 gptq_marlin 格式权重(支持 Marlin 特有的 repacked qweight 和 permuted scales)
- 修改 quantize_model.py 支持导出 gptq_marlin 格式(对称量化 + Marlin repack/permute)
- 更新 linear.py:
  - 添加 _offline_quant_bits 缓冲区存储量化位数
  - 添加 GPTQ runtime shuffle 支持(gptq_shuffle)
  - 添加 GPTQ/AWQ Marlin 的 lazy repack 支持(_maybe_prepare_offline_gptq_marlin/_awq_marlin)
  - 统一使用 vLLM 格式(int32 packed, fp16 scales)
- 简化各策略文件,移除重复代码
- 移除旧的 AllSpark Marlin 实现文件
- 添加多个 benchmark 配置文件(GPTQ/AWQ Marlin 各 bit 版本)
benchmark_results 是本地生成的评测产物,不应进入版本库。
本提交将其作为正常删除移出,并依赖 .gitignore 中的 benchmark_results/ 规则避免后续再次提交。
- 添加 quant-method=auto 支持:使用 auto-gptq / awq 进行真正的校准量化
- 添加校准数据参数:--calib-text-file, --calib-num-samples, --calib-seq-len 等
- 实现 _export_autogptq_to_vllm_weights:从 auto-gptq 量化模型中导出 vLLM 格式权重
- 实现 _export_awq_to_vllm_weights:从 awq 量化模型中导出 vLLM 格式权重
- 保留 quant-method=simple 旧实现作为后向兼容
- 修复 loader.py 中 gptq_marlin scales 的 shape 推理和 TP sharding 逻辑
- 修复 linear_gptq_marlin_w4a16.py 移除不必要的 bf16->fp16 转换
主要重构内容:

1. **diffulex/layer/linear.py** - 大幅简化量化逻辑(-197行):
   - 新增 `_forward_base()`: 统一的前向分发器,替换子类中重复的量化分支逻辑
   - 新增 `_build_offline_forward_kwargs()`: 统一构建离线量化(GPTQ/AWQ)前向参数
   - 新增 `_get_linear_strategy()`, `_offline_meta()`, `_infer_gptq_weight_bits()` 等辅助方法
   - 修复 `LoRAMixin.merge_lora` 中 base weight 为 None 的边界情况
   - 移除未使用的导入(marlin_zero_points, unpack_cols, marlin_make_empty_g_idx)

2. **diffulex/utils/loader.py** - 优化性能和代码结构:
   - 一次性扫描 safetensors 文件建立 key_to_file 索引,避免重复文件 I/O
   - 缓存 `model.named_modules()` 结果,避免重复构建字典
   - 新增 `_find_offline_capable_module()`: 统一模块查找逻辑
   - 新增 `_load_tensors_for_prefix()`: 集中加载张量,仅打开必要的文件
   - 将 print() 替换为 logger.warning()/logger.exception() 以规范化日志

3. **diffulex/engine/model_runner.py** - 消除重复循环:
   - 在 `allocate_kv_cache` 中统一缓存 attention 模块列表
   - 用 `enumerate(attn_modules)` 替换重复的模块遍历循环

4. **diffulex/utils/quantization/strategies/linear_int4_w4a16.py** - 修复缺失实现:
   - 添加 `quantize_weight_for_kernel` 方法,修复 W4A16 在线量化运行时错误

5. 删除未使用的配置文件 `gptq_marlin_w2_bf16kv_varlen.yml`

测试: 已验证 W8A16 在线量化和 GPTQ 离线量化功能正常
- 将最后总结从最后一步的瞬时吞吐改为真正的平均值(总token/总时间)
- 新增 ms/step 统计信息,便于分析性能
- 修复了之前只显示最后一步瞬时值而非平均值的问题
- 量化 linear:去 kwargs/pop/重复可用性检查,缓存 out_features 与必要中间张量
- 直连 vLLM CUDA ops(W8A8/GPTQ/AWQ/Marlin 等)以降低 Python glue 开销
- load-time 处理 qweight/scales 的布局与 contiguous,避免 forward 里重复处理
- 移除 linear.py 中 profiler record 标注,保持代码简洁
- 补充 trace/profile 辅助分析脚本与相关测试
… strategies

- Remove all .item() calls in LinearBase hot paths (GPU->CPU sync breaks graph capture)
  - Add Python-side meta cache (_offline_quant_*_py, _gptq_is_shuffled_py, etc.)
  - Use in-place fill_() + Python mirrors for state updates
- Simplify linear quantization strategies for future CUDA Graph support
  - Remove fast_path checks and redundant branching in linear_marlin_int8_w8a16
  - Remove fast_path in linear_int8_w8a8 (unified vLLM path)
  - Simplify linear_gptq_w4a16 (direct torch.ops._C.gptq_gemm call)
  - Make linear_fp8_w8a16 use explicit quant_scales parameter
- Fix FP8 weight layout: do not force contiguous for transpose-view (KxN stride0==1)
- Remove profiler record_function wrappers (graph-friendly)

Net: -129 lines, cleaner codebase ready for CUDA Graph capture
- Add per-layer ForwardPlan to pre-resolve bf16/quant/offline paths and reduce per-call Python branching.
- Prefer direct torch.ops kernels (GPTQ/AWQ/Marlin) with static args for stable capture.
- Fix D2F static CUDA graph capture/replay metadata (token buckets + cu_seqlens) and add profiler flag.
- Fix tensor shape mismatch bug in static+CUDA Graph decode mode (model_runner.py)
  - Improve bucket selection logic for variable token counts
  - Add safety fallback when runtime batch exceeds captured capacity
  - Fix metadata buffer initialization and padding

- Add new static mode benchmark configs:
  - awq_bf16kv_static.yml
  - gptq_marlin_w4_bf16kv_static.yml
  - gptq_marlin_w8_bf16kv_static.yml

- Update quantization strategies and loader utilities
- Update benchmark configurations for consistency
- 移除 v0.0.1 之后新增的 bench 配置与量化架构文档
- 将 W8A16/DP 等调参从 env 收敛到 Config/strategy.configure
- 示例/脚本去掉硬编码本机路径与默认 GPU,并修复语法问题
@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
diffulex/strategy/block_diffusion/engine/sequence.py (1)

49-55: ⚠️ Potential issue | 🔴 Critical

Type hint mismatch: modified_to parameter expects tensor but is typed as int.

Line 54 calls modified_to.item(), which is a tensor method (PyTorch/NumPy) to extract a scalar value. However, the parameter on line 49 is typed as int, which doesn't have an .item() method. This will cause an AttributeError at runtime.

🔧 Proposed fix: Update type hint to reflect tensor type

If using PyTorch tensors:

-    def modify_token(self, local_token_id: int, modified_to: int) -> None:
+    def modify_token(self, local_token_id: int, modified_to: torch.Tensor) -> None:
         if self.seq is None:
             raise RuntimeError("Diffusion block is not attached to a sequence.")
         target_id = local_token_id + self.global_start_id
         assert self.seq.token_ids[target_id] == self.mask_token_id
         self.seq.token_ids[target_id] = modified_to.item()  # type: ignore[assignment]
         self.seq.new_tokens += 1

Note: You'll need to add import torch at the top of the file.

Alternatively, if modified_to should actually be an int, remove the .item() call:

     def modify_token(self, local_token_id: int, modified_to: int) -> None:
         if self.seq is None:
             raise RuntimeError("Diffusion block is not attached to a sequence.")
         target_id = local_token_id + self.global_start_id
         assert self.seq.token_ids[target_id] == self.mask_token_id
-        self.seq.token_ids[target_id] = modified_to.item()  # type: ignore[assignment]
+        self.seq.token_ids[target_id] = modified_to  # type: ignore[assignment]
         self.seq.new_tokens += 1
diffulex_legacy/layers/attention/ops/kv_cache_kernels.py (1)

429-486: ⚠️ Potential issue | 🟠 Major

Guard against missing FP8 scales in load_kvcache

When kv_cache_dtype specifies FP8, the function must have valid scales to dequantize correctly. Line 484–485 will silently create unit scales if k_scale and v_scale are None, producing incorrect results. Add an explicit check to fail fast.

Suggested fix
     spec = parse_kv_cache_dtype(kv_cache_dtype)
+    if spec.is_fp8 and (k_scale is None or v_scale is None):
+        raise ValueError("FP8 KV cache requires k_scale and v_scale for load.")
diffulex_legacy/utils/context.py (1)

24-45: ⚠️ Potential issue | 🟡 Minor

Validate kv_cache_dtype to fail fast.
The new public parameter should be checked against supported values to avoid late runtime errors.

🛡️ Suggested guard
 def set_context_causal_lm(
     is_prefill,
     cu_seqlens_q=None, cu_seqlens_k=None,
     max_seqlen_q=0, max_seqlen_k=0,
     slot_mapping=None, context_lens=None, block_tables=None,
     kv_cache_dtype: str = "bf16"
 ) -> None:
+    allowed_kv_cache_dtypes = {
+        "bf16", "fp16", "fp32", "fp8", "fp8_e4m3", "fp8_e5m2"
+    }
+    if kv_cache_dtype not in allowed_kv_cache_dtypes:
+        raise ValueError(
+            f"Unsupported kv_cache_dtype: {kv_cache_dtype}. "
+            f"Expected one of {sorted(allowed_kv_cache_dtypes)}."
+        )
     global _CONTEXT_FOR_CAUSAL_LM
     _CONTEXT_FOR_CAUSAL_LM = ContextForCausalLM(
🤖 Fix all issues with AI agents
In `@diffulex_bench/datasets.py`:
- Around line 26-34: The bug is that slicing with dataset[:limit] turns the
Dataset into a dict-of-lists so the subsequent loop over dataset iterates keys;
replace that slice with dataset.select(range(limit)) so iteration yields
records. Update the code around load_dataset(..., split=split) and the
conditional that checks limit to use dataset = dataset.select(range(limit))
(referencing the dataset variable and load_dataset call) and ensure the rest of
the loop (for item in dataset, accessing item["question"], item["answer"])
continues to work with Dataset records.
- Around line 65-71: The code incorrectly slices the HuggingFace Dataset with
dataset[:limit], which can convert it to a list and break iteration; instead,
when limiting the humaneval dataset obtained by load_dataset("openai/humaneval")
assign dataset = dataset.select(range(limit)) (or
dataset.select(range(limit)).shuffle(...) if needed) so the result stays a
Dataset object and iteration in the subsequent loop over dataset works
correctly; update the block that checks limit to use
dataset.select(range(limit)) rather than dataset[:limit].

In `@diffulex_kernel/python/paged_attn_decode_triton.py`:
- Line 527: The assertion in paged_attn_decode_triton.py uses
attn_metadata.kv_cache_layout which doesn't exist on the AttnMetaDataBase class
and will raise AttributeError; fix by adding a default attribute
kv_cache_layout: str = "unified" to the AttnMetaDataBase definition in
diffulex/attention/metadata.py (so the assertion in paged_attn_decode_triton.py
continues to work), or alternatively change the assertion to use
getattr(attn_metadata, "kv_cache_layout", "unified") to provide a default —
update either the AttnMetaDataBase class (preferred) or the assertion
accordingly.

In `@diffulex_profiler/backends/viztracer.py`:
- Around line 53-67: The stop() method in VizTracer backend currently calls
self.tracer.stop() but never calls the required self.tracer.save(), so the trace
file is not written; update stop() (method stop, referencing self.tracer and
output_file) to call self.tracer.save() immediately after self.tracer.stop() and
before reading self.tracer.output_file, then proceed to build the result dict
and set self.tracer = None so the trace is persisted to disk.

In `@diffulex/sampler/sdar.py`:
- Around line 17-56: In forward(), the boolean flags margin_confidence and
neg_entropy are incorrectly compared to strings when passed into sample_tokens
(e.g., neg_entropy == "neg_entropy"), so True is never honored; change the calls
to normalize these inputs to booleans (accept both bool and legacy string
values) before passing them to sample_tokens — e.g., compute
normalized_neg_entropy = bool(neg_entropy) or normalized_neg_entropy =
(neg_entropy is True or neg_entropy == "neg_entropy") and similarly for
margin_confidence, then call sample_tokens(...,
neg_entropy=normalized_neg_entropy,
margin_confidence=normalized_margin_confidence); apply the same normalization
pattern wherever these flags are used (including other files llada.py, dream.py,
fast_dllm_v2.py) so sample_tokens always receives a proper bool.

In `@diffulex/strategy/fast_dllm_v2/engine/model_runner.py`:
- Around line 123-133: The code in model_runner.py fails to handle the IN_CACHE
state for seq.diffusion_blocks[-1], causing slot_mapping to be shorter than
input_ids; in the block that currently checks seq.diffusion_blocks[-1].is_active
and .is_to_cache, add an else branch that mirrors the active case by extending
slot_mapping with [-1] * self.diffusion_block_size so slot_mapping stays aligned
with the input_ids produced by diffusion_decoding_inputs(); update the branch
containing seq.diffusion_blocks[-1].is_active,
seq.diffusion_blocks[-1].is_to_cache, slot_mapping, and
diffusion_decoding_inputs() accordingly.

In `@diffulex/utils/quantization/strategies/kv_cache_bf16.py`:
- Around line 55-60: The BF16 alias registration (register_kv_cache_strategy ->
_build_kv_cache_bf16 returning KVCacheBF16Strategy) causes fp32/fp16 strings to
be treated as 2-byte storage but downstream code still uses hardcoded
dtype-to-size lookups; update callers to ask the strategy for its actual storage
dtype: call the strategy's get_storage_dtype() (e.g., on the KVCacheBF16Strategy
instance) and compute sizes via numpy dtype.itemsize instead of mapping strings
to sizes. Replace any hardcoded branches that assume "fp32" => 4 bytes (such as
code that computes itemsize) with a call to strategy.get_storage_dtype() and
np.dtype(...).itemsize so memory calculations match the registered strategy.

In `@diffulex/utils/quantization/strategies/linear_bf16.py`:
- Around line 9-14: The factory function _build_linear_bf16() calls
LinearBF16Strategy before that class is defined, causing a NameError at import;
move the class LinearBF16Strategy definition above the
`@register_linear_strategy-decorated` _build_linear_bf16 function (or
alternatively inline the class reference by returning an instance via a lambda
that imports/defines the class first) so that LinearBF16Strategy is defined when
_build_linear_bf16() is executed.
🟠 Major comments (15)
diffulex/engine/tp_worker.py-77-84 (1)

77-84: ⚠️ Potential issue | 🟠 Major

Async path bypasses activation-quant cache clear.
step_async reimplements step logic but skips the cache clear, so async generation can reuse stale activation-quant state across steps.

✅ Suggested fix (reuse step())
@@
-        def _step():
-            seqs, is_prefill = self.scheduler.schedule()
-            sample_output = self.model_runner.call("run", seqs, is_prefill)
-            n_diff_steps = self.scheduler.postprocess(seqs, sample_output)
-            outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished]
-            num_tokens = sum(seq.num_tokens for seq in seqs) if is_prefill else sum(seq.new_tokens for seq in seqs)
-            deltas = []
-            return outputs, num_tokens, is_prefill, n_diff_steps, deltas
+        def _step():
+            return self.step()

Also applies to: 94-111

diffulex_profiler/exporters/summary.py-57-72 (1)

57-72: ⚠️ Potential issue | 🟠 Major

Fix output_file clobbering — summary may write to the wrong file.
The viztracer branch overwrites output_file, so the summary can end up written to the trace file (or "N/A").

Proposed fix
-            if m.backend_data and m.backend_data.get("backend") == "viztracer":
-                output_file = m.backend_data.get("output_file", "N/A")
-                summary_lines.append(f"  VizTracer Output: {output_file}")
+            if m.backend_data and m.backend_data.get("backend") == "viztracer":
+                viz_output_file = m.backend_data.get("output_file", "N/A")
+                summary_lines.append(f"  VizTracer Output: {viz_output_file}")
diffulex/attention/attn_impl.py-59-72 (1)

59-72: ⚠️ Potential issue | 🟠 Major

Initialize scales even when they are None.

The current guard skips update_scales on the first store, so k_scale/v_scale can remain None and later decoding may fail when a strategy requires scales. It’s safer to initialize via the strategy even when the scales are not yet set.

🐛 Proposed fix
-                # Update scales if quantization strategy requires them
-                if self.k_scale is not None and self.v_scale is not None:
-                    from diffulex.utils.quantization.context import get_kv_cache_strategy
-                    strategy = get_kv_cache_strategy()
-                    if strategy is not None:
-                        self.k_scale, self.v_scale = strategy.update_scales(
-                            k, v, self.k_scale, self.v_scale,
-                            self.num_kv_heads, k.device
-                        )
-                    # Pass scale to metadata if required by strategy
-                    if strategy is not None:
-                        strategy.maybe_set_attn_metadata_scales(
-                            attn_metadata, k_scale=self.k_scale, v_scale=self.v_scale
-                        )
+                # Update/initialize scales if quantization strategy requires them
+                from diffulex.utils.quantization.context import get_kv_cache_strategy
+                strategy = get_kv_cache_strategy()
+                if strategy is not None:
+                    self.k_scale, self.v_scale = strategy.update_scales(
+                        k, v, self.k_scale, self.v_scale,
+                        self.num_kv_heads, k.device
+                    )
+                    # Pass scale to metadata if required by strategy
+                    strategy.maybe_set_attn_metadata_scales(
+                        attn_metadata, k_scale=self.k_scale, v_scale=self.v_scale
+                    )
diffulex/strategy/d2f/engine/model_runner.py-293-309 (1)

293-309: ⚠️ Potential issue | 🟠 Major

Guard CUDA-graph replay when decode_mode isn’t static.

capture_cudagraph() is explicitly static-only, but run_model() will still replay graphs even when config/default is "varlen". This can mismatch the user’s requested decode mode and captured kernel path.

Proposed guard to align replay with static-only capture
-        if is_prefill or self.enforce_eager or input_ids.size(0) > 512:
+        if is_prefill or self.enforce_eager or input_ids.size(0) > 512:
             return self.model.compute_logits(self.model(input_ids, positions))
+        if self._get_decode_mode() != "static":
+            return self.model.compute_logits(self.model(input_ids, positions))
diffulex/strategy/fast_dllm_v2/engine/kvcache_manager.py-17-18 (1)

17-18: ⚠️ Potential issue | 🟠 Major

Confusing boolean comparison in can_append.

The expression (seq.cached_or_caching_num_tokens % self.block_size == 1) evaluates to a boolean (True/False), which is then compared with >=. This means the condition becomes len(free_block_ids) >= 1 when a new block is needed, and len(free_block_ids) >= 0 (always true) otherwise. This seems unintentional.

🐛 Suggested clarification

If the intent is "need at least one free block when tokens overflow to a new block":

     def can_append(self, seq: "FDV2Sequence") -> bool:
-        return len(self.free_block_ids) >= (seq.cached_or_caching_num_tokens % self.block_size == 1)
+        needs_new_block = seq.cached_or_caching_num_tokens % self.block_size == 1
+        return not needs_new_block or len(self.free_block_ids) >= 1
diffulex_bench/runner.py-19-53 (1)

19-53: ⚠️ Potential issue | 🟠 Major

Make trust_remote_code opt‑in for tokenizer loading.

Hardcoding trust_remote_code=True in AutoTokenizer.from_pretrained() allows arbitrary code execution when loading remote tokenizers. Add a configurable parameter with a safe default of False.

Suggested change
    def __init__(
        self,
        model_path: str,
        tokenizer_path: Optional[str] = None,
        wait_ready: bool = True,
+       trust_remote_code: bool = False,
        **diffulex_kwargs
    ):
         """
         Initialize the benchmark runner
         
         Args:
             model_path: Path to the model
             tokenizer_path: Path to the tokenizer, if None uses model_path
             wait_ready: Whether to wait for engine to be fully initialized before returning
+            trust_remote_code: Whether to trust remote code when loading tokenizer
             **diffulex_kwargs: Additional arguments to pass to Diffulex
         """
         self.model_path = model_path
         self.tokenizer_path = tokenizer_path or model_path
         self.logger = get_logger(__name__)
         
         # Initialize Diffulex engine
         self.logger.info("Initializing Diffulex engine...")
         self.llm = Diffulex(model_path, **diffulex_kwargs)
         
         # Wait for engine to be ready if requested
         if wait_ready:
             self._wait_for_ready()
         
         # Load tokenizer
         self.logger.info("Loading tokenizer...")
         self.tokenizer = AutoTokenizer.from_pretrained(
             self.tokenizer_path,
-            trust_remote_code=True
+            trust_remote_code=trust_remote_code
         )
         self.logger.success("Tokenizer loaded successfully")
diffulex_bench/config.py-60-65 (1)

60-65: ⚠️ Potential issue | 🟠 Major

Rename loop variable to avoid shadowing dataclasses.field (F811).

Both to_dict methods use field as a loop variable, which shadows the imported dataclasses.field and triggers Ruff F811. Use a different name (e.g., dc_field).

🛠️ Suggested fix
     def to_dict(self) -> Dict[str, Any]:
         """Convert to dictionary"""
         return {
-            field.name: getattr(self, field.name)
-            for field in self.__dataclass_fields__.values()
+            dc_field.name: getattr(self, dc_field.name)
+            for dc_field in self.__dataclass_fields__.values()
         }
@@
     def to_dict(self) -> Dict[str, Any]:
         """Convert to dictionary"""
         return {
-            field.name: getattr(self, field.name)
-            for field in self.__dataclass_fields__.values()
+            dc_field.name: getattr(self, dc_field.name)
+            for dc_field in self.__dataclass_fields__.values()
         }

Also applies to: 131-136

diffulex/strategy/fast_dllm_v2/engine/scheduler.py-27-35 (1)

27-35: ⚠️ Potential issue | 🟠 Major

Batch token cap check ignores cached tokens.

You check num_batched_tokens + projected but later add projected - seq.num_cached_tokens. This can prematurely block prefill and even trigger the “unable to schedule” error despite capacity. Compute a single projected_tokens and use it in both places.

🛠️ Suggested fix
         while self.waiting and num_seqs < self.max_num_seqs:
             seq = self.waiting[0]
             projected = len(seq) + seq.diffusion_block_size
+            projected_tokens = projected - seq.num_cached_tokens
             if (
-                num_batched_tokens + projected > self.max_num_batched_tokens
+                num_batched_tokens + projected_tokens > self.max_num_batched_tokens
                 or not self.block_manager.can_allocate(seq)
             ):
                 break
@@
-            num_batched_tokens += projected - seq.num_cached_tokens
+            num_batched_tokens += projected_tokens
diffulex_bench/metrics.py-66-83 (1)

66-83: ⚠️ Potential issue | 🟠 Major

Return contract mismatch in humaneval_pass_at_k.

The function is annotated to return float but returns None. Any caller doing math or serialization will blow up with a TypeError. Prefer to fail fast with NotImplementedError (or change the signature to Optional[float] and document it).

🛠️ Suggested fix (fail fast)
 def humaneval_pass_at_k(
     results: List[Dict[str, Any]],
     k: int = 1,
 ) -> float:
@@
-    return None
+    raise NotImplementedError(
+        "HumanEval pass@k requires code execution; implement evaluator before use."
+    )
diffulex_bench/config.py-67-103 (1)

67-103: ⚠️ Potential issue | 🟠 Major

get_diffulex_kwargs returns before adding optional params; kwargs undefined.

The function returns a dict immediately, so the optional quantization fields are never applied, and the later kwargs[...] lines reference an undefined name. Build kwargs first, then extend it.

🛠️ Suggested fix
     def get_diffulex_kwargs(self) -> Dict[str, Any]:
         """Get arguments to pass to Diffulex engine"""
-        return {
+        kwargs = {
             'model_name': self.model_name,
             'decoding_strategy': self.decoding_strategy,
             'mask_token_id': self.mask_token_id,
             'tensor_parallel_size': self.tensor_parallel_size,
             'data_parallel_size': self.data_parallel_size,
             'gpu_memory_utilization': self.gpu_memory_utilization,
             'max_model_len': self.max_model_len,
             'max_num_batched_tokens': self.max_num_batched_tokens,
             'max_num_seqs': self.max_num_seqs,
             'use_lora': self.use_lora,
             'lora_path': self.lora_path if self.use_lora else "",
             'enforce_eager': self.enforce_eager,
             'kv_cache_layout': self.kv_cache_layout,
             'accept_threshold': self.accept_threshold,
             'complete_threshold': self.complete_threshold,
             'add_new_block_threshold': self.add_new_block_threshold,
             'diffusion_block_size': self.diffusion_block_size,
         }
@@
         if self.linear_mlp_act_dtype is not None:
             kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype
         
         return kwargs
diffulex/utils/quantization/strategies/linear_int8_w8a8.py-39-76 (1)

39-76: ⚠️ Potential issue | 🟠 Major

Scale‑shape mismatch between get_scale_shape and quantize output.

get_scale_shape returns (N,) but quantize() returns scales shaped [1, N] (and the cache comment says [N]). This mismatch can break scale buffer allocation/serialization. Align the declared shape with the actual returned tensor.

🛠️ Suggested fix (align to [1, N])
-        # Cache: id(weight) -> (qweight_int8 [N,K], w_scales_fp32 [N])
+        # Cache: id(weight) -> (qweight_int8 [N,K], w_scales_fp32 [1,N])
         self._weight_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}
@@
     def get_scale_shape(self, original_shape: tuple[int, ...], **kwargs: Any) -> tuple[int, ...]:
         _ = kwargs
         if len(original_shape) != 2:
             raise ValueError(f"Expected 2D weight [N,K], got {original_shape}")
-        return (original_shape[0],)
+        return (1, original_shape[0])
diffulex/engine/model_runner.py-40-47 (1)

40-47: ⚠️ Potential issue | 🟠 Major

Fix device_id usage in process-group init.

dist.init_process_group unconditionally indexes config.device_ids[rank], which breaks when device_ids is unset and can disagree with the later fallback path. Compute device_id once and use it consistently for both device setup and init_process_group.

🐛 Proposed fix
-        dist.init_process_group("nccl", init_method, world_size=self.world_size, rank=rank, device_id=config.device_ids[rank])
-        # Choose CUDA device for this TP rank.
-        # config.device_ids is already a list of logical CUDA device indices (respecting CUDA_VISIBLE_DEVICES).
-        # Do NOT add rank again, otherwise rank 1 with device_ids=[0,1] becomes device 2.
-        if getattr(config, "device_ids", None):
-            device_id = config.device_ids[rank]
-        else:
-            device_id = (getattr(config, "device_start", 0) or 0) + rank
+        # Choose CUDA device for this TP rank.
+        # config.device_ids is already a list of logical CUDA device indices (respecting CUDA_VISIBLE_DEVICES).
+        # Do NOT add rank again, otherwise rank 1 with device_ids=[0,1] becomes device 2.
+        if getattr(config, "device_ids", None):
+            device_id = config.device_ids[rank]
+        else:
+            device_id = (getattr(config, "device_start", 0) or 0) + rank
+        dist.init_process_group("nccl", init_method, world_size=self.world_size, rank=rank, device_id=device_id)
diffulex/strategy/fast_dllm_v2/engine/model_runner.py-189-228 (1)

189-228: ⚠️ Potential issue | 🟠 Major

CUDA-graph sizes can miss non-multiple-of-16 batch sizes.

When max_num_seqs isn’t a multiple of 16 (e.g., 20), seq_bs_list tops out at 16, so run_model can’t find a graph for num_tokens = 20 * block_size and raises StopIteration. Ensure the list always includes max_num_seqs.

🐛 Suggested fix
-        seq_bs_list = [1, 2, 4, 8] + list(range(16, max_num_seqs + 1, 16))
+        seq_bs_list = [1, 2, 4, 8] + list(range(16, max_num_seqs + 1, 16))
+        if max_num_seqs not in seq_bs_list:
+            seq_bs_list.append(max_num_seqs)
+        seq_bs_list = sorted(set(seq_bs_list))
diffulex/utils/quantization/strategies/linear_fp8_w8a8.py-44-47 (1)

44-47: ⚠️ Potential issue | 🟠 Major

Potential memory leak in weight cache using id(weight).

Using id(weight) as a cache key is risky because:

  1. If a weight tensor is deallocated and a new tensor is allocated at the same memory address, the cache will return stale quantized data.
  2. The cache holds strong references to quantized tensors, preventing garbage collection of old weights.

Consider using weakref or a bounded cache (e.g., LRU) to avoid unbounded memory growth, or clear the cache when model parameters change.

diffulex/strategy/fast_dllm_v2/engine/sequence.py-59-65 (1)

59-65: ⚠️ Potential issue | 🟠 Major

Type mismatch: modified_to is annotated as int but .item() is called on it.

The method signature shows modified_to: int, but line 64 calls modified_to.item() which implies it's a tensor. This will raise AttributeError if an actual int is passed.

🐛 Proposed fix

Either fix the type hint or handle both cases:

-    def modify_token(self, local_token_id: int, modified_to: int) -> None:
+    def modify_token(self, local_token_id: int, modified_to: int | torch.Tensor) -> None:
         if self.seq is None:
             raise RuntimeError("Diffusion block is not attached to a sequence.")
         target_id = local_token_id + self.global_start_id
         assert self.seq.token_ids[target_id] == self.mask_token_id
-        self.seq.token_ids[target_id] = modified_to.item()  # type: ignore[assignment]
+        value = modified_to.item() if hasattr(modified_to, 'item') else modified_to
+        self.seq.token_ids[target_id] = value
         self.seq.new_tokens += 1
🟡 Minor comments (34)
examples/test_bf16_kernel_e2e.py-70-70 (1)

70-70: ⚠️ Potential issue | 🟡 Minor

Remove unused f-string prefix
The f-string on line 70 contains no format expressions (no {...} placeholders), triggering Ruff F541. Remove the f prefix.

🛠️ Proposed fix
-    print(f"\n总计:")
+    print("\n总计:")
examples/test_fp8_kernel_e2e.py-72-72 (1)

72-72: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string without placeholders.

This f-string has no placeholders and should be a regular string.

🐛 Proposed fix
-    print(f"\n总计:")
+    print("\n总计:")
examples/test_fp8_linear.py-115-122 (1)

115-122: ⚠️ Potential issue | 🟡 Minor

Unused variables: M, mem_bf16, mem_fp8.

These variables are assigned but never used. Either remove them or use them for more detailed memory reporting.

🧹 Proposed fix (remove unused)
     device = torch.device("cuda")
     torch.cuda.empty_cache()
     torch.cuda.reset_peak_memory_stats()
     
     # BF16 baseline
-    M, K, N = 32, 512, 256
+    K, N = 512, 256
     weight_bf16 = torch.randn(N, K, dtype=torch.bfloat16, device=device)
-    mem_bf16 = torch.cuda.memory_allocated()
     
     # FP8 quantized
     strategy = create_linear_strategy(weight_dtype="fp8_e4m3", act_dtype="bf16")
     weight_fp8, scales = strategy.quantize_weight_for_kernel(weight_bf16, device=device)
-    mem_fp8 = torch.cuda.memory_allocated()
examples/test_gptq_awq_loading.py-52-66 (1)

52-66: ⚠️ Potential issue | 🟡 Minor

Guard _offline_quant_format access for compatibility

Line 54 assumes _offline_quant_format exists and is a tensor with .numel()/.item(). Some layers expose an int-style _offline_quant_format_py instead, which would raise AttributeError and break --list-layers. Consider a safe fallback.

Suggested fix
-                format_val = int(module._offline_quant_format.item()) if module._offline_quant_format.numel() > 0 else 0
+                fmt = getattr(module, "_offline_quant_format", None)
+                if fmt is None:
+                    format_val = int(getattr(module, "_offline_quant_format_py", 0) or 0)
+                else:
+                    format_val = int(fmt.item()) if fmt.numel() > 0 else 0
examples/test_fp8_kv_cache_comprehensive.py-1225-1306 (1)

1225-1306: ⚠️ Potential issue | 🟡 Minor

Fail fast when CUDA isn’t available

Several tests unconditionally allocate CUDA tensors; add an early guard in main() to give a clear message instead of stack traces.

Suggested fix
     args = parser.parse_args()
+
+    if not torch.cuda.is_available():
+        print("CUDA is required for FP8 KV cache tests.")
+        sys.exit(2)
diffulex/model/__init__.py-20-22 (1)

20-22: ⚠️ Potential issue | 🟡 Minor

Add stacklevel=2 to point warnings at the caller.
Without it, the warning points at this module instead of the import site.

🔧 Suggested change
-            warnings.warn(f"Failed to import {module_name}: {e!r}", RuntimeWarning)
+            warnings.warn(
+                f"Failed to import {module_name}: {e!r}",
+                RuntimeWarning,
+                stacklevel=2,
+            )
diffulex/strategy/d2f/engine/scheduler.py-108-116 (1)

108-116: ⚠️ Potential issue | 🟡 Minor

Guard against silent truncation in zip() during token assignment.

If true_local_ids and accepted_ids diverge, zip() will silently drop extras. Use strict=True to fail fast.

🔧 Safer iteration
-                for true_local_id, accepted_id in zip(true_local_ids, accepted_ids):
+                for true_local_id, accepted_id in zip(true_local_ids, accepted_ids, strict=True):

The project's Python >= 3.12 requirement supports zip(..., strict=True) (available since Python 3.10).

examples/test_fp8_kv_cache_python_dequant.py-72-72 (1)

72-72: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string without placeholders.

This f-string contains no placeholders, making the f prefix unnecessary.

🧹 Proposed fix
-    print(f"\n总计:")
+    print("\n总计:")
diffulex_profiler/example.py-46-47 (1)

46-47: ⚠️ Potential issue | 🟡 Minor

Potential division by zero if outputs is empty.

If llm.generate() returns an empty list, dividing by len(outputs) will raise ZeroDivisionError.

🛡️ Proposed defensive fix
         profiler.record_metric("num_outputs", len(outputs))
-        profiler.record_metric("avg_diff_steps", 
-                              sum(o['n_diff_steps'] for o in outputs) / len(outputs))
+        if outputs:
+            profiler.record_metric("avg_diff_steps", 
+                                  sum(o['n_diff_steps'] for o in outputs) / len(outputs))
examples/test_fp8_kv_cache_python_dequant.py-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Remove unused import.

The os module is imported but never used.

🧹 Proposed fix
-import os
 import time
diffulex_kernel/__init__.py-12-21 (1)

12-21: ⚠️ Potential issue | 🟡 Minor

Tidy up lint warnings (unused noqa, unsorted __all__).
Ruff is flagging both items; easy cleanup.

Suggested fix
-    from diffulex_kernel.python.dllm_flash_attn_kernels import (  # noqa: F401
+    from diffulex_kernel.python.dllm_flash_attn_kernels import (
         dllm_flash_attn_decode as dllm_flash_attn_decode,
         dllm_flash_attn_prefill as dllm_flash_attn_prefill,
     )
-    from diffulex_kernel.python.kv_cache_kernels import (  # noqa: F401
+    from diffulex_kernel.python.kv_cache_kernels import (
         load_kvcache as load_kvcache,
         store_kvcache_distinct_layout as store_kvcache_distinct_layout,
         store_kvcache_unified_layout as store_kvcache_unified_layout,
     )
@@
 __all__ = [
     "dllm_flash_attn_decode",
     "dllm_flash_attn_prefill",
-    "store_kvcache_distinct_layout",
-    "store_kvcache_unified_layout",
     "load_kvcache",
+    "store_kvcache_distinct_layout",
+    "store_kvcache_unified_layout",
 ]

Also applies to: 48-54

diffulex/utils/quantization/strategies/no_quantization.py-16-26 (1)

16-26: ⚠️ Potential issue | 🟡 Minor

Align quantize() output with declared BF16 storage dtype for consistency.

The current implementation returns tensors as-is, creating a mismatch with the advertised BF16 storage dtype. While quantize() is not called in the current codebase (only get_storage_dtype() is used), this inconsistency conflicts with KVCacheBF16Strategy, which does enforce the declared dtype by converting to BF16. For consistency and to avoid surprises if this method is called directly, apply the same pattern:

🔧 Suggested fix
 def quantize(self, tensor: torch.Tensor, **kwargs) -> tuple[torch.Tensor, None]:
-    """No quantization, return tensor as-is."""
-    return tensor, None
+    """No quantization, but normalize to storage dtype for consistency."""
+    if tensor.dtype != torch.bfloat16:
+        tensor = tensor.to(torch.bfloat16)
+    return tensor, None
diffulex/sampler/sdar.py-49-49 (1)

49-49: ⚠️ Potential issue | 🟡 Minor

Unused confidence variable.

confidence is never used after unpacking. Prefix with _ (or use it) to avoid lint noise.

Rename to unused placeholder
-                confidence, sampled_tokens, initial_confidence = self.sample_tokens(
+                _confidence, sampled_tokens, initial_confidence = self.sample_tokens(
diffulex_bench/report.py-29-31 (1)

29-31: ⚠️ Potential issue | 🟡 Minor

Replace lambda assignment with a local function.

Ruff E731 flags assigning a lambda; a small def keeps lint clean.

Simple refactor
-    report_lines = []
-    append_line = lambda line: report_lines.append(line)
+    report_lines = []
+    def append_line(line: str) -> None:
+        report_lines.append(line)
diffulex/strategy/d2f/engine/model_runner.py-28-47 (1)

28-47: ⚠️ Potential issue | 🟡 Minor

Remove the suggested import path change; the current import is correct via backward-compatibility re-export.

The import from diffulex.utils.kv_cache_dtype is intentionally valid—this module re-exports from the new location for backward compatibility. No change needed there.

However, the exception handling concern has merit: parse_kv_cache_dtype() can raise ValueError (for invalid dtype strings) and RuntimeError (for missing torch FP8 dtypes), and catching all exceptions silently masks these issues. If an invalid kv_cache_dtype is provided or FP8 dtypes are unavailable, defaulting to "varlen" may not be the desired behavior. Consider either letting these exceptions propagate or logging them explicitly before the fallback.

diffulex_kernel/python/dllm_flash_attn_prefill_tilelang.py-172-178 (1)

172-178: ⚠️ Potential issue | 🟡 Minor

Unused scale parameter.

The scale parameter is accepted but never used. The kernel computes its own scale at line 39 as (1.0 / HEAD_DIM) ** 0.5 * 1.44269504. This is inconsistent with flash_attn_varlen_func which uses the passed scale (line 193).

💡 Options to consider
  1. If the kernel should use the passed scale, modify the kernel to accept it as a parameter.
  2. If the hardcoded scale is intentional, document why it differs from the passed value or remove the parameter to avoid confusion.
diffulex/logger.py-44-47 (1)

44-47: ⚠️ Potential issue | 🟡 Minor

Restore record.levelname after formatting to prevent ANSI codes leaking to other handlers.

The same LogRecord object is shared across all handlers. When setup_logger() is called with both a console handler (using ColoredFormatter) and a file handler, the in-place mutation of record.levelname causes ANSI color codes to be written to the log file. This occurs in actual usage (e.g., diffulex_bench/main.py).

🔧 Suggested change
     def format(self, record):
-        log_color = self.COLORS.get(record.levelname, '')
-        record.levelname = f"{log_color}{record.levelname}{self.RESET}"
-        return super().format(record)
+        original_levelname = record.levelname
+        try:
+            log_color = self.COLORS.get(record.levelname, '')
+            record.levelname = f"{log_color}{record.levelname}{self.RESET}"
+            return super().format(record)
+        finally:
+            record.levelname = original_levelname
diffulex/logger.py-155-171 (1)

155-171: ⚠️ Potential issue | 🟡 Minor

Rich markup in success() will appear verbatim when using plain handlers.

The success() method is added to the global logging.Logger class at module import time. When Rich is installed, it always emits Rich markup [green]✓[/green]. However, loggers set up with use_rich=False use plain text handlers that don't interpret Rich markup, causing the tags to be printed literally. Detect RichHandler at runtime in the success() method and fall back to colorama/plain formatting when Rich handlers are not present.

🔧 Suggested change
-    if RICH_AVAILABLE:
-        def success(self, message: str, *args, **kwargs):
-            """Log success message with rich formatting"""
-            self.info(f"[green]✓[/green] {message}", *args, **kwargs)
-    else:
-        def success(self, message: str, *args, **kwargs):
-            """Log success message"""
-            if COLORAMA_AVAILABLE:
-                self.info(f"{Fore.GREEN}✓{Style.RESET_ALL} {message}", *args, **kwargs)
-            else:
-                self.info(f"✓ {message}", *args, **kwargs)
+    if RICH_AVAILABLE:
+        def success(self, message: str, *args, **kwargs):
+            """Log success message with rich formatting when applicable"""
+            if any(isinstance(h, RichHandler) for h in self.handlers):
+                self.info(f"[green]✓[/green] {message}", *args, **kwargs)
+            elif COLORAMA_AVAILABLE:
+                self.info(f"{Fore.GREEN}✓{Style.RESET_ALL} {message}", *args, **kwargs)
+            else:
+                self.info(f"✓ {message}", *args, **kwargs)
+    else:
+        def success(self, message: str, *args, **kwargs):
+            """Log success message"""
+            if COLORAMA_AVAILABLE:
+                self.info(f"{Fore.GREEN}✓{Style.RESET_ALL} {message}", *args, **kwargs)
+            else:
+                self.info(f"✓ {message}", *args, **kwargs)
diffulex_profiler/metrics.py-69-80 (1)

69-80: ⚠️ Potential issue | 🟡 Minor

Don’t silently swallow collector errors (S110).

The try/except/pass blocks hide failures and trip Ruff S110. Logging at debug keeps metrics best‑effort while preserving observability.

🛠️ Suggested fix (log and continue)
+import logging
@@
-import torch
+import torch
+
+logger = logging.getLogger(__name__)
@@
-        except (ImportError, Exception):
-            pass
+        except (ImportError, Exception) as exc:
+            logger.debug("pynvml metrics unavailable", exc_info=exc)
@@
-    except Exception:
-        pass
+    except Exception as exc:
+        logger.debug("collect_gpu_metrics failed", exc_info=exc)
@@
-    except Exception:
-        return {}
+    except Exception as exc:
+        logger.debug("collect_cpu_metrics failed", exc_info=exc)
+        return {}
@@
-    except Exception:
-        return {}
+    except Exception as exc:
+        logger.debug("collect_memory_metrics failed", exc_info=exc)
+        return {}

Also applies to: 90-96, 103-112

diffulex/strategy/fast_dllm_v2/engine/scheduler.py-102-110 (1)

102-110: ⚠️ Potential issue | 🟡 Minor

Guard against mismatched true_local_ids / accepted_ids lengths.

zip() truncates silently; if the lists diverge, some accepted tokens are never applied. Add an explicit length check (or raise) before iterating.

🛠️ Suggested fix
             sampled_tokens_map = sample_output.sampled_tokens_map.get(seq_id, {})
             for block_id, accepted_ids in accepted_ids_map.items():
                 if not accepted_ids:
                     continue
                 diffusion_block = seq.diffusion_blocks[int(block_id)]
                 sampled_tokens = sampled_tokens_map.get(block_id, [])
                 true_local_ids = true_ids_map.get(block_id, [])
+                if len(true_local_ids) != len(accepted_ids):
+                    raise ValueError(
+                        f"Mismatch for block {block_id}: "
+                        f"{len(true_local_ids)} true ids vs {len(accepted_ids)} accepted ids"
+                    )
                 for true_local_id, accepted_id in zip(true_local_ids, accepted_ids):
                     token = sampled_tokens[accepted_id]
diffulex/utils/quantization/kv_cache_dtype.py-56-61 (1)

56-61: ⚠️ Potential issue | 🟡 Minor

Handle both float8_e4m3fn and float8_e4m3fnuz when vLLM isn't available.

Different PyTorch builds expose different FP8 E4M3 dtypes depending on version and backend:

  • float8_e4m3fn is the OCP-standard variant (NVIDIA/CUDA builds)
  • float8_e4m3fnuz is the FNUZ variant (AMD/ROCm builds, particularly MI300+)

The current code only checks for float8_e4m3fn, so it would incorrectly raise RuntimeError even when float8_e4m3fnuz is available.

🛠️ Suggested fix
 def _get_fp8_e4m3_dtype() -> torch.dtype:
     if current_platform is None:
         if hasattr(torch, "float8_e4m3fn"):
             return torch.float8_e4m3fn  # type: ignore[attr-defined]
+        if hasattr(torch, "float8_e4m3fnuz"):
+            return torch.float8_e4m3fnuz  # type: ignore[attr-defined]
         raise RuntimeError("FP8 requested but vLLM current_platform is unavailable.")
     return current_platform.fp8_dtype()
diffulex/engine/model_runner.py-165-171 (1)

165-171: ⚠️ Potential issue | 🟡 Minor

Ensure WARMING_UP is reset on failure.

If _prefill_warmup() raises, the global warming flag remains set. Use try/finally to always reset it.

🧯 Suggested fix
-        set_warming_up(True)
-        torch.cuda.empty_cache()
-        torch.cuda.reset_peak_memory_stats()
-        self._prefill_warmup()
-        reset_warming_up()
+        set_warming_up(True)
+        try:
+            torch.cuda.empty_cache()
+            torch.cuda.reset_peak_memory_stats()
+            self._prefill_warmup()
+        finally:
+            reset_warming_up()
diffulex/engine/model_runner.py-151-163 (1)

151-163: ⚠️ Potential issue | 🟡 Minor

Guard warmup against zero sequences.

num_seqs becomes 0 when max_num_batched_tokens < max_model_len, which results in an empty warmup run. Add a guard to avoid a no-op or downstream errors.

🛠️ Suggested guard
-        num_seqs = min(max_num_batched_tokens // max_model_len, self.config.max_num_seqs)
+        num_seqs = min(max_num_batched_tokens // max_model_len, self.config.max_num_seqs)
+        if num_seqs <= 0:
+            logger.warning("Warmup skipped: max_num_batched_tokens < max_model_len")
+            return
diffulex/utils/quantization/__init__.py-42-68 (1)

42-68: ⚠️ Potential issue | 🟡 Minor

all ordering trips RUF022.

Ruff expects __all__ to be sorted; consider sorting to avoid lint failures.

🔧 One-line fix
-__all__ = [
+__all__ = sorted([
     # Context
     'QuantizationContext',
     'get_quantization_context',
     'set_kv_cache_strategy',
     'get_kv_cache_strategy',
@@
     'ensure_scale_tensor',
     'view_fp8_cache',
-]
+])
diffulex/utils/loader.py-111-112 (1)

111-112: ⚠️ Potential issue | 🟡 Minor

Unused variable pack_factor - potential bug or dead code.

pack_factor is calculated at line 111 but never used within _set_offline_gptq_marlin_weight. This is flagged by static analysis (F841). Either remove it if unnecessary, or verify if it should be used in subsequent logic.

🔧 Proposed fix if unused
-    pack_factor = 32 // int(bits)
     group_size_norm = in_features if group_size == -1 else group_size
diffulex_kernel/python/kv_cache_kernels.py-1051-1055 (1)

1051-1055: ⚠️ Potential issue | 🟡 Minor

Debug reference check is overly strict - any mismatch raises RuntimeError.

The FP8 debug reference check at lines 1051-1055 raises a RuntimeError if max_diff_k > 0 or max_diff_v > 0. Due to floating-point precision differences between the fused Triton kernel and Python reference implementation, small differences are expected and should not cause failures.

Consider using a tolerance threshold instead:

🔧 Proposed fix
-                # Be strict: any mismatch likely indicates indexing/mask/scale bug.
-                if max_diff_k > 0 or max_diff_v > 0:
+                # Allow small numerical differences due to fp32/bf16 conversion order
+                TOLERANCE = 1e-3  # Adjust based on expected precision
+                if max_diff_k > TOLERANCE or max_diff_v > TOLERANCE:
                     raise RuntimeError(
-                        f"FP8 fused load mismatch: max_abs_diff k={max_diff_k} v={max_diff_v}. "
+                        f"FP8 fused load mismatch exceeds tolerance: max_abs_diff k={max_diff_k} v={max_diff_v} (tol={TOLERANCE}). "
                         "Set DIFFULEX_DEBUG_FP8_LOAD_REF=0 to disable."
                     )
diffulex/utils/quantization/strategies/linear_fp8_w8a8.py-109-117 (1)

109-117: ⚠️ Potential issue | 🟡 Minor

Cache invalidation only checks device, missing shape/dtype validation.

The cache recomputes when cached[0].device != x.device, but if the original weight tensor's content, shape, or dtype changes (e.g., during fine-tuning or model surgery), the cached quantized weight becomes stale.

Consider adding shape/dtype validation or using a versioning mechanism:

🛡️ Proposed fix
         wid = id(weight)
         cached = self._weight_cache.get(wid)
-        if cached is None or cached[0].device != x.device:
+        if (cached is None 
+            or cached[0].device != x.device
+            or cached[0].shape != (weight.shape[1], weight.shape[0])):  # [K,N] from [N,K]
             q_fp8, meta = self.quantize(weight)
diffulex/utils/loader.py-237-240 (1)

237-240: ⚠️ Potential issue | 🟡 Minor

Remove unused AWQ Marlin variables that have no implementation in the loader.

Variables want_awq_marlin and is_awq_marlin_ckpt are defined but never used within this function. Unlike want_gptq_marlin and is_gptq_marlin_ckpt which have corresponding checkpoint loading logic (e.g., qzeros creation at line 442), these AWQ Marlin variables lack any implementation. While AWQ Marlin inference support exists in LinearBase and strategy classes, the loader itself does not load AWQ Marlin checkpoints. Remove these unused variables or implement the missing AWQ Marlin checkpoint loading logic to match the GPTQ Marlin pattern.

diffulex/utils/quantization/strategies/linear_fp8_w8a16.py-37-38 (1)

37-38: ⚠️ Potential issue | 🟡 Minor

Potential memory leak: weight cache keyed by id(weight) can grow unbounded.

The cache self._weight_cache uses id(weight) as keys. Since id() returns memory addresses that can be reused after objects are garbage collected, this can lead to:

  1. Stale entries if weights are replaced
  2. Unbounded growth if many different weights are processed

Consider using weakref or implementing a bounded cache with eviction.

🛡️ Proposed fix using WeakValueDictionary or bounded cache
 class LinearFP8W8A16Strategy(LinearQuantizationStrategy):
     def __init__(self, weight_dtype: str = "fp8_e4m3") -> None:
         super().__init__()
         self.weight_dtype_str = weight_dtype
-        # Cache: id(weight) -> (q_fp8_KN [K,N], scale_fp32 [1])
-        self._weight_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}
+        # Cache: id(weight) -> (q_fp8_KN [K,N], scale_fp32 [1])
+        # Note: bounded to avoid unbounded growth; consider LRU if needed.
+        self._weight_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {}
+        self._weight_cache_max_size: int = 64  # Limit cache size

Then in linear_forward, add eviction logic:

if len(self._weight_cache) > self._weight_cache_max_size:
    # Simple eviction: clear oldest entries
    self._weight_cache.clear()
diffulex_bench/lm_eval_model.py-267-279 (1)

267-279: ⚠️ Potential issue | 🟡 Minor

Unused local variables: avg_tokens, avg_nfe, avg_time.

These variables are computed but never used. Either use them in the log message or remove them.

♻️ Proposed fix - use in logging or remove

Option 1: Use them in logging:

             avg_tokens = self.total_generated_tokens / self.total_samples
             avg_nfe = self.total_nfe / self.total_samples
             avg_time = self.total_generation_time / self.total_samples
             throughput = num_tokens / total_time if total_time > 0 else 0
             
             self.logger.info(
                 f"Generated {len(results)} samples | "
                 f"Tokens: {num_tokens} | "
                 f"NFE: {num_nfe} | "
                 f"Time: {total_time:.2f}s | "
-                f"Throughput: {throughput:.2f} tok/s"
+                f"Throughput: {throughput:.2f} tok/s | "
+                f"Avg tokens/sample: {avg_tokens:.1f} | "
+                f"Avg NFE/sample: {avg_nfe:.1f}"
             )

Option 2: Remove unused variables:

-            avg_tokens = self.total_generated_tokens / self.total_samples
-            avg_nfe = self.total_nfe / self.total_samples
-            avg_time = self.total_generation_time / self.total_samples
             throughput = num_tokens / total_time if total_time > 0 else 0
diffulex/utils/quantization/strategies/linear_fp8_w8a16.py-119-129 (1)

119-129: ⚠️ Potential issue | 🟡 Minor

Cache invalidation issue: device mismatch check may cause redundant quantization.

When cached[0].device != x.device, the code re-quantizes but the old entry keyed by wid remains if the weight object hasn't changed. This could lead to repeated quantization if inputs alternate between devices.

Consider storing device as part of the cache key or updating the existing entry properly.

♻️ Suggested improvement
-            wid = id(weight)
-            cached = self._weight_cache.get(wid)
-            if cached is None or cached[0].device != x.device:
+            cache_key = (id(weight), x.device)
+            cached = self._weight_cache.get(cache_key)
+            if cached is None:
                 q_fp8, meta = self.quantize(weight)
                 q_fp8 = q_fp8.to(device=x.device)
                 scales = meta["scales"].to(device=x.device, dtype=torch.float32).reshape(1)
                 q_kn = q_fp8
-                self._weight_cache[wid] = (q_fp8, scales)
+                self._weight_cache[cache_key] = (q_fp8, scales)
             else:
                 q_kn, scales = cached
diffulex/strategy/fast_dllm_v2/engine/sequence.py-119-127 (1)

119-127: ⚠️ Potential issue | 🟡 Minor

Mutable default argument and incorrect error message.

  1. SamplingParams() as a default argument is evaluated once at function definition time, not per call. This can lead to shared state issues.
  2. The error message references "BDSequence" but the class is named "FDV2Sequence".
🐛 Proposed fix
     def __init__(
         self,
         token_ids: list[int],
-        sampling_params: SamplingParams = SamplingParams(),
+        sampling_params: SamplingParams | None = None,
         config: Config | None = None,
     ):
-        super().__init__(token_ids, sampling_params)
+        super().__init__(token_ids, sampling_params or SamplingParams())
         if config is None:
-            raise ValueError("BDSequence requires a Config instance.")
+            raise ValueError("FDV2Sequence requires a Config instance.")
diffulex/utils/quantization/strategies/linear_marlin_int8_w8a16.py-73-82 (1)

73-82: ⚠️ Potential issue | 🟡 Minor

Silent exception swallowing in configure() hides configuration errors.

The try-except-pass pattern here silently ignores all errors, including genuine configuration issues (e.g., invalid config types). Consider logging or at least catching more specific exceptions.

🔧 Proposed fix with logging
+import logging
+
+logger = logging.getLogger(__name__)
+
     def configure(self, *, diffulex_config: Any | None = None) -> None:
         # Prefer explicit config fields over environment-variable based tuning.
         if diffulex_config is None:
             return
         try:
             bn = int(getattr(diffulex_config, "linear_w8a16_quant_block_n", self._quant_block_n))
             self._quant_block_n = max(1, bn)
-        except Exception:
-            pass
+        except (TypeError, ValueError) as e:
+            logger.debug(f"Failed to parse linear_w8a16_quant_block_n: {e}")
         try:
             thr = int(getattr(diffulex_config, "linear_w8a16_allspark_cublas_m_threshold", self._cublas_m_thr))
             self._cublas_m_thr = max(1, thr)
-        except Exception:
-            pass
+        except (TypeError, ValueError) as e:
+            logger.debug(f"Failed to parse linear_w8a16_allspark_cublas_m_threshold: {e}")
diffulex/strategy/fast_dllm_v2/engine/sequence.py-225-249 (1)

225-249: ⚠️ Potential issue | 🟡 Minor

Unreachable elif condition when prefix is block-aligned.

When pad_prefix_len == 0, total_num_blocks == num_prefix_blocks, making the condition block_id == num_prefix_blocks unreachable within the loop range. This causes the elif is_last_prefix_block: branch (line 233) to never execute in the aligned case.

While the resulting behavior is correct—all blocks legitimately get TO_CACHE status when the prefix has no partial block—the unreachable code path suggests the logic could be clearer. Either refactor the condition to avoid the unreachable branch, or add a comment explaining why the elif exists despite being unreachable when aligned.

Note: This same pattern exists in both diffulex/strategy/block_diffusion/engine/sequence.py and diffulex/strategy/fast_dllm_v2/engine/sequence.py.

Comment on lines 26 to 34
dataset = load_dataset("gsm8k", "main", split=split)

if limit:
dataset = dataset[:limit]

results = []
for item in dataset:
question = item["question"]
answer = item["answer"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Dataset slicing changes iteration behavior when limit is used.

When limit is provided, dataset[:limit] returns a dict of lists (column-wise), not a sliced Dataset. The subsequent for item in dataset: will then iterate over dictionary keys (e.g., 'question', 'answer') instead of individual records.

Use dataset.select(range(limit)) to maintain Dataset iteration behavior.

🐛 Proposed fix
     dataset = load_dataset("gsm8k", "main", split=split)
     
     if limit:
-        dataset = dataset[:limit]
+        dataset = dataset.select(range(min(limit, len(dataset))))
     
     results = []
     for item in dataset:
🤖 Prompt for AI Agents
In `@diffulex_bench/datasets.py` around lines 26 - 34, The bug is that slicing
with dataset[:limit] turns the Dataset into a dict-of-lists so the subsequent
loop over dataset iterates keys; replace that slice with
dataset.select(range(limit)) so iteration yields records. Update the code around
load_dataset(..., split=split) and the conditional that checks limit to use
dataset = dataset.select(range(limit)) (referencing the dataset variable and
load_dataset call) and ensure the rest of the loop (for item in dataset,
accessing item["question"], item["answer"]) continues to work with Dataset
records.

Comment on lines 65 to 71
dataset = load_dataset("openai/humaneval", split="test")

if limit:
dataset = dataset[:limit]

results = []
for item in dataset:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same slicing bug as in load_gsm8k.

Apply the same fix here to maintain proper Dataset iteration.

🐛 Proposed fix
     dataset = load_dataset("openai/humaneval", split="test")
     
     if limit:
-        dataset = dataset[:limit]
+        dataset = dataset.select(range(min(limit, len(dataset))))
     
     results = []
     for item in dataset:
🤖 Prompt for AI Agents
In `@diffulex_bench/datasets.py` around lines 65 - 71, The code incorrectly slices
the HuggingFace Dataset with dataset[:limit], which can convert it to a list and
break iteration; instead, when limiting the humaneval dataset obtained by
load_dataset("openai/humaneval") assign dataset = dataset.select(range(limit))
(or dataset.select(range(limit)).shuffle(...) if needed) so the result stays a
Dataset object and iteration in the subsequent loop over dataset works
correctly; update the block that checks limit to use
dataset.select(range(limit)) rather than dataset[:limit].

assert q.is_cuda and k.is_cuda and v.is_cuda and k_cache.is_cuda and v_cache.is_cuda
assert q.dtype == torch.bfloat16 and k.dtype == torch.bfloat16 and v.dtype == torch.bfloat16
assert attn_metadata.block_tables is not None and attn_metadata.context_lens is not None and attn_metadata.cu_seqlens_q is not None
assert attn_metadata.kv_cache_layout == "unified", f"only unified layout supported, got {attn_metadata.kv_cache_layout}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing kv_cache_layout attribute in AttnMetaDataBase.

This assertion accesses attn_metadata.kv_cache_layout, but according to the AttnMetaDataBase class definition in diffulex/attention/metadata.py, this attribute does not exist. This will raise an AttributeError at runtime, not the intended AssertionError.

Either add kv_cache_layout: str = "unified" to AttnMetaDataBase, or use getattr with a default:

🐛 Proposed fix
-    assert attn_metadata.kv_cache_layout == "unified", f"only unified layout supported, got {attn_metadata.kv_cache_layout}"
+    kv_layout = getattr(attn_metadata, "kv_cache_layout", "unified")
+    assert kv_layout == "unified", f"only unified layout supported, got {kv_layout}"

Or add the attribute to AttnMetaDataBase in diffulex/attention/metadata.py:

kv_cache_layout: str = "unified"
🤖 Prompt for AI Agents
In `@diffulex_kernel/python/paged_attn_decode_triton.py` at line 527, The
assertion in paged_attn_decode_triton.py uses attn_metadata.kv_cache_layout
which doesn't exist on the AttnMetaDataBase class and will raise AttributeError;
fix by adding a default attribute kv_cache_layout: str = "unified" to the
AttnMetaDataBase definition in diffulex/attention/metadata.py (so the assertion
in paged_attn_decode_triton.py continues to work), or alternatively change the
assertion to use getattr(attn_metadata, "kv_cache_layout", "unified") to provide
a default — update either the AttnMetaDataBase class (preferred) or the
assertion accordingly.

Comment on lines 53 to 67
def stop(self) -> Optional[Dict[str, Any]]:
"""Stop VizTracer and return trace file path."""
if self.tracer is None:
return None

self.tracer.stop()
output_file = self.tracer.output_file

result = {
"backend": "viztracer",
"output_file": str(output_file),
}

self.tracer = None
return result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

VizTracer API save() method after stop() documentation

💡 Result:

VizTracer’s documented manual (inline) workflow is:

from viztracer import VizTracer

tracer = VizTracer()
tracer.start()
# code you want to trace
tracer.stop()
tracer.save()              # or tracer.save("my_trace.json")

save(output_file=None, file_info=None, verbose=None) “parse[s] data and save[s] report”; if output_file is None, it writes to the tracer’s default output_file path. stop(stop_option=None) stops tracing (and supports stop_option="flush_as_finish" to log unfinished functions). [1] [2]


🌐 Web query:

VizTracer Python profiler correct usage stop save output file

💡 Result:

Command line (profile a whole script)

viztracer -o trace.json  my_script.py arg1 arg2
# or
python -m viztracer -o trace.html  my_script.py
viztracer -o trace.json -- my_script.py -o my_script_output.json

Inline (start/stop only around the code you care about)

Context-manager (auto stop+save on exit):

from viztracer import VizTracer

with VizTracer(output_file="trace.json"):
    do_work()

(viztracer.readthedocs.io)

Manual start/stop + save:

from viztracer import VizTracer

tracer = VizTracer(output_file="trace.json")   # optional; can also pass in save()
tracer.start()
do_work()
tracer.stop()
tracer.save()                                 # or tracer.save("other_name.json")

(viztracer.readthedocs.io)

Common “why is my file empty/missing?” pitfalls

  • You must call save() after stop() to write the output file. (viztracer.readthedocs.io)
  • It’s recommended start() and stop() be in the same frame (same call-stack level), otherwise issues can happen. (pypi.org)
  • If your program exits via os._exit(), VizTracer can’t save; prefer sys.exit()/normal shutdown. (viztracer.readthedocs.io)

Citations:


🏁 Script executed:

# Search the codebase for VizTracer usage and imports
rg -i "viztracer" --type py -A 5 -B 5

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 20992


🏁 Script executed:

# Check if there are any tests or examples of VizTracer usage
fd -e py -e txt -e md | xargs grep -l "viztracer" 2>/dev/null | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 496


Add missing save() call after stop() to persist trace data.

VizTracer's documented API requires calling save() after stop() to write the trace data to the output file. Without it, the trace file will not be created, rendering the profiler non-functional. This is listed as a common pitfall in the official documentation.

🐛 Proposed fix
     def stop(self) -> Optional[Dict[str, Any]]:
         """Stop VizTracer and return trace file path."""
         if self.tracer is None:
             return None
         
         self.tracer.stop()
+        self.tracer.save()
         output_file = self.tracer.output_file
         
         result = {
             "backend": "viztracer",
             "output_file": str(output_file),
         }
         
         self.tracer = None
         return result
🤖 Prompt for AI Agents
In `@diffulex_profiler/backends/viztracer.py` around lines 53 - 67, The stop()
method in VizTracer backend currently calls self.tracer.stop() but never calls
the required self.tracer.save(), so the trace file is not written; update stop()
(method stop, referencing self.tracer and output_file) to call
self.tracer.save() immediately after self.tracer.stop() and before reading
self.tracer.output_file, then proceed to build the result dict and set
self.tracer = None so the trace is persisted to disk.

Comment on lines 17 to 56
def forward(self, seqs: list[SequenceBase], logits: torch.Tensor, temperatures: torch.Tensor,
top_p=None, top_k=None, margin_confidence=False, neg_entropy=False, threshold=0.95):
attn_metadata = self.fetch_attn_metadata()
split_logits = torch.split(
logits, [len(seq) for seq in seqs] if attn_metadata.is_prefill
else [attn_metadata.diffusion_block_size] * len(seqs), dim=0
)

accepted_ids_map = {}
sampled_tokens_map = {}
true_local_ids_map = {}
for temperature, seq, seq_logits in zip(temperatures, seqs, split_logits):
true_local_ids_sub_map = {}
accepted_ids_sub_map = {}
sampled_tokens_sub_map = {}

last_logits = self._fetch_last_logits(seq_logits, seq)

shifted_logits = self._shift_logits(seq_logits, last_logits)

for block_id, block in enumerate(seq.diffusion_blocks):
if not block.is_active or sum(block.local_mask_tokens) == 0:
continue

if len(block.global_mask_token_ids) == 0:
continue

if attn_metadata.is_prefill:
mask_token_logits = shifted_logits[block.global_mask_token_ids, ...]
else:
mask_token_logits = shifted_logits[block.local_mask_token_ids, ...]

confidence, sampled_tokens, initial_confidence = self.sample_tokens(
mask_token_logits,
temperature,
top_p=top_p,
top_k=top_k,
neg_entropy=(neg_entropy == "neg_entropy"),
margin_confidence=(margin_confidence == "margin_confidence")
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1. First, let's look at the actual file to see parameter defaults
cat -n diffulex/sampler/sdar.py | head -60

# 2. Find all calls to forward() method to see how these flags are passed
rg -n "\.forward\(" diffulex -g'*.py' -A 5

# 3. Find sample_tokens definition to understand its signature
rg -n "def sample_tokens" diffulex -g'*.py' -A 10

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 3691


🏁 Script executed:

#!/bin/bash
# Find all calls to the forward method on SDAR sampler
rg -n "forward\(" diffulex -g'*.py' -B 2 -A 5 | grep -A 7 -B 2 "margin_confidence\|neg_entropy"

# Also search for any usage of these parameter names
rg -n "margin_confidence|neg_entropy" diffulex -g'*.py' | head -30

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 4622


🏁 Script executed:

#!/bin/bash
# Search for tests or documentation
find diffulex -type f \( -name "*test*.py" -o -name "*.md" \) -exec grep -l "margin_confidence\|neg_entropy" {} \;

# Also check for any actual calls to these forward methods
rg -n "\.forward\(" diffulex -g'*.py' -B 5 | grep -E "(forward|margin_confidence|neg_entropy)" | head -40

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


Flags margin_confidence/neg_entropy are broken across all samplers.

Parameters are declared as booleans (default False) but compared to strings, so passing True never enables the feature. The sample_tokens() method expects booleans, but these comparisons always evaluate to False. This pattern appears in sdar.py, llada.py, dream.py, and fast_dllm_v2.py.

Change the comparisons to accept both bool and legacy string values, or standardize on one type:

Suggested fix
-                confidence, sampled_tokens, initial_confidence = self.sample_tokens(
+                confidence, sampled_tokens, initial_confidence = self.sample_tokens(
                     mask_token_logits, 
                     temperature, 
                     top_p=top_p, 
                     top_k=top_k, 
-                    neg_entropy=(neg_entropy == "neg_entropy"),
-                    margin_confidence=(margin_confidence == "margin_confidence")
+                    neg_entropy=neg_entropy is True or neg_entropy == "neg_entropy",
+                    margin_confidence=margin_confidence is True or margin_confidence == "margin_confidence"
                 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def forward(self, seqs: list[SequenceBase], logits: torch.Tensor, temperatures: torch.Tensor,
top_p=None, top_k=None, margin_confidence=False, neg_entropy=False, threshold=0.95):
attn_metadata = self.fetch_attn_metadata()
split_logits = torch.split(
logits, [len(seq) for seq in seqs] if attn_metadata.is_prefill
else [attn_metadata.diffusion_block_size] * len(seqs), dim=0
)
accepted_ids_map = {}
sampled_tokens_map = {}
true_local_ids_map = {}
for temperature, seq, seq_logits in zip(temperatures, seqs, split_logits):
true_local_ids_sub_map = {}
accepted_ids_sub_map = {}
sampled_tokens_sub_map = {}
last_logits = self._fetch_last_logits(seq_logits, seq)
shifted_logits = self._shift_logits(seq_logits, last_logits)
for block_id, block in enumerate(seq.diffusion_blocks):
if not block.is_active or sum(block.local_mask_tokens) == 0:
continue
if len(block.global_mask_token_ids) == 0:
continue
if attn_metadata.is_prefill:
mask_token_logits = shifted_logits[block.global_mask_token_ids, ...]
else:
mask_token_logits = shifted_logits[block.local_mask_token_ids, ...]
confidence, sampled_tokens, initial_confidence = self.sample_tokens(
mask_token_logits,
temperature,
top_p=top_p,
top_k=top_k,
neg_entropy=(neg_entropy == "neg_entropy"),
margin_confidence=(margin_confidence == "margin_confidence")
)
def forward(self, seqs: list[SequenceBase], logits: torch.Tensor, temperatures: torch.Tensor,
top_p=None, top_k=None, margin_confidence=False, neg_entropy=False, threshold=0.95):
attn_metadata = self.fetch_attn_metadata()
split_logits = torch.split(
logits, [len(seq) for seq in seqs] if attn_metadata.is_prefill
else [attn_metadata.diffusion_block_size] * len(seqs), dim=0
)
accepted_ids_map = {}
sampled_tokens_map = {}
true_local_ids_map = {}
for temperature, seq, seq_logits in zip(temperatures, seqs, split_logits):
true_local_ids_sub_map = {}
accepted_ids_sub_map = {}
sampled_tokens_sub_map = {}
last_logits = self._fetch_last_logits(seq_logits, seq)
shifted_logits = self._shift_logits(seq_logits, last_logits)
for block_id, block in enumerate(seq.diffusion_blocks):
if not block.is_active or sum(block.local_mask_tokens) == 0:
continue
if len(block.global_mask_token_ids) == 0:
continue
if attn_metadata.is_prefill:
mask_token_logits = shifted_logits[block.global_mask_token_ids, ...]
else:
mask_token_logits = shifted_logits[block.local_mask_token_ids, ...]
confidence, sampled_tokens, initial_confidence = self.sample_tokens(
mask_token_logits,
temperature,
top_p=top_p,
top_k=top_k,
neg_entropy=neg_entropy is True or neg_entropy == "neg_entropy",
margin_confidence=margin_confidence is True or margin_confidence == "margin_confidence"
)
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 28-28: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


[warning] 49-49: Unpacked variable confidence is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
In `@diffulex/sampler/sdar.py` around lines 17 - 56, In forward(), the boolean
flags margin_confidence and neg_entropy are incorrectly compared to strings when
passed into sample_tokens (e.g., neg_entropy == "neg_entropy"), so True is never
honored; change the calls to normalize these inputs to booleans (accept both
bool and legacy string values) before passing them to sample_tokens — e.g.,
compute normalized_neg_entropy = bool(neg_entropy) or normalized_neg_entropy =
(neg_entropy is True or neg_entropy == "neg_entropy") and similarly for
margin_confidence, then call sample_tokens(...,
neg_entropy=normalized_neg_entropy,
margin_confidence=normalized_margin_confidence); apply the same normalization
pattern wherever these flags are used (including other files llada.py, dream.py,
fast_dllm_v2.py) so sample_tokens always receives a proper bool.

Comment on lines 123 to 133
if seq.diffusion_blocks[-1].is_active:
slot_mapping.extend([-1] * self.diffusion_block_size)
elif seq.diffusion_blocks[-1].is_to_cache:
need_kv_cache_store = True
num_pages_storing = seq.num_page_blocks_in_active_diffusion_block
total_num_pages = len(seq.block_table)
for i in range(0, num_pages_storing):
start = seq.block_table[(total_num_pages - 1) - num_pages_storing + i] * self.block_size
end = start + self.block_size
slot_mapping.extend(range(start, end))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect FDV2 block state invariants and diffusion step transitions.
rg -n "class FDV2Block|is_active|is_to_cache|next_diffusion_step|diffusion_blocks" diffulex/strategy/fast_dllm_v2/engine/sequence.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1534


🏁 Script executed:

#!/bin/bash
sed -n '11,100p' diffulex/strategy/fast_dllm_v2/engine/sequence.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2650


🏁 Script executed:

#!/bin/bash
sed -n '110,145p' diffulex/strategy/fast_dllm_v2/engine/model_runner.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2206


🏁 Script executed:

#!/bin/bash
rg -n "def diffusion_decoding_inputs" diffulex/strategy/fast_dllm_v2/engine/sequence.py -A 20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1175


🏁 Script executed:

#!/bin/bash
rg -n "def active_block_token_ids|active_block_token_ids" diffulex/strategy/fast_dllm_v2/engine/sequence.py -B 2 -A 5

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 749


🏁 Script executed:

#!/bin/bash
sed -n '251,280p' diffulex/strategy/fast_dllm_v2/engine/sequence.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1189


Add handling for IN_CACHE state to prevent slot_mapping desync.

The code handles only is_active and is_to_cache states, but diffusion_blocks[-1] can also be IN_CACHE (which neither condition matches). When this occurs, input_ids is extended by diffusion_decoding_inputs() but slot_mapping is not, causing a length mismatch. Add an else branch to extend slot_mapping with [-1]s for consistency with the active case.

🤖 Prompt for AI Agents
In `@diffulex/strategy/fast_dllm_v2/engine/model_runner.py` around lines 123 -
133, The code in model_runner.py fails to handle the IN_CACHE state for
seq.diffusion_blocks[-1], causing slot_mapping to be shorter than input_ids; in
the block that currently checks seq.diffusion_blocks[-1].is_active and
.is_to_cache, add an else branch that mirrors the active case by extending
slot_mapping with [-1] * self.diffusion_block_size so slot_mapping stays aligned
with the input_ids produced by diffusion_decoding_inputs(); update the branch
containing seq.diffusion_blocks[-1].is_active,
seq.diffusion_blocks[-1].is_to_cache, slot_mapping, and
diffusion_decoding_inputs() accordingly.

Comment on lines +55 to +60
# NOTE: fp16/fp32 are currently routed to the BF16 kernels in Diffulex.
# Keeping them registered avoids breaking older configs while we add
# true fp16/fp32 KV-cache kernels in the future.
@register_kv_cache_strategy("bf16", "bfloat16", "fp16", "float16", "fp32", "float32")
def _build_kv_cache_bf16() -> KVCacheBF16Strategy:
return KVCacheBF16Strategy()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for code that makes assumptions about fp16/fp32 KV cache behavior
rg -n -C3 'kv_cache_dtype.*fp16|kv_cache_dtype.*fp32|kv_cache_dtype.*float16|kv_cache_dtype.*float32' --type=py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 3155


🏁 Script executed:

#!/bin/bash
# Search for other code that assumes itemsize or precision based on dtype
rg -n 'itemsize|parse_kv_cache_dtype' --type=py -B2 -A2

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 22244


🏁 Script executed:

#!/bin/bash
# Check BF16Strategy implementation
cat -n diffulex/utils/quantization/strategies/kv_cache_bf16.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 3208


🏁 Script executed:

#!/bin/bash
# Check the legacy _get_kv_cache_storage_info function
sed -n '28,50p' diffulex_legacy/engine/model_runner.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 971


🏁 Script executed:

#!/bin/bash
# Check parse_kv_cache_dtype to understand the enum values
sed -n '1,100p' diffulex/utils/quantization/kv_cache_dtype.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2911


Real precision mismatch between BF16Strategy aliases and downstream size assumptions.

When kv_cache_dtype="fp32" or "float32" are used, they are routed to BF16Strategy (2 bytes), but downstream code makes incorrect assumptions: diffulex_legacy/engine/model_runner.py:34-44 returns 4 bytes for FP32 despite the strategy using 2 bytes, and examples/test_fp8_kv_cache_comprehensive.py:506 calculates itemsize as 4 bytes for fp32. This causes memory miscalculation. Use strategy's get_storage_dtype() method instead of hardcoded dtype-to-size mapping (as done in diffulex/engine/model_runner.py:197).

🤖 Prompt for AI Agents
In `@diffulex/utils/quantization/strategies/kv_cache_bf16.py` around lines 55 -
60, The BF16 alias registration (register_kv_cache_strategy ->
_build_kv_cache_bf16 returning KVCacheBF16Strategy) causes fp32/fp16 strings to
be treated as 2-byte storage but downstream code still uses hardcoded
dtype-to-size lookups; update callers to ask the strategy for its actual storage
dtype: call the strategy's get_storage_dtype() (e.g., on the KVCacheBF16Strategy
instance) and compute sizes via numpy dtype.itemsize instead of mapping strings
to sizes. Replace any hardcoded branches that assume "fp32" => 4 bytes (such as
code that computes itemsize) with a call to strategy.get_storage_dtype() and
np.dtype(...).itemsize so memory calculations match the registered strategy.

Comment on lines 9 to 14
@register_linear_strategy(weight_dtype="bf16", act_dtype="bf16")
def _build_linear_bf16() -> LinearQuantizationStrategy:
return LinearBF16Strategy()


class LinearBF16Strategy(LinearQuantizationStrategy):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Class referenced before definition.

The factory function _build_linear_bf16() references LinearBF16Strategy on line 11, but the class is defined on line 14. This will raise a NameError at import time.

🐛 Proposed fix: Move class definition before the factory function
 from diffulex.utils.quantization.registry import register_linear_strategy
 from diffulex.utils.quantization.strategy import LinearQuantizationStrategy
 

+class LinearBF16Strategy(LinearQuantizationStrategy):
+    """Default Linear strategy: no quantization (bf16/bf16)."""
+
+    `@property`
+    def name(self) -> str:
+        return "linear_bf16"
+
+    def get_storage_dtype(self) -> tuple[torch.dtype, int]:
+        # No special storage; keep as-is.
+        return torch.bfloat16, 2
+
+    def quantize(self, tensor: torch.Tensor, **kwargs):
+        _ = kwargs
+        return tensor, None
+
+    def dequantize(self, quantized: torch.Tensor, scale_or_metadata, **kwargs) -> torch.Tensor:
+        _ = scale_or_metadata, kwargs
+        return quantized
+
+    def get_scale_shape(self, original_shape: tuple[int, ...], **kwargs) -> tuple[int, ...]:
+        _ = original_shape, kwargs
+        return tuple()
+
+
 `@register_linear_strategy`(weight_dtype="bf16", act_dtype="bf16")
 def _build_linear_bf16() -> LinearQuantizationStrategy:
     return LinearBF16Strategy()
-
-
-class LinearBF16Strategy(LinearQuantizationStrategy):
-    """Default Linear strategy: no quantization (bf16/bf16)."""
-
-    `@property`
-    def name(self) -> str:
-        return "linear_bf16"
-
-    def get_storage_dtype(self) -> tuple[torch.dtype, int]:
-        # No special storage; keep as-is.
-        return torch.bfloat16, 2
-
-    def quantize(self, tensor: torch.Tensor, **kwargs):
-        _ = kwargs
-        return tensor, None
-
-    def dequantize(self, quantized: torch.Tensor, scale_or_metadata, **kwargs) -> torch.Tensor:
-        _ = scale_or_metadata, kwargs
-        return quantized
-
-    def get_scale_shape(self, original_shape: tuple[int, ...], **kwargs) -> tuple[int, ...]:
-        _ = original_shape, kwargs
-        return tuple()
🤖 Prompt for AI Agents
In `@diffulex/utils/quantization/strategies/linear_bf16.py` around lines 9 - 14,
The factory function _build_linear_bf16() calls LinearBF16Strategy before that
class is defined, causing a NameError at import; move the class
LinearBF16Strategy definition above the `@register_linear_strategy-decorated`
_build_linear_bf16 function (or alternatively inline the class reference by
returning an instance via a lambda that imports/defines the class first) so that
LinearBF16Strategy is defined when _build_linear_bf16() is executed.

luozixin2 added 2 commits February 9, 2026 02:19
…eg entropy

- Added normalization for margin_confidence and neg_entropy parameters in DreamSampler, FastdLLMV2Sampler, LLaDASampler, and SDARSampler classes to improve clarity and consistency in handling these options.
- Introduced a new kv_cache_layout attribute in AttnMetaDataBase for better cache management.
- Updated the model_runner to handle slot mapping when blocks are already in cache.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
diffulex_legacy/layers/attention/attention_v4.py (1)

57-173: ⚠️ Potential issue | 🟠 Major

Track and reset FP8 running-max on device changes to prevent torch.maximum failures.

The device parameter in both _update_and_compute_fp8_scales and _update_and_compute_q_fp8_scale is unused. Cached max tensors (k_max_abs, v_max_abs, q_max_abs) are initialized from input tensors at first call, so they reside on the device of those inputs. If the model is moved to a different device (e.g., via .to(device)) or in multi-GPU scenarios, subsequent calls pass tensors on a different device, causing torch.maximum to fail with a device mismatch error. Reset caches by comparing the provided device parameter against a newly tracked kv_cache_device attribute.

🛠️ Suggested fix (track device and reset on change)
@@
         self.kv_cache_dtype_cache: str | None = None
+        self.kv_cache_device: torch.device | None = None
@@
-        # Reset running max if dtype changed
-        if self.kv_cache_dtype_cache != kv_cache_dtype:
+        # Reset running max if dtype or device changed
+        if self.kv_cache_device != device or self.kv_cache_dtype_cache != kv_cache_dtype:
             self.k_max_abs = None
             self.v_max_abs = None
             self.q_max_abs = None
             self.kv_cache_dtype_cache = kv_cache_dtype
+            self.kv_cache_device = device
@@
-        # Reset running max if dtype changed
-        if self.kv_cache_dtype_cache != kv_cache_dtype:
+        # Reset running max if dtype or device changed
+        if self.kv_cache_device != device or self.kv_cache_dtype_cache != kv_cache_dtype:
             self.q_max_abs = None
             self.kv_cache_dtype_cache = kv_cache_dtype
+            self.kv_cache_device = device
diffulex/model/llada.py (1)

199-199: ⚠️ Potential issue | 🔴 Critical

Typo: nn.Moduledict should be nn.ModuleDict.

This will cause an AttributeError at runtime since PyTorch's class name uses a capital "D".

🐛 Proposed fix
-        self.transformer = nn.Moduledict(
+        self.transformer = nn.ModuleDict(
diffulex/strategy/d2f/engine/kvcache_manager.py (1)

44-58: ⚠️ Potential issue | 🟠 Major

Avoid hashing the wrong block when allocating multiple KV blocks.

With multi-block allocation, prev_end_token/prev_block_idx stay constant while last_block changes each iteration. If required spans multiple new blocks, the hash update can be applied to the wrong (newly allocated) block, corrupting hash_to_block_id. Gate hash finalization to the block that actually contains prev_end_token.

🧩 Proposed fix
         required = self._required_kv_blocks(seq)
+        prev_end_token = seq.cached_or_caching_num_tokens - seq.caching_num_tokens - 1
+        prev_block_idx = prev_end_token // self.block_size if prev_end_token >= 0 else -1
         # Allocate enough KV blocks to cover all cached_or_caching tokens.
         while len(block_table) < required:
             last_block = self.blocks[block_table[-1]]
             # Preserve the existing "finalize previous block hash" behavior before moving on.
-            if last_block.hash == -1:
-                prev_end_token = seq.cached_or_caching_num_tokens - seq.caching_num_tokens - 1
-                prev_block_idx = prev_end_token // self.block_size
-                if prev_block_idx < seq.num_blocks:
+            if last_block.hash == -1 and (len(block_table) - 1) == prev_block_idx:
+                if 0 <= prev_block_idx < seq.num_blocks:
                     token_ids: list[int] = seq.block(prev_block_idx)
                     prefix = self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1
                     h = self.compute_hash(token_ids, prefix)
                     last_block.update(h, token_ids)
                     self.hash_to_block_id[h] = last_block.block_id
diffulex_legacy/engine/model_runner.py (1)

410-421: ⚠️ Potential issue | 🟡 Minor

Potential UnboundLocalError in the fallback loop.

If the initial get_num_kvcache_blocks call fails (line 411) and enters the except block, num_kvcache_blocks is referenced in the while condition (line 415) before being assigned, causing an UnboundLocalError.

🐛 Proposed fix
         try:
             num_kvcache_blocks = get_num_kvcache_blocks(config.gpu_memory_utilization)
             assert num_kvcache_blocks > 0
         except:  # noqa: E722
             gpu_memory_utilization = config.gpu_memory_utilization
+            num_kvcache_blocks = get_num_kvcache_blocks(gpu_memory_utilization)
             while num_kvcache_blocks <= 200: 
diffulex/strategy/block_diffusion/engine/model_runner.py (1)

188-268: ⚠️ Potential issue | 🟠 Major

Ensure reset_warming_up() runs even if capture fails.

If an exception occurs during capture, the global warming flag can stay enabled and affect subsequent runs. Wrap the body in try/finally.

🛠️ Suggested fix
-        set_warming_up(True)
-        config = self.config
-        hf_config = config.hf_config
+        set_warming_up(True)
+        try:
+            config = self.config
+            hf_config = config.hf_config
             ...
-        reset_warming_up()
+        finally:
+            reset_warming_up()
🤖 Fix all issues with AI agents
In `@diffulex_bench/config.py`:
- Around line 67-103: The get_diffulex_kwargs function currently returns a
literal dict immediately, then attempts to mutate an undefined kwargs and add
quantization fields; fix by creating a single kwargs variable (e.g., kwargs = {
... } using the current dict contents from get_diffulex_kwargs), remove the
early return, then conditionally set kv_cache_dtype, decode_mode,
linear_attn_weight_dtype, linear_mlp_weight_dtype, linear_attn_act_dtype, and
linear_mlp_act_dtype onto that kwargs object, and finally return kwargs; update
references in this function to avoid the undefined variable and ensure
quantization options are included.

In `@diffulex_bench/lm_eval_model.py`:
- Around line 223-236: The loop collects per-request gen_args but never applies
them; update the code that calls self.runner.generate to pass per-request
SamplingParams by mapping each req's gen_args into a SamplingParams instance
(merging/overriding defaults from self.sampling_params) and pass a list of
SamplingParams instead of a single self.sampling_params; specifically, keep
building gen_args in the for req in requests loop, convert each gen_args entry
into a SamplingParams (honoring fields like max_gen_toks and until) and call
self.runner.generate(prompts, per_request_sampling_params_list, use_tqdm=not
disable_tqdm) so the runner receives list[SamplingParams] and honors per-request
overrides.

In `@diffulex_bench/runner.py`:
- Around line 19-53: The tokenizer is being loaded with
AutoTokenizer.from_pretrained(..., trust_remote_code=True) inside __init__ which
is unsafe; add a new parameter (e.g., trust_remote_code: bool = False and
optional revision: Optional[str] = None) to the Runner __init__ signature, pass
that parameter to AutoTokenizer.from_pretrained and only set trust_remote_code
when explicitly True, and if a mutable remote execution is required encourage
pinning by forwarding revision to from_pretrained; update the __init__'s
tokenizer_path handling and the call site that constructs DiffulexRunner to
opt-in when needed (also apply same pattern to other modules like
diffulex/config.py and diffulex/engine/llm_engine.py where
AutoTokenizer.from_pretrained or model loading uses trust_remote_code).

In `@diffulex_kernel/python/kv_cache_kernels.py`:
- Around line 919-945: store_kvcache_distinct_layout currently doesn't trim
slot_mapping for partial-prefill cases, causing failures when slot_mapping is
longer than the current token slice; update store_kvcache_distinct_layout to
mirror the unified-layout behavior by slicing/trimming slot_mapping to the
actual token count before calling _store_kvcache_distinct_bf16 or
_store_kvcache_distinct_fp8 (i.e., compute the active length from key/value
tensors or attn_metadata and replace slot_mapping with slot_mapping[:active_len]
when it's longer), and then pass the trimmed slot_mapping into those helper
functions.

In `@diffulex_profiler/__init__.py`:
- Around line 12-17: The unconditional imports of VizTracerBackend and
PyTorchProfilerBackend cause ImportError when optional deps are absent; change
the top-level imports so ProfilerBackend and SimpleTimerBackend are imported
normally, but wrap imports of VizTracerBackend and PyTorchProfilerBackend in
try/except ImportError blocks (or use getattr fallback) and only add those names
to the module exports when successfully imported; also update the module's
__all__ to include the optional backend names conditionally so the package
doesn't fail to import if optional dependencies are missing.

In `@diffulex_profiler/exporters/summary.py`:
- Around line 57-59: The loop is shadowing the module-level/file-level variable
output_file (set earlier around line 19) by reassigning output_file when
handling viztracer backend data; rename the local variable (e.g.,
viztracer_output_file or viz_output_file) inside the if m.backend_data and
m.backend_data.get("backend") == "viztracer" block and update the summary_lines
append to use that new name so the original output_file used for writing the
.txt summary is not overwritten; locate the handling code using symbols
m.backend_data, summary_lines, and output_file to make the change.

In `@diffulex/engine/model_runner.py`:
- Around line 193-197: The code calls strategy.get_storage_dtype() and later
expects strategy.init_scales(), but NoQuantizationStrategy (returned by
get_kv_cache_strategy fallback) doesn't implement init_scales, causing errors;
modify the fallback so get_kv_cache_strategy() never returns
NoQuantizationStrategy for KV-cache use (e.g., default to a KV-capable strategy
like BF16QuantizationStrategy) or add a guard before calling init_scales() to
skip/handle strategies without that method; update the logic around
get_kv_cache_strategy(), NoQuantizationStrategy, get_storage_dtype, and any
subsequent init_scales() calls (also apply the same change to the similar block
around lines 290-303) so only strategies that implement the KV-cache interface
are used for init_scales().
- Around line 165-171: In warmup_model, ensure reset_warming_up() always runs by
wrapping the work between set_warming_up(True) and reset_warming_up() in a
try/finally: call set_warming_up(True), do torch.cuda.empty_cache(),
torch.cuda.reset_peak_memory_stats() and call self._prefill_warmup() inside the
try block, and call reset_warming_up() in the finally block so that any
exception in _prefill_warmup() still clears the warming flag.
- Around line 151-163: In _prefill_warmup, guard against num_seqs resolving to
0: compute num_seqs from max_num_batched_tokens and max_model_len, and if
num_seqs == 0 log a debug/info message and return early so you don't call
self.run([]) or create zero-length seqs; otherwise continue to build seqs via
AutoSequence.create, call self.run(seqs, True), call seq.post_process() for each
seq, and still call torch.cuda.empty_cache() as before.
- Around line 40-49: Compute device_id before calling dist.init_process_group
and use that computed value for both init and torch.cuda.set_device to avoid
indexing a missing/short config.device_ids list; specifically, in
model_runner.py determine device_id by checking getattr(config, "device_ids",
None) and falling back to (getattr(config, "device_start", 0) or 0) + rank,
validate it against torch.cuda.device_count(), then pass device_id to
dist.init_process_group (instead of indexing config.device_ids again) and call
torch.cuda.set_device(device_id).

In `@diffulex/sampler/fast_dllm_v2.py`:
- Around line 69-72: Update the two schedulers that still call token.item()
(diffulex/strategy/fast_dllm_v2/engine/scheduler.py and
diffulex/strategy/block_diffusion/engine/scheduler.py): find the comparison
using token.item() == self.eos and replace it with a defensive conversion that
accepts either a Tensor or a Python int (e.g., if isinstance(token,
torch.Tensor): value = int(token.item()) else: value = int(token)) and then
compare value == self.eos; ensure this change is applied wherever sampled tokens
from sampled_tokens_sub_map or accepted_ids_sub_map are checked so list values
(already Python ints) and tensors both work correctly.

In `@diffulex/strategy/fast_dllm_v2/engine/sequence.py`:
- Around line 121-125: The __init__ for the Sequence class currently uses a
shared mutable SamplingParams() as a default; change the signature to use
sampling_params: SamplingParams | None = None and inside Sequence.__init__
create a new instance when None (e.g., sampling_params = SamplingParams() if
sampling_params is None else sampling_params) before calling
super().__init__(token_ids, sampling_params), ensuring each Sequence gets its
own SamplingParams instance and avoiding shared mutable defaults.

In `@diffulex/utils/quantization/strategies/linear_marlin_int8_w8a16.py`:
- Around line 101-132: get_storage_dtype declares torch.uint8 storage but
quantize()/dequantize() use signed int8; change quantize in function
quantize(...) to produce uint8 by biasing signed int8 values (add 128) and
clamping to [0,255] and return dtype torch.uint8, and change dequantize in
dequantize(...) to accept the uint8 storage, convert back to signed by
subtracting 128 (or cast to int8 after subtract) before multiplying by scales;
ensure scales handling (scales.squeeze/unsqueeze) stays the same and types are
converted to float32 for arithmetic then result cast to bfloat16, so
get_storage_dtype, quantize, and dequantize are consistent.
🟡 Minor comments (24)
diffulex/utils/quantization/quantize_model.py-147-201 (1)

147-201: ⚠️ Potential issue | 🟡 Minor

Remove unused pack_factor (ruff F841).

🧹 Suggested fix
-    pack_factor = 32 // bits
     qweight = gptq_pack(w_q, bits, size_k, size_n).contiguous()  # [K/pack, N]
diffulex/utils/quantization/quantize_model.py-707-712 (1)

707-712: ⚠️ Potential issue | 🟡 Minor

Remove unused f-string prefixes (ruff F541).

Lines 707 and 712 contain only literal strings with no variable interpolation, making the f prefix unnecessary.

🧹 Suggested fix
-    print(f"\n✓ Quantization complete!")
+    print("\n✓ Quantization complete!")
@@
-    print(f"\n  You can now use this directory directly as model path:")
+    print("\n  You can now use this directory directly as model path:")
diffulex/layer/linear.py-824-824 (1)

824-824: ⚠️ Potential issue | 🟡 Minor

Remove unused variable dev_key.

The variable dev_key is assigned but never used. This was also flagged by static analysis (F841).

🧹 Proposed fix
-                    dev_key = self._device_index(device)
diffulex/layer/linear.py-1298-1300 (1)

1298-1300: ⚠️ Potential issue | 🟡 Minor

Redundant check: in_features is already an int.

Line 1298 assigns in_features = int(self._offline_quant_in_features_py), so checking if in_features is None on line 1299 will never be true since int() never returns None.

🐛 Proposed fix
         in_features = int(self._offline_quant_in_features_py)
-        if in_features is None or in_features <= 0:
+        if in_features <= 0:
             raise RuntimeError("GPTQ offline 权重已加载,但无法推断 in_features 以计算 weight_bits。")
examples/test_fp8_kv_cache_distinct.py-72-72 (1)

72-72: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string without placeholders.

The f-string on this line has no placeholders, making the f prefix unnecessary.

Proposed fix
-    print(f"\n总计:")
+    print("\n总计:")
diffulex/strategy/fast_dllm_v2/attention/metadata.py-16-18 (1)

16-18: ⚠️ Potential issue | 🟡 Minor

Potential type issue: sum() on a tensor doesn't return a comparable scalar.

If context_lens is a torch.Tensor, sum(self.context_lens) will iterate and sum elements but returns a tensor, not a Python scalar. The comparison > 0 may not behave as expected for zero-dimensional tensors in some contexts.

Proposed fix
     def __post_init__(self):
-        if self.context_lens is not None and sum(self.context_lens) > 0:
+        if self.context_lens is not None and self.context_lens.sum().item() > 0:
             self.total_lens = self.diffusion_block_size + self.context_lens
diffulex_profiler/README.md-163-169 (1)

163-169: ⚠️ Potential issue | 🟡 Minor

Doc mismatch: use tokens= parameter name.

The API reference later lists record_throughput(tokens: int, ...), but this example uses total_tokens. Align the example with the API to avoid confusion.

📝 Proposed fix
-    profiler.record_throughput(total_tokens=1000)
+    profiler.record_throughput(tokens=1000)
examples/test_bf16_kernel_e2e.py-70-70 (1)

70-70: ⚠️ Potential issue | 🟡 Minor

Remove redundant f-string prefix.

Line 70 uses an f-string without any placeholders; drop the f to satisfy F541.

🔧 Proposed fix
-    print(f"\n总计:")
+    print("\n总计:")
examples/test_fp8_kv_cache_python_dequant.py-72-72 (1)

72-72: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix.

This f-string has no placeholders.

Proposed fix
-    print(f"\n总计:")
+    print("\n总计:")
examples/test_fastdllmv2_diffulex_gsm8k.py-69-75 (1)

69-75: ⚠️ Potential issue | 🟡 Minor

Create the profiling output directory before writing.
The nested log/profiles/... path will fail if the directory doesn’t exist.

💡 Proposed fix
     if PROFILE:
         output_file = "log/profiles/perf_dvllm_dream_7B.json"
+        os.makedirs(os.path.dirname(output_file), exist_ok=True)
         if os.path.exists(output_file):
             os.remove(output_file)
#!/bin/bash
# Sanity check: ensure the profiling output directory exists in the repo (if expected).
if [ ! -d log/profiles ]; then
  echo "log/profiles is missing; profiling output may fail unless created at runtime."
fi
examples/test_fp8_linear.py-115-122 (1)

115-122: ⚠️ Potential issue | 🟡 Minor

Remove unused variables to satisfy Ruff F841.

M, mem_bf16, and mem_fp8 are unused and will fail linting.

🛠️ Suggested fix
-    M, K, N = 32, 512, 256
+    K, N = 512, 256
     weight_bf16 = torch.randn(N, K, dtype=torch.bfloat16, device=device)
-    mem_bf16 = torch.cuda.memory_allocated()
@@
     strategy = create_linear_strategy(weight_dtype="fp8_e4m3", act_dtype="bf16")
     weight_fp8, scales = strategy.quantize_weight_for_kernel(weight_bf16, device=device)
-    mem_fp8 = torch.cuda.memory_allocated()
diffulex/sampler/sdar.py-51-58 (1)

51-58: ⚠️ Potential issue | 🟡 Minor

Rename unused confidence to _confidence to satisfy Ruff.

🛠️ Suggested fix
-                confidence, sampled_tokens, initial_confidence = self.sample_tokens(
+                _confidence, sampled_tokens, initial_confidence = self.sample_tokens(
                     mask_token_logits,
                     temperature,
                     top_p=top_p,
                     top_k=top_k,
                     neg_entropy=normalized_neg_entropy,
                     margin_confidence=normalized_margin_confidence,
                 )
diffulex_bench/metrics.py-66-83 (1)

66-83: ⚠️ Potential issue | 🟡 Minor

Align HumanEval stub typing and silence unused-arg warnings.

The function is annotated as float but returns None, and results/k are unused. Consider Optional[float] and a dummy assignment to avoid lint errors.

🛠️ Suggested fix
-def humaneval_pass_at_k(
-    results: List[Dict[str, Any]],
-    k: int = 1,
-) -> float:
+def humaneval_pass_at_k(
+    results: List[Dict[str, Any]],
+    k: int = 1,
+) -> Optional[float]:
@@
-    # Returns None, actual evaluation requires implementing code execution logic
+    _ = results, k
+    # Returns None, actual evaluation requires implementing code execution logic
     return None
diffulex/logger.py-44-47 (1)

44-47: ⚠️ Potential issue | 🟡 Minor

Restore record.levelname after formatting to prevent color codes leaking into file logs.

LogRecord is shared across all handlers in Python's logging system. When ColoredFormatter.format() mutates record.levelname to add ANSI color codes, subsequent handlers (like file handlers) receive the modified value, resulting in color codes appearing in log files.

Wrap the mutation in a try/finally block to restore the original value:

🛠️ Suggested fix
     def format(self, record):
         log_color = self.COLORS.get(record.levelname, '')
-        record.levelname = f"{log_color}{record.levelname}{self.RESET}"
-        return super().format(record)
+        original_levelname = record.levelname
+        try:
+            record.levelname = f"{log_color}{record.levelname}{self.RESET}"
+            return super().format(record)
+        finally:
+            record.levelname = original_levelname
examples/test_fp8_linear.py-135-152 (1)

135-152: ⚠️ Potential issue | 🟡 Minor

Add early CUDA guard to skip FP8 tests when CUDA is unavailable.

The FP8 quantization tests depend on vLLM's Fp8LinearOp and custom CUDA kernels, which will fail when running on CPU-only setups or when FP8 kernels aren't available. This follows the same pattern already established in test_memory_usage() (lines 106–108), providing early feedback instead of cryptic runtime errors.

Suggested fix
 def main():
     """Run all end-to-end tests."""
     print("=" * 60)
     print("FP8 Linear Quantization End-to-End Tests")
     print("=" * 60)
     print()
+    if not torch.cuda.is_available():
+        print("CUDA not available; skipping FP8 tests.")
+        return 0
     
     try:
diffulex_kernel/python/paged_attn_decode_triton.py-72-75 (1)

72-75: ⚠️ Potential issue | 🟡 Minor

Rename l accumulator to satisfy Ruff E741 and improve clarity.

l is flagged as ambiguous. Renaming to lse/logsumexp avoids lint failures and makes the intent clearer (apply in all kernels).

✏️ Example rename (apply across kernels)
-    l = tl.zeros([BLOCK_M], dtype=tl.float32)
+    lse = tl.zeros([BLOCK_M], dtype=tl.float32)
@@
-        l_new = l * tl.exp(m - m_new) + tl.sum(p, axis=1)
+        l_new = lse * tl.exp(m - m_new) + tl.sum(p, axis=1)
@@
-        l = l_new
+        lse = l_new
@@
-    out = acc / l[:, None]
+    out = acc / lse[:, None]

Also applies to: 229-232, 395-397

diffulex/strategy/fast_dllm_v2/engine/scheduler.py-90-122 (1)

90-122: ⚠️ Potential issue | 🟡 Minor

Guard against mismatched accepted/true ID lengths.

zip() will silently truncate if the lists diverge, which can drop token updates without notice. Add a length check (or strict=True) before iterating.

🔧 Suggested guard
                 sampled_tokens = sampled_tokens_map.get(block_id, [])
                 true_local_ids = true_ids_map.get(block_id, [])
-                for true_local_id, accepted_id in zip(true_local_ids, accepted_ids):
+                if len(true_local_ids) != len(accepted_ids):
+                    raise ValueError(
+                        f"Mismatched lengths for block {block_id}: "
+                        f"{len(true_local_ids)} true_ids vs {len(accepted_ids)} accepted_ids"
+                    )
+                for true_local_id, accepted_id in zip(true_local_ids, accepted_ids):
diffulex_bench/main.py-19-77 (1)

19-77: ⚠️ Potential issue | 🟡 Minor

Filter out None values in model_args.

If optional fields are unset, key=None gets passed to lm_eval and can be misparsed.

🧹 Suggested fix
-    args_list = [f"{k}={v}" for k, v in args_dict.items()]
+    args_list = [f"{k}={v}" for k, v in args_dict.items() if v is not None]
diffulex_bench/config.py-60-65 (1)

60-65: ⚠️ Potential issue | 🟡 Minor

Rename the loop variable to avoid shadowing field.

Ruff flags F811 here; use a different loop variable.

🧹 Suggested cleanup
-        return {
-            field.name: getattr(self, field.name)
-            for field in self.__dataclass_fields__.values()
-        }
+        return {
+            f.name: getattr(self, f.name)
+            for f in self.__dataclass_fields__.values()
+        }
...
-        return {
-            field.name: getattr(self, field.name)
-            for field in self.__dataclass_fields__.values()
-        }
+        return {
+            f.name: getattr(self, f.name)
+            for f in self.__dataclass_fields__.values()
+        }

Also applies to: 131-136

diffulex/utils/quantization/strategies/__init__.py-10-19 (1)

10-19: ⚠️ Potential issue | 🟡 Minor

Remove unused # noqa: F401 directives.

Ruff flags these as unused since F401 isn’t enabled; dropping them keeps lint clean.

🧹 Suggested cleanup
-from diffulex.utils.quantization.strategies.linear_int8_w8a16 import LinearInt8W8A16Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_int4_w4a16 import LinearInt4W4A16Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_int8_w8a8 import LinearInt8W8A8Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_int4_w4a8 import LinearInt4W4A8Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_fp8_w8a16 import LinearFP8W8A16Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_fp8_w8a8 import LinearFP8W8A8Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_gptq_w4a16 import LinearGPTQW4A16Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_gptq_marlin_w4a16 import LinearGPTQMarlinW4A16Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_awq_w4a16 import LinearAWQW4A16Strategy  # noqa: F401
-from diffulex.utils.quantization.strategies.linear_awq_marlin_w4a16 import LinearAWQMarlinW4A16Strategy  # noqa: F401
+from diffulex.utils.quantization.strategies.linear_int8_w8a16 import LinearInt8W8A16Strategy
+from diffulex.utils.quantization.strategies.linear_int4_w4a16 import LinearInt4W4A16Strategy
+from diffulex.utils.quantization.strategies.linear_int8_w8a8 import LinearInt8W8A8Strategy
+from diffulex.utils.quantization.strategies.linear_int4_w4a8 import LinearInt4W4A8Strategy
+from diffulex.utils.quantization.strategies.linear_fp8_w8a16 import LinearFP8W8A16Strategy
+from diffulex.utils.quantization.strategies.linear_fp8_w8a8 import LinearFP8W8A8Strategy
+from diffulex.utils.quantization.strategies.linear_gptq_w4a16 import LinearGPTQW4A16Strategy
+from diffulex.utils.quantization.strategies.linear_gptq_marlin_w4a16 import LinearGPTQMarlinW4A16Strategy
+from diffulex.utils.quantization.strategies.linear_awq_w4a16 import LinearAWQW4A16Strategy
+from diffulex.utils.quantization.strategies.linear_awq_marlin_w4a16 import LinearAWQMarlinW4A16Strategy
diffulex/utils/quantization/strategies/__init__.py-21-37 (1)

21-37: ⚠️ Potential issue | 🟡 Minor

Sort __all__ to satisfy RUF022.

This keeps the public export list stable and lint-clean.

🔤 Suggested ordering
 __all__ = [
-    'NoQuantizationStrategy',
-    'KVCacheBF16Strategy',
-    'KVCacheFP8RunningMaxStrategy',
-    'LinearBF16Strategy',
-    'LinearStubStrategy',
-    'LinearInt8W8A16Strategy',
-    'LinearInt4W4A16Strategy',
-    'LinearInt8W8A8Strategy',
-    'LinearInt4W4A8Strategy',
-    'LinearFP8W8A16Strategy',
-    'LinearFP8W8A8Strategy',
-    'LinearGPTQW4A16Strategy',
-    'LinearGPTQMarlinW4A16Strategy',
-    'LinearAWQW4A16Strategy',
-    'LinearAWQMarlinW4A16Strategy',
+    'KVCacheBF16Strategy',
+    'KVCacheFP8RunningMaxStrategy',
+    'LinearAWQMarlinW4A16Strategy',
+    'LinearAWQW4A16Strategy',
+    'LinearBF16Strategy',
+    'LinearFP8W8A16Strategy',
+    'LinearFP8W8A8Strategy',
+    'LinearGPTQMarlinW4A16Strategy',
+    'LinearGPTQW4A16Strategy',
+    'LinearInt4W4A16Strategy',
+    'LinearInt4W4A8Strategy',
+    'LinearInt8W8A16Strategy',
+    'LinearInt8W8A8Strategy',
+    'LinearStubStrategy',
+    'NoQuantizationStrategy',
 ]
diffulex/utils/quantization/strategies/linear_awq_marlin_w4a16.py-124-131 (1)

124-131: ⚠️ Potential issue | 🟡 Minor

Add an explicit dtype guard before the kernel call.
linear_act_format is bf16, but linear_forward will pass other dtypes through; a fast-fail (or cast) avoids undefined behavior.

🛡️ Suggested fix
-        dtype_id = 1 if reshaped_x.dtype == torch.bfloat16 else (2 if reshaped_x.dtype == torch.float16 else 0)
+        if reshaped_x.dtype not in (torch.bfloat16, torch.float16):
+            raise RuntimeError("awq_marlin expects bf16/fp16 inputs.")
+        dtype_id = 1 if reshaped_x.dtype == torch.bfloat16 else 2
diffulex_legacy/layers/attention/ops/kv_cache_kernels.py-458-480 (1)

458-480: ⚠️ Potential issue | 🟡 Minor

Remove the redundant KvCacheDType import inside load_kvcache.
It redefines the already-imported symbol and triggers Ruff F811.

🧹 Suggested fix
-    from diffulex.utils.kv_cache_dtype import KvCacheDType
     if out_dtype == torch.bfloat16:
         out_dtype_enum = int(KvCacheDType.BF16)  # 0
diffulex/strategy/fast_dllm_v2/engine/sequence.py-126-127 (1)

126-127: ⚠️ Potential issue | 🟡 Minor

Fix copy‑paste error in the exception message.
The message says “BDSequence” but the class is FDV2Sequence, which can mislead callers.

✏️ Proposed fix
-            raise ValueError("BDSequence requires a Config instance.")
+            raise ValueError("FDV2Sequence requires a Config instance.")

Comment on lines 67 to 103
def get_diffulex_kwargs(self) -> Dict[str, Any]:
"""Get arguments to pass to Diffulex engine"""
return {
'model_name': self.model_name,
'decoding_strategy': self.decoding_strategy,
'mask_token_id': self.mask_token_id,
'tensor_parallel_size': self.tensor_parallel_size,
'data_parallel_size': self.data_parallel_size,
'gpu_memory_utilization': self.gpu_memory_utilization,
'max_model_len': self.max_model_len,
'max_num_batched_tokens': self.max_num_batched_tokens,
'max_num_seqs': self.max_num_seqs,
'use_lora': self.use_lora,
'lora_path': self.lora_path if self.use_lora else "",
'enforce_eager': self.enforce_eager,
'kv_cache_layout': self.kv_cache_layout,
'accept_threshold': self.accept_threshold,
'complete_threshold': self.complete_threshold,
'add_new_block_threshold': self.add_new_block_threshold,
'diffusion_block_size': self.diffusion_block_size,
}

# Add quantization parameters if specified
if self.kv_cache_dtype is not None:
kwargs['kv_cache_dtype'] = self.kv_cache_dtype
if self.decode_mode is not None:
kwargs['decode_mode'] = self.decode_mode
if self.linear_attn_weight_dtype is not None:
kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype
if self.linear_mlp_weight_dtype is not None:
kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype
if self.linear_attn_act_dtype is not None:
kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype
if self.linear_mlp_act_dtype is not None:
kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype

return kwargs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

get_diffulex_kwargs returns before adding quantization fields.

The function returns a dict immediately, so the quantization options are never applied and kwargs is undefined.

✅ Suggested fix
-        return {
+        kwargs = {
             'model_name': self.model_name,
             'decoding_strategy': self.decoding_strategy,
             'mask_token_id': self.mask_token_id,
             'tensor_parallel_size': self.tensor_parallel_size,
             'data_parallel_size': self.data_parallel_size,
             'gpu_memory_utilization': self.gpu_memory_utilization,
             'max_model_len': self.max_model_len,
             'max_num_batched_tokens': self.max_num_batched_tokens,
             'max_num_seqs': self.max_num_seqs,
             'use_lora': self.use_lora,
             'lora_path': self.lora_path if self.use_lora else "",
             'enforce_eager': self.enforce_eager,
             'kv_cache_layout': self.kv_cache_layout,
             'accept_threshold': self.accept_threshold,
             'complete_threshold': self.complete_threshold,
             'add_new_block_threshold': self.add_new_block_threshold,
             'diffusion_block_size': self.diffusion_block_size,
         }
         
         # Add quantization parameters if specified
         if self.kv_cache_dtype is not None:
             kwargs['kv_cache_dtype'] = self.kv_cache_dtype
         if self.decode_mode is not None:
             kwargs['decode_mode'] = self.decode_mode
         if self.linear_attn_weight_dtype is not None:
             kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype
         if self.linear_mlp_weight_dtype is not None:
             kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype
         if self.linear_attn_act_dtype is not None:
             kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype
         if self.linear_mlp_act_dtype is not None:
             kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype
         
-        return kwargs
+        return kwargs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_diffulex_kwargs(self) -> Dict[str, Any]:
"""Get arguments to pass to Diffulex engine"""
return {
'model_name': self.model_name,
'decoding_strategy': self.decoding_strategy,
'mask_token_id': self.mask_token_id,
'tensor_parallel_size': self.tensor_parallel_size,
'data_parallel_size': self.data_parallel_size,
'gpu_memory_utilization': self.gpu_memory_utilization,
'max_model_len': self.max_model_len,
'max_num_batched_tokens': self.max_num_batched_tokens,
'max_num_seqs': self.max_num_seqs,
'use_lora': self.use_lora,
'lora_path': self.lora_path if self.use_lora else "",
'enforce_eager': self.enforce_eager,
'kv_cache_layout': self.kv_cache_layout,
'accept_threshold': self.accept_threshold,
'complete_threshold': self.complete_threshold,
'add_new_block_threshold': self.add_new_block_threshold,
'diffusion_block_size': self.diffusion_block_size,
}
# Add quantization parameters if specified
if self.kv_cache_dtype is not None:
kwargs['kv_cache_dtype'] = self.kv_cache_dtype
if self.decode_mode is not None:
kwargs['decode_mode'] = self.decode_mode
if self.linear_attn_weight_dtype is not None:
kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype
if self.linear_mlp_weight_dtype is not None:
kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype
if self.linear_attn_act_dtype is not None:
kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype
if self.linear_mlp_act_dtype is not None:
kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype
return kwargs
def get_diffulex_kwargs(self) -> Dict[str, Any]:
"""Get arguments to pass to Diffulex engine"""
kwargs = {
'model_name': self.model_name,
'decoding_strategy': self.decoding_strategy,
'mask_token_id': self.mask_token_id,
'tensor_parallel_size': self.tensor_parallel_size,
'data_parallel_size': self.data_parallel_size,
'gpu_memory_utilization': self.gpu_memory_utilization,
'max_model_len': self.max_model_len,
'max_num_batched_tokens': self.max_num_batched_tokens,
'max_num_seqs': self.max_num_seqs,
'use_lora': self.use_lora,
'lora_path': self.lora_path if self.use_lora else "",
'enforce_eager': self.enforce_eager,
'kv_cache_layout': self.kv_cache_layout,
'accept_threshold': self.accept_threshold,
'complete_threshold': self.complete_threshold,
'add_new_block_threshold': self.add_new_block_threshold,
'diffusion_block_size': self.diffusion_block_size,
}
# Add quantization parameters if specified
if self.kv_cache_dtype is not None:
kwargs['kv_cache_dtype'] = self.kv_cache_dtype
if self.decode_mode is not None:
kwargs['decode_mode'] = self.decode_mode
if self.linear_attn_weight_dtype is not None:
kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype
if self.linear_mlp_weight_dtype is not None:
kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype
if self.linear_attn_act_dtype is not None:
kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype
if self.linear_mlp_act_dtype is not None:
kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype
return kwargs
🧰 Tools
🪛 Ruff (0.14.14)

[error] 91-91: Undefined name kwargs

(F821)


[error] 93-93: Undefined name kwargs

(F821)


[error] 95-95: Undefined name kwargs

(F821)


[error] 97-97: Undefined name kwargs

(F821)


[error] 99-99: Undefined name kwargs

(F821)


[error] 101-101: Undefined name kwargs

(F821)


[error] 103-103: Undefined name kwargs

(F821)

🤖 Prompt for AI Agents
In `@diffulex_bench/config.py` around lines 67 - 103, The get_diffulex_kwargs
function currently returns a literal dict immediately, then attempts to mutate
an undefined kwargs and add quantization fields; fix by creating a single kwargs
variable (e.g., kwargs = { ... } using the current dict contents from
get_diffulex_kwargs), remove the early return, then conditionally set
kv_cache_dtype, decode_mode, linear_attn_weight_dtype, linear_mlp_weight_dtype,
linear_attn_act_dtype, and linear_mlp_act_dtype onto that kwargs object, and
finally return kwargs; update references in this function to avoid the undefined
variable and ensure quantization options are included.

Comment on lines 223 to 236
for req in requests:
prompt = req.arguments[0]
if self.add_bos_token and self.tokenizer.bos_token:
prompt = self.tokenizer.bos_token + prompt
prompts.append(prompt)
gen_args.append(req.arguments[1] if len(req.arguments) > 1 else {})

# Run generation
start_time = time.time()
outputs = self.runner.generate(
prompts,
self.sampling_params,
use_tqdm=not disable_tqdm,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and locate the relevant code
find . -name "lm_eval_model.py" -type f

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 99


🏁 Script executed:

# Let's examine the full function context around lines 223-236
cat -n diffulex_bench/lm_eval_model.py | sed -n '200,250p'

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2007


🏁 Script executed:

# Check the entire generate_until function to understand gen_args lifecycle
rg -A 50 "def generate_until" diffulex_bench/lm_eval_model.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1747


🏁 Script executed:

# Look for the runner.generate method signature
rg "def generate" diffulex_bench/lm_eval_model.py | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 149


🏁 Script executed:

# Check what runner object is and its generate method
rg "self.runner" diffulex_bench/lm_eval_model.py | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 191


🏁 Script executed:

# Find BenchmarkRunner class definition
rg "class BenchmarkRunner" --type py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 113


🏁 Script executed:

# Look for imports and where BenchmarkRunner comes from
rg "BenchmarkRunner" diffulex_bench/lm_eval_model.py | head -10

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 154


🏁 Script executed:

# Find the BenchmarkRunner file
find . -name "*.py" -type f | xargs rg "class BenchmarkRunner" 2>/dev/null

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 115


🏁 Script executed:

# Check the generate method signature in BenchmarkRunner
rg -B 2 "def generate" --type py | grep -A 20 "class BenchmarkRunner" | head -30

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Search all Python files for generate method
find . -name "*.py" -type f -exec grep -l "def generate" {} \; | head -10

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 409


🏁 Script executed:

# Examine the BenchmarkRunner class and its generate method
cat -n diffulex_bench/runner.py | head -100

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 4772


🏁 Script executed:

# Find the generate method in BenchmarkRunner
rg -A 15 "def generate" diffulex_bench/runner.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 544


🏁 Script executed:

# Check the full BenchmarkRunner class structure
wc -l diffulex_bench/runner.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 94


🏁 Script executed:

# Get the complete generate method
rg -A 50 "def generate" diffulex_bench/runner.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1801


🏁 Script executed:

# Check what Instance class looks like from lm-eval
find . -name "*.py" -type f | xargs grep -l "class Instance" 2>/dev/null | head -5

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Look for lm-eval imports and understand Instance.arguments
rg "Instance" diffulex_bench/lm_eval_model.py | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 389


🏁 Script executed:

# Check if there are any other calls to runner.generate in the codebase
rg "runner.generate\|self.runner.generate" --type py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check what Instance arguments structure looks like
cat -n diffulex_bench/lm_eval_model.py | sed -n '220,260p'

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1692


🏁 Script executed:

# Search for any documentation or usage of Instance.arguments in the codebase
rg "arguments\[1\]\|arguments\[0\]" --type py diffulex_bench/

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check if lm_eval is in the repository
find . -name "lm_eval*" -o -name "*lm_eval*" 2>/dev/null | head -10

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 99


🏁 Script executed:

# Look for requirements or setup files to understand lm-eval integration
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i "lm-eval\|lm_eval" -A 2 -B 2

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 125


🏁 Script executed:

# Search for any documentation about what generate_until should do
rg "generate_until\|loglikelihood_rolling" --type py -B 5 -A 20 | head -80

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check the Diffulex.generate() signature that BenchmarkRunner wraps
rg -B 5 "def generate" diffulex_bench/runner.py | grep -A 15 "class BenchmarkRunner" | head -50

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Look at the Diffulex engine's generate method signature
find . -name "*.py" -type f | xargs grep -l "class Diffulex" 2>/dev/null | head -3

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 158


🏁 Script executed:

# Check if there's any documentation about what lm-eval expects
rg "until\|max_gen_toks\|generate_until" --type py diffulex_bench/lm_eval_model.py -B 3 -A 3

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Get more context on the full generate_until implementation
cat -n diffulex_bench/lm_eval_model.py | sed -n '206,280p'

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 3230


🏁 Script executed:

# Check the Diffulex class generate method signature
rg -A 20 "def generate" diffulex/diffulex.py | head -40

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Look for SamplingParams class definition
rg -B 2 -A 15 "class SamplingParams" --type py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 680


🏁 Script executed:

# Check if there are any examples of how to use per-request sampling
rg "SamplingParams\|sampling_params" diffulex_bench/lm_eval_model.py -B 2 -A 2 | head -50

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Get the full SamplingParams class definition
cat -n diffulex/sampling_params.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 273


🏁 Script executed:

# Check how Diffulex.generate is actually called/used in examples
find . -name "eval*.py" -type f | head -5 | xargs cat 2>/dev/null | grep -B 5 -A 10 "\.generate\|runner.generate" | head -50

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check the actual Diffulex class for generate method
cat -n diffulex/diffulex.py | head -200

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 538


🏁 Script executed:

# Search for how generate is called in tests or examples
rg "\.generate\(" --type py -B 2 -A 5 | head -80

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 5790


🏁 Script executed:

# Look for where sampling_params is initialized in lm_eval_model.py
rg "sampling_params" diffulex_bench/lm_eval_model.py | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 146


🏁 Script executed:

# Check the full class initialization to see sampling_params setup
cat -n diffulex_bench/lm_eval_model.py | sed -n '1,100p'

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 4244


🏁 Script executed:

# Look for any comments about lm-eval Instance.arguments structure
rg "Instance\|arguments" diffulex_bench/lm_eval_model.py -B 2 -A 2

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check if there's documentation on what gen_args contains
grep -r "max_gen_toks\|until" --include="*.py" diffulex_bench/ examples/ 2>/dev/null | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1765


🏁 Script executed:

# Check examples to see how they handle gen_args properly
cat -n examples/eval_dream.py | sed -n '1,250p' | grep -A 30 "def generate_until"

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Look at eval_llada.py for the proper pattern
cat -n examples/eval_llada.py | sed -n '1,350p' | grep -B 5 -A 40 "def generate_until"

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check if gen_args contains "until" field
rg "gen_args\|gen_kwargs\|until" examples/eval_dream.py -B 2 -A 2 | head -50

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Find and examine the example files
find . -name "eval_dream.py" -o -name "eval_llada.py" | head -5

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 115


🏁 Script executed:

# Try looking at the examples directory directly
ls -la examples/

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2425


🏁 Script executed:

# Check the examples more carefully
cat examples/eval_dream.py | head -300

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 12358


🏁 Script executed:

# Search for how gen_args is processed in eval_dream.py
grep -n "gen_args\|gen_kwargs" examples/eval_dream.py | head -30

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 205


🏁 Script executed:

# Get the generate_until function from eval_dream.py
rg "def generate_until" examples/eval_dream.py -A 60

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2336


🏁 Script executed:

# Let's verify what the lm-eval Instance.arguments structure should be
# by checking if there are any test files or documentation
find . -name "*.md" -o -name "*README*" | xargs grep -l "Instance\|arguments\|generate_until" 2>/dev/null | head -5

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 113


🏁 Script executed:

# Check eval_llada.py to see if it has a better implementation
grep -n "gen_args\|gen_kwargs\|until" examples/eval_llada.py | head -40

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 923


🏁 Script executed:

# Get the generate_until from eval_llada.py to compare
rg "def generate_until" examples/eval_llada.py -A 80 | head -100

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 3345


🏁 Script executed:

# Let's check if there's any documentation about what lm-eval expects
cat diffulex_bench/README.md

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 7742


🏁 Script executed:

# Verify that Instance.arguments/args is indeed a 2-tuple with (prompt, gen_args_dict)
rg "req.arguments\|req.args" diffulex_bench/lm_eval_model.py examples/ -B 2 -A 2

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


🏁 Script executed:

# Check if runner.generate or Diffulex.generate could be modified to accept list of params
rg "def generate" diffulex/engine/tp_worker.py -A 10

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 1054


🏁 Script executed:

# Check if SamplingParams has a stop_sequences or until field
cat diffulex/sampling_params.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 217


🏁 Script executed:

# Check the full signature again and see what parameters are supported
rg "class SamplingParams" -A 30 diffulex/sampling_params.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 170


🏁 Script executed:

# Look for how stop sequences are handled in the engine
rg "stop\|until" diffulex/engine/tp_worker.py | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 48


Honor per-request generation arguments from lm-eval.

gen_args is collected but unused. This ignores per-request generation limits (max_gen_toks) and stop conditions (until), causing results to drift from harness expectations.

The runner already supports list[SamplingParams], so you can apply per-request overrides:

Suggested approach
         # Run generation
         start_time = time.time()
+        sampling_params_list = []
+        for g in gen_args:
+            sp = self.sampling_params
+            if "max_gen_toks" in g:
+                sp = copy.deepcopy(self.sampling_params)
+                sp.max_tokens = int(g["max_gen_toks"])
+            if "temperature" in g:
+                if sp is self.sampling_params:
+                    sp = copy.deepcopy(self.sampling_params)
+                sp.temperature = float(g["temperature"])
+            sampling_params_list.append(sp)
         outputs = self.runner.generate(
             prompts,
-            self.sampling_params,
+            sampling_params_list,
             use_tqdm=not disable_tqdm,
         )
+        # Post-process to handle "until" stop sequences (SamplingParams doesn't support them)
+        for i, output in enumerate(outputs):
+            text = output.get('text', '')
+            if "until" in gen_args[i]:
+                for stop_seq in gen_args[i]["until"]:
+                    if stop_seq in text:
+                        text = text.split(stop_seq)[0]
+            output['text'] = text
🤖 Prompt for AI Agents
In `@diffulex_bench/lm_eval_model.py` around lines 223 - 236, The loop collects
per-request gen_args but never applies them; update the code that calls
self.runner.generate to pass per-request SamplingParams by mapping each req's
gen_args into a SamplingParams instance (merging/overriding defaults from
self.sampling_params) and pass a list of SamplingParams instead of a single
self.sampling_params; specifically, keep building gen_args in the for req in
requests loop, convert each gen_args entry into a SamplingParams (honoring
fields like max_gen_toks and until) and call self.runner.generate(prompts,
per_request_sampling_params_list, use_tqdm=not disable_tqdm) so the runner
receives list[SamplingParams] and honors per-request overrides.

Comment on lines 19 to 53
def __init__(
self,
model_path: str,
tokenizer_path: Optional[str] = None,
wait_ready: bool = True,
**diffulex_kwargs
):
"""
Initialize the benchmark runner

Args:
model_path: Path to the model
tokenizer_path: Path to the tokenizer, if None uses model_path
wait_ready: Whether to wait for engine to be fully initialized before returning
**diffulex_kwargs: Additional arguments to pass to Diffulex
"""
self.model_path = model_path
self.tokenizer_path = tokenizer_path or model_path
self.logger = get_logger(__name__)

# Initialize Diffulex engine
self.logger.info("Initializing Diffulex engine...")
self.llm = Diffulex(model_path, **diffulex_kwargs)

# Wait for engine to be ready if requested
if wait_ready:
self._wait_for_ready()

# Load tokenizer
self.logger.info("Loading tokenizer...")
self.tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer_path,
trust_remote_code=True
)
self.logger.success("Tokenizer loaded successfully")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's confirm the code exists as described
cat -n diffulex_bench/runner.py | head -60

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 2388


🏁 Script executed:

# Search for all instances of trust_remote_code in the codebase
rg "trust_remote_code" --type py -n

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 4619


🌐 Web query:

Hugging Face Transformers AutoTokenizer trust_remote_code security best practices

💡 Result:

What trust_remote_code means (and why it’s risky)

  • Setting trust_remote_code=True allows Transformers “Auto*” classes (including AutoTokenizer.from_pretrained(...)) to download and execute Python code from the model repo (typically via auto_map / custom classes). Hugging Face explicitly recommends reviewing the code first and pinning a revision to avoid supply-chain updates. [1][2]

Best practices for using AutoTokenizer securely

1) Prefer trust_remote_code=False (default) and models/tokenizers implemented natively

  • Only enable trust_remote_code when you must load a custom tokenizer implementation. If you can switch to a tokenizer that’s already supported in Transformers, do that. [2]

2) If you must enable it, pin to an immutable revision

  • Always pass a commit hash (or other immutable ref) via revision=... when using trust_remote_code=True, so you don’t execute newly-pushed code later. [1][2]

3) Manually review the repo code you are about to execute

  • Review the custom Python files referenced by the repo (e.g., tokenizer/modeling code) before enabling trust_remote_code=True. [1][2]

4) Prefer safetensors and block unsafe weight formats

  • Use safe weight formats (safetensors) and enforce it where applicable (Transformers recommends use_safetensors to avoid pickle-based formats). While tokenizers aren’t usually pickle-loaded, projects commonly load both tokenizer + weights together, so apply this policy consistently. [1][3]

5) Keep dependencies patched (Transformers + related loaders)

  • Upgrade Transformers promptly; it has had deserialization-related RCE vulnerabilities fixed in newer versions (e.g., CVE-2024-3568 fixed in 4.38.0). [4]

6) Add operational guardrails (recommended in production)

  • Allowlist trusted orgs/repos; avoid arbitrary user repos.
  • Run model/tokenizer loading in a sandbox (container/VM, low privileges, read-only FS where possible, no secrets in env).
  • Use network restrictions (egress allowlist) during loading if feasible.
  • Consider policies that disallow trust_remote_code unless the publisher is verified/trusted (this is how some curated deployments gate models). [3]

Sources: Hugging Face Transformers security policy and guidance on trust_remote_code + pinning revisions [1][2], Hugging Face deployment security guidance disallowing trust_remote_code unless verified [3], and a representative Transformers deserialization CVE advisory underscoring the need to patch quickly [4].


Make trust_remote_code opt-in with default False in tokenizer loading.

Setting trust_remote_code=True allows arbitrary code execution from model/tokenizer repositories. This should default to False and only be enabled when explicitly requested. If the feature is required, also pin to an immutable revision to prevent execution of newly-pushed code.

🔧 Suggested change
@@
     def __init__(
         self,
         model_path: str,
         tokenizer_path: Optional[str] = None,
         wait_ready: bool = True,
+        trust_remote_code: bool = False,
         **diffulex_kwargs
     ):
@@
             model_path: Path to the model
             tokenizer_path: Path to the tokenizer, if None uses model_path
             wait_ready: Whether to wait for engine to be fully initialized before returning
+            trust_remote_code: Whether to allow remote code execution when loading the tokenizer
             **diffulex_kwargs: Additional arguments to pass to Diffulex
@@
         self.tokenizer = AutoTokenizer.from_pretrained(
             self.tokenizer_path,
-            trust_remote_code=True
+            trust_remote_code=trust_remote_code,
         )

Note: This pattern appears throughout the codebase. Consider applying the same fix to other instances in diffulex/config.py, diffulex/engine/llm_engine.py, and other modules that load tokenizers or models with trust_remote_code=True.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(
self,
model_path: str,
tokenizer_path: Optional[str] = None,
wait_ready: bool = True,
**diffulex_kwargs
):
"""
Initialize the benchmark runner
Args:
model_path: Path to the model
tokenizer_path: Path to the tokenizer, if None uses model_path
wait_ready: Whether to wait for engine to be fully initialized before returning
**diffulex_kwargs: Additional arguments to pass to Diffulex
"""
self.model_path = model_path
self.tokenizer_path = tokenizer_path or model_path
self.logger = get_logger(__name__)
# Initialize Diffulex engine
self.logger.info("Initializing Diffulex engine...")
self.llm = Diffulex(model_path, **diffulex_kwargs)
# Wait for engine to be ready if requested
if wait_ready:
self._wait_for_ready()
# Load tokenizer
self.logger.info("Loading tokenizer...")
self.tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer_path,
trust_remote_code=True
)
self.logger.success("Tokenizer loaded successfully")
def __init__(
self,
model_path: str,
tokenizer_path: Optional[str] = None,
wait_ready: bool = True,
trust_remote_code: bool = False,
**diffulex_kwargs
):
"""
Initialize the benchmark runner
Args:
model_path: Path to the model
tokenizer_path: Path to the tokenizer, if None uses model_path
wait_ready: Whether to wait for engine to be fully initialized before returning
trust_remote_code: Whether to allow remote code execution when loading the tokenizer
**diffulex_kwargs: Additional arguments to pass to Diffulex
"""
self.model_path = model_path
self.tokenizer_path = tokenizer_path or model_path
self.logger = get_logger(__name__)
# Initialize Diffulex engine
self.logger.info("Initializing Diffulex engine...")
self.llm = Diffulex(model_path, **diffulex_kwargs)
# Wait for engine to be ready if requested
if wait_ready:
self._wait_for_ready()
# Load tokenizer
self.logger.info("Loading tokenizer...")
self.tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer_path,
trust_remote_code=trust_remote_code,
)
self.logger.success("Tokenizer loaded successfully")
🤖 Prompt for AI Agents
In `@diffulex_bench/runner.py` around lines 19 - 53, The tokenizer is being loaded
with AutoTokenizer.from_pretrained(..., trust_remote_code=True) inside __init__
which is unsafe; add a new parameter (e.g., trust_remote_code: bool = False and
optional revision: Optional[str] = None) to the Runner __init__ signature, pass
that parameter to AutoTokenizer.from_pretrained and only set trust_remote_code
when explicitly True, and if a mutable remote execution is required encourage
pinning by forwarding revision to from_pretrained; update the __init__'s
tokenizer_path handling and the call site that constructs DiffulexRunner to
opt-in when needed (also apply same pattern to other modules like
diffulex/config.py and diffulex/engine/llm_engine.py where
AutoTokenizer.from_pretrained or model loading uses trust_remote_code).

Comment on lines 919 to 945
def store_kvcache_distinct_layout(key: torch.Tensor, value: torch.Tensor,
k_cache: torch.Tensor, v_cache: torch.Tensor,
slot_mapping: torch.Tensor, attn_metadata: AttnMetaDataBase) -> None:
"""
Store KV cache (distinct layout).
Dynamically selects the appropriate kernel based on quantization strategy from context.
"""
from diffulex.utils.quantization.context import get_kv_cache_strategy
strategy = get_kv_cache_strategy()
if strategy is None:
_store_kvcache_distinct_bf16(key, value, k_cache, v_cache, slot_mapping)
return

fmt = getattr(strategy, "kv_cache_format", "bf16")
if fmt == "bf16":
_store_kvcache_distinct_bf16(key, value, k_cache, v_cache, slot_mapping)
return
if fmt == "fp8":
if attn_metadata.k_scale is None or attn_metadata.v_scale is None:
raise ValueError("FP8 quantization requires k_scale and v_scale in metadata")
_store_kvcache_distinct_fp8(
key, value, k_cache, v_cache, slot_mapping,
attn_metadata.k_scale, attn_metadata.v_scale,
strategy=strategy,
)
return
raise ValueError(f"Unsupported kv_cache_format={fmt!r} for distinct layout (strategy={type(strategy)})")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mirror the unified-layout slot_mapping alignment in distinct layout.
Unified layout trims slot_mapping when it’s longer than the current token slice, but distinct layout doesn’t—this can fail partial-prefill flows with longer mappings.

🛠️ Suggested fix (align slot_mapping length)
 def store_kvcache_distinct_layout(key: torch.Tensor, value: torch.Tensor, 
                                   k_cache: torch.Tensor, v_cache: torch.Tensor, 
                                   slot_mapping: torch.Tensor, attn_metadata: AttnMetaDataBase) -> None:
     """
     Store KV cache (distinct layout).
     Dynamically selects the appropriate kernel based on quantization strategy from context.
     """
+    N = int(key.shape[0])
+    if int(slot_mapping.numel()) != N:
+        if int(slot_mapping.numel()) > N:
+            slot_mapping = slot_mapping[-N:]
+        else:
+            raise AssertionError(
+                f"slot_mapping is shorter than key/value tokens: "
+                f"N={N}, slot_mapping.numel()={int(slot_mapping.numel())}"
+            )
     from diffulex.utils.quantization.context import get_kv_cache_strategy
     strategy = get_kv_cache_strategy()
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 938-938: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 945-945: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@diffulex_kernel/python/kv_cache_kernels.py` around lines 919 - 945,
store_kvcache_distinct_layout currently doesn't trim slot_mapping for
partial-prefill cases, causing failures when slot_mapping is longer than the
current token slice; update store_kvcache_distinct_layout to mirror the
unified-layout behavior by slicing/trimming slot_mapping to the actual token
count before calling _store_kvcache_distinct_bf16 or _store_kvcache_distinct_fp8
(i.e., compute the active length from key/value tensors or attn_metadata and
replace slot_mapping with slot_mapping[:active_len] when it's longer), and then
pass the trimmed slot_mapping into those helper functions.

Comment on lines 12 to 17
from diffulex_profiler.backends import (
ProfilerBackend,
SimpleTimerBackend,
VizTracerBackend,
PyTorchProfilerBackend,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unconditional import of optional backends will cause ImportError.

VizTracerBackend and PyTorchProfilerBackend are conditionally exported in diffulex_profiler/backends/__init__.py (only if their dependencies are available). However, this file imports them unconditionally, which will raise ImportError when the optional dependencies are missing.

Proposed fix: use conditional imports here as well
 from diffulex_profiler.backends import (
     ProfilerBackend,
     SimpleTimerBackend,
-    VizTracerBackend,
-    PyTorchProfilerBackend,
 )
+
+# Optional backends - only import if available
+try:
+    from diffulex_profiler.backends import VizTracerBackend
+except ImportError:
+    VizTracerBackend = None  # type: ignore[misc,assignment]
+
+try:
+    from diffulex_profiler.backends import PyTorchProfilerBackend
+except ImportError:
+    PyTorchProfilerBackend = None  # type: ignore[misc,assignment]

And update __all__ accordingly:

 __all__ = [
     "DiffulexProfiler",
     "ProfilerConfig",
     "PerformanceMetrics",
     "collect_gpu_metrics",
     "collect_cpu_metrics",
     "collect_memory_metrics",
     "ProfilerBackend",
     "SimpleTimerBackend",
-    "VizTracerBackend",
-    "PyTorchProfilerBackend",
     "ProfilerExporter",
     "JSONExporter",
     "CSVExporter",
     "SummaryExporter",
 ]
+
+if VizTracerBackend is not None:
+    __all__.append("VizTracerBackend")
+if PyTorchProfilerBackend is not None:
+    __all__.append("PyTorchProfilerBackend")
🤖 Prompt for AI Agents
In `@diffulex_profiler/__init__.py` around lines 12 - 17, The unconditional
imports of VizTracerBackend and PyTorchProfilerBackend cause ImportError when
optional deps are absent; change the top-level imports so ProfilerBackend and
SimpleTimerBackend are imported normally, but wrap imports of VizTracerBackend
and PyTorchProfilerBackend in try/except ImportError blocks (or use getattr
fallback) and only add those names to the module exports when successfully
imported; also update the module's __all__ to include the optional backend names
conditionally so the package doesn't fail to import if optional dependencies are
missing.

Comment on lines 165 to 171
def warmup_model(self):
"""Model-specific warmup logic."""
pass
logger.info("Warming up model...")
set_warming_up(True)
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
self._prefill_warmup()
reset_warming_up()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure reset_warming_up() always runs.

If _prefill_warmup() raises, the warming flag stays set. Wrap with try/finally.

✅ Suggested fix
         logger.info("Warming up model...")
         set_warming_up(True)
-        torch.cuda.empty_cache()
-        torch.cuda.reset_peak_memory_stats()
-        self._prefill_warmup()
-        reset_warming_up()
+        try:
+            torch.cuda.empty_cache()
+            torch.cuda.reset_peak_memory_stats()
+            self._prefill_warmup()
+        finally:
+            reset_warming_up()
🤖 Prompt for AI Agents
In `@diffulex/engine/model_runner.py` around lines 165 - 171, In warmup_model,
ensure reset_warming_up() always runs by wrapping the work between
set_warming_up(True) and reset_warming_up() in a try/finally: call
set_warming_up(True), do torch.cuda.empty_cache(),
torch.cuda.reset_peak_memory_stats() and call self._prefill_warmup() inside the
try block, and call reset_warming_up() in the finally block so that any
exception in _prefill_warmup() still clears the warming flag.

Comment on lines 193 to 197
# Get storage dtype and itemsize from quantization strategy
strategy = get_kv_cache_strategy()
if strategy is None:
strategy = NoQuantizationStrategy()
storage_dtype, itemsize = strategy.get_storage_dtype()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fallback strategy lacks init_scales.

NoQuantizationStrategy doesn’t implement the KV-cache interface, so init_scales() will raise when no KV-cache strategy is configured. Default to a KV-cache strategy (e.g., BF16) or guard init_scales.

🔧 Suggested fix (KV-cache default)
-from diffulex.utils.quantization.strategies import NoQuantizationStrategy
+from diffulex.utils.quantization.strategies import KVCacheBF16Strategy
...
         strategy = get_kv_cache_strategy()
         if strategy is None:
-            strategy = NoQuantizationStrategy()
+            strategy = KVCacheBF16Strategy()
...
-        k_scale_init, v_scale_init = strategy.init_scales(num_kv_heads, device)
+        k_scale_init, v_scale_init = strategy.init_scales(num_kv_heads, device)

Also applies to: 290-303

🤖 Prompt for AI Agents
In `@diffulex/engine/model_runner.py` around lines 193 - 197, The code calls
strategy.get_storage_dtype() and later expects strategy.init_scales(), but
NoQuantizationStrategy (returned by get_kv_cache_strategy fallback) doesn't
implement init_scales, causing errors; modify the fallback so
get_kv_cache_strategy() never returns NoQuantizationStrategy for KV-cache use
(e.g., default to a KV-capable strategy like BF16QuantizationStrategy) or add a
guard before calling init_scales() to skip/handle strategies without that
method; update the logic around get_kv_cache_strategy(), NoQuantizationStrategy,
get_storage_dtype, and any subsequent init_scales() calls (also apply the same
change to the similar block around lines 290-303) so only strategies that
implement the KV-cache interface are used for init_scales().

Comment on lines +69 to +72
accepted_ids_list = accepted_ids.to(device="cpu").tolist()
true_local_ids_sub_map[str(block_id)] = [block.local_mask_token_ids[i] for i in accepted_ids_list]
accepted_ids_sub_map[str(block_id)] = accepted_ids_list
sampled_tokens_sub_map[str(block_id)] = sampled_tokens.to(device="cpu").tolist()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for consumers of the maps mentioned in the review comment
# These are sub_maps being built, so look for references to the parent structures

# First, let's understand the context better by looking at the file
echo "=== File structure ===" 
wc -l diffulex/sampler/fast_dllm_v2.py

# Get context around lines 69-72
echo -e "\n=== Context around lines 69-72 ===" 
sed -n '50,90p' diffulex/sampler/fast_dllm_v2.py

# Search for where these _sub_map variables are used
echo -e "\n=== Searching for _sub_map usage ===" 
rg -n '_sub_map' diffulex/sampler/fast_dllm_v2.py

# Search for broader map usage patterns
echo -e "\n=== Searching for accepted_ids_map/sampled_tokens_map/true_local_ids_map ===" 
rg -n '\b(accepted_ids_map|sampled_tokens_map|true_local_ids_map)\b'

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 10352


🏁 Script executed:

#!/bin/bash
# Check the type annotations in base.py
echo "=== Type annotations in base.py ===" 
sed -n '76,83p' diffulex/sampler/base.py

# Examine the consumer code in the scheduler files
echo -e "\n=== Scheduler consumer code (fast_dllm_v2) ===" 
sed -n '99,115p' diffulex/strategy/fast_dllm_v2/engine/scheduler.py

echo -e "\n=== Scheduler consumer code (d2f) ===" 
sed -n '101,115p' diffulex/strategy/d2f/engine/scheduler.py

echo -e "\n=== Scheduler consumer code (block_diffusion) ===" 
sed -n '99,115p' diffulex/strategy/block_diffusion/engine/scheduler.py

# Also check if any code tries to call tensor methods on these values
echo -e "\n=== Check for tensor operations on accepted_ids ===" 
rg -n '\.to\(|\.cuda|\.device|\.shape|\.unsqueeze|\.squeeze' diffulex/strategy/fast_dllm_v2/engine/scheduler.py | head -20

echo -e "\n=== Check for tensor operations in d2f ===" 
rg -n '\.to\(|\.cuda|\.device|\.shape|\.unsqueeze|\.squeeze' diffulex/strategy/d2f/engine/scheduler.py | head -20

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 3772


🏁 Script executed:

#!/bin/bash
# Get more context around the token.item() calls
echo "=== fast_dllm_v2 scheduler context ===" 
sed -n '102,120p' diffulex/strategy/fast_dllm_v2/engine/scheduler.py

echo -e "\n=== d2f scheduler context ===" 
sed -n '104,120p' diffulex/strategy/d2f/engine/scheduler.py

echo -e "\n=== block_diffusion scheduler context ===" 
sed -n '102,120p' diffulex/strategy/block_diffusion/engine/scheduler.py

# Check if there's any conversion happening before .item() calls
echo -e "\n=== Check for any conversion before token.item() ===" 
rg -B5 'token\.item\(\)' diffulex/strategy/fast_dllm_v2/engine/scheduler.py

# Check what type token should be at that point
echo -e "\n=== sampled_tokens extraction ===" 
rg -B2 -A8 'sampled_tokens\[accepted_id\]' diffulex/strategy/fast_dllm_v2/engine/scheduler.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 4437


Fix incompatible consumer code in schedulers that still expect tensor methods.

The maps now store Python lists, but two downstream schedulers still call .item() on tokens:

  • diffulex/strategy/fast_dllm_v2/engine/scheduler.py (line ~113): token.item() == self.eos will fail with AttributeError
  • diffulex/strategy/block_diffusion/engine/scheduler.py (line ~113): token.item() == self.eos will fail with AttributeError

The diffulex/strategy/d2f/engine/scheduler.py has already been correctly updated with defensive handling (isinstance check and int() conversion).

Update the two broken schedulers to match the d2f implementation: add defensive checks for tensor input, use int() conversion for list values, and replace .item() comparison with direct integer comparison.

🤖 Prompt for AI Agents
In `@diffulex/sampler/fast_dllm_v2.py` around lines 69 - 72, Update the two
schedulers that still call token.item()
(diffulex/strategy/fast_dllm_v2/engine/scheduler.py and
diffulex/strategy/block_diffusion/engine/scheduler.py): find the comparison
using token.item() == self.eos and replace it with a defensive conversion that
accepts either a Tensor or a Python int (e.g., if isinstance(token,
torch.Tensor): value = int(token.item()) else: value = int(token)) and then
compare value == self.eos; ensure this change is applied wherever sampled tokens
from sampled_tokens_sub_map or accepted_ids_sub_map are checked so list values
(already Python ints) and tensors both work correctly.

Comment on lines 121 to 125
token_ids: list[int],
sampling_params: SamplingParams = SamplingParams(),
config: Config | None = None,
):
super().__init__(token_ids, sampling_params)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's check the exact code in sequence.py around lines 121-125
sed -n '110,130p' diffulex/strategy/fast_dllm_v2/engine/sequence.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 854


🏁 Script executed:

#!/bin/bash
# Check the SamplingParams class definition to verify if it's mutable
head -100 diffulex/sampling_params.py

Repository: SJTU-DENG-Lab/Diffulex

Length of output: 217


Avoid instantiating SamplingParams in a default argument.

Default-argument instantiation happens at import time and creates a shared mutable instance across all function calls. Since SamplingParams is a non-frozen dataclass, modifications to this instance (whether in super().__init__() or elsewhere) will affect all subsequent sequences that don't explicitly pass sampling_params. Use None as the default and initialize inside __init__.

🔧 Proposed fix
-        sampling_params: SamplingParams = SamplingParams(),
+        sampling_params: SamplingParams | None = None,
@@
-        super().__init__(token_ids, sampling_params)
+        if sampling_params is None:
+            sampling_params = SamplingParams()
+        super().__init__(token_ids, sampling_params)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
token_ids: list[int],
sampling_params: SamplingParams = SamplingParams(),
config: Config | None = None,
):
super().__init__(token_ids, sampling_params)
token_ids: list[int],
sampling_params: SamplingParams | None = None,
config: Config | None = None,
):
if sampling_params is None:
sampling_params = SamplingParams()
super().__init__(token_ids, sampling_params)
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 122-122: Do not perform function call SamplingParams in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
In `@diffulex/strategy/fast_dllm_v2/engine/sequence.py` around lines 121 - 125,
The __init__ for the Sequence class currently uses a shared mutable
SamplingParams() as a default; change the signature to use sampling_params:
SamplingParams | None = None and inside Sequence.__init__ create a new instance
when None (e.g., sampling_params = SamplingParams() if sampling_params is None
else sampling_params) before calling super().__init__(token_ids,
sampling_params), ensuring each Sequence gets its own SamplingParams instance
and avoiding shared mutable defaults.

Comment on lines 101 to 132
def get_storage_dtype(self) -> tuple[torch.dtype, int]:
# We store qweight as uint8 (bias128 representation).
return torch.uint8, 1

# ---- Required abstract methods (for registry/factory instantiation) ----
def quantize(self, tensor: torch.Tensor, **kwargs: Any) -> tuple[torch.Tensor, Any]:
"""Reference per-output-channel symmetric int8 quantization.

Returns:
quantized_int8: [N,K] int8
scales: [N] bf16
"""
_ = kwargs
if tensor.dim() != 2:
raise ValueError(f"Expected 2D weight [N,K], got shape={tuple(tensor.shape)}")
if tensor.dtype != torch.bfloat16:
tensor = tensor.to(dtype=torch.bfloat16)
abs_max = torch.abs(tensor).max(dim=-1, keepdim=True)[0] # [N,1]
scales = (abs_max.clamp(min=1e-8) / 127.0).to(dtype=torch.bfloat16) # [N,1]
q = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int8)
return q, scales.squeeze(-1)

def dequantize(self, quantized: torch.Tensor, scale_or_metadata: Any, **kwargs: Any) -> torch.Tensor:
"""Reference dequantization back to bf16."""
_ = kwargs
scales = scale_or_metadata.get("scales") if isinstance(scale_or_metadata, dict) else scale_or_metadata
if scales is None:
raise ValueError("scales required for dequantization")
if scales.dim() == 1:
scales = scales.unsqueeze(-1)
return (quantized.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align quantize()/dequantize() with declared uint8 storage.
get_storage_dtype() advertises uint8, but quantize() returns int8 and dequantize() assumes signed values. This mismatch can break storage buffers created from the strategy metadata.

🔧 Suggested alignment (uint8 storage + bias128)
-        q = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int8)
-        return q, scales.squeeze(-1)
+        q_i16 = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int16)
+        q_u8 = (q_i16 + 128).to(torch.uint8)
+        return q_u8, scales.squeeze(-1)
-        return (quantized.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)
+        if quantized.dtype == torch.uint8:
+            q = quantized.to(torch.int16) - 128
+        else:
+            q = quantized.to(torch.int16)
+        return (q.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_storage_dtype(self) -> tuple[torch.dtype, int]:
# We store qweight as uint8 (bias128 representation).
return torch.uint8, 1
# ---- Required abstract methods (for registry/factory instantiation) ----
def quantize(self, tensor: torch.Tensor, **kwargs: Any) -> tuple[torch.Tensor, Any]:
"""Reference per-output-channel symmetric int8 quantization.
Returns:
quantized_int8: [N,K] int8
scales: [N] bf16
"""
_ = kwargs
if tensor.dim() != 2:
raise ValueError(f"Expected 2D weight [N,K], got shape={tuple(tensor.shape)}")
if tensor.dtype != torch.bfloat16:
tensor = tensor.to(dtype=torch.bfloat16)
abs_max = torch.abs(tensor).max(dim=-1, keepdim=True)[0] # [N,1]
scales = (abs_max.clamp(min=1e-8) / 127.0).to(dtype=torch.bfloat16) # [N,1]
q = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int8)
return q, scales.squeeze(-1)
def dequantize(self, quantized: torch.Tensor, scale_or_metadata: Any, **kwargs: Any) -> torch.Tensor:
"""Reference dequantization back to bf16."""
_ = kwargs
scales = scale_or_metadata.get("scales") if isinstance(scale_or_metadata, dict) else scale_or_metadata
if scales is None:
raise ValueError("scales required for dequantization")
if scales.dim() == 1:
scales = scales.unsqueeze(-1)
return (quantized.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)
def get_storage_dtype(self) -> tuple[torch.dtype, int]:
# We store qweight as uint8 (bias128 representation).
return torch.uint8, 1
# ---- Required abstract methods (for registry/factory instantiation) ----
def quantize(self, tensor: torch.Tensor, **kwargs: Any) -> tuple[torch.Tensor, Any]:
"""Reference per-output-channel symmetric int8 quantization.
Returns:
quantized_int8: [N,K] int8
scales: [N] bf16
"""
_ = kwargs
if tensor.dim() != 2:
raise ValueError(f"Expected 2D weight [N,K], got shape={tuple(tensor.shape)}")
if tensor.dtype != torch.bfloat16:
tensor = tensor.to(dtype=torch.bfloat16)
abs_max = torch.abs(tensor).max(dim=-1, keepdim=True)[0] # [N,1]
scales = (abs_max.clamp(min=1e-8) / 127.0).to(dtype=torch.bfloat16) # [N,1]
q_i16 = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int16)
q_u8 = (q_i16 + 128).to(torch.uint8)
return q_u8, scales.squeeze(-1)
def dequantize(self, quantized: torch.Tensor, scale_or_metadata: Any, **kwargs: Any) -> torch.Tensor:
"""Reference dequantization back to bf16."""
_ = kwargs
scales = scale_or_metadata.get("scales") if isinstance(scale_or_metadata, dict) else scale_or_metadata
if scales is None:
raise ValueError("scales required for dequantization")
if scales.dim() == 1:
scales = scales.unsqueeze(-1)
if quantized.dtype == torch.uint8:
q = quantized.to(torch.int16) - 128
else:
q = quantized.to(torch.int16)
return (q.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 115-115: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 128-128: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@diffulex/utils/quantization/strategies/linear_marlin_int8_w8a16.py` around
lines 101 - 132, get_storage_dtype declares torch.uint8 storage but
quantize()/dequantize() use signed int8; change quantize in function
quantize(...) to produce uint8 by biasing signed int8 values (add 128) and
clamping to [0,255] and return dtype torch.uint8, and change dequantize in
dequantize(...) to accept the uint8 storage, convert back to signed by
subtracting 128 (or cast to int8 after subtract) before multiplying by scales;
ensure scales handling (scales.squeeze/unsqueeze) stays the same and types are
converted to float32 for arithmetic then result cast to bfloat16, so
get_storage_dtype, quantize, and dequantize are consistent.

luozixin2 added 2 commits February 9, 2026 02:55
…and revision support

- Added trust_remote_code and revision attributes to Config class for improved model and tokenizer loading flexibility.
- Updated model_runner and tp_worker to utilize new configuration options when loading models and tokenizers.
- Enhanced quantization strategies to handle initialization and storage more robustly.
- Improved error handling and logging for model warmup and KV cache allocation processes.
…ed logits

- Enhanced the _fetch_last_logits method to include error handling for empty logits and out-of-bounds indices.
- Introduced a new _gather_shifted_logits_rows method to efficiently gather shifted logits without materializing the full tensor.
- Updated DreamSampler and FastdLLMV2Sampler classes to utilize the new gathering method for improved performance and memory management.
- Ensured compatibility with cached-prefill scenarios by using query-length splits for logits.
@luozixin2 luozixin2 merged commit 0a8f0f4 into SJTU-DENG-Lab:v0.0.1.0209 Feb 9, 2026
1 check passed
@luozixin2 luozixin2 deleted the v0.0.1.0209 branch February 9, 2026 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants