Conversation
- Add KvCacheDType enum supporting bf16/fp16/fp32/fp8_e4m3/fp8_e5m2 - Add parse_kv_cache_dtype() to convert string to dtype - Add get_fp8_dtype_for_storage() to get FP8 dtype from vLLM platform - Add compute_fp8_scale() to compute quantization scale using absmax - Support FP8 storage as uint8 + view(fp8_dtype) pattern - Add helper functions for FP8 min/max bounds
…che kernels Core changes: - Add kv_cache_dtype and k_scale/v_scale parameters to store/load wrappers - Refactor store kernels to support FP8 quantization with per-head scale: * store_kvcache_kernel_causal_lm: add FP8 quantization logic * store_kvcache_kernel_diffusion_lm: add FP8 quantization logic * store_kvcache_kernel_diffusion_lm_distinct: add FP8 quantization logic - Refactor load_kvcache_kernel_kv to support FP8 dequantization: * Load FP8 values from cache (uint8 storage + view to FP8 dtype) * Dequantize using per-head scale and cast to output dtype * Support BF16/FP16/FP32 cache without quantization overhead - Update store_kvcache_unified_layout() to handle FP8 uint8->fp8 view - Update store_kvcache_distinct_layout() to handle FP8 uint8->fp8 view - Update load_kvcache() to support configurable output dtype (defaults to k_new.dtype) - Use constexpr int constants instead of enum in Triton kernels (Triton limitation) Technical details: - FP8 uses absmax-based quantization: value_fp8 = clamp(value_fp32 / scale, fp8_range) - FP8 dequantization: value_out = (value_fp8.to(float32) * scale).to(output_dtype) - Scale can be scalar or per-head vector [num_kv_heads] - Maintains backward compatibility: defaults to BF16 when kv_cache_dtype not specified
- Update import from attention_v4 to ops module - Fix function name from store_kvcache_unified to store_kvcache_unified_layout
- Add test_kv_cache_fp8_unified_roundtrip.py for unified layout FP8 store/load roundtrip - Add test_kv_cache_fp8_distinct_roundtrip.py for distinct layout FP8 store test - Test FP8 quantization/dequantization with per-head scales - Verify roundtrip accuracy with atol=1e-1, rtol=1e-1 tolerance for FP8 precision
- Reduce num_warps from 4 to 1 to reduce shared memory usage - Reduce num_unroll_cache from 4 to 2 to reduce shared memory usage - Add comments explaining why BLOCK_M/BLOCK_N cannot be reduced - Minor code formatting fix in kv_cache_kernels.py
… KV cache implementation
- Add kv_cache_dtype field to Config class (default: bf16) - Add _get_kv_cache_storage_info() helper function to determine storage dtype and itemsize - Update allocate_kv_cache() in ModelRunnerForCausalLM to use kv_cache_dtype - Update allocate_kv_cache() in ModelRunnerForDiffusionLM to use kv_cache_dtype - Support FP8 KV cache allocation using uint8 storage dtype
- Add kv_cache_dtype parameter passing in attention layers (v4 and v5) - Implement running max strategy for FP8 scale computation - Pass scale parameters to store/load functions in forward method - Update ContextForCausalLM to support kv_cache_dtype - Update ModelRunnerForCausalLM to pass kv_cache_dtype to context Changes: - attention_v4.py: Add _get_kv_cache_dtype(), _update_and_compute_fp8_scales(), _get_fp8_scales_from_max() methods; update forward() to pass scales - attention_v5.py: Same changes as attention_v4.py - context.py: Add kv_cache_dtype field to ContextForCausalLM - model_runner.py: Pass kv_cache_dtype to set_context_causal_lm() calls All tests passed including unit tests and FP8 roundtrip tests.
- Fix store_kvcache calls to pass context as keyword argument - Resolves 'got multiple values for argument' error when using FP8 KV cache - Verified with full pipeline test using FP8 KV cache Changes: - attention_v4.py: Pass context as keyword argument in store_kvcache call - attention_v5.py: Same fix as attention_v4.py - test_fp8_kv_cache_pipeline.py: Add integration test for FP8 KV cache in full pipeline Test results: - Successfully generated text using FP8 KV cache (fp8_e4m3) - All 3 test prompts generated correctly - No errors in FP8 quantization/dequantization path
- Add test_kv_cache_memory_usage.py to verify KV cache memory allocation - Add test_kv_cache_speed_comparison.py to compare FP8 vs BF16 performance - Verified FP8 reduces per-block memory by 50% and allows 2x blocks allocation - Performance tests show FP8 is comparable to BF16 in speed Test results: - FP8: 428 blocks × 7 MB/block = 2996 MB total - BF16: 214 blocks × 14 MB/block = 2996 MB total - FP8 throughput: 63.15 tok/s vs BF16: 56.27 tok/s (12% faster)
…support feat: kv cache fp8 support
…s; remove unused checker.py
…rom global memory fetching into fragment fetching
…ilable, checking errors of cuda graph capturing fixed.
… and WARP_SPECIALIZATION
- Fix quantize function to support 2D input tensors - Implement FP8 unified store kernel and helper - Implement FP8 load with Python-level dequantization - Support both static and varlen decode modes - Remove debug code - Update documentation Note: temp/ directory excluded from commit
- Add FP8 distinct store kernel (Triton) - Add FP8 distinct store helper with Python-level quantization - Update store_kvcache_distinct_layout to support FP8 strategy - Extend _load_kvcache_fp8 to support distinct layout - Fix _load_kvcache_bf16 to handle distinct layout stride calculation - Implement distinct layout decode path in attn_impl.py - Add load_kvcache export to diffulex_kernel/__init__.py - Add test script for distinct layout - Update .gitignore to exclude temp/ directory
…zation strategy support - Rename dllm_flash_attn_prefill to _dllm_flash_attn_prefill_bf16 - Rename dllm_flash_attn_decode to _dllm_flash_attn_decode_bf16 - Add new dllm_flash_attn_prefill wrapper that dynamically selects kernel based on quantization strategy - Add new dllm_flash_attn_decode wrapper that dynamically selects kernel based on quantization strategy - Currently FP8 strategy uses BF16 kernel (FP8 kernels to be implemented later) - Maintain backward compatibility with same function signatures - Tested: BF16 path works correctly in end-to-end tests
…and pull requests
Key optimizations: 1. Replace element-wise FP8->FP32->BF16 dequantization loops with T.copy for vectorized cast 2. Fuse K_Scale into score computation (avoid element-wise multiplication) 3. Fuse V_Scale into cache branch output (only affects cache path, not V_new) Performance improvement: - FP8 decode throughput: ~11.9 tok/s -> ~24.4 tok/s (2x improvement) - FP8/BF16 decode ratio: 0.759x (was ~0.38x) Technical details: - Removed K_Cache_shared_fp8/V_Cache_shared_fp8 buffers and element-wise conversion loops - Use T.copy(K_Cache[..], K_Cache_shared_bf16) for direct FP8->BF16 cast - Apply K_Scale[kv_head_idx] to acc_score_kvcache after GEMM (before softmax) - Apply V_Scale[kv_head_idx] to acc_score_kvcache before V_Cache GEMM (only cache branch) - Maintains numerical equivalence with previous implementation
主要变更: 1. 重构量化模块架构: - 新增 QuantizationConfig 和 registry 系统 - 支持 KV cache 和 Attention-Q 的量化策略 - 实现策略能力接口,移除硬编码的 isinstance 检查 - 添加 AttnQQuantizationStrategy 支持(架构层,kernel 待实现) 2. 重命名 FP8 内核: - dllm_flash_attn_decode_kernel_fp8 -> dllm_flash_attn_decode_kernel_bf16_q_fp8_kv - 更准确地反映内核的实际功能(BF16 Q + FP8 KV) 3. 简化内核实现: - 移除 USE_KV_SHARED 环境变量开关 - 移除 fragment 路径,只保留 shared memory 路径 - 简化配置管理(从字典改为单个配置对象) 4. 测试和验证: - 添加端到端测试验证 BF16 和 BF16+FP8 KV 路径 - 所有测试通过,文本生成功能正常 向后兼容:保持现有 API 不变,现有代码无需修改
合并 origin/main 的更新: - 更新 README.md 的设备列表 - 更新 .gitignore,添加 cuda_cache/ - 更新 GitHub workflows 权限配置 保持 README.md 为 main 分支的原始版本,不包含量化相关文档。
…support Feat/kv cache fp8 support
- Add LinearQuantizationStrategy interface supporting weight+activation quantization - Support layer-type-specific strategies (attn/mlp/other) - Add registry system for linear quantization strategies - Add Config fields: linear_attn_weight_dtype, linear_mlp_weight_dtype, linear_attn_act_dtype, linear_mlp_act_dtype - Integrate factory to inject strategies into QuantizationContext - Add dynamic dispatch in Linear.forward() based on quant_kind - Tag Linear layers in models (dream/llada/sdar/fast_dllm_v2) with quant_kind - Add placeholder strategies (stub) that raise NotImplementedError for non-bf16 dtypes - Add unit tests for registry/factory/dispatch behavior - Default bf16 behavior unchanged (fully backward compatible) All non-bf16 paths currently raise NotImplementedError with clear error messages, providing stable interface for future kernel/packed weight implementations.
- 从 git 跟踪中移除 .cursor 目录 - 将 .cursor/ 添加到 .gitignore 以避免将来误提交
- Optimize W8A16 small-M decode: pad M<16 to 16 (instead of 64) and use block_M=16/32/64 - Add w8a16_gemm_bias kernel with fused bias epilogue (opt-in via DIFFULEX_W8A16_FUSE_BIAS) - Add runtime profiling hooks for W8A16 (DIFFULEX_LINEAR_PROFILE) to track M distribution and fallbacks - Implement FP8 KV varlen fused dequantization kernel (Triton) for unified layout - Add benchmark configs for W4A8 and W8A8 quantization strategies - Add profiling hooks for KV cache load timing (DIFFULEX_PROFILE_KVCACHE)
主要新增内容:
1. **Marlin/AllSpark INT8 W8A16 量化策略集成**:
- 新增 linear_marlin_int8_w8a16.py:实现基于 vLLM AllSpark kernel 的 W8A16 量化策略
- 新增 diffulex_kernel/csrc/marlin/:vendored vLLM 的 AllSpark CUDA kernels
* allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel
* allspark_repack.cu: N32K16 权重重排 kernel
* allspark_utils.cuh: 工具函数和数据结构
* torch_bindings_marlin.cpp: PyTorch C++ 绑定
- 新增 diffulex_kernel/python/marlin_ops.py:Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels
2. **量化策略注册更新**:
- 在 registry.py 中添加 'marlin' 别名支持(映射到 marlin_int8)
- 在 strategies/__init__.py 中导入新的策略
3. **性能改进**:
- Marlin W8A16 策略显著提升了 Prefill 吞吐量(从 4518.92 tok/s 提升到 9520.91 tok/s,约 2.1 倍)
- Decode 吞吐量接近 BF16 基线(23.16 tok/s vs 23.36 tok/s)
- 支持与 FP8 KV cache 组合使用
4. **其他改进**:
- 优化了多个量化策略的实现
- 改进了 KV cache 管理
- 增强了 profiler 功能
- 新增了多个 benchmark 配置文件
…support Linear Quantization Support
主要新增内容:
1. **Marlin/AllSpark INT8 W8A16 量化策略集成**:
- 新增 linear_marlin_int8_w8a16.py:实现基于 vLLM AllSpark kernel 的 W8A16 量化策略
- 新增 diffulex_kernel/csrc/marlin/:vendored vLLM 的 AllSpark CUDA kernels
* allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel
* allspark_repack.cu: N32K16 权重重排 kernel
* allspark_utils.cuh: 工具函数和数据结构
* torch_bindings_marlin.cpp: PyTorch C++ 绑定
- 新增 diffulex_kernel/python/marlin_ops.py:Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels
2. **量化策略注册更新**:
- 在 registry.py 中添加 'marlin' 别名支持(映射到 marlin_int8)
- 在 strategies/__init__.py 中导入新的策略
3. **性能改进**:
- Marlin W8A16 策略显著提升了 Prefill 吞吐量(从 4518.92 tok/s 提升到 9520.91 tok/s,约 2.1 倍)
- Decode 吞吐量接近 BF16 基线(23.16 tok/s vs 23.36 tok/s)
- 支持与 FP8 KV cache 组合使用
4. **其他改进**:
- 优化了多个量化策略的实现
- 改进了 KV cache 管理
- 增强了 profiler 功能
- 新增了多个 benchmark 配置文件
…support feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy
主要变更: - 添加 GPTQ Marlin (W4A16) 和 AWQ Marlin (W4A16) 量化策略 - 修复 loader.py 以正确加载 gptq_marlin 格式权重(支持 Marlin 特有的 repacked qweight 和 permuted scales) - 修改 quantize_model.py 支持导出 gptq_marlin 格式(对称量化 + Marlin repack/permute) - 更新 linear.py: - 添加 _offline_quant_bits 缓冲区存储量化位数 - 添加 GPTQ runtime shuffle 支持(gptq_shuffle) - 添加 GPTQ/AWQ Marlin 的 lazy repack 支持(_maybe_prepare_offline_gptq_marlin/_awq_marlin) - 统一使用 vLLM 格式(int32 packed, fp16 scales) - 简化各策略文件,移除重复代码 - 移除旧的 AllSpark Marlin 实现文件 - 添加多个 benchmark 配置文件(GPTQ/AWQ Marlin 各 bit 版本)
benchmark_results 是本地生成的评测产物,不应进入版本库。 本提交将其作为正常删除移出,并依赖 .gitignore 中的 benchmark_results/ 规则避免后续再次提交。
- 添加 quant-method=auto 支持:使用 auto-gptq / awq 进行真正的校准量化 - 添加校准数据参数:--calib-text-file, --calib-num-samples, --calib-seq-len 等 - 实现 _export_autogptq_to_vllm_weights:从 auto-gptq 量化模型中导出 vLLM 格式权重 - 实现 _export_awq_to_vllm_weights:从 awq 量化模型中导出 vLLM 格式权重 - 保留 quant-method=simple 旧实现作为后向兼容 - 修复 loader.py 中 gptq_marlin scales 的 shape 推理和 TP sharding 逻辑 - 修复 linear_gptq_marlin_w4a16.py 移除不必要的 bf16->fp16 转换
主要重构内容: 1. **diffulex/layer/linear.py** - 大幅简化量化逻辑(-197行): - 新增 `_forward_base()`: 统一的前向分发器,替换子类中重复的量化分支逻辑 - 新增 `_build_offline_forward_kwargs()`: 统一构建离线量化(GPTQ/AWQ)前向参数 - 新增 `_get_linear_strategy()`, `_offline_meta()`, `_infer_gptq_weight_bits()` 等辅助方法 - 修复 `LoRAMixin.merge_lora` 中 base weight 为 None 的边界情况 - 移除未使用的导入(marlin_zero_points, unpack_cols, marlin_make_empty_g_idx) 2. **diffulex/utils/loader.py** - 优化性能和代码结构: - 一次性扫描 safetensors 文件建立 key_to_file 索引,避免重复文件 I/O - 缓存 `model.named_modules()` 结果,避免重复构建字典 - 新增 `_find_offline_capable_module()`: 统一模块查找逻辑 - 新增 `_load_tensors_for_prefix()`: 集中加载张量,仅打开必要的文件 - 将 print() 替换为 logger.warning()/logger.exception() 以规范化日志 3. **diffulex/engine/model_runner.py** - 消除重复循环: - 在 `allocate_kv_cache` 中统一缓存 attention 模块列表 - 用 `enumerate(attn_modules)` 替换重复的模块遍历循环 4. **diffulex/utils/quantization/strategies/linear_int4_w4a16.py** - 修复缺失实现: - 添加 `quantize_weight_for_kernel` 方法,修复 W4A16 在线量化运行时错误 5. 删除未使用的配置文件 `gptq_marlin_w2_bf16kv_varlen.yml` 测试: 已验证 W8A16 在线量化和 GPTQ 离线量化功能正常
- 将最后总结从最后一步的瞬时吞吐改为真正的平均值(总token/总时间) - 新增 ms/step 统计信息,便于分析性能 - 修复了之前只显示最后一步瞬时值而非平均值的问题
- 量化 linear:去 kwargs/pop/重复可用性检查,缓存 out_features 与必要中间张量 - 直连 vLLM CUDA ops(W8A8/GPTQ/AWQ/Marlin 等)以降低 Python glue 开销 - load-time 处理 qweight/scales 的布局与 contiguous,避免 forward 里重复处理 - 移除 linear.py 中 profiler record 标注,保持代码简洁 - 补充 trace/profile 辅助分析脚本与相关测试
… strategies - Remove all .item() calls in LinearBase hot paths (GPU->CPU sync breaks graph capture) - Add Python-side meta cache (_offline_quant_*_py, _gptq_is_shuffled_py, etc.) - Use in-place fill_() + Python mirrors for state updates - Simplify linear quantization strategies for future CUDA Graph support - Remove fast_path checks and redundant branching in linear_marlin_int8_w8a16 - Remove fast_path in linear_int8_w8a8 (unified vLLM path) - Simplify linear_gptq_w4a16 (direct torch.ops._C.gptq_gemm call) - Make linear_fp8_w8a16 use explicit quant_scales parameter - Fix FP8 weight layout: do not force contiguous for transpose-view (KxN stride0==1) - Remove profiler record_function wrappers (graph-friendly) Net: -129 lines, cleaner codebase ready for CUDA Graph capture
- Add per-layer ForwardPlan to pre-resolve bf16/quant/offline paths and reduce per-call Python branching. - Prefer direct torch.ops kernels (GPTQ/AWQ/Marlin) with static args for stable capture. - Fix D2F static CUDA graph capture/replay metadata (token buckets + cu_seqlens) and add profiler flag.
- Fix tensor shape mismatch bug in static+CUDA Graph decode mode (model_runner.py) - Improve bucket selection logic for variable token counts - Add safety fallback when runtime batch exceeds captured capacity - Fix metadata buffer initialization and padding - Add new static mode benchmark configs: - awq_bf16kv_static.yml - gptq_marlin_w4_bf16kv_static.yml - gptq_marlin_w8_bf16kv_static.yml - Update quantization strategies and loader utilities - Update benchmark configurations for consistency
- 移除 v0.0.1 之后新增的 bench 配置与量化架构文档 - 将 W8A16/DP 等调参从 env 收敛到 Config/strategy.configure - 示例/脚本去掉硬编码本机路径与默认 GPU,并修复语法问题
Feat/kv cache fp8 support
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
diffulex/strategy/block_diffusion/engine/sequence.py (1)
49-55:⚠️ Potential issue | 🔴 CriticalType hint mismatch:
modified_toparameter expects tensor but is typed asint.Line 54 calls
modified_to.item(), which is a tensor method (PyTorch/NumPy) to extract a scalar value. However, the parameter on line 49 is typed asint, which doesn't have an.item()method. This will cause anAttributeErrorat runtime.🔧 Proposed fix: Update type hint to reflect tensor type
If using PyTorch tensors:
- def modify_token(self, local_token_id: int, modified_to: int) -> None: + def modify_token(self, local_token_id: int, modified_to: torch.Tensor) -> None: if self.seq is None: raise RuntimeError("Diffusion block is not attached to a sequence.") target_id = local_token_id + self.global_start_id assert self.seq.token_ids[target_id] == self.mask_token_id self.seq.token_ids[target_id] = modified_to.item() # type: ignore[assignment] self.seq.new_tokens += 1Note: You'll need to add
import torchat the top of the file.Alternatively, if
modified_toshould actually be anint, remove the.item()call:def modify_token(self, local_token_id: int, modified_to: int) -> None: if self.seq is None: raise RuntimeError("Diffusion block is not attached to a sequence.") target_id = local_token_id + self.global_start_id assert self.seq.token_ids[target_id] == self.mask_token_id - self.seq.token_ids[target_id] = modified_to.item() # type: ignore[assignment] + self.seq.token_ids[target_id] = modified_to # type: ignore[assignment] self.seq.new_tokens += 1diffulex_legacy/layers/attention/ops/kv_cache_kernels.py (1)
429-486:⚠️ Potential issue | 🟠 MajorGuard against missing FP8 scales in
load_kvcacheWhen
kv_cache_dtypespecifies FP8, the function must have valid scales to dequantize correctly. Line 484–485 will silently create unit scales ifk_scaleandv_scaleareNone, producing incorrect results. Add an explicit check to fail fast.Suggested fix
spec = parse_kv_cache_dtype(kv_cache_dtype) + if spec.is_fp8 and (k_scale is None or v_scale is None): + raise ValueError("FP8 KV cache requires k_scale and v_scale for load.")diffulex_legacy/utils/context.py (1)
24-45:⚠️ Potential issue | 🟡 MinorValidate
kv_cache_dtypeto fail fast.
The new public parameter should be checked against supported values to avoid late runtime errors.🛡️ Suggested guard
def set_context_causal_lm( is_prefill, cu_seqlens_q=None, cu_seqlens_k=None, max_seqlen_q=0, max_seqlen_k=0, slot_mapping=None, context_lens=None, block_tables=None, kv_cache_dtype: str = "bf16" ) -> None: + allowed_kv_cache_dtypes = { + "bf16", "fp16", "fp32", "fp8", "fp8_e4m3", "fp8_e5m2" + } + if kv_cache_dtype not in allowed_kv_cache_dtypes: + raise ValueError( + f"Unsupported kv_cache_dtype: {kv_cache_dtype}. " + f"Expected one of {sorted(allowed_kv_cache_dtypes)}." + ) global _CONTEXT_FOR_CAUSAL_LM _CONTEXT_FOR_CAUSAL_LM = ContextForCausalLM(
🤖 Fix all issues with AI agents
In `@diffulex_bench/datasets.py`:
- Around line 26-34: The bug is that slicing with dataset[:limit] turns the
Dataset into a dict-of-lists so the subsequent loop over dataset iterates keys;
replace that slice with dataset.select(range(limit)) so iteration yields
records. Update the code around load_dataset(..., split=split) and the
conditional that checks limit to use dataset = dataset.select(range(limit))
(referencing the dataset variable and load_dataset call) and ensure the rest of
the loop (for item in dataset, accessing item["question"], item["answer"])
continues to work with Dataset records.
- Around line 65-71: The code incorrectly slices the HuggingFace Dataset with
dataset[:limit], which can convert it to a list and break iteration; instead,
when limiting the humaneval dataset obtained by load_dataset("openai/humaneval")
assign dataset = dataset.select(range(limit)) (or
dataset.select(range(limit)).shuffle(...) if needed) so the result stays a
Dataset object and iteration in the subsequent loop over dataset works
correctly; update the block that checks limit to use
dataset.select(range(limit)) rather than dataset[:limit].
In `@diffulex_kernel/python/paged_attn_decode_triton.py`:
- Line 527: The assertion in paged_attn_decode_triton.py uses
attn_metadata.kv_cache_layout which doesn't exist on the AttnMetaDataBase class
and will raise AttributeError; fix by adding a default attribute
kv_cache_layout: str = "unified" to the AttnMetaDataBase definition in
diffulex/attention/metadata.py (so the assertion in paged_attn_decode_triton.py
continues to work), or alternatively change the assertion to use
getattr(attn_metadata, "kv_cache_layout", "unified") to provide a default —
update either the AttnMetaDataBase class (preferred) or the assertion
accordingly.
In `@diffulex_profiler/backends/viztracer.py`:
- Around line 53-67: The stop() method in VizTracer backend currently calls
self.tracer.stop() but never calls the required self.tracer.save(), so the trace
file is not written; update stop() (method stop, referencing self.tracer and
output_file) to call self.tracer.save() immediately after self.tracer.stop() and
before reading self.tracer.output_file, then proceed to build the result dict
and set self.tracer = None so the trace is persisted to disk.
In `@diffulex/sampler/sdar.py`:
- Around line 17-56: In forward(), the boolean flags margin_confidence and
neg_entropy are incorrectly compared to strings when passed into sample_tokens
(e.g., neg_entropy == "neg_entropy"), so True is never honored; change the calls
to normalize these inputs to booleans (accept both bool and legacy string
values) before passing them to sample_tokens — e.g., compute
normalized_neg_entropy = bool(neg_entropy) or normalized_neg_entropy =
(neg_entropy is True or neg_entropy == "neg_entropy") and similarly for
margin_confidence, then call sample_tokens(...,
neg_entropy=normalized_neg_entropy,
margin_confidence=normalized_margin_confidence); apply the same normalization
pattern wherever these flags are used (including other files llada.py, dream.py,
fast_dllm_v2.py) so sample_tokens always receives a proper bool.
In `@diffulex/strategy/fast_dllm_v2/engine/model_runner.py`:
- Around line 123-133: The code in model_runner.py fails to handle the IN_CACHE
state for seq.diffusion_blocks[-1], causing slot_mapping to be shorter than
input_ids; in the block that currently checks seq.diffusion_blocks[-1].is_active
and .is_to_cache, add an else branch that mirrors the active case by extending
slot_mapping with [-1] * self.diffusion_block_size so slot_mapping stays aligned
with the input_ids produced by diffusion_decoding_inputs(); update the branch
containing seq.diffusion_blocks[-1].is_active,
seq.diffusion_blocks[-1].is_to_cache, slot_mapping, and
diffusion_decoding_inputs() accordingly.
In `@diffulex/utils/quantization/strategies/kv_cache_bf16.py`:
- Around line 55-60: The BF16 alias registration (register_kv_cache_strategy ->
_build_kv_cache_bf16 returning KVCacheBF16Strategy) causes fp32/fp16 strings to
be treated as 2-byte storage but downstream code still uses hardcoded
dtype-to-size lookups; update callers to ask the strategy for its actual storage
dtype: call the strategy's get_storage_dtype() (e.g., on the KVCacheBF16Strategy
instance) and compute sizes via numpy dtype.itemsize instead of mapping strings
to sizes. Replace any hardcoded branches that assume "fp32" => 4 bytes (such as
code that computes itemsize) with a call to strategy.get_storage_dtype() and
np.dtype(...).itemsize so memory calculations match the registered strategy.
In `@diffulex/utils/quantization/strategies/linear_bf16.py`:
- Around line 9-14: The factory function _build_linear_bf16() calls
LinearBF16Strategy before that class is defined, causing a NameError at import;
move the class LinearBF16Strategy definition above the
`@register_linear_strategy-decorated` _build_linear_bf16 function (or
alternatively inline the class reference by returning an instance via a lambda
that imports/defines the class first) so that LinearBF16Strategy is defined when
_build_linear_bf16() is executed.
🟠 Major comments (15)
diffulex/engine/tp_worker.py-77-84 (1)
77-84:⚠️ Potential issue | 🟠 MajorAsync path bypasses activation-quant cache clear.
step_asyncreimplements step logic but skips the cache clear, so async generation can reuse stale activation-quant state across steps.✅ Suggested fix (reuse step())
@@ - def _step(): - seqs, is_prefill = self.scheduler.schedule() - sample_output = self.model_runner.call("run", seqs, is_prefill) - n_diff_steps = self.scheduler.postprocess(seqs, sample_output) - outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished] - num_tokens = sum(seq.num_tokens for seq in seqs) if is_prefill else sum(seq.new_tokens for seq in seqs) - deltas = [] - return outputs, num_tokens, is_prefill, n_diff_steps, deltas + def _step(): + return self.step()Also applies to: 94-111
diffulex_profiler/exporters/summary.py-57-72 (1)
57-72:⚠️ Potential issue | 🟠 MajorFix
output_fileclobbering — summary may write to the wrong file.
The viztracer branch overwritesoutput_file, so the summary can end up written to the trace file (or"N/A").Proposed fix
- if m.backend_data and m.backend_data.get("backend") == "viztracer": - output_file = m.backend_data.get("output_file", "N/A") - summary_lines.append(f" VizTracer Output: {output_file}") + if m.backend_data and m.backend_data.get("backend") == "viztracer": + viz_output_file = m.backend_data.get("output_file", "N/A") + summary_lines.append(f" VizTracer Output: {viz_output_file}")diffulex/attention/attn_impl.py-59-72 (1)
59-72:⚠️ Potential issue | 🟠 MajorInitialize scales even when they are
None.The current guard skips
update_scaleson the first store, sok_scale/v_scalecan remainNoneand later decoding may fail when a strategy requires scales. It’s safer to initialize via the strategy even when the scales are not yet set.🐛 Proposed fix
- # Update scales if quantization strategy requires them - if self.k_scale is not None and self.v_scale is not None: - from diffulex.utils.quantization.context import get_kv_cache_strategy - strategy = get_kv_cache_strategy() - if strategy is not None: - self.k_scale, self.v_scale = strategy.update_scales( - k, v, self.k_scale, self.v_scale, - self.num_kv_heads, k.device - ) - # Pass scale to metadata if required by strategy - if strategy is not None: - strategy.maybe_set_attn_metadata_scales( - attn_metadata, k_scale=self.k_scale, v_scale=self.v_scale - ) + # Update/initialize scales if quantization strategy requires them + from diffulex.utils.quantization.context import get_kv_cache_strategy + strategy = get_kv_cache_strategy() + if strategy is not None: + self.k_scale, self.v_scale = strategy.update_scales( + k, v, self.k_scale, self.v_scale, + self.num_kv_heads, k.device + ) + # Pass scale to metadata if required by strategy + strategy.maybe_set_attn_metadata_scales( + attn_metadata, k_scale=self.k_scale, v_scale=self.v_scale + )diffulex/strategy/d2f/engine/model_runner.py-293-309 (1)
293-309:⚠️ Potential issue | 🟠 MajorGuard CUDA-graph replay when
decode_modeisn’t static.
capture_cudagraph()is explicitly static-only, butrun_model()will still replay graphs even when config/default is"varlen". This can mismatch the user’s requested decode mode and captured kernel path.Proposed guard to align replay with static-only capture
- if is_prefill or self.enforce_eager or input_ids.size(0) > 512: + if is_prefill or self.enforce_eager or input_ids.size(0) > 512: return self.model.compute_logits(self.model(input_ids, positions)) + if self._get_decode_mode() != "static": + return self.model.compute_logits(self.model(input_ids, positions))diffulex/strategy/fast_dllm_v2/engine/kvcache_manager.py-17-18 (1)
17-18:⚠️ Potential issue | 🟠 MajorConfusing boolean comparison in
can_append.The expression
(seq.cached_or_caching_num_tokens % self.block_size == 1)evaluates to a boolean (True/False), which is then compared with>=. This means the condition becomeslen(free_block_ids) >= 1when a new block is needed, andlen(free_block_ids) >= 0(always true) otherwise. This seems unintentional.🐛 Suggested clarification
If the intent is "need at least one free block when tokens overflow to a new block":
def can_append(self, seq: "FDV2Sequence") -> bool: - return len(self.free_block_ids) >= (seq.cached_or_caching_num_tokens % self.block_size == 1) + needs_new_block = seq.cached_or_caching_num_tokens % self.block_size == 1 + return not needs_new_block or len(self.free_block_ids) >= 1diffulex_bench/runner.py-19-53 (1)
19-53:⚠️ Potential issue | 🟠 MajorMake
trust_remote_codeopt‑in for tokenizer loading.Hardcoding
trust_remote_code=TrueinAutoTokenizer.from_pretrained()allows arbitrary code execution when loading remote tokenizers. Add a configurable parameter with a safe default ofFalse.Suggested change
def __init__( self, model_path: str, tokenizer_path: Optional[str] = None, wait_ready: bool = True, + trust_remote_code: bool = False, **diffulex_kwargs ): """ Initialize the benchmark runner Args: model_path: Path to the model tokenizer_path: Path to the tokenizer, if None uses model_path wait_ready: Whether to wait for engine to be fully initialized before returning + trust_remote_code: Whether to trust remote code when loading tokenizer **diffulex_kwargs: Additional arguments to pass to Diffulex """ self.model_path = model_path self.tokenizer_path = tokenizer_path or model_path self.logger = get_logger(__name__) # Initialize Diffulex engine self.logger.info("Initializing Diffulex engine...") self.llm = Diffulex(model_path, **diffulex_kwargs) # Wait for engine to be ready if requested if wait_ready: self._wait_for_ready() # Load tokenizer self.logger.info("Loading tokenizer...") self.tokenizer = AutoTokenizer.from_pretrained( self.tokenizer_path, - trust_remote_code=True + trust_remote_code=trust_remote_code ) self.logger.success("Tokenizer loaded successfully")diffulex_bench/config.py-60-65 (1)
60-65:⚠️ Potential issue | 🟠 MajorRename loop variable to avoid shadowing
dataclasses.field(F811).Both
to_dictmethods usefieldas a loop variable, which shadows the importeddataclasses.fieldand triggers Ruff F811. Use a different name (e.g.,dc_field).🛠️ Suggested fix
def to_dict(self) -> Dict[str, Any]: """Convert to dictionary""" return { - field.name: getattr(self, field.name) - for field in self.__dataclass_fields__.values() + dc_field.name: getattr(self, dc_field.name) + for dc_field in self.__dataclass_fields__.values() } @@ def to_dict(self) -> Dict[str, Any]: """Convert to dictionary""" return { - field.name: getattr(self, field.name) - for field in self.__dataclass_fields__.values() + dc_field.name: getattr(self, dc_field.name) + for dc_field in self.__dataclass_fields__.values() }Also applies to: 131-136
diffulex/strategy/fast_dllm_v2/engine/scheduler.py-27-35 (1)
27-35:⚠️ Potential issue | 🟠 MajorBatch token cap check ignores cached tokens.
You check
num_batched_tokens + projectedbut later addprojected - seq.num_cached_tokens. This can prematurely block prefill and even trigger the “unable to schedule” error despite capacity. Compute a singleprojected_tokensand use it in both places.🛠️ Suggested fix
while self.waiting and num_seqs < self.max_num_seqs: seq = self.waiting[0] projected = len(seq) + seq.diffusion_block_size + projected_tokens = projected - seq.num_cached_tokens if ( - num_batched_tokens + projected > self.max_num_batched_tokens + num_batched_tokens + projected_tokens > self.max_num_batched_tokens or not self.block_manager.can_allocate(seq) ): break @@ - num_batched_tokens += projected - seq.num_cached_tokens + num_batched_tokens += projected_tokensdiffulex_bench/metrics.py-66-83 (1)
66-83:⚠️ Potential issue | 🟠 MajorReturn contract mismatch in
humaneval_pass_at_k.The function is annotated to return
floatbut returnsNone. Any caller doing math or serialization will blow up with aTypeError. Prefer to fail fast withNotImplementedError(or change the signature toOptional[float]and document it).🛠️ Suggested fix (fail fast)
def humaneval_pass_at_k( results: List[Dict[str, Any]], k: int = 1, ) -> float: @@ - return None + raise NotImplementedError( + "HumanEval pass@k requires code execution; implement evaluator before use." + )diffulex_bench/config.py-67-103 (1)
67-103:⚠️ Potential issue | 🟠 Major
get_diffulex_kwargsreturns before adding optional params;kwargsundefined.The function returns a dict immediately, so the optional quantization fields are never applied, and the later
kwargs[...]lines reference an undefined name. Buildkwargsfirst, then extend it.🛠️ Suggested fix
def get_diffulex_kwargs(self) -> Dict[str, Any]: """Get arguments to pass to Diffulex engine""" - return { + kwargs = { 'model_name': self.model_name, 'decoding_strategy': self.decoding_strategy, 'mask_token_id': self.mask_token_id, 'tensor_parallel_size': self.tensor_parallel_size, 'data_parallel_size': self.data_parallel_size, 'gpu_memory_utilization': self.gpu_memory_utilization, 'max_model_len': self.max_model_len, 'max_num_batched_tokens': self.max_num_batched_tokens, 'max_num_seqs': self.max_num_seqs, 'use_lora': self.use_lora, 'lora_path': self.lora_path if self.use_lora else "", 'enforce_eager': self.enforce_eager, 'kv_cache_layout': self.kv_cache_layout, 'accept_threshold': self.accept_threshold, 'complete_threshold': self.complete_threshold, 'add_new_block_threshold': self.add_new_block_threshold, 'diffusion_block_size': self.diffusion_block_size, } @@ if self.linear_mlp_act_dtype is not None: kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype return kwargsdiffulex/utils/quantization/strategies/linear_int8_w8a8.py-39-76 (1)
39-76:⚠️ Potential issue | 🟠 MajorScale‑shape mismatch between
get_scale_shapeand quantize output.
get_scale_shapereturns(N,)butquantize()returnsscalesshaped[1, N](and the cache comment says[N]). This mismatch can break scale buffer allocation/serialization. Align the declared shape with the actual returned tensor.🛠️ Suggested fix (align to [1, N])
- # Cache: id(weight) -> (qweight_int8 [N,K], w_scales_fp32 [N]) + # Cache: id(weight) -> (qweight_int8 [N,K], w_scales_fp32 [1,N]) self._weight_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {} @@ def get_scale_shape(self, original_shape: tuple[int, ...], **kwargs: Any) -> tuple[int, ...]: _ = kwargs if len(original_shape) != 2: raise ValueError(f"Expected 2D weight [N,K], got {original_shape}") - return (original_shape[0],) + return (1, original_shape[0])diffulex/engine/model_runner.py-40-47 (1)
40-47:⚠️ Potential issue | 🟠 MajorFix device_id usage in process-group init.
dist.init_process_groupunconditionally indexesconfig.device_ids[rank], which breaks whendevice_idsis unset and can disagree with the later fallback path. Computedevice_idonce and use it consistently for both device setup andinit_process_group.🐛 Proposed fix
- dist.init_process_group("nccl", init_method, world_size=self.world_size, rank=rank, device_id=config.device_ids[rank]) - # Choose CUDA device for this TP rank. - # config.device_ids is already a list of logical CUDA device indices (respecting CUDA_VISIBLE_DEVICES). - # Do NOT add rank again, otherwise rank 1 with device_ids=[0,1] becomes device 2. - if getattr(config, "device_ids", None): - device_id = config.device_ids[rank] - else: - device_id = (getattr(config, "device_start", 0) or 0) + rank + # Choose CUDA device for this TP rank. + # config.device_ids is already a list of logical CUDA device indices (respecting CUDA_VISIBLE_DEVICES). + # Do NOT add rank again, otherwise rank 1 with device_ids=[0,1] becomes device 2. + if getattr(config, "device_ids", None): + device_id = config.device_ids[rank] + else: + device_id = (getattr(config, "device_start", 0) or 0) + rank + dist.init_process_group("nccl", init_method, world_size=self.world_size, rank=rank, device_id=device_id)diffulex/strategy/fast_dllm_v2/engine/model_runner.py-189-228 (1)
189-228:⚠️ Potential issue | 🟠 MajorCUDA-graph sizes can miss non-multiple-of-16 batch sizes.
When
max_num_seqsisn’t a multiple of 16 (e.g., 20),seq_bs_listtops out at 16, sorun_modelcan’t find a graph fornum_tokens = 20 * block_sizeand raisesStopIteration. Ensure the list always includesmax_num_seqs.🐛 Suggested fix
- seq_bs_list = [1, 2, 4, 8] + list(range(16, max_num_seqs + 1, 16)) + seq_bs_list = [1, 2, 4, 8] + list(range(16, max_num_seqs + 1, 16)) + if max_num_seqs not in seq_bs_list: + seq_bs_list.append(max_num_seqs) + seq_bs_list = sorted(set(seq_bs_list))diffulex/utils/quantization/strategies/linear_fp8_w8a8.py-44-47 (1)
44-47:⚠️ Potential issue | 🟠 MajorPotential memory leak in weight cache using
id(weight).Using
id(weight)as a cache key is risky because:
- If a weight tensor is deallocated and a new tensor is allocated at the same memory address, the cache will return stale quantized data.
- The cache holds strong references to quantized tensors, preventing garbage collection of old weights.
Consider using
weakrefor a bounded cache (e.g., LRU) to avoid unbounded memory growth, or clear the cache when model parameters change.diffulex/strategy/fast_dllm_v2/engine/sequence.py-59-65 (1)
59-65:⚠️ Potential issue | 🟠 MajorType mismatch:
modified_tois annotated asintbut.item()is called on it.The method signature shows
modified_to: int, but line 64 callsmodified_to.item()which implies it's a tensor. This will raiseAttributeErrorif an actualintis passed.🐛 Proposed fix
Either fix the type hint or handle both cases:
- def modify_token(self, local_token_id: int, modified_to: int) -> None: + def modify_token(self, local_token_id: int, modified_to: int | torch.Tensor) -> None: if self.seq is None: raise RuntimeError("Diffusion block is not attached to a sequence.") target_id = local_token_id + self.global_start_id assert self.seq.token_ids[target_id] == self.mask_token_id - self.seq.token_ids[target_id] = modified_to.item() # type: ignore[assignment] + value = modified_to.item() if hasattr(modified_to, 'item') else modified_to + self.seq.token_ids[target_id] = value self.seq.new_tokens += 1
🟡 Minor comments (34)
examples/test_bf16_kernel_e2e.py-70-70 (1)
70-70:⚠️ Potential issue | 🟡 MinorRemove unused f-string prefix
The f-string on line 70 contains no format expressions (no{...}placeholders), triggering Ruff F541. Remove thefprefix.🛠️ Proposed fix
- print(f"\n总计:") + print("\n总计:")examples/test_fp8_kernel_e2e.py-72-72 (1)
72-72:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefix from string without placeholders.This f-string has no placeholders and should be a regular string.
🐛 Proposed fix
- print(f"\n总计:") + print("\n总计:")examples/test_fp8_linear.py-115-122 (1)
115-122:⚠️ Potential issue | 🟡 MinorUnused variables:
M,mem_bf16,mem_fp8.These variables are assigned but never used. Either remove them or use them for more detailed memory reporting.
🧹 Proposed fix (remove unused)
device = torch.device("cuda") torch.cuda.empty_cache() torch.cuda.reset_peak_memory_stats() # BF16 baseline - M, K, N = 32, 512, 256 + K, N = 512, 256 weight_bf16 = torch.randn(N, K, dtype=torch.bfloat16, device=device) - mem_bf16 = torch.cuda.memory_allocated() # FP8 quantized strategy = create_linear_strategy(weight_dtype="fp8_e4m3", act_dtype="bf16") weight_fp8, scales = strategy.quantize_weight_for_kernel(weight_bf16, device=device) - mem_fp8 = torch.cuda.memory_allocated()examples/test_gptq_awq_loading.py-52-66 (1)
52-66:⚠️ Potential issue | 🟡 MinorGuard
_offline_quant_formataccess for compatibilityLine 54 assumes
_offline_quant_formatexists and is a tensor with.numel()/.item(). Some layers expose an int-style_offline_quant_format_pyinstead, which would raiseAttributeErrorand break--list-layers. Consider a safe fallback.Suggested fix
- format_val = int(module._offline_quant_format.item()) if module._offline_quant_format.numel() > 0 else 0 + fmt = getattr(module, "_offline_quant_format", None) + if fmt is None: + format_val = int(getattr(module, "_offline_quant_format_py", 0) or 0) + else: + format_val = int(fmt.item()) if fmt.numel() > 0 else 0examples/test_fp8_kv_cache_comprehensive.py-1225-1306 (1)
1225-1306:⚠️ Potential issue | 🟡 MinorFail fast when CUDA isn’t available
Several tests unconditionally allocate CUDA tensors; add an early guard in
main()to give a clear message instead of stack traces.Suggested fix
args = parser.parse_args() + + if not torch.cuda.is_available(): + print("CUDA is required for FP8 KV cache tests.") + sys.exit(2)diffulex/model/__init__.py-20-22 (1)
20-22:⚠️ Potential issue | 🟡 MinorAdd
stacklevel=2to point warnings at the caller.
Without it, the warning points at this module instead of the import site.🔧 Suggested change
- warnings.warn(f"Failed to import {module_name}: {e!r}", RuntimeWarning) + warnings.warn( + f"Failed to import {module_name}: {e!r}", + RuntimeWarning, + stacklevel=2, + )diffulex/strategy/d2f/engine/scheduler.py-108-116 (1)
108-116:⚠️ Potential issue | 🟡 MinorGuard against silent truncation in
zip()during token assignment.If
true_local_idsandaccepted_idsdiverge,zip()will silently drop extras. Usestrict=Trueto fail fast.🔧 Safer iteration
- for true_local_id, accepted_id in zip(true_local_ids, accepted_ids): + for true_local_id, accepted_id in zip(true_local_ids, accepted_ids, strict=True):The project's Python >= 3.12 requirement supports
zip(..., strict=True)(available since Python 3.10).examples/test_fp8_kv_cache_python_dequant.py-72-72 (1)
72-72:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefix from string without placeholders.This f-string contains no placeholders, making the
fprefix unnecessary.🧹 Proposed fix
- print(f"\n总计:") + print("\n总计:")diffulex_profiler/example.py-46-47 (1)
46-47:⚠️ Potential issue | 🟡 MinorPotential division by zero if
outputsis empty.If
llm.generate()returns an empty list, dividing bylen(outputs)will raiseZeroDivisionError.🛡️ Proposed defensive fix
profiler.record_metric("num_outputs", len(outputs)) - profiler.record_metric("avg_diff_steps", - sum(o['n_diff_steps'] for o in outputs) / len(outputs)) + if outputs: + profiler.record_metric("avg_diff_steps", + sum(o['n_diff_steps'] for o in outputs) / len(outputs))examples/test_fp8_kv_cache_python_dequant.py-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorRemove unused import.
The
osmodule is imported but never used.🧹 Proposed fix
-import os import timediffulex_kernel/__init__.py-12-21 (1)
12-21:⚠️ Potential issue | 🟡 MinorTidy up lint warnings (unused
noqa, unsorted__all__).
Ruff is flagging both items; easy cleanup.Suggested fix
- from diffulex_kernel.python.dllm_flash_attn_kernels import ( # noqa: F401 + from diffulex_kernel.python.dllm_flash_attn_kernels import ( dllm_flash_attn_decode as dllm_flash_attn_decode, dllm_flash_attn_prefill as dllm_flash_attn_prefill, ) - from diffulex_kernel.python.kv_cache_kernels import ( # noqa: F401 + from diffulex_kernel.python.kv_cache_kernels import ( load_kvcache as load_kvcache, store_kvcache_distinct_layout as store_kvcache_distinct_layout, store_kvcache_unified_layout as store_kvcache_unified_layout, ) @@ __all__ = [ "dllm_flash_attn_decode", "dllm_flash_attn_prefill", - "store_kvcache_distinct_layout", - "store_kvcache_unified_layout", "load_kvcache", + "store_kvcache_distinct_layout", + "store_kvcache_unified_layout", ]Also applies to: 48-54
diffulex/utils/quantization/strategies/no_quantization.py-16-26 (1)
16-26:⚠️ Potential issue | 🟡 MinorAlign quantize() output with declared BF16 storage dtype for consistency.
The current implementation returns tensors as-is, creating a mismatch with the advertised BF16 storage dtype. While
quantize()is not called in the current codebase (onlyget_storage_dtype()is used), this inconsistency conflicts withKVCacheBF16Strategy, which does enforce the declared dtype by converting to BF16. For consistency and to avoid surprises if this method is called directly, apply the same pattern:🔧 Suggested fix
def quantize(self, tensor: torch.Tensor, **kwargs) -> tuple[torch.Tensor, None]: - """No quantization, return tensor as-is.""" - return tensor, None + """No quantization, but normalize to storage dtype for consistency.""" + if tensor.dtype != torch.bfloat16: + tensor = tensor.to(torch.bfloat16) + return tensor, Nonediffulex/sampler/sdar.py-49-49 (1)
49-49:⚠️ Potential issue | 🟡 MinorUnused
confidencevariable.
confidenceis never used after unpacking. Prefix with_(or use it) to avoid lint noise.Rename to unused placeholder
- confidence, sampled_tokens, initial_confidence = self.sample_tokens( + _confidence, sampled_tokens, initial_confidence = self.sample_tokens(diffulex_bench/report.py-29-31 (1)
29-31:⚠️ Potential issue | 🟡 MinorReplace lambda assignment with a local function.
Ruff E731 flags assigning a lambda; a small
defkeeps lint clean.Simple refactor
- report_lines = [] - append_line = lambda line: report_lines.append(line) + report_lines = [] + def append_line(line: str) -> None: + report_lines.append(line)diffulex/strategy/d2f/engine/model_runner.py-28-47 (1)
28-47:⚠️ Potential issue | 🟡 MinorRemove the suggested import path change; the current import is correct via backward-compatibility re-export.
The import from
diffulex.utils.kv_cache_dtypeis intentionally valid—this module re-exports from the new location for backward compatibility. No change needed there.However, the exception handling concern has merit:
parse_kv_cache_dtype()can raiseValueError(for invalid dtype strings) andRuntimeError(for missing torch FP8 dtypes), and catching all exceptions silently masks these issues. If an invalidkv_cache_dtypeis provided or FP8 dtypes are unavailable, defaulting to"varlen"may not be the desired behavior. Consider either letting these exceptions propagate or logging them explicitly before the fallback.diffulex_kernel/python/dllm_flash_attn_prefill_tilelang.py-172-178 (1)
172-178:⚠️ Potential issue | 🟡 MinorUnused
scaleparameter.The
scaleparameter is accepted but never used. The kernel computes its own scale at line 39 as(1.0 / HEAD_DIM) ** 0.5 * 1.44269504. This is inconsistent withflash_attn_varlen_funcwhich uses the passedscale(line 193).💡 Options to consider
- If the kernel should use the passed scale, modify the kernel to accept it as a parameter.
- If the hardcoded scale is intentional, document why it differs from the passed value or remove the parameter to avoid confusion.
diffulex/logger.py-44-47 (1)
44-47:⚠️ Potential issue | 🟡 MinorRestore
record.levelnameafter formatting to prevent ANSI codes leaking to other handlers.The same
LogRecordobject is shared across all handlers. Whensetup_logger()is called with both a console handler (usingColoredFormatter) and a file handler, the in-place mutation ofrecord.levelnamecauses ANSI color codes to be written to the log file. This occurs in actual usage (e.g.,diffulex_bench/main.py).🔧 Suggested change
def format(self, record): - log_color = self.COLORS.get(record.levelname, '') - record.levelname = f"{log_color}{record.levelname}{self.RESET}" - return super().format(record) + original_levelname = record.levelname + try: + log_color = self.COLORS.get(record.levelname, '') + record.levelname = f"{log_color}{record.levelname}{self.RESET}" + return super().format(record) + finally: + record.levelname = original_levelnamediffulex/logger.py-155-171 (1)
155-171:⚠️ Potential issue | 🟡 MinorRich markup in
success()will appear verbatim when using plain handlers.The
success()method is added to the globallogging.Loggerclass at module import time. When Rich is installed, it always emits Rich markup[green]✓[/green]. However, loggers set up withuse_rich=Falseuse plain text handlers that don't interpret Rich markup, causing the tags to be printed literally. DetectRichHandlerat runtime in thesuccess()method and fall back to colorama/plain formatting when Rich handlers are not present.🔧 Suggested change
- if RICH_AVAILABLE: - def success(self, message: str, *args, **kwargs): - """Log success message with rich formatting""" - self.info(f"[green]✓[/green] {message}", *args, **kwargs) - else: - def success(self, message: str, *args, **kwargs): - """Log success message""" - if COLORAMA_AVAILABLE: - self.info(f"{Fore.GREEN}✓{Style.RESET_ALL} {message}", *args, **kwargs) - else: - self.info(f"✓ {message}", *args, **kwargs) + if RICH_AVAILABLE: + def success(self, message: str, *args, **kwargs): + """Log success message with rich formatting when applicable""" + if any(isinstance(h, RichHandler) for h in self.handlers): + self.info(f"[green]✓[/green] {message}", *args, **kwargs) + elif COLORAMA_AVAILABLE: + self.info(f"{Fore.GREEN}✓{Style.RESET_ALL} {message}", *args, **kwargs) + else: + self.info(f"✓ {message}", *args, **kwargs) + else: + def success(self, message: str, *args, **kwargs): + """Log success message""" + if COLORAMA_AVAILABLE: + self.info(f"{Fore.GREEN}✓{Style.RESET_ALL} {message}", *args, **kwargs) + else: + self.info(f"✓ {message}", *args, **kwargs)diffulex_profiler/metrics.py-69-80 (1)
69-80:⚠️ Potential issue | 🟡 MinorDon’t silently swallow collector errors (S110).
The
try/except/passblocks hide failures and trip Ruff S110. Logging at debug keeps metrics best‑effort while preserving observability.🛠️ Suggested fix (log and continue)
+import logging @@ -import torch +import torch + +logger = logging.getLogger(__name__) @@ - except (ImportError, Exception): - pass + except (ImportError, Exception) as exc: + logger.debug("pynvml metrics unavailable", exc_info=exc) @@ - except Exception: - pass + except Exception as exc: + logger.debug("collect_gpu_metrics failed", exc_info=exc) @@ - except Exception: - return {} + except Exception as exc: + logger.debug("collect_cpu_metrics failed", exc_info=exc) + return {} @@ - except Exception: - return {} + except Exception as exc: + logger.debug("collect_memory_metrics failed", exc_info=exc) + return {}Also applies to: 90-96, 103-112
diffulex/strategy/fast_dllm_v2/engine/scheduler.py-102-110 (1)
102-110:⚠️ Potential issue | 🟡 MinorGuard against mismatched
true_local_ids/accepted_idslengths.
zip()truncates silently; if the lists diverge, some accepted tokens are never applied. Add an explicit length check (or raise) before iterating.🛠️ Suggested fix
sampled_tokens_map = sample_output.sampled_tokens_map.get(seq_id, {}) for block_id, accepted_ids in accepted_ids_map.items(): if not accepted_ids: continue diffusion_block = seq.diffusion_blocks[int(block_id)] sampled_tokens = sampled_tokens_map.get(block_id, []) true_local_ids = true_ids_map.get(block_id, []) + if len(true_local_ids) != len(accepted_ids): + raise ValueError( + f"Mismatch for block {block_id}: " + f"{len(true_local_ids)} true ids vs {len(accepted_ids)} accepted ids" + ) for true_local_id, accepted_id in zip(true_local_ids, accepted_ids): token = sampled_tokens[accepted_id]diffulex/utils/quantization/kv_cache_dtype.py-56-61 (1)
56-61:⚠️ Potential issue | 🟡 MinorHandle both
float8_e4m3fnandfloat8_e4m3fnuzwhen vLLM isn't available.Different PyTorch builds expose different FP8 E4M3 dtypes depending on version and backend:
float8_e4m3fnis the OCP-standard variant (NVIDIA/CUDA builds)float8_e4m3fnuzis the FNUZ variant (AMD/ROCm builds, particularly MI300+)The current code only checks for
float8_e4m3fn, so it would incorrectly raiseRuntimeErroreven whenfloat8_e4m3fnuzis available.🛠️ Suggested fix
def _get_fp8_e4m3_dtype() -> torch.dtype: if current_platform is None: if hasattr(torch, "float8_e4m3fn"): return torch.float8_e4m3fn # type: ignore[attr-defined] + if hasattr(torch, "float8_e4m3fnuz"): + return torch.float8_e4m3fnuz # type: ignore[attr-defined] raise RuntimeError("FP8 requested but vLLM current_platform is unavailable.") return current_platform.fp8_dtype()diffulex/engine/model_runner.py-165-171 (1)
165-171:⚠️ Potential issue | 🟡 MinorEnsure WARMING_UP is reset on failure.
If
_prefill_warmup()raises, the global warming flag remains set. Usetry/finallyto always reset it.🧯 Suggested fix
- set_warming_up(True) - torch.cuda.empty_cache() - torch.cuda.reset_peak_memory_stats() - self._prefill_warmup() - reset_warming_up() + set_warming_up(True) + try: + torch.cuda.empty_cache() + torch.cuda.reset_peak_memory_stats() + self._prefill_warmup() + finally: + reset_warming_up()diffulex/engine/model_runner.py-151-163 (1)
151-163:⚠️ Potential issue | 🟡 MinorGuard warmup against zero sequences.
num_seqsbecomes 0 whenmax_num_batched_tokens < max_model_len, which results in an empty warmup run. Add a guard to avoid a no-op or downstream errors.🛠️ Suggested guard
- num_seqs = min(max_num_batched_tokens // max_model_len, self.config.max_num_seqs) + num_seqs = min(max_num_batched_tokens // max_model_len, self.config.max_num_seqs) + if num_seqs <= 0: + logger.warning("Warmup skipped: max_num_batched_tokens < max_model_len") + returndiffulex/utils/quantization/__init__.py-42-68 (1)
42-68:⚠️ Potential issue | 🟡 Minorall ordering trips RUF022.
Ruff expects
__all__to be sorted; consider sorting to avoid lint failures.🔧 One-line fix
-__all__ = [ +__all__ = sorted([ # Context 'QuantizationContext', 'get_quantization_context', 'set_kv_cache_strategy', 'get_kv_cache_strategy', @@ 'ensure_scale_tensor', 'view_fp8_cache', -] +])diffulex/utils/loader.py-111-112 (1)
111-112:⚠️ Potential issue | 🟡 MinorUnused variable
pack_factor- potential bug or dead code.
pack_factoris calculated at line 111 but never used within_set_offline_gptq_marlin_weight. This is flagged by static analysis (F841). Either remove it if unnecessary, or verify if it should be used in subsequent logic.🔧 Proposed fix if unused
- pack_factor = 32 // int(bits) group_size_norm = in_features if group_size == -1 else group_sizediffulex_kernel/python/kv_cache_kernels.py-1051-1055 (1)
1051-1055:⚠️ Potential issue | 🟡 MinorDebug reference check is overly strict - any mismatch raises RuntimeError.
The FP8 debug reference check at lines 1051-1055 raises a
RuntimeErrorifmax_diff_k > 0 or max_diff_v > 0. Due to floating-point precision differences between the fused Triton kernel and Python reference implementation, small differences are expected and should not cause failures.Consider using a tolerance threshold instead:
🔧 Proposed fix
- # Be strict: any mismatch likely indicates indexing/mask/scale bug. - if max_diff_k > 0 or max_diff_v > 0: + # Allow small numerical differences due to fp32/bf16 conversion order + TOLERANCE = 1e-3 # Adjust based on expected precision + if max_diff_k > TOLERANCE or max_diff_v > TOLERANCE: raise RuntimeError( - f"FP8 fused load mismatch: max_abs_diff k={max_diff_k} v={max_diff_v}. " + f"FP8 fused load mismatch exceeds tolerance: max_abs_diff k={max_diff_k} v={max_diff_v} (tol={TOLERANCE}). " "Set DIFFULEX_DEBUG_FP8_LOAD_REF=0 to disable." )diffulex/utils/quantization/strategies/linear_fp8_w8a8.py-109-117 (1)
109-117:⚠️ Potential issue | 🟡 MinorCache invalidation only checks device, missing shape/dtype validation.
The cache recomputes when
cached[0].device != x.device, but if the originalweighttensor's content, shape, or dtype changes (e.g., during fine-tuning or model surgery), the cached quantized weight becomes stale.Consider adding shape/dtype validation or using a versioning mechanism:
🛡️ Proposed fix
wid = id(weight) cached = self._weight_cache.get(wid) - if cached is None or cached[0].device != x.device: + if (cached is None + or cached[0].device != x.device + or cached[0].shape != (weight.shape[1], weight.shape[0])): # [K,N] from [N,K] q_fp8, meta = self.quantize(weight)diffulex/utils/loader.py-237-240 (1)
237-240:⚠️ Potential issue | 🟡 MinorRemove unused AWQ Marlin variables that have no implementation in the loader.
Variables
want_awq_marlinandis_awq_marlin_ckptare defined but never used within this function. Unlikewant_gptq_marlinandis_gptq_marlin_ckptwhich have corresponding checkpoint loading logic (e.g., qzeros creation at line 442), these AWQ Marlin variables lack any implementation. While AWQ Marlin inference support exists inLinearBaseand strategy classes, the loader itself does not load AWQ Marlin checkpoints. Remove these unused variables or implement the missing AWQ Marlin checkpoint loading logic to match the GPTQ Marlin pattern.diffulex/utils/quantization/strategies/linear_fp8_w8a16.py-37-38 (1)
37-38:⚠️ Potential issue | 🟡 MinorPotential memory leak: weight cache keyed by
id(weight)can grow unbounded.The cache
self._weight_cacheusesid(weight)as keys. Sinceid()returns memory addresses that can be reused after objects are garbage collected, this can lead to:
- Stale entries if weights are replaced
- Unbounded growth if many different weights are processed
Consider using
weakrefor implementing a bounded cache with eviction.🛡️ Proposed fix using WeakValueDictionary or bounded cache
class LinearFP8W8A16Strategy(LinearQuantizationStrategy): def __init__(self, weight_dtype: str = "fp8_e4m3") -> None: super().__init__() self.weight_dtype_str = weight_dtype - # Cache: id(weight) -> (q_fp8_KN [K,N], scale_fp32 [1]) - self._weight_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {} + # Cache: id(weight) -> (q_fp8_KN [K,N], scale_fp32 [1]) + # Note: bounded to avoid unbounded growth; consider LRU if needed. + self._weight_cache: dict[int, tuple[torch.Tensor, torch.Tensor]] = {} + self._weight_cache_max_size: int = 64 # Limit cache sizeThen in
linear_forward, add eviction logic:if len(self._weight_cache) > self._weight_cache_max_size: # Simple eviction: clear oldest entries self._weight_cache.clear()diffulex_bench/lm_eval_model.py-267-279 (1)
267-279:⚠️ Potential issue | 🟡 MinorUnused local variables:
avg_tokens,avg_nfe,avg_time.These variables are computed but never used. Either use them in the log message or remove them.
♻️ Proposed fix - use in logging or remove
Option 1: Use them in logging:
avg_tokens = self.total_generated_tokens / self.total_samples avg_nfe = self.total_nfe / self.total_samples avg_time = self.total_generation_time / self.total_samples throughput = num_tokens / total_time if total_time > 0 else 0 self.logger.info( f"Generated {len(results)} samples | " f"Tokens: {num_tokens} | " f"NFE: {num_nfe} | " f"Time: {total_time:.2f}s | " - f"Throughput: {throughput:.2f} tok/s" + f"Throughput: {throughput:.2f} tok/s | " + f"Avg tokens/sample: {avg_tokens:.1f} | " + f"Avg NFE/sample: {avg_nfe:.1f}" )Option 2: Remove unused variables:
- avg_tokens = self.total_generated_tokens / self.total_samples - avg_nfe = self.total_nfe / self.total_samples - avg_time = self.total_generation_time / self.total_samples throughput = num_tokens / total_time if total_time > 0 else 0diffulex/utils/quantization/strategies/linear_fp8_w8a16.py-119-129 (1)
119-129:⚠️ Potential issue | 🟡 MinorCache invalidation issue: device mismatch check may cause redundant quantization.
When
cached[0].device != x.device, the code re-quantizes but the old entry keyed bywidremains if the weight object hasn't changed. This could lead to repeated quantization if inputs alternate between devices.Consider storing device as part of the cache key or updating the existing entry properly.
♻️ Suggested improvement
- wid = id(weight) - cached = self._weight_cache.get(wid) - if cached is None or cached[0].device != x.device: + cache_key = (id(weight), x.device) + cached = self._weight_cache.get(cache_key) + if cached is None: q_fp8, meta = self.quantize(weight) q_fp8 = q_fp8.to(device=x.device) scales = meta["scales"].to(device=x.device, dtype=torch.float32).reshape(1) q_kn = q_fp8 - self._weight_cache[wid] = (q_fp8, scales) + self._weight_cache[cache_key] = (q_fp8, scales) else: q_kn, scales = cacheddiffulex/strategy/fast_dllm_v2/engine/sequence.py-119-127 (1)
119-127:⚠️ Potential issue | 🟡 MinorMutable default argument and incorrect error message.
SamplingParams()as a default argument is evaluated once at function definition time, not per call. This can lead to shared state issues.- The error message references "BDSequence" but the class is named "FDV2Sequence".
🐛 Proposed fix
def __init__( self, token_ids: list[int], - sampling_params: SamplingParams = SamplingParams(), + sampling_params: SamplingParams | None = None, config: Config | None = None, ): - super().__init__(token_ids, sampling_params) + super().__init__(token_ids, sampling_params or SamplingParams()) if config is None: - raise ValueError("BDSequence requires a Config instance.") + raise ValueError("FDV2Sequence requires a Config instance.")diffulex/utils/quantization/strategies/linear_marlin_int8_w8a16.py-73-82 (1)
73-82:⚠️ Potential issue | 🟡 MinorSilent exception swallowing in
configure()hides configuration errors.The
try-except-passpattern here silently ignores all errors, including genuine configuration issues (e.g., invalid config types). Consider logging or at least catching more specific exceptions.🔧 Proposed fix with logging
+import logging + +logger = logging.getLogger(__name__) + def configure(self, *, diffulex_config: Any | None = None) -> None: # Prefer explicit config fields over environment-variable based tuning. if diffulex_config is None: return try: bn = int(getattr(diffulex_config, "linear_w8a16_quant_block_n", self._quant_block_n)) self._quant_block_n = max(1, bn) - except Exception: - pass + except (TypeError, ValueError) as e: + logger.debug(f"Failed to parse linear_w8a16_quant_block_n: {e}") try: thr = int(getattr(diffulex_config, "linear_w8a16_allspark_cublas_m_threshold", self._cublas_m_thr)) self._cublas_m_thr = max(1, thr) - except Exception: - pass + except (TypeError, ValueError) as e: + logger.debug(f"Failed to parse linear_w8a16_allspark_cublas_m_threshold: {e}")diffulex/strategy/fast_dllm_v2/engine/sequence.py-225-249 (1)
225-249:⚠️ Potential issue | 🟡 MinorUnreachable
elifcondition when prefix is block-aligned.When
pad_prefix_len == 0,total_num_blocks == num_prefix_blocks, making the conditionblock_id == num_prefix_blocksunreachable within the loop range. This causes theelif is_last_prefix_block:branch (line 233) to never execute in the aligned case.While the resulting behavior is correct—all blocks legitimately get
TO_CACHEstatus when the prefix has no partial block—the unreachable code path suggests the logic could be clearer. Either refactor the condition to avoid the unreachable branch, or add a comment explaining why theelifexists despite being unreachable when aligned.Note: This same pattern exists in both
diffulex/strategy/block_diffusion/engine/sequence.pyanddiffulex/strategy/fast_dllm_v2/engine/sequence.py.
| dataset = load_dataset("gsm8k", "main", split=split) | ||
|
|
||
| if limit: | ||
| dataset = dataset[:limit] | ||
|
|
||
| results = [] | ||
| for item in dataset: | ||
| question = item["question"] | ||
| answer = item["answer"] |
There was a problem hiding this comment.
Bug: Dataset slicing changes iteration behavior when limit is used.
When limit is provided, dataset[:limit] returns a dict of lists (column-wise), not a sliced Dataset. The subsequent for item in dataset: will then iterate over dictionary keys (e.g., 'question', 'answer') instead of individual records.
Use dataset.select(range(limit)) to maintain Dataset iteration behavior.
🐛 Proposed fix
dataset = load_dataset("gsm8k", "main", split=split)
if limit:
- dataset = dataset[:limit]
+ dataset = dataset.select(range(min(limit, len(dataset))))
results = []
for item in dataset:🤖 Prompt for AI Agents
In `@diffulex_bench/datasets.py` around lines 26 - 34, The bug is that slicing
with dataset[:limit] turns the Dataset into a dict-of-lists so the subsequent
loop over dataset iterates keys; replace that slice with
dataset.select(range(limit)) so iteration yields records. Update the code around
load_dataset(..., split=split) and the conditional that checks limit to use
dataset = dataset.select(range(limit)) (referencing the dataset variable and
load_dataset call) and ensure the rest of the loop (for item in dataset,
accessing item["question"], item["answer"]) continues to work with Dataset
records.
| dataset = load_dataset("openai/humaneval", split="test") | ||
|
|
||
| if limit: | ||
| dataset = dataset[:limit] | ||
|
|
||
| results = [] | ||
| for item in dataset: |
There was a problem hiding this comment.
Same slicing bug as in load_gsm8k.
Apply the same fix here to maintain proper Dataset iteration.
🐛 Proposed fix
dataset = load_dataset("openai/humaneval", split="test")
if limit:
- dataset = dataset[:limit]
+ dataset = dataset.select(range(min(limit, len(dataset))))
results = []
for item in dataset:🤖 Prompt for AI Agents
In `@diffulex_bench/datasets.py` around lines 65 - 71, The code incorrectly slices
the HuggingFace Dataset with dataset[:limit], which can convert it to a list and
break iteration; instead, when limiting the humaneval dataset obtained by
load_dataset("openai/humaneval") assign dataset = dataset.select(range(limit))
(or dataset.select(range(limit)).shuffle(...) if needed) so the result stays a
Dataset object and iteration in the subsequent loop over dataset works
correctly; update the block that checks limit to use
dataset.select(range(limit)) rather than dataset[:limit].
| assert q.is_cuda and k.is_cuda and v.is_cuda and k_cache.is_cuda and v_cache.is_cuda | ||
| assert q.dtype == torch.bfloat16 and k.dtype == torch.bfloat16 and v.dtype == torch.bfloat16 | ||
| assert attn_metadata.block_tables is not None and attn_metadata.context_lens is not None and attn_metadata.cu_seqlens_q is not None | ||
| assert attn_metadata.kv_cache_layout == "unified", f"only unified layout supported, got {attn_metadata.kv_cache_layout}" |
There was a problem hiding this comment.
Missing kv_cache_layout attribute in AttnMetaDataBase.
This assertion accesses attn_metadata.kv_cache_layout, but according to the AttnMetaDataBase class definition in diffulex/attention/metadata.py, this attribute does not exist. This will raise an AttributeError at runtime, not the intended AssertionError.
Either add kv_cache_layout: str = "unified" to AttnMetaDataBase, or use getattr with a default:
🐛 Proposed fix
- assert attn_metadata.kv_cache_layout == "unified", f"only unified layout supported, got {attn_metadata.kv_cache_layout}"
+ kv_layout = getattr(attn_metadata, "kv_cache_layout", "unified")
+ assert kv_layout == "unified", f"only unified layout supported, got {kv_layout}"Or add the attribute to AttnMetaDataBase in diffulex/attention/metadata.py:
kv_cache_layout: str = "unified"🤖 Prompt for AI Agents
In `@diffulex_kernel/python/paged_attn_decode_triton.py` at line 527, The
assertion in paged_attn_decode_triton.py uses attn_metadata.kv_cache_layout
which doesn't exist on the AttnMetaDataBase class and will raise AttributeError;
fix by adding a default attribute kv_cache_layout: str = "unified" to the
AttnMetaDataBase definition in diffulex/attention/metadata.py (so the assertion
in paged_attn_decode_triton.py continues to work), or alternatively change the
assertion to use getattr(attn_metadata, "kv_cache_layout", "unified") to provide
a default — update either the AttnMetaDataBase class (preferred) or the
assertion accordingly.
| def stop(self) -> Optional[Dict[str, Any]]: | ||
| """Stop VizTracer and return trace file path.""" | ||
| if self.tracer is None: | ||
| return None | ||
|
|
||
| self.tracer.stop() | ||
| output_file = self.tracer.output_file | ||
|
|
||
| result = { | ||
| "backend": "viztracer", | ||
| "output_file": str(output_file), | ||
| } | ||
|
|
||
| self.tracer = None | ||
| return result |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
VizTracer API save() method after stop() documentation
💡 Result:
VizTracer’s documented manual (inline) workflow is:
from viztracer import VizTracer
tracer = VizTracer()
tracer.start()
# code you want to trace
tracer.stop()
tracer.save() # or tracer.save("my_trace.json")save(output_file=None, file_info=None, verbose=None) “parse[s] data and save[s] report”; if output_file is None, it writes to the tracer’s default output_file path. stop(stop_option=None) stops tracing (and supports stop_option="flush_as_finish" to log unfinished functions). [1] [2]
🌐 Web query:
VizTracer Python profiler correct usage stop save output file
💡 Result:
Command line (profile a whole script)
viztracer -o trace.json my_script.py arg1 arg2
# or
python -m viztracer -o trace.html my_script.py- Default output is
result.jsonif you don’t pass-o/--output_file. (viztracer.readthedocs.io) - If your script also has flags that might conflict, use
--to separate VizTracer args from your script args: (viztracer.readthedocs.io)
viztracer -o trace.json -- my_script.py -o my_script_output.jsonInline (start/stop only around the code you care about)
Context-manager (auto stop+save on exit):
from viztracer import VizTracer
with VizTracer(output_file="trace.json"):
do_work()Manual start/stop + save:
from viztracer import VizTracer
tracer = VizTracer(output_file="trace.json") # optional; can also pass in save()
tracer.start()
do_work()
tracer.stop()
tracer.save() # or tracer.save("other_name.json")Common “why is my file empty/missing?” pitfalls
- You must call
save()afterstop()to write the output file. (viztracer.readthedocs.io) - It’s recommended
start()andstop()be in the same frame (same call-stack level), otherwise issues can happen. (pypi.org) - If your program exits via
os._exit(), VizTracer can’t save; prefersys.exit()/normal shutdown. (viztracer.readthedocs.io)
Citations:
- 1: https://viztracer.readthedocs.io/en/latest/basic_usage.html?utm_source=openai
- 2: https://viztracer.readthedocs.io/en/latest/basic_usage.html?utm_source=openai
- 3: https://viztracer.readthedocs.io/en/latest/basic_usage.html?utm_source=openai
- 4: https://viztracer.readthedocs.io/en/latest/basic_usage.html?utm_source=openai
- 5: https://viztracer.readthedocs.io/en/latest/basic_usage.html?utm_source=openai
- 6: https://pypi.org/project/viztracer/0.1.1/?utm_source=openai
- 7: https://viztracer.readthedocs.io/en/latest/limitations.html?utm_source=openai
🏁 Script executed:
# Search the codebase for VizTracer usage and imports
rg -i "viztracer" --type py -A 5 -B 5Repository: SJTU-DENG-Lab/Diffulex
Length of output: 20992
🏁 Script executed:
# Check if there are any tests or examples of VizTracer usage
fd -e py -e txt -e md | xargs grep -l "viztracer" 2>/dev/null | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 496
Add missing save() call after stop() to persist trace data.
VizTracer's documented API requires calling save() after stop() to write the trace data to the output file. Without it, the trace file will not be created, rendering the profiler non-functional. This is listed as a common pitfall in the official documentation.
🐛 Proposed fix
def stop(self) -> Optional[Dict[str, Any]]:
"""Stop VizTracer and return trace file path."""
if self.tracer is None:
return None
self.tracer.stop()
+ self.tracer.save()
output_file = self.tracer.output_file
result = {
"backend": "viztracer",
"output_file": str(output_file),
}
self.tracer = None
return result🤖 Prompt for AI Agents
In `@diffulex_profiler/backends/viztracer.py` around lines 53 - 67, The stop()
method in VizTracer backend currently calls self.tracer.stop() but never calls
the required self.tracer.save(), so the trace file is not written; update stop()
(method stop, referencing self.tracer and output_file) to call
self.tracer.save() immediately after self.tracer.stop() and before reading
self.tracer.output_file, then proceed to build the result dict and set
self.tracer = None so the trace is persisted to disk.
| def forward(self, seqs: list[SequenceBase], logits: torch.Tensor, temperatures: torch.Tensor, | ||
| top_p=None, top_k=None, margin_confidence=False, neg_entropy=False, threshold=0.95): | ||
| attn_metadata = self.fetch_attn_metadata() | ||
| split_logits = torch.split( | ||
| logits, [len(seq) for seq in seqs] if attn_metadata.is_prefill | ||
| else [attn_metadata.diffusion_block_size] * len(seqs), dim=0 | ||
| ) | ||
|
|
||
| accepted_ids_map = {} | ||
| sampled_tokens_map = {} | ||
| true_local_ids_map = {} | ||
| for temperature, seq, seq_logits in zip(temperatures, seqs, split_logits): | ||
| true_local_ids_sub_map = {} | ||
| accepted_ids_sub_map = {} | ||
| sampled_tokens_sub_map = {} | ||
|
|
||
| last_logits = self._fetch_last_logits(seq_logits, seq) | ||
|
|
||
| shifted_logits = self._shift_logits(seq_logits, last_logits) | ||
|
|
||
| for block_id, block in enumerate(seq.diffusion_blocks): | ||
| if not block.is_active or sum(block.local_mask_tokens) == 0: | ||
| continue | ||
|
|
||
| if len(block.global_mask_token_ids) == 0: | ||
| continue | ||
|
|
||
| if attn_metadata.is_prefill: | ||
| mask_token_logits = shifted_logits[block.global_mask_token_ids, ...] | ||
| else: | ||
| mask_token_logits = shifted_logits[block.local_mask_token_ids, ...] | ||
|
|
||
| confidence, sampled_tokens, initial_confidence = self.sample_tokens( | ||
| mask_token_logits, | ||
| temperature, | ||
| top_p=top_p, | ||
| top_k=top_k, | ||
| neg_entropy=(neg_entropy == "neg_entropy"), | ||
| margin_confidence=(margin_confidence == "margin_confidence") | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1. First, let's look at the actual file to see parameter defaults
cat -n diffulex/sampler/sdar.py | head -60
# 2. Find all calls to forward() method to see how these flags are passed
rg -n "\.forward\(" diffulex -g'*.py' -A 5
# 3. Find sample_tokens definition to understand its signature
rg -n "def sample_tokens" diffulex -g'*.py' -A 10Repository: SJTU-DENG-Lab/Diffulex
Length of output: 3691
🏁 Script executed:
#!/bin/bash
# Find all calls to the forward method on SDAR sampler
rg -n "forward\(" diffulex -g'*.py' -B 2 -A 5 | grep -A 7 -B 2 "margin_confidence\|neg_entropy"
# Also search for any usage of these parameter names
rg -n "margin_confidence|neg_entropy" diffulex -g'*.py' | head -30Repository: SJTU-DENG-Lab/Diffulex
Length of output: 4622
🏁 Script executed:
#!/bin/bash
# Search for tests or documentation
find diffulex -type f \( -name "*test*.py" -o -name "*.md" \) -exec grep -l "margin_confidence\|neg_entropy" {} \;
# Also check for any actual calls to these forward methods
rg -n "\.forward\(" diffulex -g'*.py' -B 5 | grep -E "(forward|margin_confidence|neg_entropy)" | head -40Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
Flags margin_confidence/neg_entropy are broken across all samplers.
Parameters are declared as booleans (default False) but compared to strings, so passing True never enables the feature. The sample_tokens() method expects booleans, but these comparisons always evaluate to False. This pattern appears in sdar.py, llada.py, dream.py, and fast_dllm_v2.py.
Change the comparisons to accept both bool and legacy string values, or standardize on one type:
Suggested fix
- confidence, sampled_tokens, initial_confidence = self.sample_tokens(
+ confidence, sampled_tokens, initial_confidence = self.sample_tokens(
mask_token_logits,
temperature,
top_p=top_p,
top_k=top_k,
- neg_entropy=(neg_entropy == "neg_entropy"),
- margin_confidence=(margin_confidence == "margin_confidence")
+ neg_entropy=neg_entropy is True or neg_entropy == "neg_entropy",
+ margin_confidence=margin_confidence is True or margin_confidence == "margin_confidence"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def forward(self, seqs: list[SequenceBase], logits: torch.Tensor, temperatures: torch.Tensor, | |
| top_p=None, top_k=None, margin_confidence=False, neg_entropy=False, threshold=0.95): | |
| attn_metadata = self.fetch_attn_metadata() | |
| split_logits = torch.split( | |
| logits, [len(seq) for seq in seqs] if attn_metadata.is_prefill | |
| else [attn_metadata.diffusion_block_size] * len(seqs), dim=0 | |
| ) | |
| accepted_ids_map = {} | |
| sampled_tokens_map = {} | |
| true_local_ids_map = {} | |
| for temperature, seq, seq_logits in zip(temperatures, seqs, split_logits): | |
| true_local_ids_sub_map = {} | |
| accepted_ids_sub_map = {} | |
| sampled_tokens_sub_map = {} | |
| last_logits = self._fetch_last_logits(seq_logits, seq) | |
| shifted_logits = self._shift_logits(seq_logits, last_logits) | |
| for block_id, block in enumerate(seq.diffusion_blocks): | |
| if not block.is_active or sum(block.local_mask_tokens) == 0: | |
| continue | |
| if len(block.global_mask_token_ids) == 0: | |
| continue | |
| if attn_metadata.is_prefill: | |
| mask_token_logits = shifted_logits[block.global_mask_token_ids, ...] | |
| else: | |
| mask_token_logits = shifted_logits[block.local_mask_token_ids, ...] | |
| confidence, sampled_tokens, initial_confidence = self.sample_tokens( | |
| mask_token_logits, | |
| temperature, | |
| top_p=top_p, | |
| top_k=top_k, | |
| neg_entropy=(neg_entropy == "neg_entropy"), | |
| margin_confidence=(margin_confidence == "margin_confidence") | |
| ) | |
| def forward(self, seqs: list[SequenceBase], logits: torch.Tensor, temperatures: torch.Tensor, | |
| top_p=None, top_k=None, margin_confidence=False, neg_entropy=False, threshold=0.95): | |
| attn_metadata = self.fetch_attn_metadata() | |
| split_logits = torch.split( | |
| logits, [len(seq) for seq in seqs] if attn_metadata.is_prefill | |
| else [attn_metadata.diffusion_block_size] * len(seqs), dim=0 | |
| ) | |
| accepted_ids_map = {} | |
| sampled_tokens_map = {} | |
| true_local_ids_map = {} | |
| for temperature, seq, seq_logits in zip(temperatures, seqs, split_logits): | |
| true_local_ids_sub_map = {} | |
| accepted_ids_sub_map = {} | |
| sampled_tokens_sub_map = {} | |
| last_logits = self._fetch_last_logits(seq_logits, seq) | |
| shifted_logits = self._shift_logits(seq_logits, last_logits) | |
| for block_id, block in enumerate(seq.diffusion_blocks): | |
| if not block.is_active or sum(block.local_mask_tokens) == 0: | |
| continue | |
| if len(block.global_mask_token_ids) == 0: | |
| continue | |
| if attn_metadata.is_prefill: | |
| mask_token_logits = shifted_logits[block.global_mask_token_ids, ...] | |
| else: | |
| mask_token_logits = shifted_logits[block.local_mask_token_ids, ...] | |
| confidence, sampled_tokens, initial_confidence = self.sample_tokens( | |
| mask_token_logits, | |
| temperature, | |
| top_p=top_p, | |
| top_k=top_k, | |
| neg_entropy=neg_entropy is True or neg_entropy == "neg_entropy", | |
| margin_confidence=margin_confidence is True or margin_confidence == "margin_confidence" | |
| ) |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 28-28: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
[warning] 49-49: Unpacked variable confidence is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
In `@diffulex/sampler/sdar.py` around lines 17 - 56, In forward(), the boolean
flags margin_confidence and neg_entropy are incorrectly compared to strings when
passed into sample_tokens (e.g., neg_entropy == "neg_entropy"), so True is never
honored; change the calls to normalize these inputs to booleans (accept both
bool and legacy string values) before passing them to sample_tokens — e.g.,
compute normalized_neg_entropy = bool(neg_entropy) or normalized_neg_entropy =
(neg_entropy is True or neg_entropy == "neg_entropy") and similarly for
margin_confidence, then call sample_tokens(...,
neg_entropy=normalized_neg_entropy,
margin_confidence=normalized_margin_confidence); apply the same normalization
pattern wherever these flags are used (including other files llada.py, dream.py,
fast_dllm_v2.py) so sample_tokens always receives a proper bool.
| if seq.diffusion_blocks[-1].is_active: | ||
| slot_mapping.extend([-1] * self.diffusion_block_size) | ||
| elif seq.diffusion_blocks[-1].is_to_cache: | ||
| need_kv_cache_store = True | ||
| num_pages_storing = seq.num_page_blocks_in_active_diffusion_block | ||
| total_num_pages = len(seq.block_table) | ||
| for i in range(0, num_pages_storing): | ||
| start = seq.block_table[(total_num_pages - 1) - num_pages_storing + i] * self.block_size | ||
| end = start + self.block_size | ||
| slot_mapping.extend(range(start, end)) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect FDV2 block state invariants and diffusion step transitions.
rg -n "class FDV2Block|is_active|is_to_cache|next_diffusion_step|diffusion_blocks" diffulex/strategy/fast_dllm_v2/engine/sequence.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 1534
🏁 Script executed:
#!/bin/bash
sed -n '11,100p' diffulex/strategy/fast_dllm_v2/engine/sequence.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 2650
🏁 Script executed:
#!/bin/bash
sed -n '110,145p' diffulex/strategy/fast_dllm_v2/engine/model_runner.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 2206
🏁 Script executed:
#!/bin/bash
rg -n "def diffusion_decoding_inputs" diffulex/strategy/fast_dllm_v2/engine/sequence.py -A 20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 1175
🏁 Script executed:
#!/bin/bash
rg -n "def active_block_token_ids|active_block_token_ids" diffulex/strategy/fast_dllm_v2/engine/sequence.py -B 2 -A 5Repository: SJTU-DENG-Lab/Diffulex
Length of output: 749
🏁 Script executed:
#!/bin/bash
sed -n '251,280p' diffulex/strategy/fast_dllm_v2/engine/sequence.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 1189
Add handling for IN_CACHE state to prevent slot_mapping desync.
The code handles only is_active and is_to_cache states, but diffusion_blocks[-1] can also be IN_CACHE (which neither condition matches). When this occurs, input_ids is extended by diffusion_decoding_inputs() but slot_mapping is not, causing a length mismatch. Add an else branch to extend slot_mapping with [-1]s for consistency with the active case.
🤖 Prompt for AI Agents
In `@diffulex/strategy/fast_dllm_v2/engine/model_runner.py` around lines 123 -
133, The code in model_runner.py fails to handle the IN_CACHE state for
seq.diffusion_blocks[-1], causing slot_mapping to be shorter than input_ids; in
the block that currently checks seq.diffusion_blocks[-1].is_active and
.is_to_cache, add an else branch that mirrors the active case by extending
slot_mapping with [-1] * self.diffusion_block_size so slot_mapping stays aligned
with the input_ids produced by diffusion_decoding_inputs(); update the branch
containing seq.diffusion_blocks[-1].is_active,
seq.diffusion_blocks[-1].is_to_cache, slot_mapping, and
diffusion_decoding_inputs() accordingly.
| # NOTE: fp16/fp32 are currently routed to the BF16 kernels in Diffulex. | ||
| # Keeping them registered avoids breaking older configs while we add | ||
| # true fp16/fp32 KV-cache kernels in the future. | ||
| @register_kv_cache_strategy("bf16", "bfloat16", "fp16", "float16", "fp32", "float32") | ||
| def _build_kv_cache_bf16() -> KVCacheBF16Strategy: | ||
| return KVCacheBF16Strategy() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for code that makes assumptions about fp16/fp32 KV cache behavior
rg -n -C3 'kv_cache_dtype.*fp16|kv_cache_dtype.*fp32|kv_cache_dtype.*float16|kv_cache_dtype.*float32' --type=pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 3155
🏁 Script executed:
#!/bin/bash
# Search for other code that assumes itemsize or precision based on dtype
rg -n 'itemsize|parse_kv_cache_dtype' --type=py -B2 -A2Repository: SJTU-DENG-Lab/Diffulex
Length of output: 22244
🏁 Script executed:
#!/bin/bash
# Check BF16Strategy implementation
cat -n diffulex/utils/quantization/strategies/kv_cache_bf16.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 3208
🏁 Script executed:
#!/bin/bash
# Check the legacy _get_kv_cache_storage_info function
sed -n '28,50p' diffulex_legacy/engine/model_runner.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 971
🏁 Script executed:
#!/bin/bash
# Check parse_kv_cache_dtype to understand the enum values
sed -n '1,100p' diffulex/utils/quantization/kv_cache_dtype.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 2911
Real precision mismatch between BF16Strategy aliases and downstream size assumptions.
When kv_cache_dtype="fp32" or "float32" are used, they are routed to BF16Strategy (2 bytes), but downstream code makes incorrect assumptions: diffulex_legacy/engine/model_runner.py:34-44 returns 4 bytes for FP32 despite the strategy using 2 bytes, and examples/test_fp8_kv_cache_comprehensive.py:506 calculates itemsize as 4 bytes for fp32. This causes memory miscalculation. Use strategy's get_storage_dtype() method instead of hardcoded dtype-to-size mapping (as done in diffulex/engine/model_runner.py:197).
🤖 Prompt for AI Agents
In `@diffulex/utils/quantization/strategies/kv_cache_bf16.py` around lines 55 -
60, The BF16 alias registration (register_kv_cache_strategy ->
_build_kv_cache_bf16 returning KVCacheBF16Strategy) causes fp32/fp16 strings to
be treated as 2-byte storage but downstream code still uses hardcoded
dtype-to-size lookups; update callers to ask the strategy for its actual storage
dtype: call the strategy's get_storage_dtype() (e.g., on the KVCacheBF16Strategy
instance) and compute sizes via numpy dtype.itemsize instead of mapping strings
to sizes. Replace any hardcoded branches that assume "fp32" => 4 bytes (such as
code that computes itemsize) with a call to strategy.get_storage_dtype() and
np.dtype(...).itemsize so memory calculations match the registered strategy.
| @register_linear_strategy(weight_dtype="bf16", act_dtype="bf16") | ||
| def _build_linear_bf16() -> LinearQuantizationStrategy: | ||
| return LinearBF16Strategy() | ||
|
|
||
|
|
||
| class LinearBF16Strategy(LinearQuantizationStrategy): |
There was a problem hiding this comment.
Class referenced before definition.
The factory function _build_linear_bf16() references LinearBF16Strategy on line 11, but the class is defined on line 14. This will raise a NameError at import time.
🐛 Proposed fix: Move class definition before the factory function
from diffulex.utils.quantization.registry import register_linear_strategy
from diffulex.utils.quantization.strategy import LinearQuantizationStrategy
+class LinearBF16Strategy(LinearQuantizationStrategy):
+ """Default Linear strategy: no quantization (bf16/bf16)."""
+
+ `@property`
+ def name(self) -> str:
+ return "linear_bf16"
+
+ def get_storage_dtype(self) -> tuple[torch.dtype, int]:
+ # No special storage; keep as-is.
+ return torch.bfloat16, 2
+
+ def quantize(self, tensor: torch.Tensor, **kwargs):
+ _ = kwargs
+ return tensor, None
+
+ def dequantize(self, quantized: torch.Tensor, scale_or_metadata, **kwargs) -> torch.Tensor:
+ _ = scale_or_metadata, kwargs
+ return quantized
+
+ def get_scale_shape(self, original_shape: tuple[int, ...], **kwargs) -> tuple[int, ...]:
+ _ = original_shape, kwargs
+ return tuple()
+
+
`@register_linear_strategy`(weight_dtype="bf16", act_dtype="bf16")
def _build_linear_bf16() -> LinearQuantizationStrategy:
return LinearBF16Strategy()
-
-
-class LinearBF16Strategy(LinearQuantizationStrategy):
- """Default Linear strategy: no quantization (bf16/bf16)."""
-
- `@property`
- def name(self) -> str:
- return "linear_bf16"
-
- def get_storage_dtype(self) -> tuple[torch.dtype, int]:
- # No special storage; keep as-is.
- return torch.bfloat16, 2
-
- def quantize(self, tensor: torch.Tensor, **kwargs):
- _ = kwargs
- return tensor, None
-
- def dequantize(self, quantized: torch.Tensor, scale_or_metadata, **kwargs) -> torch.Tensor:
- _ = scale_or_metadata, kwargs
- return quantized
-
- def get_scale_shape(self, original_shape: tuple[int, ...], **kwargs) -> tuple[int, ...]:
- _ = original_shape, kwargs
- return tuple()🤖 Prompt for AI Agents
In `@diffulex/utils/quantization/strategies/linear_bf16.py` around lines 9 - 14,
The factory function _build_linear_bf16() calls LinearBF16Strategy before that
class is defined, causing a NameError at import; move the class
LinearBF16Strategy definition above the `@register_linear_strategy-decorated`
_build_linear_bf16 function (or alternatively inline the class reference by
returning an instance via a lambda that imports/defines the class first) so that
LinearBF16Strategy is defined when _build_linear_bf16() is executed.
…eg entropy - Added normalization for margin_confidence and neg_entropy parameters in DreamSampler, FastdLLMV2Sampler, LLaDASampler, and SDARSampler classes to improve clarity and consistency in handling these options. - Introduced a new kv_cache_layout attribute in AttnMetaDataBase for better cache management. - Updated the model_runner to handle slot mapping when blocks are already in cache.
…nto v0.0.1.0209
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
diffulex_legacy/layers/attention/attention_v4.py (1)
57-173:⚠️ Potential issue | 🟠 MajorTrack and reset FP8 running-max on device changes to prevent torch.maximum failures.
The
deviceparameter in both_update_and_compute_fp8_scalesand_update_and_compute_q_fp8_scaleis unused. Cached max tensors (k_max_abs,v_max_abs,q_max_abs) are initialized from input tensors at first call, so they reside on the device of those inputs. If the model is moved to a different device (e.g., via.to(device)) or in multi-GPU scenarios, subsequent calls pass tensors on a different device, causingtorch.maximumto fail with a device mismatch error. Reset caches by comparing the provideddeviceparameter against a newly trackedkv_cache_deviceattribute.🛠️ Suggested fix (track device and reset on change)
@@ self.kv_cache_dtype_cache: str | None = None + self.kv_cache_device: torch.device | None = None @@ - # Reset running max if dtype changed - if self.kv_cache_dtype_cache != kv_cache_dtype: + # Reset running max if dtype or device changed + if self.kv_cache_device != device or self.kv_cache_dtype_cache != kv_cache_dtype: self.k_max_abs = None self.v_max_abs = None self.q_max_abs = None self.kv_cache_dtype_cache = kv_cache_dtype + self.kv_cache_device = device @@ - # Reset running max if dtype changed - if self.kv_cache_dtype_cache != kv_cache_dtype: + # Reset running max if dtype or device changed + if self.kv_cache_device != device or self.kv_cache_dtype_cache != kv_cache_dtype: self.q_max_abs = None self.kv_cache_dtype_cache = kv_cache_dtype + self.kv_cache_device = devicediffulex/model/llada.py (1)
199-199:⚠️ Potential issue | 🔴 CriticalTypo:
nn.Moduledictshould benn.ModuleDict.This will cause an
AttributeErrorat runtime since PyTorch's class name uses a capital "D".🐛 Proposed fix
- self.transformer = nn.Moduledict( + self.transformer = nn.ModuleDict(diffulex/strategy/d2f/engine/kvcache_manager.py (1)
44-58:⚠️ Potential issue | 🟠 MajorAvoid hashing the wrong block when allocating multiple KV blocks.
With multi-block allocation,
prev_end_token/prev_block_idxstay constant whilelast_blockchanges each iteration. Ifrequiredspans multiple new blocks, the hash update can be applied to the wrong (newly allocated) block, corruptinghash_to_block_id. Gate hash finalization to the block that actually containsprev_end_token.🧩 Proposed fix
required = self._required_kv_blocks(seq) + prev_end_token = seq.cached_or_caching_num_tokens - seq.caching_num_tokens - 1 + prev_block_idx = prev_end_token // self.block_size if prev_end_token >= 0 else -1 # Allocate enough KV blocks to cover all cached_or_caching tokens. while len(block_table) < required: last_block = self.blocks[block_table[-1]] # Preserve the existing "finalize previous block hash" behavior before moving on. - if last_block.hash == -1: - prev_end_token = seq.cached_or_caching_num_tokens - seq.caching_num_tokens - 1 - prev_block_idx = prev_end_token // self.block_size - if prev_block_idx < seq.num_blocks: + if last_block.hash == -1 and (len(block_table) - 1) == prev_block_idx: + if 0 <= prev_block_idx < seq.num_blocks: token_ids: list[int] = seq.block(prev_block_idx) prefix = self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1 h = self.compute_hash(token_ids, prefix) last_block.update(h, token_ids) self.hash_to_block_id[h] = last_block.block_iddiffulex_legacy/engine/model_runner.py (1)
410-421:⚠️ Potential issue | 🟡 MinorPotential
UnboundLocalErrorin the fallback loop.If the initial
get_num_kvcache_blockscall fails (line 411) and enters the except block,num_kvcache_blocksis referenced in the while condition (line 415) before being assigned, causing anUnboundLocalError.🐛 Proposed fix
try: num_kvcache_blocks = get_num_kvcache_blocks(config.gpu_memory_utilization) assert num_kvcache_blocks > 0 except: # noqa: E722 gpu_memory_utilization = config.gpu_memory_utilization + num_kvcache_blocks = get_num_kvcache_blocks(gpu_memory_utilization) while num_kvcache_blocks <= 200:diffulex/strategy/block_diffusion/engine/model_runner.py (1)
188-268:⚠️ Potential issue | 🟠 MajorEnsure
reset_warming_up()runs even if capture fails.If an exception occurs during capture, the global warming flag can stay enabled and affect subsequent runs. Wrap the body in
try/finally.🛠️ Suggested fix
- set_warming_up(True) - config = self.config - hf_config = config.hf_config + set_warming_up(True) + try: + config = self.config + hf_config = config.hf_config ... - reset_warming_up() + finally: + reset_warming_up()
🤖 Fix all issues with AI agents
In `@diffulex_bench/config.py`:
- Around line 67-103: The get_diffulex_kwargs function currently returns a
literal dict immediately, then attempts to mutate an undefined kwargs and add
quantization fields; fix by creating a single kwargs variable (e.g., kwargs = {
... } using the current dict contents from get_diffulex_kwargs), remove the
early return, then conditionally set kv_cache_dtype, decode_mode,
linear_attn_weight_dtype, linear_mlp_weight_dtype, linear_attn_act_dtype, and
linear_mlp_act_dtype onto that kwargs object, and finally return kwargs; update
references in this function to avoid the undefined variable and ensure
quantization options are included.
In `@diffulex_bench/lm_eval_model.py`:
- Around line 223-236: The loop collects per-request gen_args but never applies
them; update the code that calls self.runner.generate to pass per-request
SamplingParams by mapping each req's gen_args into a SamplingParams instance
(merging/overriding defaults from self.sampling_params) and pass a list of
SamplingParams instead of a single self.sampling_params; specifically, keep
building gen_args in the for req in requests loop, convert each gen_args entry
into a SamplingParams (honoring fields like max_gen_toks and until) and call
self.runner.generate(prompts, per_request_sampling_params_list, use_tqdm=not
disable_tqdm) so the runner receives list[SamplingParams] and honors per-request
overrides.
In `@diffulex_bench/runner.py`:
- Around line 19-53: The tokenizer is being loaded with
AutoTokenizer.from_pretrained(..., trust_remote_code=True) inside __init__ which
is unsafe; add a new parameter (e.g., trust_remote_code: bool = False and
optional revision: Optional[str] = None) to the Runner __init__ signature, pass
that parameter to AutoTokenizer.from_pretrained and only set trust_remote_code
when explicitly True, and if a mutable remote execution is required encourage
pinning by forwarding revision to from_pretrained; update the __init__'s
tokenizer_path handling and the call site that constructs DiffulexRunner to
opt-in when needed (also apply same pattern to other modules like
diffulex/config.py and diffulex/engine/llm_engine.py where
AutoTokenizer.from_pretrained or model loading uses trust_remote_code).
In `@diffulex_kernel/python/kv_cache_kernels.py`:
- Around line 919-945: store_kvcache_distinct_layout currently doesn't trim
slot_mapping for partial-prefill cases, causing failures when slot_mapping is
longer than the current token slice; update store_kvcache_distinct_layout to
mirror the unified-layout behavior by slicing/trimming slot_mapping to the
actual token count before calling _store_kvcache_distinct_bf16 or
_store_kvcache_distinct_fp8 (i.e., compute the active length from key/value
tensors or attn_metadata and replace slot_mapping with slot_mapping[:active_len]
when it's longer), and then pass the trimmed slot_mapping into those helper
functions.
In `@diffulex_profiler/__init__.py`:
- Around line 12-17: The unconditional imports of VizTracerBackend and
PyTorchProfilerBackend cause ImportError when optional deps are absent; change
the top-level imports so ProfilerBackend and SimpleTimerBackend are imported
normally, but wrap imports of VizTracerBackend and PyTorchProfilerBackend in
try/except ImportError blocks (or use getattr fallback) and only add those names
to the module exports when successfully imported; also update the module's
__all__ to include the optional backend names conditionally so the package
doesn't fail to import if optional dependencies are missing.
In `@diffulex_profiler/exporters/summary.py`:
- Around line 57-59: The loop is shadowing the module-level/file-level variable
output_file (set earlier around line 19) by reassigning output_file when
handling viztracer backend data; rename the local variable (e.g.,
viztracer_output_file or viz_output_file) inside the if m.backend_data and
m.backend_data.get("backend") == "viztracer" block and update the summary_lines
append to use that new name so the original output_file used for writing the
.txt summary is not overwritten; locate the handling code using symbols
m.backend_data, summary_lines, and output_file to make the change.
In `@diffulex/engine/model_runner.py`:
- Around line 193-197: The code calls strategy.get_storage_dtype() and later
expects strategy.init_scales(), but NoQuantizationStrategy (returned by
get_kv_cache_strategy fallback) doesn't implement init_scales, causing errors;
modify the fallback so get_kv_cache_strategy() never returns
NoQuantizationStrategy for KV-cache use (e.g., default to a KV-capable strategy
like BF16QuantizationStrategy) or add a guard before calling init_scales() to
skip/handle strategies without that method; update the logic around
get_kv_cache_strategy(), NoQuantizationStrategy, get_storage_dtype, and any
subsequent init_scales() calls (also apply the same change to the similar block
around lines 290-303) so only strategies that implement the KV-cache interface
are used for init_scales().
- Around line 165-171: In warmup_model, ensure reset_warming_up() always runs by
wrapping the work between set_warming_up(True) and reset_warming_up() in a
try/finally: call set_warming_up(True), do torch.cuda.empty_cache(),
torch.cuda.reset_peak_memory_stats() and call self._prefill_warmup() inside the
try block, and call reset_warming_up() in the finally block so that any
exception in _prefill_warmup() still clears the warming flag.
- Around line 151-163: In _prefill_warmup, guard against num_seqs resolving to
0: compute num_seqs from max_num_batched_tokens and max_model_len, and if
num_seqs == 0 log a debug/info message and return early so you don't call
self.run([]) or create zero-length seqs; otherwise continue to build seqs via
AutoSequence.create, call self.run(seqs, True), call seq.post_process() for each
seq, and still call torch.cuda.empty_cache() as before.
- Around line 40-49: Compute device_id before calling dist.init_process_group
and use that computed value for both init and torch.cuda.set_device to avoid
indexing a missing/short config.device_ids list; specifically, in
model_runner.py determine device_id by checking getattr(config, "device_ids",
None) and falling back to (getattr(config, "device_start", 0) or 0) + rank,
validate it against torch.cuda.device_count(), then pass device_id to
dist.init_process_group (instead of indexing config.device_ids again) and call
torch.cuda.set_device(device_id).
In `@diffulex/sampler/fast_dllm_v2.py`:
- Around line 69-72: Update the two schedulers that still call token.item()
(diffulex/strategy/fast_dllm_v2/engine/scheduler.py and
diffulex/strategy/block_diffusion/engine/scheduler.py): find the comparison
using token.item() == self.eos and replace it with a defensive conversion that
accepts either a Tensor or a Python int (e.g., if isinstance(token,
torch.Tensor): value = int(token.item()) else: value = int(token)) and then
compare value == self.eos; ensure this change is applied wherever sampled tokens
from sampled_tokens_sub_map or accepted_ids_sub_map are checked so list values
(already Python ints) and tensors both work correctly.
In `@diffulex/strategy/fast_dllm_v2/engine/sequence.py`:
- Around line 121-125: The __init__ for the Sequence class currently uses a
shared mutable SamplingParams() as a default; change the signature to use
sampling_params: SamplingParams | None = None and inside Sequence.__init__
create a new instance when None (e.g., sampling_params = SamplingParams() if
sampling_params is None else sampling_params) before calling
super().__init__(token_ids, sampling_params), ensuring each Sequence gets its
own SamplingParams instance and avoiding shared mutable defaults.
In `@diffulex/utils/quantization/strategies/linear_marlin_int8_w8a16.py`:
- Around line 101-132: get_storage_dtype declares torch.uint8 storage but
quantize()/dequantize() use signed int8; change quantize in function
quantize(...) to produce uint8 by biasing signed int8 values (add 128) and
clamping to [0,255] and return dtype torch.uint8, and change dequantize in
dequantize(...) to accept the uint8 storage, convert back to signed by
subtracting 128 (or cast to int8 after subtract) before multiplying by scales;
ensure scales handling (scales.squeeze/unsqueeze) stays the same and types are
converted to float32 for arithmetic then result cast to bfloat16, so
get_storage_dtype, quantize, and dequantize are consistent.
🟡 Minor comments (24)
diffulex/utils/quantization/quantize_model.py-147-201 (1)
147-201:⚠️ Potential issue | 🟡 MinorRemove unused
pack_factor(ruff F841).🧹 Suggested fix
- pack_factor = 32 // bits qweight = gptq_pack(w_q, bits, size_k, size_n).contiguous() # [K/pack, N]diffulex/utils/quantization/quantize_model.py-707-712 (1)
707-712:⚠️ Potential issue | 🟡 MinorRemove unused f-string prefixes (ruff F541).
Lines 707 and 712 contain only literal strings with no variable interpolation, making the
fprefix unnecessary.🧹 Suggested fix
- print(f"\n✓ Quantization complete!") + print("\n✓ Quantization complete!") @@ - print(f"\n You can now use this directory directly as model path:") + print("\n You can now use this directory directly as model path:")diffulex/layer/linear.py-824-824 (1)
824-824:⚠️ Potential issue | 🟡 MinorRemove unused variable
dev_key.The variable
dev_keyis assigned but never used. This was also flagged by static analysis (F841).🧹 Proposed fix
- dev_key = self._device_index(device)diffulex/layer/linear.py-1298-1300 (1)
1298-1300:⚠️ Potential issue | 🟡 MinorRedundant check:
in_featuresis already anint.Line 1298 assigns
in_features = int(self._offline_quant_in_features_py), so checkingif in_features is Noneon line 1299 will never be true sinceint()never returnsNone.🐛 Proposed fix
in_features = int(self._offline_quant_in_features_py) - if in_features is None or in_features <= 0: + if in_features <= 0: raise RuntimeError("GPTQ offline 权重已加载,但无法推断 in_features 以计算 weight_bits。")examples/test_fp8_kv_cache_distinct.py-72-72 (1)
72-72:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefix from string without placeholders.The f-string on this line has no placeholders, making the
fprefix unnecessary.Proposed fix
- print(f"\n总计:") + print("\n总计:")diffulex/strategy/fast_dllm_v2/attention/metadata.py-16-18 (1)
16-18:⚠️ Potential issue | 🟡 MinorPotential type issue:
sum()on a tensor doesn't return a comparable scalar.If
context_lensis atorch.Tensor,sum(self.context_lens)will iterate and sum elements but returns a tensor, not a Python scalar. The comparison> 0may not behave as expected for zero-dimensional tensors in some contexts.Proposed fix
def __post_init__(self): - if self.context_lens is not None and sum(self.context_lens) > 0: + if self.context_lens is not None and self.context_lens.sum().item() > 0: self.total_lens = self.diffusion_block_size + self.context_lensdiffulex_profiler/README.md-163-169 (1)
163-169:⚠️ Potential issue | 🟡 MinorDoc mismatch: use
tokens=parameter name.The API reference later lists
record_throughput(tokens: int, ...), but this example usestotal_tokens. Align the example with the API to avoid confusion.📝 Proposed fix
- profiler.record_throughput(total_tokens=1000) + profiler.record_throughput(tokens=1000)examples/test_bf16_kernel_e2e.py-70-70 (1)
70-70:⚠️ Potential issue | 🟡 MinorRemove redundant f-string prefix.
Line 70 uses an f-string without any placeholders; drop the
fto satisfy F541.🔧 Proposed fix
- print(f"\n总计:") + print("\n总计:")examples/test_fp8_kv_cache_python_dequant.py-72-72 (1)
72-72:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefix.This f-string has no placeholders.
Proposed fix
- print(f"\n总计:") + print("\n总计:")examples/test_fastdllmv2_diffulex_gsm8k.py-69-75 (1)
69-75:⚠️ Potential issue | 🟡 MinorCreate the profiling output directory before writing.
The nestedlog/profiles/...path will fail if the directory doesn’t exist.💡 Proposed fix
if PROFILE: output_file = "log/profiles/perf_dvllm_dream_7B.json" + os.makedirs(os.path.dirname(output_file), exist_ok=True) if os.path.exists(output_file): os.remove(output_file)#!/bin/bash # Sanity check: ensure the profiling output directory exists in the repo (if expected). if [ ! -d log/profiles ]; then echo "log/profiles is missing; profiling output may fail unless created at runtime." fiexamples/test_fp8_linear.py-115-122 (1)
115-122:⚠️ Potential issue | 🟡 MinorRemove unused variables to satisfy Ruff F841.
M,mem_bf16, andmem_fp8are unused and will fail linting.🛠️ Suggested fix
- M, K, N = 32, 512, 256 + K, N = 512, 256 weight_bf16 = torch.randn(N, K, dtype=torch.bfloat16, device=device) - mem_bf16 = torch.cuda.memory_allocated() @@ strategy = create_linear_strategy(weight_dtype="fp8_e4m3", act_dtype="bf16") weight_fp8, scales = strategy.quantize_weight_for_kernel(weight_bf16, device=device) - mem_fp8 = torch.cuda.memory_allocated()diffulex/sampler/sdar.py-51-58 (1)
51-58:⚠️ Potential issue | 🟡 MinorRename unused
confidenceto_confidenceto satisfy Ruff.🛠️ Suggested fix
- confidence, sampled_tokens, initial_confidence = self.sample_tokens( + _confidence, sampled_tokens, initial_confidence = self.sample_tokens( mask_token_logits, temperature, top_p=top_p, top_k=top_k, neg_entropy=normalized_neg_entropy, margin_confidence=normalized_margin_confidence, )diffulex_bench/metrics.py-66-83 (1)
66-83:⚠️ Potential issue | 🟡 MinorAlign HumanEval stub typing and silence unused-arg warnings.
The function is annotated as
floatbut returnsNone, andresults/kare unused. ConsiderOptional[float]and a dummy assignment to avoid lint errors.🛠️ Suggested fix
-def humaneval_pass_at_k( - results: List[Dict[str, Any]], - k: int = 1, -) -> float: +def humaneval_pass_at_k( + results: List[Dict[str, Any]], + k: int = 1, +) -> Optional[float]: @@ - # Returns None, actual evaluation requires implementing code execution logic + _ = results, k + # Returns None, actual evaluation requires implementing code execution logic return Nonediffulex/logger.py-44-47 (1)
44-47:⚠️ Potential issue | 🟡 MinorRestore
record.levelnameafter formatting to prevent color codes leaking into file logs.
LogRecordis shared across all handlers in Python's logging system. WhenColoredFormatter.format()mutatesrecord.levelnameto add ANSI color codes, subsequent handlers (like file handlers) receive the modified value, resulting in color codes appearing in log files.Wrap the mutation in a try/finally block to restore the original value:
🛠️ Suggested fix
def format(self, record): log_color = self.COLORS.get(record.levelname, '') - record.levelname = f"{log_color}{record.levelname}{self.RESET}" - return super().format(record) + original_levelname = record.levelname + try: + record.levelname = f"{log_color}{record.levelname}{self.RESET}" + return super().format(record) + finally: + record.levelname = original_levelnameexamples/test_fp8_linear.py-135-152 (1)
135-152:⚠️ Potential issue | 🟡 MinorAdd early CUDA guard to skip FP8 tests when CUDA is unavailable.
The FP8 quantization tests depend on vLLM's
Fp8LinearOpand custom CUDA kernels, which will fail when running on CPU-only setups or when FP8 kernels aren't available. This follows the same pattern already established intest_memory_usage()(lines 106–108), providing early feedback instead of cryptic runtime errors.Suggested fix
def main(): """Run all end-to-end tests.""" print("=" * 60) print("FP8 Linear Quantization End-to-End Tests") print("=" * 60) print() + if not torch.cuda.is_available(): + print("CUDA not available; skipping FP8 tests.") + return 0 try:diffulex_kernel/python/paged_attn_decode_triton.py-72-75 (1)
72-75:⚠️ Potential issue | 🟡 MinorRename
laccumulator to satisfy Ruff E741 and improve clarity.
lis flagged as ambiguous. Renaming tolse/logsumexpavoids lint failures and makes the intent clearer (apply in all kernels).✏️ Example rename (apply across kernels)
- l = tl.zeros([BLOCK_M], dtype=tl.float32) + lse = tl.zeros([BLOCK_M], dtype=tl.float32) @@ - l_new = l * tl.exp(m - m_new) + tl.sum(p, axis=1) + l_new = lse * tl.exp(m - m_new) + tl.sum(p, axis=1) @@ - l = l_new + lse = l_new @@ - out = acc / l[:, None] + out = acc / lse[:, None]Also applies to: 229-232, 395-397
diffulex/strategy/fast_dllm_v2/engine/scheduler.py-90-122 (1)
90-122:⚠️ Potential issue | 🟡 MinorGuard against mismatched accepted/true ID lengths.
zip()will silently truncate if the lists diverge, which can drop token updates without notice. Add a length check (orstrict=True) before iterating.🔧 Suggested guard
sampled_tokens = sampled_tokens_map.get(block_id, []) true_local_ids = true_ids_map.get(block_id, []) - for true_local_id, accepted_id in zip(true_local_ids, accepted_ids): + if len(true_local_ids) != len(accepted_ids): + raise ValueError( + f"Mismatched lengths for block {block_id}: " + f"{len(true_local_ids)} true_ids vs {len(accepted_ids)} accepted_ids" + ) + for true_local_id, accepted_id in zip(true_local_ids, accepted_ids):diffulex_bench/main.py-19-77 (1)
19-77:⚠️ Potential issue | 🟡 MinorFilter out
Nonevalues inmodel_args.If optional fields are unset,
key=Nonegets passed to lm_eval and can be misparsed.🧹 Suggested fix
- args_list = [f"{k}={v}" for k, v in args_dict.items()] + args_list = [f"{k}={v}" for k, v in args_dict.items() if v is not None]diffulex_bench/config.py-60-65 (1)
60-65:⚠️ Potential issue | 🟡 MinorRename the loop variable to avoid shadowing
field.Ruff flags F811 here; use a different loop variable.
🧹 Suggested cleanup
- return { - field.name: getattr(self, field.name) - for field in self.__dataclass_fields__.values() - } + return { + f.name: getattr(self, f.name) + for f in self.__dataclass_fields__.values() + } ... - return { - field.name: getattr(self, field.name) - for field in self.__dataclass_fields__.values() - } + return { + f.name: getattr(self, f.name) + for f in self.__dataclass_fields__.values() + }Also applies to: 131-136
diffulex/utils/quantization/strategies/__init__.py-10-19 (1)
10-19:⚠️ Potential issue | 🟡 MinorRemove unused
# noqa: F401directives.Ruff flags these as unused since F401 isn’t enabled; dropping them keeps lint clean.
🧹 Suggested cleanup
-from diffulex.utils.quantization.strategies.linear_int8_w8a16 import LinearInt8W8A16Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_int4_w4a16 import LinearInt4W4A16Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_int8_w8a8 import LinearInt8W8A8Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_int4_w4a8 import LinearInt4W4A8Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_fp8_w8a16 import LinearFP8W8A16Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_fp8_w8a8 import LinearFP8W8A8Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_gptq_w4a16 import LinearGPTQW4A16Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_gptq_marlin_w4a16 import LinearGPTQMarlinW4A16Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_awq_w4a16 import LinearAWQW4A16Strategy # noqa: F401 -from diffulex.utils.quantization.strategies.linear_awq_marlin_w4a16 import LinearAWQMarlinW4A16Strategy # noqa: F401 +from diffulex.utils.quantization.strategies.linear_int8_w8a16 import LinearInt8W8A16Strategy +from diffulex.utils.quantization.strategies.linear_int4_w4a16 import LinearInt4W4A16Strategy +from diffulex.utils.quantization.strategies.linear_int8_w8a8 import LinearInt8W8A8Strategy +from diffulex.utils.quantization.strategies.linear_int4_w4a8 import LinearInt4W4A8Strategy +from diffulex.utils.quantization.strategies.linear_fp8_w8a16 import LinearFP8W8A16Strategy +from diffulex.utils.quantization.strategies.linear_fp8_w8a8 import LinearFP8W8A8Strategy +from diffulex.utils.quantization.strategies.linear_gptq_w4a16 import LinearGPTQW4A16Strategy +from diffulex.utils.quantization.strategies.linear_gptq_marlin_w4a16 import LinearGPTQMarlinW4A16Strategy +from diffulex.utils.quantization.strategies.linear_awq_w4a16 import LinearAWQW4A16Strategy +from diffulex.utils.quantization.strategies.linear_awq_marlin_w4a16 import LinearAWQMarlinW4A16Strategydiffulex/utils/quantization/strategies/__init__.py-21-37 (1)
21-37:⚠️ Potential issue | 🟡 MinorSort
__all__to satisfy RUF022.This keeps the public export list stable and lint-clean.
🔤 Suggested ordering
__all__ = [ - 'NoQuantizationStrategy', - 'KVCacheBF16Strategy', - 'KVCacheFP8RunningMaxStrategy', - 'LinearBF16Strategy', - 'LinearStubStrategy', - 'LinearInt8W8A16Strategy', - 'LinearInt4W4A16Strategy', - 'LinearInt8W8A8Strategy', - 'LinearInt4W4A8Strategy', - 'LinearFP8W8A16Strategy', - 'LinearFP8W8A8Strategy', - 'LinearGPTQW4A16Strategy', - 'LinearGPTQMarlinW4A16Strategy', - 'LinearAWQW4A16Strategy', - 'LinearAWQMarlinW4A16Strategy', + 'KVCacheBF16Strategy', + 'KVCacheFP8RunningMaxStrategy', + 'LinearAWQMarlinW4A16Strategy', + 'LinearAWQW4A16Strategy', + 'LinearBF16Strategy', + 'LinearFP8W8A16Strategy', + 'LinearFP8W8A8Strategy', + 'LinearGPTQMarlinW4A16Strategy', + 'LinearGPTQW4A16Strategy', + 'LinearInt4W4A16Strategy', + 'LinearInt4W4A8Strategy', + 'LinearInt8W8A16Strategy', + 'LinearInt8W8A8Strategy', + 'LinearStubStrategy', + 'NoQuantizationStrategy', ]diffulex/utils/quantization/strategies/linear_awq_marlin_w4a16.py-124-131 (1)
124-131:⚠️ Potential issue | 🟡 MinorAdd an explicit dtype guard before the kernel call.
linear_act_formatis bf16, butlinear_forwardwill pass other dtypes through; a fast-fail (or cast) avoids undefined behavior.🛡️ Suggested fix
- dtype_id = 1 if reshaped_x.dtype == torch.bfloat16 else (2 if reshaped_x.dtype == torch.float16 else 0) + if reshaped_x.dtype not in (torch.bfloat16, torch.float16): + raise RuntimeError("awq_marlin expects bf16/fp16 inputs.") + dtype_id = 1 if reshaped_x.dtype == torch.bfloat16 else 2diffulex_legacy/layers/attention/ops/kv_cache_kernels.py-458-480 (1)
458-480:⚠️ Potential issue | 🟡 MinorRemove the redundant
KvCacheDTypeimport insideload_kvcache.
It redefines the already-imported symbol and triggers Ruff F811.🧹 Suggested fix
- from diffulex.utils.kv_cache_dtype import KvCacheDType if out_dtype == torch.bfloat16: out_dtype_enum = int(KvCacheDType.BF16) # 0diffulex/strategy/fast_dllm_v2/engine/sequence.py-126-127 (1)
126-127:⚠️ Potential issue | 🟡 MinorFix copy‑paste error in the exception message.
The message says “BDSequence” but the class isFDV2Sequence, which can mislead callers.✏️ Proposed fix
- raise ValueError("BDSequence requires a Config instance.") + raise ValueError("FDV2Sequence requires a Config instance.")
| def get_diffulex_kwargs(self) -> Dict[str, Any]: | ||
| """Get arguments to pass to Diffulex engine""" | ||
| return { | ||
| 'model_name': self.model_name, | ||
| 'decoding_strategy': self.decoding_strategy, | ||
| 'mask_token_id': self.mask_token_id, | ||
| 'tensor_parallel_size': self.tensor_parallel_size, | ||
| 'data_parallel_size': self.data_parallel_size, | ||
| 'gpu_memory_utilization': self.gpu_memory_utilization, | ||
| 'max_model_len': self.max_model_len, | ||
| 'max_num_batched_tokens': self.max_num_batched_tokens, | ||
| 'max_num_seqs': self.max_num_seqs, | ||
| 'use_lora': self.use_lora, | ||
| 'lora_path': self.lora_path if self.use_lora else "", | ||
| 'enforce_eager': self.enforce_eager, | ||
| 'kv_cache_layout': self.kv_cache_layout, | ||
| 'accept_threshold': self.accept_threshold, | ||
| 'complete_threshold': self.complete_threshold, | ||
| 'add_new_block_threshold': self.add_new_block_threshold, | ||
| 'diffusion_block_size': self.diffusion_block_size, | ||
| } | ||
|
|
||
| # Add quantization parameters if specified | ||
| if self.kv_cache_dtype is not None: | ||
| kwargs['kv_cache_dtype'] = self.kv_cache_dtype | ||
| if self.decode_mode is not None: | ||
| kwargs['decode_mode'] = self.decode_mode | ||
| if self.linear_attn_weight_dtype is not None: | ||
| kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype | ||
| if self.linear_mlp_weight_dtype is not None: | ||
| kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype | ||
| if self.linear_attn_act_dtype is not None: | ||
| kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype | ||
| if self.linear_mlp_act_dtype is not None: | ||
| kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype | ||
|
|
||
| return kwargs |
There was a problem hiding this comment.
get_diffulex_kwargs returns before adding quantization fields.
The function returns a dict immediately, so the quantization options are never applied and kwargs is undefined.
✅ Suggested fix
- return {
+ kwargs = {
'model_name': self.model_name,
'decoding_strategy': self.decoding_strategy,
'mask_token_id': self.mask_token_id,
'tensor_parallel_size': self.tensor_parallel_size,
'data_parallel_size': self.data_parallel_size,
'gpu_memory_utilization': self.gpu_memory_utilization,
'max_model_len': self.max_model_len,
'max_num_batched_tokens': self.max_num_batched_tokens,
'max_num_seqs': self.max_num_seqs,
'use_lora': self.use_lora,
'lora_path': self.lora_path if self.use_lora else "",
'enforce_eager': self.enforce_eager,
'kv_cache_layout': self.kv_cache_layout,
'accept_threshold': self.accept_threshold,
'complete_threshold': self.complete_threshold,
'add_new_block_threshold': self.add_new_block_threshold,
'diffusion_block_size': self.diffusion_block_size,
}
# Add quantization parameters if specified
if self.kv_cache_dtype is not None:
kwargs['kv_cache_dtype'] = self.kv_cache_dtype
if self.decode_mode is not None:
kwargs['decode_mode'] = self.decode_mode
if self.linear_attn_weight_dtype is not None:
kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype
if self.linear_mlp_weight_dtype is not None:
kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype
if self.linear_attn_act_dtype is not None:
kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype
if self.linear_mlp_act_dtype is not None:
kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype
- return kwargs
+ return kwargs📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_diffulex_kwargs(self) -> Dict[str, Any]: | |
| """Get arguments to pass to Diffulex engine""" | |
| return { | |
| 'model_name': self.model_name, | |
| 'decoding_strategy': self.decoding_strategy, | |
| 'mask_token_id': self.mask_token_id, | |
| 'tensor_parallel_size': self.tensor_parallel_size, | |
| 'data_parallel_size': self.data_parallel_size, | |
| 'gpu_memory_utilization': self.gpu_memory_utilization, | |
| 'max_model_len': self.max_model_len, | |
| 'max_num_batched_tokens': self.max_num_batched_tokens, | |
| 'max_num_seqs': self.max_num_seqs, | |
| 'use_lora': self.use_lora, | |
| 'lora_path': self.lora_path if self.use_lora else "", | |
| 'enforce_eager': self.enforce_eager, | |
| 'kv_cache_layout': self.kv_cache_layout, | |
| 'accept_threshold': self.accept_threshold, | |
| 'complete_threshold': self.complete_threshold, | |
| 'add_new_block_threshold': self.add_new_block_threshold, | |
| 'diffusion_block_size': self.diffusion_block_size, | |
| } | |
| # Add quantization parameters if specified | |
| if self.kv_cache_dtype is not None: | |
| kwargs['kv_cache_dtype'] = self.kv_cache_dtype | |
| if self.decode_mode is not None: | |
| kwargs['decode_mode'] = self.decode_mode | |
| if self.linear_attn_weight_dtype is not None: | |
| kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype | |
| if self.linear_mlp_weight_dtype is not None: | |
| kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype | |
| if self.linear_attn_act_dtype is not None: | |
| kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype | |
| if self.linear_mlp_act_dtype is not None: | |
| kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype | |
| return kwargs | |
| def get_diffulex_kwargs(self) -> Dict[str, Any]: | |
| """Get arguments to pass to Diffulex engine""" | |
| kwargs = { | |
| 'model_name': self.model_name, | |
| 'decoding_strategy': self.decoding_strategy, | |
| 'mask_token_id': self.mask_token_id, | |
| 'tensor_parallel_size': self.tensor_parallel_size, | |
| 'data_parallel_size': self.data_parallel_size, | |
| 'gpu_memory_utilization': self.gpu_memory_utilization, | |
| 'max_model_len': self.max_model_len, | |
| 'max_num_batched_tokens': self.max_num_batched_tokens, | |
| 'max_num_seqs': self.max_num_seqs, | |
| 'use_lora': self.use_lora, | |
| 'lora_path': self.lora_path if self.use_lora else "", | |
| 'enforce_eager': self.enforce_eager, | |
| 'kv_cache_layout': self.kv_cache_layout, | |
| 'accept_threshold': self.accept_threshold, | |
| 'complete_threshold': self.complete_threshold, | |
| 'add_new_block_threshold': self.add_new_block_threshold, | |
| 'diffusion_block_size': self.diffusion_block_size, | |
| } | |
| # Add quantization parameters if specified | |
| if self.kv_cache_dtype is not None: | |
| kwargs['kv_cache_dtype'] = self.kv_cache_dtype | |
| if self.decode_mode is not None: | |
| kwargs['decode_mode'] = self.decode_mode | |
| if self.linear_attn_weight_dtype is not None: | |
| kwargs['linear_attn_weight_dtype'] = self.linear_attn_weight_dtype | |
| if self.linear_mlp_weight_dtype is not None: | |
| kwargs['linear_mlp_weight_dtype'] = self.linear_mlp_weight_dtype | |
| if self.linear_attn_act_dtype is not None: | |
| kwargs['linear_attn_act_dtype'] = self.linear_attn_act_dtype | |
| if self.linear_mlp_act_dtype is not None: | |
| kwargs['linear_mlp_act_dtype'] = self.linear_mlp_act_dtype | |
| return kwargs |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 91-91: Undefined name kwargs
(F821)
[error] 93-93: Undefined name kwargs
(F821)
[error] 95-95: Undefined name kwargs
(F821)
[error] 97-97: Undefined name kwargs
(F821)
[error] 99-99: Undefined name kwargs
(F821)
[error] 101-101: Undefined name kwargs
(F821)
[error] 103-103: Undefined name kwargs
(F821)
🤖 Prompt for AI Agents
In `@diffulex_bench/config.py` around lines 67 - 103, The get_diffulex_kwargs
function currently returns a literal dict immediately, then attempts to mutate
an undefined kwargs and add quantization fields; fix by creating a single kwargs
variable (e.g., kwargs = { ... } using the current dict contents from
get_diffulex_kwargs), remove the early return, then conditionally set
kv_cache_dtype, decode_mode, linear_attn_weight_dtype, linear_mlp_weight_dtype,
linear_attn_act_dtype, and linear_mlp_act_dtype onto that kwargs object, and
finally return kwargs; update references in this function to avoid the undefined
variable and ensure quantization options are included.
| for req in requests: | ||
| prompt = req.arguments[0] | ||
| if self.add_bos_token and self.tokenizer.bos_token: | ||
| prompt = self.tokenizer.bos_token + prompt | ||
| prompts.append(prompt) | ||
| gen_args.append(req.arguments[1] if len(req.arguments) > 1 else {}) | ||
|
|
||
| # Run generation | ||
| start_time = time.time() | ||
| outputs = self.runner.generate( | ||
| prompts, | ||
| self.sampling_params, | ||
| use_tqdm=not disable_tqdm, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the file structure and locate the relevant code
find . -name "lm_eval_model.py" -type fRepository: SJTU-DENG-Lab/Diffulex
Length of output: 99
🏁 Script executed:
# Let's examine the full function context around lines 223-236
cat -n diffulex_bench/lm_eval_model.py | sed -n '200,250p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 2007
🏁 Script executed:
# Check the entire generate_until function to understand gen_args lifecycle
rg -A 50 "def generate_until" diffulex_bench/lm_eval_model.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 1747
🏁 Script executed:
# Look for the runner.generate method signature
rg "def generate" diffulex_bench/lm_eval_model.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 149
🏁 Script executed:
# Check what runner object is and its generate method
rg "self.runner" diffulex_bench/lm_eval_model.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 191
🏁 Script executed:
# Find BenchmarkRunner class definition
rg "class BenchmarkRunner" --type pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 113
🏁 Script executed:
# Look for imports and where BenchmarkRunner comes from
rg "BenchmarkRunner" diffulex_bench/lm_eval_model.py | head -10Repository: SJTU-DENG-Lab/Diffulex
Length of output: 154
🏁 Script executed:
# Find the BenchmarkRunner file
find . -name "*.py" -type f | xargs rg "class BenchmarkRunner" 2>/dev/nullRepository: SJTU-DENG-Lab/Diffulex
Length of output: 115
🏁 Script executed:
# Check the generate method signature in BenchmarkRunner
rg -B 2 "def generate" --type py | grep -A 20 "class BenchmarkRunner" | head -30Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Search all Python files for generate method
find . -name "*.py" -type f -exec grep -l "def generate" {} \; | head -10Repository: SJTU-DENG-Lab/Diffulex
Length of output: 409
🏁 Script executed:
# Examine the BenchmarkRunner class and its generate method
cat -n diffulex_bench/runner.py | head -100Repository: SJTU-DENG-Lab/Diffulex
Length of output: 4772
🏁 Script executed:
# Find the generate method in BenchmarkRunner
rg -A 15 "def generate" diffulex_bench/runner.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 544
🏁 Script executed:
# Check the full BenchmarkRunner class structure
wc -l diffulex_bench/runner.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 94
🏁 Script executed:
# Get the complete generate method
rg -A 50 "def generate" diffulex_bench/runner.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 1801
🏁 Script executed:
# Check what Instance class looks like from lm-eval
find . -name "*.py" -type f | xargs grep -l "class Instance" 2>/dev/null | head -5Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Look for lm-eval imports and understand Instance.arguments
rg "Instance" diffulex_bench/lm_eval_model.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 389
🏁 Script executed:
# Check if there are any other calls to runner.generate in the codebase
rg "runner.generate\|self.runner.generate" --type pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check what Instance arguments structure looks like
cat -n diffulex_bench/lm_eval_model.py | sed -n '220,260p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 1692
🏁 Script executed:
# Search for any documentation or usage of Instance.arguments in the codebase
rg "arguments\[1\]\|arguments\[0\]" --type py diffulex_bench/Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check if lm_eval is in the repository
find . -name "lm_eval*" -o -name "*lm_eval*" 2>/dev/null | head -10Repository: SJTU-DENG-Lab/Diffulex
Length of output: 99
🏁 Script executed:
# Look for requirements or setup files to understand lm-eval integration
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i "lm-eval\|lm_eval" -A 2 -B 2Repository: SJTU-DENG-Lab/Diffulex
Length of output: 125
🏁 Script executed:
# Search for any documentation about what generate_until should do
rg "generate_until\|loglikelihood_rolling" --type py -B 5 -A 20 | head -80Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check the Diffulex.generate() signature that BenchmarkRunner wraps
rg -B 5 "def generate" diffulex_bench/runner.py | grep -A 15 "class BenchmarkRunner" | head -50Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Look at the Diffulex engine's generate method signature
find . -name "*.py" -type f | xargs grep -l "class Diffulex" 2>/dev/null | head -3Repository: SJTU-DENG-Lab/Diffulex
Length of output: 158
🏁 Script executed:
# Check if there's any documentation about what lm-eval expects
rg "until\|max_gen_toks\|generate_until" --type py diffulex_bench/lm_eval_model.py -B 3 -A 3Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Get more context on the full generate_until implementation
cat -n diffulex_bench/lm_eval_model.py | sed -n '206,280p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 3230
🏁 Script executed:
# Check the Diffulex class generate method signature
rg -A 20 "def generate" diffulex/diffulex.py | head -40Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Look for SamplingParams class definition
rg -B 2 -A 15 "class SamplingParams" --type pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 680
🏁 Script executed:
# Check if there are any examples of how to use per-request sampling
rg "SamplingParams\|sampling_params" diffulex_bench/lm_eval_model.py -B 2 -A 2 | head -50Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Get the full SamplingParams class definition
cat -n diffulex/sampling_params.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 273
🏁 Script executed:
# Check how Diffulex.generate is actually called/used in examples
find . -name "eval*.py" -type f | head -5 | xargs cat 2>/dev/null | grep -B 5 -A 10 "\.generate\|runner.generate" | head -50Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check the actual Diffulex class for generate method
cat -n diffulex/diffulex.py | head -200Repository: SJTU-DENG-Lab/Diffulex
Length of output: 538
🏁 Script executed:
# Search for how generate is called in tests or examples
rg "\.generate\(" --type py -B 2 -A 5 | head -80Repository: SJTU-DENG-Lab/Diffulex
Length of output: 5790
🏁 Script executed:
# Look for where sampling_params is initialized in lm_eval_model.py
rg "sampling_params" diffulex_bench/lm_eval_model.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 146
🏁 Script executed:
# Check the full class initialization to see sampling_params setup
cat -n diffulex_bench/lm_eval_model.py | sed -n '1,100p'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 4244
🏁 Script executed:
# Look for any comments about lm-eval Instance.arguments structure
rg "Instance\|arguments" diffulex_bench/lm_eval_model.py -B 2 -A 2Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check if there's documentation on what gen_args contains
grep -r "max_gen_toks\|until" --include="*.py" diffulex_bench/ examples/ 2>/dev/null | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 1765
🏁 Script executed:
# Check examples to see how they handle gen_args properly
cat -n examples/eval_dream.py | sed -n '1,250p' | grep -A 30 "def generate_until"Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Look at eval_llada.py for the proper pattern
cat -n examples/eval_llada.py | sed -n '1,350p' | grep -B 5 -A 40 "def generate_until"Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check if gen_args contains "until" field
rg "gen_args\|gen_kwargs\|until" examples/eval_dream.py -B 2 -A 2 | head -50Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Find and examine the example files
find . -name "eval_dream.py" -o -name "eval_llada.py" | head -5Repository: SJTU-DENG-Lab/Diffulex
Length of output: 115
🏁 Script executed:
# Try looking at the examples directory directly
ls -la examples/Repository: SJTU-DENG-Lab/Diffulex
Length of output: 2425
🏁 Script executed:
# Check the examples more carefully
cat examples/eval_dream.py | head -300Repository: SJTU-DENG-Lab/Diffulex
Length of output: 12358
🏁 Script executed:
# Search for how gen_args is processed in eval_dream.py
grep -n "gen_args\|gen_kwargs" examples/eval_dream.py | head -30Repository: SJTU-DENG-Lab/Diffulex
Length of output: 205
🏁 Script executed:
# Get the generate_until function from eval_dream.py
rg "def generate_until" examples/eval_dream.py -A 60Repository: SJTU-DENG-Lab/Diffulex
Length of output: 2336
🏁 Script executed:
# Let's verify what the lm-eval Instance.arguments structure should be
# by checking if there are any test files or documentation
find . -name "*.md" -o -name "*README*" | xargs grep -l "Instance\|arguments\|generate_until" 2>/dev/null | head -5Repository: SJTU-DENG-Lab/Diffulex
Length of output: 113
🏁 Script executed:
# Check eval_llada.py to see if it has a better implementation
grep -n "gen_args\|gen_kwargs\|until" examples/eval_llada.py | head -40Repository: SJTU-DENG-Lab/Diffulex
Length of output: 923
🏁 Script executed:
# Get the generate_until from eval_llada.py to compare
rg "def generate_until" examples/eval_llada.py -A 80 | head -100Repository: SJTU-DENG-Lab/Diffulex
Length of output: 3345
🏁 Script executed:
# Let's check if there's any documentation about what lm-eval expects
cat diffulex_bench/README.mdRepository: SJTU-DENG-Lab/Diffulex
Length of output: 7742
🏁 Script executed:
# Verify that Instance.arguments/args is indeed a 2-tuple with (prompt, gen_args_dict)
rg "req.arguments\|req.args" diffulex_bench/lm_eval_model.py examples/ -B 2 -A 2Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
🏁 Script executed:
# Check if runner.generate or Diffulex.generate could be modified to accept list of params
rg "def generate" diffulex/engine/tp_worker.py -A 10Repository: SJTU-DENG-Lab/Diffulex
Length of output: 1054
🏁 Script executed:
# Check if SamplingParams has a stop_sequences or until field
cat diffulex/sampling_params.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 217
🏁 Script executed:
# Check the full signature again and see what parameters are supported
rg "class SamplingParams" -A 30 diffulex/sampling_params.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 170
🏁 Script executed:
# Look for how stop sequences are handled in the engine
rg "stop\|until" diffulex/engine/tp_worker.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 48
Honor per-request generation arguments from lm-eval.
gen_args is collected but unused. This ignores per-request generation limits (max_gen_toks) and stop conditions (until), causing results to drift from harness expectations.
The runner already supports list[SamplingParams], so you can apply per-request overrides:
Suggested approach
# Run generation
start_time = time.time()
+ sampling_params_list = []
+ for g in gen_args:
+ sp = self.sampling_params
+ if "max_gen_toks" in g:
+ sp = copy.deepcopy(self.sampling_params)
+ sp.max_tokens = int(g["max_gen_toks"])
+ if "temperature" in g:
+ if sp is self.sampling_params:
+ sp = copy.deepcopy(self.sampling_params)
+ sp.temperature = float(g["temperature"])
+ sampling_params_list.append(sp)
outputs = self.runner.generate(
prompts,
- self.sampling_params,
+ sampling_params_list,
use_tqdm=not disable_tqdm,
)
+ # Post-process to handle "until" stop sequences (SamplingParams doesn't support them)
+ for i, output in enumerate(outputs):
+ text = output.get('text', '')
+ if "until" in gen_args[i]:
+ for stop_seq in gen_args[i]["until"]:
+ if stop_seq in text:
+ text = text.split(stop_seq)[0]
+ output['text'] = text🤖 Prompt for AI Agents
In `@diffulex_bench/lm_eval_model.py` around lines 223 - 236, The loop collects
per-request gen_args but never applies them; update the code that calls
self.runner.generate to pass per-request SamplingParams by mapping each req's
gen_args into a SamplingParams instance (merging/overriding defaults from
self.sampling_params) and pass a list of SamplingParams instead of a single
self.sampling_params; specifically, keep building gen_args in the for req in
requests loop, convert each gen_args entry into a SamplingParams (honoring
fields like max_gen_toks and until) and call self.runner.generate(prompts,
per_request_sampling_params_list, use_tqdm=not disable_tqdm) so the runner
receives list[SamplingParams] and honors per-request overrides.
| def __init__( | ||
| self, | ||
| model_path: str, | ||
| tokenizer_path: Optional[str] = None, | ||
| wait_ready: bool = True, | ||
| **diffulex_kwargs | ||
| ): | ||
| """ | ||
| Initialize the benchmark runner | ||
|
|
||
| Args: | ||
| model_path: Path to the model | ||
| tokenizer_path: Path to the tokenizer, if None uses model_path | ||
| wait_ready: Whether to wait for engine to be fully initialized before returning | ||
| **diffulex_kwargs: Additional arguments to pass to Diffulex | ||
| """ | ||
| self.model_path = model_path | ||
| self.tokenizer_path = tokenizer_path or model_path | ||
| self.logger = get_logger(__name__) | ||
|
|
||
| # Initialize Diffulex engine | ||
| self.logger.info("Initializing Diffulex engine...") | ||
| self.llm = Diffulex(model_path, **diffulex_kwargs) | ||
|
|
||
| # Wait for engine to be ready if requested | ||
| if wait_ready: | ||
| self._wait_for_ready() | ||
|
|
||
| # Load tokenizer | ||
| self.logger.info("Loading tokenizer...") | ||
| self.tokenizer = AutoTokenizer.from_pretrained( | ||
| self.tokenizer_path, | ||
| trust_remote_code=True | ||
| ) | ||
| self.logger.success("Tokenizer loaded successfully") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's confirm the code exists as described
cat -n diffulex_bench/runner.py | head -60Repository: SJTU-DENG-Lab/Diffulex
Length of output: 2388
🏁 Script executed:
# Search for all instances of trust_remote_code in the codebase
rg "trust_remote_code" --type py -nRepository: SJTU-DENG-Lab/Diffulex
Length of output: 4619
🌐 Web query:
Hugging Face Transformers AutoTokenizer trust_remote_code security best practices
💡 Result:
What trust_remote_code means (and why it’s risky)
- Setting
trust_remote_code=Trueallows Transformers “Auto*” classes (includingAutoTokenizer.from_pretrained(...)) to download and execute Python code from the model repo (typically viaauto_map/ custom classes). Hugging Face explicitly recommends reviewing the code first and pinning a revision to avoid supply-chain updates. [1][2]
Best practices for using AutoTokenizer securely
1) Prefer trust_remote_code=False (default) and models/tokenizers implemented natively
- Only enable
trust_remote_codewhen you must load a custom tokenizer implementation. If you can switch to a tokenizer that’s already supported in Transformers, do that. [2]
2) If you must enable it, pin to an immutable revision
- Always pass a commit hash (or other immutable ref) via
revision=...when usingtrust_remote_code=True, so you don’t execute newly-pushed code later. [1][2]
3) Manually review the repo code you are about to execute
- Review the custom Python files referenced by the repo (e.g., tokenizer/modeling code) before enabling
trust_remote_code=True. [1][2]
4) Prefer safetensors and block unsafe weight formats
- Use safe weight formats (
safetensors) and enforce it where applicable (Transformers recommendsuse_safetensorsto avoid pickle-based formats). While tokenizers aren’t usually pickle-loaded, projects commonly load both tokenizer + weights together, so apply this policy consistently. [1][3]
5) Keep dependencies patched (Transformers + related loaders)
- Upgrade Transformers promptly; it has had deserialization-related RCE vulnerabilities fixed in newer versions (e.g., CVE-2024-3568 fixed in 4.38.0). [4]
6) Add operational guardrails (recommended in production)
- Allowlist trusted orgs/repos; avoid arbitrary user repos.
- Run model/tokenizer loading in a sandbox (container/VM, low privileges, read-only FS where possible, no secrets in env).
- Use network restrictions (egress allowlist) during loading if feasible.
- Consider policies that disallow
trust_remote_codeunless the publisher is verified/trusted (this is how some curated deployments gate models). [3]
Sources: Hugging Face Transformers security policy and guidance on trust_remote_code + pinning revisions [1][2], Hugging Face deployment security guidance disallowing trust_remote_code unless verified [3], and a representative Transformers deserialization CVE advisory underscoring the need to patch quickly [4].
Make trust_remote_code opt-in with default False in tokenizer loading.
Setting trust_remote_code=True allows arbitrary code execution from model/tokenizer repositories. This should default to False and only be enabled when explicitly requested. If the feature is required, also pin to an immutable revision to prevent execution of newly-pushed code.
🔧 Suggested change
@@
def __init__(
self,
model_path: str,
tokenizer_path: Optional[str] = None,
wait_ready: bool = True,
+ trust_remote_code: bool = False,
**diffulex_kwargs
):
@@
model_path: Path to the model
tokenizer_path: Path to the tokenizer, if None uses model_path
wait_ready: Whether to wait for engine to be fully initialized before returning
+ trust_remote_code: Whether to allow remote code execution when loading the tokenizer
**diffulex_kwargs: Additional arguments to pass to Diffulex
@@
self.tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer_path,
- trust_remote_code=True
+ trust_remote_code=trust_remote_code,
)Note: This pattern appears throughout the codebase. Consider applying the same fix to other instances in diffulex/config.py, diffulex/engine/llm_engine.py, and other modules that load tokenizers or models with trust_remote_code=True.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__( | |
| self, | |
| model_path: str, | |
| tokenizer_path: Optional[str] = None, | |
| wait_ready: bool = True, | |
| **diffulex_kwargs | |
| ): | |
| """ | |
| Initialize the benchmark runner | |
| Args: | |
| model_path: Path to the model | |
| tokenizer_path: Path to the tokenizer, if None uses model_path | |
| wait_ready: Whether to wait for engine to be fully initialized before returning | |
| **diffulex_kwargs: Additional arguments to pass to Diffulex | |
| """ | |
| self.model_path = model_path | |
| self.tokenizer_path = tokenizer_path or model_path | |
| self.logger = get_logger(__name__) | |
| # Initialize Diffulex engine | |
| self.logger.info("Initializing Diffulex engine...") | |
| self.llm = Diffulex(model_path, **diffulex_kwargs) | |
| # Wait for engine to be ready if requested | |
| if wait_ready: | |
| self._wait_for_ready() | |
| # Load tokenizer | |
| self.logger.info("Loading tokenizer...") | |
| self.tokenizer = AutoTokenizer.from_pretrained( | |
| self.tokenizer_path, | |
| trust_remote_code=True | |
| ) | |
| self.logger.success("Tokenizer loaded successfully") | |
| def __init__( | |
| self, | |
| model_path: str, | |
| tokenizer_path: Optional[str] = None, | |
| wait_ready: bool = True, | |
| trust_remote_code: bool = False, | |
| **diffulex_kwargs | |
| ): | |
| """ | |
| Initialize the benchmark runner | |
| Args: | |
| model_path: Path to the model | |
| tokenizer_path: Path to the tokenizer, if None uses model_path | |
| wait_ready: Whether to wait for engine to be fully initialized before returning | |
| trust_remote_code: Whether to allow remote code execution when loading the tokenizer | |
| **diffulex_kwargs: Additional arguments to pass to Diffulex | |
| """ | |
| self.model_path = model_path | |
| self.tokenizer_path = tokenizer_path or model_path | |
| self.logger = get_logger(__name__) | |
| # Initialize Diffulex engine | |
| self.logger.info("Initializing Diffulex engine...") | |
| self.llm = Diffulex(model_path, **diffulex_kwargs) | |
| # Wait for engine to be ready if requested | |
| if wait_ready: | |
| self._wait_for_ready() | |
| # Load tokenizer | |
| self.logger.info("Loading tokenizer...") | |
| self.tokenizer = AutoTokenizer.from_pretrained( | |
| self.tokenizer_path, | |
| trust_remote_code=trust_remote_code, | |
| ) | |
| self.logger.success("Tokenizer loaded successfully") |
🤖 Prompt for AI Agents
In `@diffulex_bench/runner.py` around lines 19 - 53, The tokenizer is being loaded
with AutoTokenizer.from_pretrained(..., trust_remote_code=True) inside __init__
which is unsafe; add a new parameter (e.g., trust_remote_code: bool = False and
optional revision: Optional[str] = None) to the Runner __init__ signature, pass
that parameter to AutoTokenizer.from_pretrained and only set trust_remote_code
when explicitly True, and if a mutable remote execution is required encourage
pinning by forwarding revision to from_pretrained; update the __init__'s
tokenizer_path handling and the call site that constructs DiffulexRunner to
opt-in when needed (also apply same pattern to other modules like
diffulex/config.py and diffulex/engine/llm_engine.py where
AutoTokenizer.from_pretrained or model loading uses trust_remote_code).
| def store_kvcache_distinct_layout(key: torch.Tensor, value: torch.Tensor, | ||
| k_cache: torch.Tensor, v_cache: torch.Tensor, | ||
| slot_mapping: torch.Tensor, attn_metadata: AttnMetaDataBase) -> None: | ||
| """ | ||
| Store KV cache (distinct layout). | ||
| Dynamically selects the appropriate kernel based on quantization strategy from context. | ||
| """ | ||
| from diffulex.utils.quantization.context import get_kv_cache_strategy | ||
| strategy = get_kv_cache_strategy() | ||
| if strategy is None: | ||
| _store_kvcache_distinct_bf16(key, value, k_cache, v_cache, slot_mapping) | ||
| return | ||
|
|
||
| fmt = getattr(strategy, "kv_cache_format", "bf16") | ||
| if fmt == "bf16": | ||
| _store_kvcache_distinct_bf16(key, value, k_cache, v_cache, slot_mapping) | ||
| return | ||
| if fmt == "fp8": | ||
| if attn_metadata.k_scale is None or attn_metadata.v_scale is None: | ||
| raise ValueError("FP8 quantization requires k_scale and v_scale in metadata") | ||
| _store_kvcache_distinct_fp8( | ||
| key, value, k_cache, v_cache, slot_mapping, | ||
| attn_metadata.k_scale, attn_metadata.v_scale, | ||
| strategy=strategy, | ||
| ) | ||
| return | ||
| raise ValueError(f"Unsupported kv_cache_format={fmt!r} for distinct layout (strategy={type(strategy)})") |
There was a problem hiding this comment.
Mirror the unified-layout slot_mapping alignment in distinct layout.
Unified layout trims slot_mapping when it’s longer than the current token slice, but distinct layout doesn’t—this can fail partial-prefill flows with longer mappings.
🛠️ Suggested fix (align slot_mapping length)
def store_kvcache_distinct_layout(key: torch.Tensor, value: torch.Tensor,
k_cache: torch.Tensor, v_cache: torch.Tensor,
slot_mapping: torch.Tensor, attn_metadata: AttnMetaDataBase) -> None:
"""
Store KV cache (distinct layout).
Dynamically selects the appropriate kernel based on quantization strategy from context.
"""
+ N = int(key.shape[0])
+ if int(slot_mapping.numel()) != N:
+ if int(slot_mapping.numel()) > N:
+ slot_mapping = slot_mapping[-N:]
+ else:
+ raise AssertionError(
+ f"slot_mapping is shorter than key/value tokens: "
+ f"N={N}, slot_mapping.numel()={int(slot_mapping.numel())}"
+ )
from diffulex.utils.quantization.context import get_kv_cache_strategy
strategy = get_kv_cache_strategy()🧰 Tools
🪛 Ruff (0.14.14)
[warning] 938-938: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 945-945: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@diffulex_kernel/python/kv_cache_kernels.py` around lines 919 - 945,
store_kvcache_distinct_layout currently doesn't trim slot_mapping for
partial-prefill cases, causing failures when slot_mapping is longer than the
current token slice; update store_kvcache_distinct_layout to mirror the
unified-layout behavior by slicing/trimming slot_mapping to the actual token
count before calling _store_kvcache_distinct_bf16 or _store_kvcache_distinct_fp8
(i.e., compute the active length from key/value tensors or attn_metadata and
replace slot_mapping with slot_mapping[:active_len] when it's longer), and then
pass the trimmed slot_mapping into those helper functions.
diffulex_profiler/__init__.py
Outdated
| from diffulex_profiler.backends import ( | ||
| ProfilerBackend, | ||
| SimpleTimerBackend, | ||
| VizTracerBackend, | ||
| PyTorchProfilerBackend, | ||
| ) |
There was a problem hiding this comment.
Unconditional import of optional backends will cause ImportError.
VizTracerBackend and PyTorchProfilerBackend are conditionally exported in diffulex_profiler/backends/__init__.py (only if their dependencies are available). However, this file imports them unconditionally, which will raise ImportError when the optional dependencies are missing.
Proposed fix: use conditional imports here as well
from diffulex_profiler.backends import (
ProfilerBackend,
SimpleTimerBackend,
- VizTracerBackend,
- PyTorchProfilerBackend,
)
+
+# Optional backends - only import if available
+try:
+ from diffulex_profiler.backends import VizTracerBackend
+except ImportError:
+ VizTracerBackend = None # type: ignore[misc,assignment]
+
+try:
+ from diffulex_profiler.backends import PyTorchProfilerBackend
+except ImportError:
+ PyTorchProfilerBackend = None # type: ignore[misc,assignment]And update __all__ accordingly:
__all__ = [
"DiffulexProfiler",
"ProfilerConfig",
"PerformanceMetrics",
"collect_gpu_metrics",
"collect_cpu_metrics",
"collect_memory_metrics",
"ProfilerBackend",
"SimpleTimerBackend",
- "VizTracerBackend",
- "PyTorchProfilerBackend",
"ProfilerExporter",
"JSONExporter",
"CSVExporter",
"SummaryExporter",
]
+
+if VizTracerBackend is not None:
+ __all__.append("VizTracerBackend")
+if PyTorchProfilerBackend is not None:
+ __all__.append("PyTorchProfilerBackend")🤖 Prompt for AI Agents
In `@diffulex_profiler/__init__.py` around lines 12 - 17, The unconditional
imports of VizTracerBackend and PyTorchProfilerBackend cause ImportError when
optional deps are absent; change the top-level imports so ProfilerBackend and
SimpleTimerBackend are imported normally, but wrap imports of VizTracerBackend
and PyTorchProfilerBackend in try/except ImportError blocks (or use getattr
fallback) and only add those names to the module exports when successfully
imported; also update the module's __all__ to include the optional backend names
conditionally so the package doesn't fail to import if optional dependencies are
missing.
diffulex/engine/model_runner.py
Outdated
| def warmup_model(self): | ||
| """Model-specific warmup logic.""" | ||
| pass | ||
| logger.info("Warming up model...") | ||
| set_warming_up(True) | ||
| torch.cuda.empty_cache() | ||
| torch.cuda.reset_peak_memory_stats() | ||
| self._prefill_warmup() | ||
| reset_warming_up() |
There was a problem hiding this comment.
Ensure reset_warming_up() always runs.
If _prefill_warmup() raises, the warming flag stays set. Wrap with try/finally.
✅ Suggested fix
logger.info("Warming up model...")
set_warming_up(True)
- torch.cuda.empty_cache()
- torch.cuda.reset_peak_memory_stats()
- self._prefill_warmup()
- reset_warming_up()
+ try:
+ torch.cuda.empty_cache()
+ torch.cuda.reset_peak_memory_stats()
+ self._prefill_warmup()
+ finally:
+ reset_warming_up()🤖 Prompt for AI Agents
In `@diffulex/engine/model_runner.py` around lines 165 - 171, In warmup_model,
ensure reset_warming_up() always runs by wrapping the work between
set_warming_up(True) and reset_warming_up() in a try/finally: call
set_warming_up(True), do torch.cuda.empty_cache(),
torch.cuda.reset_peak_memory_stats() and call self._prefill_warmup() inside the
try block, and call reset_warming_up() in the finally block so that any
exception in _prefill_warmup() still clears the warming flag.
diffulex/engine/model_runner.py
Outdated
| # Get storage dtype and itemsize from quantization strategy | ||
| strategy = get_kv_cache_strategy() | ||
| if strategy is None: | ||
| strategy = NoQuantizationStrategy() | ||
| storage_dtype, itemsize = strategy.get_storage_dtype() |
There was a problem hiding this comment.
Fallback strategy lacks init_scales.
NoQuantizationStrategy doesn’t implement the KV-cache interface, so init_scales() will raise when no KV-cache strategy is configured. Default to a KV-cache strategy (e.g., BF16) or guard init_scales.
🔧 Suggested fix (KV-cache default)
-from diffulex.utils.quantization.strategies import NoQuantizationStrategy
+from diffulex.utils.quantization.strategies import KVCacheBF16Strategy
...
strategy = get_kv_cache_strategy()
if strategy is None:
- strategy = NoQuantizationStrategy()
+ strategy = KVCacheBF16Strategy()
...
- k_scale_init, v_scale_init = strategy.init_scales(num_kv_heads, device)
+ k_scale_init, v_scale_init = strategy.init_scales(num_kv_heads, device)Also applies to: 290-303
🤖 Prompt for AI Agents
In `@diffulex/engine/model_runner.py` around lines 193 - 197, The code calls
strategy.get_storage_dtype() and later expects strategy.init_scales(), but
NoQuantizationStrategy (returned by get_kv_cache_strategy fallback) doesn't
implement init_scales, causing errors; modify the fallback so
get_kv_cache_strategy() never returns NoQuantizationStrategy for KV-cache use
(e.g., default to a KV-capable strategy like BF16QuantizationStrategy) or add a
guard before calling init_scales() to skip/handle strategies without that
method; update the logic around get_kv_cache_strategy(), NoQuantizationStrategy,
get_storage_dtype, and any subsequent init_scales() calls (also apply the same
change to the similar block around lines 290-303) so only strategies that
implement the KV-cache interface are used for init_scales().
| accepted_ids_list = accepted_ids.to(device="cpu").tolist() | ||
| true_local_ids_sub_map[str(block_id)] = [block.local_mask_token_ids[i] for i in accepted_ids_list] | ||
| accepted_ids_sub_map[str(block_id)] = accepted_ids_list | ||
| sampled_tokens_sub_map[str(block_id)] = sampled_tokens.to(device="cpu").tolist() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for consumers of the maps mentioned in the review comment
# These are sub_maps being built, so look for references to the parent structures
# First, let's understand the context better by looking at the file
echo "=== File structure ==="
wc -l diffulex/sampler/fast_dllm_v2.py
# Get context around lines 69-72
echo -e "\n=== Context around lines 69-72 ==="
sed -n '50,90p' diffulex/sampler/fast_dllm_v2.py
# Search for where these _sub_map variables are used
echo -e "\n=== Searching for _sub_map usage ==="
rg -n '_sub_map' diffulex/sampler/fast_dllm_v2.py
# Search for broader map usage patterns
echo -e "\n=== Searching for accepted_ids_map/sampled_tokens_map/true_local_ids_map ==="
rg -n '\b(accepted_ids_map|sampled_tokens_map|true_local_ids_map)\b'Repository: SJTU-DENG-Lab/Diffulex
Length of output: 10352
🏁 Script executed:
#!/bin/bash
# Check the type annotations in base.py
echo "=== Type annotations in base.py ==="
sed -n '76,83p' diffulex/sampler/base.py
# Examine the consumer code in the scheduler files
echo -e "\n=== Scheduler consumer code (fast_dllm_v2) ==="
sed -n '99,115p' diffulex/strategy/fast_dllm_v2/engine/scheduler.py
echo -e "\n=== Scheduler consumer code (d2f) ==="
sed -n '101,115p' diffulex/strategy/d2f/engine/scheduler.py
echo -e "\n=== Scheduler consumer code (block_diffusion) ==="
sed -n '99,115p' diffulex/strategy/block_diffusion/engine/scheduler.py
# Also check if any code tries to call tensor methods on these values
echo -e "\n=== Check for tensor operations on accepted_ids ==="
rg -n '\.to\(|\.cuda|\.device|\.shape|\.unsqueeze|\.squeeze' diffulex/strategy/fast_dllm_v2/engine/scheduler.py | head -20
echo -e "\n=== Check for tensor operations in d2f ==="
rg -n '\.to\(|\.cuda|\.device|\.shape|\.unsqueeze|\.squeeze' diffulex/strategy/d2f/engine/scheduler.py | head -20Repository: SJTU-DENG-Lab/Diffulex
Length of output: 3772
🏁 Script executed:
#!/bin/bash
# Get more context around the token.item() calls
echo "=== fast_dllm_v2 scheduler context ==="
sed -n '102,120p' diffulex/strategy/fast_dllm_v2/engine/scheduler.py
echo -e "\n=== d2f scheduler context ==="
sed -n '104,120p' diffulex/strategy/d2f/engine/scheduler.py
echo -e "\n=== block_diffusion scheduler context ==="
sed -n '102,120p' diffulex/strategy/block_diffusion/engine/scheduler.py
# Check if there's any conversion happening before .item() calls
echo -e "\n=== Check for any conversion before token.item() ==="
rg -B5 'token\.item\(\)' diffulex/strategy/fast_dllm_v2/engine/scheduler.py
# Check what type token should be at that point
echo -e "\n=== sampled_tokens extraction ==="
rg -B2 -A8 'sampled_tokens\[accepted_id\]' diffulex/strategy/fast_dllm_v2/engine/scheduler.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 4437
Fix incompatible consumer code in schedulers that still expect tensor methods.
The maps now store Python lists, but two downstream schedulers still call .item() on tokens:
diffulex/strategy/fast_dllm_v2/engine/scheduler.py(line ~113):token.item() == self.eoswill fail with AttributeErrordiffulex/strategy/block_diffusion/engine/scheduler.py(line ~113):token.item() == self.eoswill fail with AttributeError
The diffulex/strategy/d2f/engine/scheduler.py has already been correctly updated with defensive handling (isinstance check and int() conversion).
Update the two broken schedulers to match the d2f implementation: add defensive checks for tensor input, use int() conversion for list values, and replace .item() comparison with direct integer comparison.
🤖 Prompt for AI Agents
In `@diffulex/sampler/fast_dllm_v2.py` around lines 69 - 72, Update the two
schedulers that still call token.item()
(diffulex/strategy/fast_dllm_v2/engine/scheduler.py and
diffulex/strategy/block_diffusion/engine/scheduler.py): find the comparison
using token.item() == self.eos and replace it with a defensive conversion that
accepts either a Tensor or a Python int (e.g., if isinstance(token,
torch.Tensor): value = int(token.item()) else: value = int(token)) and then
compare value == self.eos; ensure this change is applied wherever sampled tokens
from sampled_tokens_sub_map or accepted_ids_sub_map are checked so list values
(already Python ints) and tensors both work correctly.
| token_ids: list[int], | ||
| sampling_params: SamplingParams = SamplingParams(), | ||
| config: Config | None = None, | ||
| ): | ||
| super().__init__(token_ids, sampling_params) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's check the exact code in sequence.py around lines 121-125
sed -n '110,130p' diffulex/strategy/fast_dllm_v2/engine/sequence.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 854
🏁 Script executed:
#!/bin/bash
# Check the SamplingParams class definition to verify if it's mutable
head -100 diffulex/sampling_params.pyRepository: SJTU-DENG-Lab/Diffulex
Length of output: 217
Avoid instantiating SamplingParams in a default argument.
Default-argument instantiation happens at import time and creates a shared mutable instance across all function calls. Since SamplingParams is a non-frozen dataclass, modifications to this instance (whether in super().__init__() or elsewhere) will affect all subsequent sequences that don't explicitly pass sampling_params. Use None as the default and initialize inside __init__.
🔧 Proposed fix
- sampling_params: SamplingParams = SamplingParams(),
+ sampling_params: SamplingParams | None = None,
@@
- super().__init__(token_ids, sampling_params)
+ if sampling_params is None:
+ sampling_params = SamplingParams()
+ super().__init__(token_ids, sampling_params)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| token_ids: list[int], | |
| sampling_params: SamplingParams = SamplingParams(), | |
| config: Config | None = None, | |
| ): | |
| super().__init__(token_ids, sampling_params) | |
| token_ids: list[int], | |
| sampling_params: SamplingParams | None = None, | |
| config: Config | None = None, | |
| ): | |
| if sampling_params is None: | |
| sampling_params = SamplingParams() | |
| super().__init__(token_ids, sampling_params) |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 122-122: Do not perform function call SamplingParams in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
🤖 Prompt for AI Agents
In `@diffulex/strategy/fast_dllm_v2/engine/sequence.py` around lines 121 - 125,
The __init__ for the Sequence class currently uses a shared mutable
SamplingParams() as a default; change the signature to use sampling_params:
SamplingParams | None = None and inside Sequence.__init__ create a new instance
when None (e.g., sampling_params = SamplingParams() if sampling_params is None
else sampling_params) before calling super().__init__(token_ids,
sampling_params), ensuring each Sequence gets its own SamplingParams instance
and avoiding shared mutable defaults.
| def get_storage_dtype(self) -> tuple[torch.dtype, int]: | ||
| # We store qweight as uint8 (bias128 representation). | ||
| return torch.uint8, 1 | ||
|
|
||
| # ---- Required abstract methods (for registry/factory instantiation) ---- | ||
| def quantize(self, tensor: torch.Tensor, **kwargs: Any) -> tuple[torch.Tensor, Any]: | ||
| """Reference per-output-channel symmetric int8 quantization. | ||
|
|
||
| Returns: | ||
| quantized_int8: [N,K] int8 | ||
| scales: [N] bf16 | ||
| """ | ||
| _ = kwargs | ||
| if tensor.dim() != 2: | ||
| raise ValueError(f"Expected 2D weight [N,K], got shape={tuple(tensor.shape)}") | ||
| if tensor.dtype != torch.bfloat16: | ||
| tensor = tensor.to(dtype=torch.bfloat16) | ||
| abs_max = torch.abs(tensor).max(dim=-1, keepdim=True)[0] # [N,1] | ||
| scales = (abs_max.clamp(min=1e-8) / 127.0).to(dtype=torch.bfloat16) # [N,1] | ||
| q = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int8) | ||
| return q, scales.squeeze(-1) | ||
|
|
||
| def dequantize(self, quantized: torch.Tensor, scale_or_metadata: Any, **kwargs: Any) -> torch.Tensor: | ||
| """Reference dequantization back to bf16.""" | ||
| _ = kwargs | ||
| scales = scale_or_metadata.get("scales") if isinstance(scale_or_metadata, dict) else scale_or_metadata | ||
| if scales is None: | ||
| raise ValueError("scales required for dequantization") | ||
| if scales.dim() == 1: | ||
| scales = scales.unsqueeze(-1) | ||
| return (quantized.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16) | ||
|
|
There was a problem hiding this comment.
Align quantize()/dequantize() with declared uint8 storage.
get_storage_dtype() advertises uint8, but quantize() returns int8 and dequantize() assumes signed values. This mismatch can break storage buffers created from the strategy metadata.
🔧 Suggested alignment (uint8 storage + bias128)
- q = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int8)
- return q, scales.squeeze(-1)
+ q_i16 = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int16)
+ q_u8 = (q_i16 + 128).to(torch.uint8)
+ return q_u8, scales.squeeze(-1)- return (quantized.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)
+ if quantized.dtype == torch.uint8:
+ q = quantized.to(torch.int16) - 128
+ else:
+ q = quantized.to(torch.int16)
+ return (q.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_storage_dtype(self) -> tuple[torch.dtype, int]: | |
| # We store qweight as uint8 (bias128 representation). | |
| return torch.uint8, 1 | |
| # ---- Required abstract methods (for registry/factory instantiation) ---- | |
| def quantize(self, tensor: torch.Tensor, **kwargs: Any) -> tuple[torch.Tensor, Any]: | |
| """Reference per-output-channel symmetric int8 quantization. | |
| Returns: | |
| quantized_int8: [N,K] int8 | |
| scales: [N] bf16 | |
| """ | |
| _ = kwargs | |
| if tensor.dim() != 2: | |
| raise ValueError(f"Expected 2D weight [N,K], got shape={tuple(tensor.shape)}") | |
| if tensor.dtype != torch.bfloat16: | |
| tensor = tensor.to(dtype=torch.bfloat16) | |
| abs_max = torch.abs(tensor).max(dim=-1, keepdim=True)[0] # [N,1] | |
| scales = (abs_max.clamp(min=1e-8) / 127.0).to(dtype=torch.bfloat16) # [N,1] | |
| q = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int8) | |
| return q, scales.squeeze(-1) | |
| def dequantize(self, quantized: torch.Tensor, scale_or_metadata: Any, **kwargs: Any) -> torch.Tensor: | |
| """Reference dequantization back to bf16.""" | |
| _ = kwargs | |
| scales = scale_or_metadata.get("scales") if isinstance(scale_or_metadata, dict) else scale_or_metadata | |
| if scales is None: | |
| raise ValueError("scales required for dequantization") | |
| if scales.dim() == 1: | |
| scales = scales.unsqueeze(-1) | |
| return (quantized.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16) | |
| def get_storage_dtype(self) -> tuple[torch.dtype, int]: | |
| # We store qweight as uint8 (bias128 representation). | |
| return torch.uint8, 1 | |
| # ---- Required abstract methods (for registry/factory instantiation) ---- | |
| def quantize(self, tensor: torch.Tensor, **kwargs: Any) -> tuple[torch.Tensor, Any]: | |
| """Reference per-output-channel symmetric int8 quantization. | |
| Returns: | |
| quantized_int8: [N,K] int8 | |
| scales: [N] bf16 | |
| """ | |
| _ = kwargs | |
| if tensor.dim() != 2: | |
| raise ValueError(f"Expected 2D weight [N,K], got shape={tuple(tensor.shape)}") | |
| if tensor.dtype != torch.bfloat16: | |
| tensor = tensor.to(dtype=torch.bfloat16) | |
| abs_max = torch.abs(tensor).max(dim=-1, keepdim=True)[0] # [N,1] | |
| scales = (abs_max.clamp(min=1e-8) / 127.0).to(dtype=torch.bfloat16) # [N,1] | |
| q_i16 = torch.round(tensor.to(torch.float32) / scales.to(torch.float32)).clamp(-128, 127).to(torch.int16) | |
| q_u8 = (q_i16 + 128).to(torch.uint8) | |
| return q_u8, scales.squeeze(-1) | |
| def dequantize(self, quantized: torch.Tensor, scale_or_metadata: Any, **kwargs: Any) -> torch.Tensor: | |
| """Reference dequantization back to bf16.""" | |
| _ = kwargs | |
| scales = scale_or_metadata.get("scales") if isinstance(scale_or_metadata, dict) else scale_or_metadata | |
| if scales is None: | |
| raise ValueError("scales required for dequantization") | |
| if scales.dim() == 1: | |
| scales = scales.unsqueeze(-1) | |
| if quantized.dtype == torch.uint8: | |
| q = quantized.to(torch.int16) - 128 | |
| else: | |
| q = quantized.to(torch.int16) | |
| return (q.to(torch.float32) * scales.to(torch.float32)).to(torch.bfloat16) |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 115-115: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 128-128: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@diffulex/utils/quantization/strategies/linear_marlin_int8_w8a16.py` around
lines 101 - 132, get_storage_dtype declares torch.uint8 storage but
quantize()/dequantize() use signed int8; change quantize in function
quantize(...) to produce uint8 by biasing signed int8 values (add 128) and
clamping to [0,255] and return dtype torch.uint8, and change dequantize in
dequantize(...) to accept the uint8 storage, convert back to signed by
subtracting 128 (or cast to int8 after subtract) before multiplying by scales;
ensure scales handling (scales.squeeze/unsqueeze) stays the same and types are
converted to float32 for arithmetic then result cast to bfloat16, so
get_storage_dtype, quantize, and dequantize are consistent.
…and revision support - Added trust_remote_code and revision attributes to Config class for improved model and tokenizer loading flexibility. - Updated model_runner and tp_worker to utilize new configuration options when loading models and tokenizers. - Enhanced quantization strategies to handle initialization and storage more robustly. - Improved error handling and logging for model warmup and KV cache allocation processes.
…ed logits - Enhanced the _fetch_last_logits method to include error handling for empty logits and out-of-bounds indices. - Introduced a new _gather_shifted_logits_rows method to efficiently gather shifted logits without materializing the full tensor. - Updated DreamSampler and FastdLLMV2Sampler classes to utilize the new gathering method for improved performance and memory management. - Ensured compatibility with cached-prefill scenarios by using query-length splits for logits.
Summary by CodeRabbit
Release Notes
New Features
diffulex_bench) for model evaluation and metricsdiffulex_profiler) with multiple backend support (VizTracer, PyTorch Profiler)Refactor