Skip to content

Feature/kimi milm models#44

Open
foundeachang-bit wants to merge 13 commits intomainfrom
feature/kimi-milm-models
Open

Feature/kimi milm models#44
foundeachang-bit wants to merge 13 commits intomainfrom
feature/kimi-milm-models

Conversation

@foundeachang-bit
Copy link
Copy Markdown
Collaborator

@foundeachang-bit foundeachang-bit commented Mar 25, 2026

add kimi model

Summary by CodeRabbit

  • New Features

    • Added single-layer decode inference examples for Qwen3 (multiple tiled variants), Kimi K2, and Xiaomi MiLM
    • Added a batched hash-lookup example with a full multi-stage compilation/optimization pipeline
  • Documentation

    • Added implementation report, test report, and README for model examples
    • Added TILELET tiling reference and usage guides for Qwen3

hengliao1972 and others added 11 commits March 11, 2026 12:48
Introduce a tile-parallel batch hash lookup example with outer probe rounds and generated pass dumps to analyze orchestration/incore lowering behavior.

Made-with: Cursor
- Refactor probe loop to use pl.break_() when round_has_active == 0
- Update pass dumps after run

Made-with: Cursor
Build output for the qwen3_decode_expand_mixed_kernel example showing
each compiler pass stage from frontend through memory allocation.

Made-with: Cursor
BATCH_TILE=4, K_CHUNK=128 simultaneously achieves:
- All 9 matmul weight tiles at 16 KB (100% cube utilisation)
- RMSNorm, down proj, residual vector tiles at 2 KB (100% TILELET)
- Attention vectors at 2 KB via Q_HEAD_BATCH=4 and batched K RoPE
- On-chip Vec buffer at 99.0% (245.6 KB / 248 KB)

Includes qwen3_tilelet.md documenting all tile sizes and constraints.

Made-with: Cursor
- Add Kimi K2 (月之暗面) decode implementation with MoE architecture
  * 8 experts + 1 shared expert
  * Sliding window attention (128K+ context support)
  * GQA (32Q/8KV heads)
  * Flash Decoding with online softmax

- Add Xiaomi MiLM decode implementation
  * Llama-style Transformer with SwiGLU
  * GQA (Grouped Query Attention)
  * Optimized for edge deployment
  * BF16 precision

- Add documentation
  * models/README.md: Model comparison and usage guide
  * IMPLEMENTATION_REPORT.md: Implementation details
  * TEST_REPORT.md: Verification results

Both implementations follow the existing Qwen3 pattern and use
PyPTO core operations (matmul, slice, assemble, auto_incore, etc.)

Signed-off-by: OpenClaw Assistant
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two major example pipelines (batched hash-table probe and Qwen3 single-layer decode) with staged compiler pass dumps, plus new model decode implementations (Kimi K2, Xiaomi MiLM), TILELET reference material, and supporting docs and test/report files. Multiple IR-lowering stages and memory/memory-space annotations are included.

Changes

Cohort / File(s) Summary
Batch Hash Lookup (pass-dump)
examples/batch_hash_lookup_dump/passes_dump/00_frontend.pyexamples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py
Added a multi-stage example of a batched hash-table probe kernel across 14 pass-dump files: frontend → unroll → SSA → flatten/call/loop transforms → incore outlining → memory init/reuse → sync/addr allocation. Introduces orchestration + in-core kernels, loop-carried state, early-exit probe rounds, and MemRef-backed tensor views.
Qwen3 Decode (pass-dump)
examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/00_frontend.py.../15_after_AllocateMemoryAddr.py
Added a staged, pass-dump implementation of a single-layer Qwen3 decode pipeline across 16 pass files: projection/RMSNorm, RoPE, KV cache writes, numerically-stable blockwise softmax, attention accumulation, gated MLP, and progressively-lowered IR with AIV/in-core kernels and DDR MemRef allocation.
Qwen3 Tilelet reference & example
examples/qwen3/qwen3_32b_decode_tilelet.py, examples/qwen3/qwen3_tilelet.md, examples/qwen3_decode_expand_mixed_kernel.py
New TILELET-aware Qwen3 implementation, tile-constraint documentation, helper tensor-spec builders, and an executable example that compiles/runs the decode layer with optional pass dumping.
Kimi K2 model
examples/models/kimi/kimi_k2_decode.py
New decode-only program builder for Kimi K2: RMSNorm, Q/K/V projections, RoPE, sliding-window flash decoding attention, KV cache update, and an MoE block with shared/expert routing. Exposes build_kimi_k2_decode_program and KimiK2Decode.kimi_k2_decode_layer.
MiLM model
examples/models/milm/milm_decode.py
New decode-only MiLM program builder: RMSNorm, chunked Q/K/V projections, RoPE, GQA-aware flash decoding attention (streaming softmax), SwiGLU MLP, and exposed build_milm_decode_program / MiLMDecode.milm_decode_layer.
Documentation & Reports
examples/models/README.md, examples/models/IMPLEMENTATION_REPORT.md, examples/models/TEST_REPORT.md
Added documentation, implementation report, and test report summarizing new model implementations, verification checklist, usage snippets, and next steps for Ascend NPU deployment.
DeepSeek V3.2 update
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py
Refactor and formatting updates: moved from __future__ to top, adjusted numeric literal types, relocated some allocations into pl.auto_incore scopes, refactored RoPE slicing, changed backend from BackendType.Ascend910B_PTOBackendType.CCE, and improved work-dir derivation and logging.
New pass-dump infra artifacts
many new files under examples/.../passes_dump/
Multiple lowered IR/artifact files added across both example pipelines documenting each pass stage (unroll, SSA, flatten, chunking, interchange, outline, expand-mixed, convert-to-block/tile ops, init MemRef, memory reuse, sync, allocate addr).

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant Orchestrator
    end
    rect rgba(200,255,200,0.5)
    participant InCoreKernel
    end
    rect rgba(255,200,200,0.5)
    participant DDR_MemRef
    end

    Orchestrator->>InCoreKernel: init per-(b,ti) zero tiles (incore_0)
    Orchestrator->>InCoreKernel: for probe in 0..7 (yield state)
    InCoreKernel->>DDR_MemRef: read hash_pool / keys
    InCoreKernel->>InCoreKernel: compute mixed hash, scan buckets, select candidate (cand_key,cand_val)
    InCoreKernel->>InCoreKernel: compute active & hit masks, assemble updated tile
    InCoreKernel-->>Orchestrator: yield (round_has_active, updated value_ptr_out)
    Orchestrator->>Orchestrator: break if round_has_active == 0
Loading
sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant Driver
    end
    rect rgba(200,255,200,0.5)
    participant AIV_Kernels
    end
    rect rgba(255,200,200,0.5)
    participant KV_Cache
    end
    rect rgba(255,255,200,0.5)
    participant TensorBuf
    end

    Driver->>AIV_Kernels: compute inv-RMS, project Q/K/V
    AIV_Kernels->>KV_Cache: apply RoPE, write K/V at cache_row
    Driver->>AIV_Kernels: iterate ctx_blocks (streaming softmax)
    AIV_Kernels->>KV_Cache: load K/V tiles with valid_len masking
    AIV_Kernels->>TensorBuf: accumulate attention output (stable max/exp/sum)
    AIV_Kernels->>AIV_Kernels: post-attn proj, RMSNorm, gated MLP
    AIV_Kernels-->>Driver: assemble final BF16 output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped through passes, tile by tile,

Keys and kernels raced a merry mile,
Qwen's rotations spun with RoPE delight,
MiLM and Kimi joined the night,
Memory mapped and neatly spun — hooray, the compile's done!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Feature/kimi milm models' directly refers to the main feature additions: Kimi K2 and Xiaomi MiLM model implementations in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (14)
examples/models/kimi/kimi_k2_decode.py-359-364 (1)

359-364: ⚠️ Potential issue | 🟠 Major

Post-attention RMSNorm incorrectly uses input_rms_weight instead of a separate post_rms_weight.

The function signature is missing a post_rms_weight parameter. Pre-attention and post-attention RMSNorm layers should have distinct learned weights. The MiLM implementation correctly includes separate weights (see milm_decode.py line 105).

🐛 Proposed fix - add post_rms_weight parameter

Add to function signature after line 100:

            post_rms_weight: pl.Tensor[[1, HIDDEN_CFG], pl.FP32],

Then fix line 362:

-                        gamma = pl.slice(input_rms_weight, [1, K_CHUNK], [0, k0])
+                        gamma = pl.slice(post_rms_weight, [1, K_CHUNK], [0, k0])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/kimi/kimi_k2_decode.py` around lines 359 - 364, The
post-attention RMSNorm loop incorrectly reuses input_rms_weight instead of a
separate post_rms_weight and the function signature lacks that parameter; update
the function signature to accept post_rms_weight: pl.Tensor[[1, HIDDEN_CFG],
pl.FP32] and inside the pl.range(HIDDEN_BLOCKS) loop (where k0, x_chunk, gamma,
normed, post_norm_tile are used) replace references to input_rms_weight with
post_rms_weight so post-attention RMSNorm uses its own learned weights.
examples/qwen3/qwen3_32b_decode_tilelet.py-91-119 (1)

91-119: ⚠️ Potential issue | 🟠 Major

Validate the tiling invariants before computing block counts.

The builder exposes configurable shapes, but the implementation below assumes exact divisibility (hidden_size % K_CHUNK == 0, hidden_size % Q_OUT_CHUNK == 0, intermediate_size % MLP_OUT_CHUNK == 0, batch % BATCH_TILE == 0) plus hidden_size == num_heads * head_dim, num_heads % num_kv_heads == 0, and (num_heads // num_kv_heads) % Q_HEAD_BATCH == 0. Any non-default config currently truncates loops or addresses the wrong head ranges. Fail fast here instead of generating a silently wrong program.

Proposed guardrail
 def build_qwen3_single_layer_decode_program(
     batch: int = BATCH,
     max_seq_len: int = MAX_SEQ,
     hidden_size: int = HIDDEN,
     num_heads: int = NUM_HEADS,
     num_kv_heads: int = NUM_KV_HEADS,
     head_dim: int = HEAD_DIM,
     intermediate_size: int = INTERMEDIATE,
 ):
+    if max_seq_len <= 0:
+        raise ValueError("max_seq_len must be positive")
+    if batch % BATCH_TILE != 0:
+        raise ValueError(f"batch must be divisible by {BATCH_TILE}")
+    if num_kv_heads <= 0 or num_heads % num_kv_heads != 0:
+        raise ValueError("num_heads must be divisible by num_kv_heads")
+    if hidden_size != num_heads * head_dim:
+        raise ValueError("hidden_size must equal num_heads * head_dim")
+    if hidden_size % K_CHUNK != 0 or hidden_size % Q_OUT_CHUNK != 0:
+        raise ValueError("hidden_size must be divisible by K_CHUNK and Q_OUT_CHUNK")
+    if (num_kv_heads * head_dim) % KV_OUT_CHUNK != 0:
+        raise ValueError("num_kv_heads * head_dim must be divisible by KV_OUT_CHUNK")
+    if intermediate_size % MLP_OUT_CHUNK != 0:
+        raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK")
+    if (num_heads // num_kv_heads) % Q_HEAD_BATCH != 0:
+        raise ValueError("num_heads / num_kv_heads must be divisible by Q_HEAD_BATCH")
+
     BATCH_CFG = batch
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3/qwen3_32b_decode_tilelet.py` around lines 91 - 119, Add
explicit validation at the start of build_qwen3_single_layer_decode_program to
check tiling and shape invariants before computing block counts: assert/raise if
hidden_size % K_CHUNK != 0, hidden_size % Q_OUT_CHUNK != 0, intermediate_size %
MLP_OUT_CHUNK != 0, batch % BATCH_TILE != 0, hidden_size != num_heads *
head_dim, num_heads % num_kv_heads != 0, or (num_heads // num_kv_heads) %
Q_HEAD_BATCH != 0; include the offending values in the error messages so callers
fail fast instead of producing truncated loops or wrong head ranges. Ensure
these checks occur before computing HIDDEN_BLOCKS, Q_OUT_BLOCKS, KV_OUT_BLOCKS,
MLP_OUT_BLOCKS, Q_PER_KV_CFG, Q_GROUPS, TOTAL_Q_GROUPS, and ATTN_INIT_BLOCKS.
examples/batch_hash_lookup_dump/passes_dump/00_frontend.py-21-27 (1)

21-27: ⚠️ Potential issue | 🟠 Major

Thread hash_table_size and hash_base_ptr into the addressing logic.

Line 21 hard-codes a 64-bucket mask, and Lines 26-27 always read fixed bucket/value offsets. That makes hash_table_size and hash_base_ptr ineffective, so any caller with non-64 tables or non-zero pool offsets gets incorrect lookups.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py` around lines 21 -
27, The code hardcodes 64 buckets and fixed offsets, breaking tables with other
sizes or non-zero base offsets; update the loop and addressing to use
hash_table_size and hash_base_ptr: replace pl.range(0, 64, 1) with pl.range(0,
hash_table_size, 1), compute the mask as pl.tensor.ands(..., hash_table_size -
1) instead of 64 - 1, and when building bucket access use hash_base_ptr and
hash_table_size to compute indices into hash_pool (e.g., bucket_idx =
hash_base_ptr + bucket; val_idx = hash_base_ptr + hash_table_size + bucket) and
pass those to pl.tensor.view for bucket_keys and bucket_vals so lookups honor
both hash_table_size and hash_base_ptr.
examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py-41-41 (1)

41-41: ⚠️ Potential issue | 🟠 Major

and is a Python reserved keyword and cannot be used as a method name.

The expression pl.tensor.and(...) causes a syntax error because and is a reserved keyword in Python. If these compiler pass dumps are intended to be valid, parseable Python, the code generator should emit a different method name (e.g., pl.tensor.and_, pl.tensor.bitwise_and, or pl.tensor.land).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py` at
line 41, The generated dump uses the reserved Python keyword as a method
name—pl.tensor.and(...)—which breaks parsing; update the code generator so it
emits a valid method name (e.g., pl.tensor.and_ or pl.tensor.bitwise_and)
wherever logical/bitwise tensor ANDs are printed (specifically change the call
that produces hit_mask_0: pl.Tensor[...] = pl.tensor.and(active_mask_0,
key_match_0) to use the new method name), and ensure any other occurrences
(e.g., similar patterns with active_mask_0 and key_match_0) are replaced
consistently so the dumps are valid Python.
examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py-52-54 (1)

52-54: ⚠️ Potential issue | 🟠 Major

Undefined variable ti_0 — same issue persists.

ti_0 referenced but not defined in batch_hash_lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py` around
lines 52 - 54, The loop references an undefined variable ti_0 when calling
batch_hash_lookup_incore_1; fix by initializing ti_0 to the correct initial
value before the pl.range or by replacing ti_0 in the pl.range init_values with
the intended iterator (e.g., use ti_iter_1 or another existing initialized
symbol) so that the call to batch_hash_lookup_incore_1(hash_pool_0, probe_0,
..., ti_0, ti_iter_1, ...) uses a defined variable; update either the variable
declaration for ti_0 or the pl.range init_values tuple (b_0, ti_0,
value_ptr_out_2) to reference the correct existing symbol.
examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py-52-54 (1)

52-54: ⚠️ Potential issue | 🟠 Major

Undefined variable ti_0 — same issue persists.

ti_0 referenced but not defined in batch_hash_lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py` around
lines 52 - 54, The call to self.batch_hash_lookup_incore_1 uses an undefined
variable ti_0; fix by supplying the correct, defined iteration variable or
initializing ti_0 before the loop — either (a) change the argument list to use
the existing iterator variable ti_iter_1 (or another in-scope variable like ti)
instead of ti_0, or (b) add a declaration/initialization for ti_0 before the for
loop so it matches the intended init_values passed into the loop; update the
call site in batch_hash_lookup where batch_hash_lookup_incore_1(hash_pool_0,
probe_0, ..., search_key_0, ti_0, ti_iter_1, ...) is invoked to reference only
defined symbols.
examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py-41-41 (1)

41-41: ⚠️ Potential issue | 🟠 Major

and is a Python reserved keyword — same issue persists.

pl.tensor.and(active_mask_0, key_match_0) is invalid Python syntax.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py` at line
41, The expression uses the Python reserved keyword `and` as a function name
(pl.tensor.and), which is invalid; update the call that creates hit_mask_0 to
use a valid tensor logical/bitwise operation such as
pl.tensor.logical_and(active_mask_0, key_match_0) or the tensor operator form
(active_mask_0 & key_match_0), ensuring you replace pl.tensor.and with the
correct API (logical_and or bitwise_and) so hit_mask_0 is computed from
active_mask_0 and key_match_0 without syntax errors.
examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py-52-54 (1)

52-54: ⚠️ Potential issue | 🟠 Major

Undefined variable ti_0 — fix needed in the code generator.

This issue appears in all 6 pass dump files. The batch_hash_lookup orchestration function references ti_0 without defining it. Possible fixes:

  1. Add ti_0 as a function parameter
  2. Initialize ti_0 before the loop (e.g., ti_0: pl.Scalar[pl.INDEX] = 0)
  3. If ti_0 should come from the previous loop, restructure the code to capture it properly
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py`
around lines 52 - 54, The loop references an undefined symbol ti_0 used when
calling batch_hash_lookup_incore_1; define ti_0 before the for-loop (e.g.,
initialize ti_0 as a pl.Scalar[pl.INDEX] = 0) or add ti_0 as a parameter to the
surrounding batch_hash_lookup orchestration function so the symbol exists when
invoking batch_hash_lookup_incore_1 (adjust callers accordingly); ensure the
chosen fix is applied consistently across all six pass dump files that contain
the same pattern (look for the for loop using probe_0, b_iter_1, ti_iter_1 and
the call to batch_hash_lookup_incore_1).
examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py-41-41 (1)

41-41: ⚠️ Potential issue | 🟠 Major

and is a Python reserved keyword — same issue persists.

pl.tensor.and(active_mask_0, key_match_0) is invalid Python syntax.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py` at line
41, The code uses the Python reserved word `and` as an attribute call
(`pl.tensor.and(active_mask_0, key_match_0)`), which is invalid; replace that
call with the proper tensor logical operator function (for example
`pl.tensor.logical_and(active_mask_0, key_match_0)` or
`pl.tensor.bitwise_and(...)` depending on boolean vs bitwise semantics) wherever
`pl.tensor.and` is used (notably the `hit_mask_0` assignment), ensuring the
function name matches the tensor API.
examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py-41-41 (1)

41-41: ⚠️ Potential issue | 🟠 Major

and is a Python reserved keyword — same issue persists.

pl.tensor.and(active_mask_0, key_match_0) is invalid Python syntax.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py` at line
41, The code uses the reserved Python keyword form pl.tensor.and(active_mask_0,
key_match_0); replace this with the proper API call (e.g.,
pl.tensor.logical_and(active_mask_0, key_match_0) for boolean masks, or
pl.tensor.bitwise_and(...) for bitwise ops) so the expression that produces
hit_mask_0 uses a valid function name and references active_mask_0 and
key_match_0.
examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py-52-54 (1)

52-54: ⚠️ Potential issue | 🟠 Major

Undefined variable ti_0 — same issue as previous pass.

ti_0 is used in init_values and as an argument to batch_hash_lookup_incore_1 but is never defined in the batch_hash_lookup function scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py`
around lines 52 - 54, The loop uses an undefined variable ti_0 in init_values
and as an argument to batch_hash_lookup_incore_1; fix by ensuring ti_0 is
defined in the batch_hash_lookup scope before the pl.range loop (or replace ti_0
with the correct in-scope variable, e.g., ti) so
batch_hash_lookup_incore_1(hash_pool_0, probe_0, ..., ti_0, ti_iter_1, ...) and
the init_values tuple (b_0, ti_0, value_ptr_out_2) refer to a valid, initialized
symbol.
examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py-52-54 (1)

52-54: ⚠️ Potential issue | 🟠 Major

Undefined variable ti_0 — same issue persists.

ti_0 referenced but not defined in batch_hash_lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py` around
lines 52 - 54, The loop references an undefined variable ti_0 when initializing
the iterator tuple for pl.range; define or initialize ti_0 before the loop (or
replace it with the correct existing initializer) so the init_values tuple
passed to pl.range is valid. Locate the batch_hash_lookup function and ensure
ti_0 is set (e.g., a scalar/index initial value used by pl.range) or change the
init_values to use the existing variable name that holds the initial
thread-index state; also verify batch_hash_lookup_incore_1 signature still
matches the provided arguments after fixing ti_0.
examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py-41-41 (1)

41-41: ⚠️ Potential issue | 🟠 Major

and is a Python reserved keyword — same issue as previous pass.

The code generator emits pl.tensor.and(...) which is invalid Python syntax. This needs to be fixed at the source (the code generator) rather than in individual pass dump files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py`
at line 41, The generator emits the invalid Python identifier "pl.tensor.and" in
ConvertTensorToBlockOps (seen as pl.tensor.and(...)); update the emission logic
to map the operator name "and" to a valid callable (e.g., "logical_and" or the
appropriate bitwise name) instead of writing the reserved keyword. Locate the
code in the ConvertTensorToBlockOps emission/codegen that constructs calls to
pl.tensor.<op> and change it to emit pl.tensor.logical_and (or another
appropriate safe name) or use getattr(pl.tensor, op_name_mapping[op]) so
generated code never contains the raw "and" keyword.
examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py-41-41 (1)

41-41: ⚠️ Potential issue | 🟠 Major

and is a Python reserved keyword — fix needed in the code generator.

This issue appears in all 6 pass dump files (passes 08–13). The code generator should emit a valid Python identifier instead of and. Common conventions include:

  • pl.tensor.and_ (trailing underscore, PEP 8 convention for keyword conflicts)
  • pl.tensor.bitwise_and or pl.tensor.logical_and (descriptive name)
🐛 Suggested fix pattern for all files
-                hit_mask_0: pl.Tensor[[1, 32], pl.INDEX, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 21)] = pl.tensor.and(active_mask_0, key_match_0)
+                hit_mask_0: pl.Tensor[[1, 32], pl.INDEX, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 21)] = pl.tensor.and_(active_mask_0, key_match_0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py`
at line 41, The generated code uses the Python reserved word `and` as a method
name (seen as `pl.tensor.and` producing the assignment to `hit_mask_0` using
`active_mask_0` and `key_match_0`), which is invalid; update the code generator
to emit a valid identifier for this operation (e.g., `pl.tensor.and_` or
`pl.tensor.bitwise_and`/`pl.tensor.logical_and`) wherever it currently emits
`and` (affecting the tensor op emission logic that creates expressions like
`pl.tensor.and(active_mask_0, key_match_0)`), and ensure the same replacement is
applied consistently across all pass dumps 08–13 so generated names are valid
Python identifiers.
🟡 Minor comments (8)
examples/models/README.md-46-46 (1)

46-46: ⚠️ Potential issue | 🟡 Minor

Import paths don't match the actual directory structure.

The actual file locations are examples/models/kimi/... and examples/models/milm/..., but the examples import from models.kimi.... Either update the imports or add a note about the expected working directory.

📝 Suggested fix
 ### Kimi K2 Decode
 ```python
-from models.kimi.kimi_k2_decode import build_kimi_k2_decode_program
+from examples.models.kimi.kimi_k2_decode import build_kimi_k2_decode_program
 ### Xiaomi MiLM Decode
 ```python
-from models.milm.milm_decode import build_milm_decode_program
+from examples.models.milm.milm_decode import build_milm_decode_program

Alternatively, add a note that users should run from the examples/ directory.

Also applies to: 66-66

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/README.md` at line 46, The imports in the README use
incorrect module paths — update the import lines that reference
build_kimi_k2_decode_program and build_milm_decode_program to point to the
actual package location (examples.models.kimi.kimi_k2_decode and
examples.models.milm.milm_decode) or alternatively add a brief note instructing
users to run examples from the examples/ working directory so the original
relative imports resolve.
examples/models/milm/milm_decode.py-191-193 (1)

191-193: ⚠️ Potential issue | 🟡 Minor

Potential division by zero if cache_pos contains invalid values.

If cache_pos[b] is negative, ctx_len becomes ≤ 0, ctx_blocks becomes 0, and the attention loop is skipped entirely. This leaves li = 0, causing division by zero at line 312.

Consider either validating cache_pos at entry or adding a guard:

🛡️ Proposed guard
-                        ctx = pl.row_expand_div(oi, li)
+                        # Guard against div-by-zero when context is empty
+                        li_safe = pl.maximum(li, pl.mul(li, 0.0) + 1e-9)
+                        ctx = pl.row_expand_div(oi, li_safe)

Or validate at function entry that all cache_pos values are non-negative.

Also applies to: 312-312

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/milm/milm_decode.py` around lines 191 - 193, The code
computes ctx_len and ctx_blocks from cache_pos via pl.tensor.read(cache_pos,
[b]) and can produce ctx_blocks == 0 when cache_pos[b] is negative, which later
causes li to be zero and a division-by-zero; fix by validating or clamping
cache_pos values before use (e.g., assert or replace negative cache_pos entries
with 0) and/or adding a guard after computing ctx_blocks to ensure it is at
least 1 before entering the attention loop and before computing li (use
max(ctx_blocks, 1) or skip the division when ctx_blocks == 0), referencing
pl.tensor.read(cache_pos, [b]), ctx_len, ctx_blocks, SEQ_TILE and li to locate
the change.
examples/models/IMPLEMENTATION_REPORT.md-69-69 (1)

69-69: ⚠️ Potential issue | 🟡 Minor

Invalid Python import syntax - hyphens not allowed in module names.

pypto-lib contains a hyphen which is invalid in Python module/package names. These example imports will fail with SyntaxError.

📝 Suggested fix
 ### Kimi K2
 ```python
-from pypto-lib.examples.models.kimi.kimi_k2_decode import build_kimi_k2_decode_program
+from examples.models.kimi.kimi_k2_decode import build_kimi_k2_decode_program
 ### Xiaomi MiLM
 ```python
-from pypto-lib.examples.models.milm.milm_decode import build_milm_decode_program
+from examples.models.milm.milm_decode import build_milm_decode_program

Also applies to: 84-84

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/IMPLEMENTATION_REPORT.md` at line 69, The import statements
like "from pypto-lib.examples.models.kimi.kimi_k2_decode import
build_kimi_k2_decode_program" use an invalid module name with a hyphen; update
these imports (e.g., the KIMI and MILM import lines referencing
build_kimi_k2_decode_program and build_milm_decode_program) to use the correct
package path without the hyphen (for example start the import from
examples.models...) so the module names are valid Python identifiers; apply the
same change to the other occurrence around the MILM import.
examples/models/TEST_REPORT.md-47-47 (1)

47-47: ⚠️ Potential issue | 🟡 Minor

Remove hardcoded absolute paths.

Same issue as in README.md - these developer-specific paths should be converted to relative paths or removed.

Also applies to: 166-179, 221-224

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/TEST_REPORT.md` at line 47, The markdown contains
developer-specific absolute paths (e.g. the string
'/data/z00885570/miniconda3/envs/py310/lib/python3.10/site-packages/pypto/')
that must be removed or converted to non‑specific references; edit
examples/models/TEST_REPORT.md (and the similar occurrences referenced in
README.md) to replace hardcoded absolute paths with relative paths,
environment-variable placeholders (e.g. $VENV_PATH or {VENV_PATH}), or generic
instructions (e.g. "your Python environment's site‑packages directory") so the
document is portable and does not expose developer machine paths.
examples/models/README.md-116-119 (1)

116-119: ⚠️ Potential issue | 🟡 Minor

Remove hardcoded absolute paths.

These paths are developer-specific and won't work for other users. Use relative paths from the repository root instead.

📝 Suggested fix
 ## 参考资源
 
-- PyPTO 文档:`/data/z00885570/pypto3.0/pypto-lib/docs/`
-- PTO IR 手册:`/data/z00885570/pypto3.0/PTOAS/docs/PTO_IR_manual.md`
-- Qwen3 实现:`/data/z00885570/pypto3.0/pypto-lib/examples/models/qwen3/`
-- DeepSeek 实现:`/data/z00885570/pypto3.0/pypto-lib/examples/models/deepseek_v3_2/`
+- PyPTO 文档:`docs/`
+- PTO IR 手册:参考 PTOAS 仓库文档
+- Qwen3 实现:`examples/models/qwen3/`
+- DeepSeek 实现:`examples/models/deepseek_v3_2/`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/README.md` around lines 116 - 119, Replace the four hardcoded
absolute paths in examples/models/README.md (the entries titled "PyPTO 文档", "PTO
IR 手册", "Qwen3 实现", and "DeepSeek 实现") with repository-relative paths so they
work for all users; update each line to point to the corresponding path under
the repo root (for example use docs/... or examples/models/... as appropriate)
and ensure links are valid and use relative Markdown link syntax if needed.
examples/qwen3_decode_expand_mixed_kernel.py-167-167 (1)

167-167: ⚠️ Potential issue | 🟡 Minor

Type inconsistency: ctx_blocks also uses INT32 arithmetic.

Similar to pos, the ctx_blocks calculation should use INDEX type for consistency with the pass dumps:

ctx_blocks = (pl.cast(ctx_len, pl.INDEX) + SEQ_TILE - 1) // SEQ_TILE
🔧 Proposed fix
-            ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE
+            ctx_blocks = (pl.cast(ctx_len, pl.INDEX) + SEQ_TILE - 1) // SEQ_TILE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_decode_expand_mixed_kernel.py` at line 167, The calculation of
ctx_blocks uses INT32 arithmetic; change it to INDEX arithmetic to match pass
dumps by casting ctx_len to pl.INDEX before the division. Replace the current
expression that computes ctx_blocks with one that uses pl.cast(ctx_len,
pl.INDEX) (keeping SEQ_TILE and the same arithmetic) so ctx_blocks =
(pl.cast(ctx_len, pl.INDEX) + SEQ_TILE - 1) // SEQ_TILE; update the ctx_blocks
computation in the same scope where ctx_len and SEQ_TILE are defined (the
ctx_blocks variable) to ensure type consistency with pos and other INDEX-based
values.
examples/qwen3_decode_expand_mixed_kernel.py-165-168 (1)

165-168: ⚠️ Potential issue | 🟡 Minor

Type inconsistency: pos should be cast to INDEX type.

The ctx_len is read as INT32 but pos is used directly as a slice index without casting. The compiler pass dumps show this should be:

pos = pl.cast(ctx_len, pl.INDEX) - 1

This matches the pattern in the generated pass dumps (e.g., 00_frontend.py line 58).

🔧 Proposed fix
             ctx_len = pl.tensor.read(seq_lens, [b])
-            pos = ctx_len - 1
+            pos = pl.cast(ctx_len, pl.INDEX) - 1
             ctx_blocks = (ctx_len + SEQ_TILE - 1) // SEQ_TILE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_decode_expand_mixed_kernel.py` around lines 165 - 168, The
slice index `pos` is computed from `ctx_len` (read via pl.tensor.read) which is
INT32 but used as an INDEX; fix by casting `ctx_len` to pl.INDEX before
subtracting one (i.e., replace the current `pos = ctx_len - 1` with a casted
version using pl.cast(ctx_len, pl.INDEX) - 1) so `pos` has the correct type when
passed to pl.slice(rope_cos, ...); update the computation near where `ctx_len`,
`pos`, `ctx_blocks`, and `cos_row` are defined.
examples/qwen3/qwen3_32b_decode_tilelet.py-542-545 (1)

542-545: ⚠️ Potential issue | 🟡 Minor

Don't print the skip message and the raw error for the same result.

When code_runner is missing, both branches execute today, so the helper prints the special-case success message and then immediately prints the raw error. Make the second branch elif.

Proposed fix
     if not result.passed and result.error and "code_runner" in result.error:
         print("Result: COMPILE OK — device run skipped (code_runner not found).")
-    if not result.passed and result.error:
+    elif not result.passed and result.error:
         print(f"Result: {result.error}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3/qwen3_32b_decode_tilelet.py` around lines 542 - 545, The two
conditional branches in the result handling currently both run when code_runner
is missing, causing both the special-case skip message and the raw error to
print; update the second if that checks result.error to be an elif so it only
prints the raw error when the special-case ("code_runner" in result.error) did
not match—modify the conditionals around result.passed/result.error in the block
containing the print statements to use elif for the second branch.
🧹 Nitpick comments (6)
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py (1)

267-274: Consider extracting a helper for tensor initialization to -1.

The pattern of pl.mul(tensor, 0) followed by pl.add(tensor, -1) to initialize INT32 tensors to -1 is repeated multiple times throughout this file (lines 267-273, 312, 324, 333-337, 349, 361, 370-374). This works correctly but is verbose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 267
- 274, Extract a small helper (e.g., make_neg_one_tensor(shape, dtype)) that
wraps pl.create_tensor and the operations pl.mul(..., 0) and pl.add(..., -1) to
produce an INT32 tensor initialized to -1, then replace the repeated sequences
that build neg_one for topk_idx and blk_topk_idx (and other similar sites) with
calls to this helper; update usages around the loop that assembles
topk_idx/blk_topk_idx (the block using pl.range(INDEX_TOPK_CFG) and the calls to
pl.assemble) to call make_neg_one_tensor([1,1], pl.INT32) instead of the current
mul/add pattern.
examples/models/milm/milm_decode.py (1)

76-76: Consider validating GQA head divisibility.

If num_heads is not evenly divisible by num_kv_heads, the integer division will silently truncate, potentially causing incorrect head mapping in the attention computation.

🛡️ Proposed validation
     Q_PER_KV_CFG = num_heads // num_kv_heads
+    if num_heads % num_kv_heads != 0:
+        raise ValueError(f"num_heads ({num_heads}) must be divisible by num_kv_heads ({num_kv_heads})")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/milm/milm_decode.py` at line 76, The assignment Q_PER_KV_CFG
= num_heads // num_kv_heads can silently truncate when num_heads is not
divisible by num_kv_heads; add a validation check (e.g., in the constructor or
just before this assignment) that verifies num_heads % num_kv_heads == 0 and
raise a clear ValueError or AssertionError if not, referencing the variables
num_heads, num_kv_heads and the computed Q_PER_KV_CFG to help locate the logic
in milm_decode.py (where Q_PER_KV_CFG is defined) so attention head mapping is
correct.
examples/models/IMPLEMENTATION_REPORT.md (1)

106-117: File structure shows pypto-lib/ prefix inconsistently.

The file structure diagram shows paths starting with pypto-lib/examples/models/ but the actual files are at examples/models/ relative to the repository root. This inconsistency may confuse users.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/IMPLEMENTATION_REPORT.md` around lines 106 - 117, The
diagram's root prefix `pypto-lib/` is inconsistent with actual repo layout;
update the tree so paths start at the repository root (`examples/models/`) to
match real files (e.g., README.md, kimi/kimi_k2_decode.py, milm/milm_decode.py,
qwen3/qwen3-32b.py, deepseek_v3_2/deepseek_v3_2_*.py), ensuring the example
block uses `examples/models/` as the top-level path and remove the `pypto-lib/`
prefix.
examples/models/kimi/kimi_k2_decode.py (1)

420-422: NUM_ACTIVE_EXPERTS parameter is unused - all experts are processed.

The implementation accepts num_active_experts (default 4) but the comment at line 421 acknowledges this is "simplified" to process all experts. This negates the computational benefits of MoE sparse routing and makes the parameter misleading.

Consider either:

  1. Implementing actual top-K expert selection
  2. Removing the unused parameter to avoid confusion
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/kimi/kimi_k2_decode.py` around lines 420 - 422, The loop
currently iterates all experts (for exp_idx in pl.range(NUM_EXPERTS_CFG))
ignoring the NUM_ACTIVE_EXPERTS parameter; update the routing to perform top-K
selection using NUM_ACTIVE_EXPERTS: compute gating scores, pick the top
NUM_ACTIVE_EXPERTS expert indices per token, and only iterate/send tokens to
those selected experts (use the existing gating tensor and replace the
full-range loop with a loop over selected indices), or alternatively remove
NUM_ACTIVE_EXPERTS and related parameters/comments if you choose to keep dense
routing; reference NUM_ACTIVE_EXPERTS, the for exp_idx in
pl.range(NUM_EXPERTS_CFG) loop, and the gating weight tensor when making the
change.
examples/models/TEST_REPORT.md (1)

29-33: Add language specifiers to fenced code blocks.

Static analysis flagged these code blocks as missing language identifiers. Adding language specifiers improves syntax highlighting and documentation quality.

📝 Suggested fix
 ### 语法检查
-```
+```text
 ✓ Kimi K2:    语法检查通过 (26,114 字节)
 ### PyPTO 操作使用
-```
+```text
 ✓ pl.matmul      - 矩阵乘法
 ### PyPTO 环境
-```
+```text
 ✓ PyPTO 已安装 (v0.1.1)

Also applies to: 36-42, 45-56

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/TEST_REPORT.md` around lines 29 - 33, Several fenced code
blocks in the TEST_REPORT content (for example the blocks containing "✓ Kimi K2:
语法检查通过 (26,114 字节)", "✓ pl.matmul      - 矩阵乘法", and "✓ PyPTO 已安装 (v0.1.1)") are
missing language specifiers; update each triple-backtick fence to include a
language tag such as text (e.g. change ``` to ```text) for all affected blocks
referenced around the shown snippets and the other ranges (36-42, 45-56) so
syntax highlighters and docs render correctly.
examples/qwen3_decode_expand_mixed_kernel.py (1)

452-456: Error handling flow allows both error messages to print.

When code_runner is not found, both the "COMPILE OK" message and the general error message could print since neither returns early.

🔧 Proposed fix
     if not result.passed and result.error and "code_runner" in result.error:
         print("Result: COMPILE OK — device run skipped (code_runner not found).")
-    if not result.passed and result.error:
+    elif not result.passed and result.error:
         print(f"Result: {result.error}")
     return result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_decode_expand_mixed_kernel.py` around lines 452 - 456, The two
consecutive ifs allow both messages to print when the error contains
"code_runner"; update the control flow so the "COMPILE OK — device run skipped
(code_runner not found)" branch is exclusive: either convert the first condition
(if not result.passed and result.error and "code_runner" in result.error) to
return result immediately after printing, or change the second check to an elif
(elif not result.passed and result.error) so only one message prints; locate the
branch around the result variable where those two ifs are defined and apply one
of these fixes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7bc3d089-f51b-4882-93aa-97fa42ee14f8

📥 Commits

Reviewing files that changed from the base of the PR and between a5a5940 and 23b963b.

📒 Files selected for processing (39)
  • examples/batch_hash_lookup_dump/passes_dump/00_frontend.py
  • examples/batch_hash_lookup_dump/passes_dump/01_after_UnrollLoops.py
  • examples/batch_hash_lookup_dump/passes_dump/02_after_ConvertToSSA.py
  • examples/batch_hash_lookup_dump/passes_dump/03_after_FlattenCallExpr.py
  • examples/batch_hash_lookup_dump/passes_dump/04_after_SplitChunkedLoops.py
  • examples/batch_hash_lookup_dump/passes_dump/05_after_InterchangeChunkLoops.py
  • examples/batch_hash_lookup_dump/passes_dump/06_after_RunVerifier.py
  • examples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.py
  • examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py
  • examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py
  • examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py
  • examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py
  • examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py
  • examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/00_frontend.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/01_after_UnrollLoops.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/02_after_ConvertToSSA.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/03_after_FlattenCallExpr.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/05_after_InterchangeChunkLoops.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/06_after_OutlineIncoreScopes.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/07_after_OutlineClusterScopes.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/08_after_ExpandMixedKernel.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/09_after_ConvertTensorToTileOps.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/10_after_FlattenTileNdTo2D.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/11_after_InferTileMemorySpace.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/12_after_ResolveTransposeLayout.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/13_after_InitMemRef.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/14_after_MemoryReuse.py
  • examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/15_after_AllocateMemoryAddr.py
  • examples/models/IMPLEMENTATION_REPORT.md
  • examples/models/README.md
  • examples/models/TEST_REPORT.md
  • examples/models/deepseek_v3_2/deepseek_v3_2_decode_front.py
  • examples/models/kimi/kimi_k2_decode.py
  • examples/models/milm/milm_decode.py
  • examples/qwen3/qwen3_32b_decode_tilelet.py
  • examples/qwen3/qwen3_tilelet.md
  • examples/qwen3_decode_expand_mixed_kernel.py

if active_count_s != 0:
round_has_active: pl.Scalar[pl.INDEX] = 1
key_match: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key, keys_tile, cmp_type=0)
hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask, key_match)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path

root = Path("examples/batch_hash_lookup_dump/passes_dump")
for path in sorted(root.glob("*.py")):
    try:
        ast.parse(path.read_text(), filename=str(path))
    except SyntaxError as exc:
        print(f"{path}:{exc.lineno}:{exc.offset}: {exc.msg}")
PY

Repository: hw-native-sys/pypto-lib

Length of output: 1373


🏁 Script executed:

#!/bin/bash
# Check the actual line 37 in the first problematic file
head -40 examples/batch_hash_lookup_dump/passes_dump/00_frontend.py | tail -5

Repository: hw-native-sys/pypto-lib

Length of output: 571


🏁 Script executed:

#!/bin/bash
# Get more context around the issue - let's see the full function/block
sed -n '30,45p' examples/batch_hash_lookup_dump/passes_dump/00_frontend.py

Repository: hw-native-sys/pypto-lib

Length of output: 1168


🏁 Script executed:

#!/bin/bash
# Check if these are generated files (look for generator markers/comments)
head -20 examples/batch_hash_lookup_dump/passes_dump/00_frontend.py

Repository: hw-native-sys/pypto-lib

Length of output: 1387


🏁 Script executed:

#!/bin/bash
# Look for the source generator/origin of these files
find examples/batch_hash_lookup_dump -type f -name "*.py" | head -20
# Check directory structure
ls -la examples/batch_hash_lookup_dump/

Repository: hw-native-sys/pypto-lib

Length of output: 1230


🏁 Script executed:

#!/bin/bash
# Look for the compiler/generator that creates these pass dumps
find examples/batch_hash_lookup_dump -type f ! -name "*.py" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Search for references to "batch_hash_lookup" to find the original source
rg "batch_hash_lookup" examples/batch_hash_lookup_dump/ -l --type py

Repository: hw-native-sys/pypto-lib

Length of output: 1061


🏁 Script executed:

#!/bin/bash
# Search for code that might generate these dumps or emit pl.tensor.and
rg "pl\.tensor\.and|tensor\.and" --type py | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 3786


🏁 Script executed:

#!/bin/bash
# Check if there's a generator script or configuration
ls -la examples/batch_hash_lookup_dump/

Repository: hw-native-sys/pypto-lib

Length of output: 235


🏁 Script executed:

#!/bin/bash
# Search for files that mention "passes_dump" to find the generator
rg "passes_dump" --type py | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Look for the original program source (likely in examples/)
find examples -type f -name "*.py" ! -path "*/passes_dump/*" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 928


🏁 Script executed:

#!/bin/bash
# Search for compiler or generator code
rg "batch_hash_lookup|BatchHashLookup" --type py -l | grep -v passes_dump

Repository: hw-native-sys/pypto-lib

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Check if there's a test or runner script
find . -type f -name "*.py" -path "*/examples/*" ! -path "*/passes_dump/*" | xargs grep -l "passes_dump\|batch_hash" 2>/dev/null | head -10

Repository: hw-native-sys/pypto-lib

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Search for batch_hash_lookup in all file types
rg "batch_hash_lookup|BatchHashLookup" -t all 2>/dev/null | grep -v "passes_dump" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Check if there's a build script or Makefile that generates these
find . -name "Makefile" -o -name "*.mk" -o -name "CMakeLists.txt" -o -name "setup.py" -o -name "build*.py" | xargs grep -l "passes_dump\|batch_hash" 2>/dev/null

Repository: hw-native-sys/pypto-lib

Length of output: 51


🏁 Script executed:

#!/bin/bash
# Look for any Python code that generates tensor operations
rg "tensor\.and|tensor\.ands" --type py -B 5 -A 5 | head -50

Repository: hw-native-sys/pypto-lib

Length of output: 8625


🏁 Script executed:

#!/bin/bash
# Check if there's a compiler or DSL-to-Python generator
find . -type f \( -name "*.py" -o -name "*.cpp" -o -name "*.cc" -o -name "*.rs" \) -path "*/compiler/*" -o -path "*/codegen/*" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 49


Fix the reserved keyword usage in tensor operation calls and regenerate pass dumps.

The source compiler is emitting pl.tensor.and(...) and pl.tensor.ands(...) calls, which use Python's reserved keyword and as a method name. This makes all 13 pass-dump files unparsable: lines 37 (and similar lines in later passes) contain hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask, key_match), which is invalid Python syntax.

Verify the issue with:

import ast
from pathlib import Path
root = Path("examples/batch_hash_lookup_dump/passes_dump")
for path in sorted(root.glob("*.py")):
    try:
        ast.parse(path.read_text(), filename=str(path))
    except SyntaxError as exc:
        print(f"{path}:{exc.lineno}:{exc.offset}: {exc.msg}")

The problem persists across all compilation stages. Fix the tensor operation names in the code generator (replace and/ands with safe alternatives like bit_and/bit_ands or similar), regenerate all pass-dump artifacts, and add a Python syntax validation check to CI to prevent similar regressions.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 37-37: Expected an identifier, but found a keyword and that cannot be used here

(invalid-syntax)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py` at line 37, The
generated pass-dump uses Python reserved names pl.tensor.and / pl.tensor.ands
(e.g., the expression assigning hit_mask from pl.tensor.and(active_mask,
key_match)), which produces invalid syntax; update the code generator that emits
tensor ops so it emits non-keyword names such as pl.tensor.bit_and and
pl.tensor.bit_ands (or another safe mapping) wherever pl.tensor.and /
pl.tensor.ands are produced, then regenerate all pass-dump artifacts (so lines
assigning hit_mask from active_mask and key_match are fixed) and add a CI step
that validates each generated .py with ast.parse (or a simple python -m
py_compile) to catch syntax errors in future dumps.

value_ptr_out_4: pl.Tensor[[1024, 64, 32], pl.INT32] = pl.yield_(value_ptr_out_5)
return value_ptr_out_4
@pl.function(type=pl.FunctionType.InCore)
def batch_hash_lookup_incore_1(self, hash_pool_0: pl.Tensor[[64, 128, 32], pl.INT32], probe_0: pl.Scalar[pl.INDEX], round_has_active_0: pl.Scalar[pl.INDEX], search_key_0: pl.Tensor[[1024, 64, 32], pl.INT32], ti_0: pl.Scalar[pl.INDEX], ti_iter_1: pl.Scalar[pl.INDEX], value_ptr_out_2: pl.Tensor[[1024, 64, 32], pl.INT32], value_ptr_out_iter_6: pl.Tensor[[1024, 64, 32], pl.INT32]) -> tuple[pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Tensor[[1024, 64, 32], pl.INT32]]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import re
from pathlib import Path

path = Path("examples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.py")
lines = path.read_text().splitlines()

inside = False
for lineno, line in enumerate(lines, 1):
    if re.match(r"\s*def batch_hash_lookup\(", line):
        inside = True
    elif inside and re.match(r"\s*@pl\.function", line):
        break
    if inside and "ti_0" in line:
        print(f"{lineno}: {line}")
PY

Repository: hw-native-sys/pypto-lib

Length of output: 477


🏁 Script executed:

cat -n examples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.py | head -70

Repository: hw-native-sys/pypto-lib

Length of output: 6812


Fix undefined ti_0 in orchestration function.

Lines 52 and 54 reference ti_0, which is never defined in the batch_hash_lookup function. This causes a NameError at runtime. Either initialize ti_0 (for example, ti_0 = 0) or remove it from the init_values and function call if unused.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.py`
at line 15, In the orchestration function batch_hash_lookup you reference ti_0
but it is never defined; fix by either initializing ti_0 (e.g., ti_0 = 0) before
building init_values and before calling batch_hash_lookup_incore_1, or if ti_0
is unused remove ti_0 from init_values and from the call to
batch_hash_lookup_incore_1 and from the batch_hash_lookup function signature so
no NameError occurs; update all places that construct init_values and the call
site to match the chosen approach.

attn_out_3: pl.Tensor[[16, 5120], pl.FP32] = pl.tensor.assemble(attn_out_iter_1, attn_row_iter_2_outer_rv, [b_0, 0])
attn_out_2, k_cache_2, v_cache_2 = pl.yield_(attn_out_3, k_cache_4, v_cache_4)
with pl.auto_incore():
for b0_1, (gamma_iter_6, inv_rms_iter_1, k0_iter_13, kb_iter_9, normed_iter_6, ob_iter_2, out_iter_1, sq_sum_iter_5, x_chunk_iter_13) in pl.range(0, 16, 4, init_values=(gamma_iter_1_outer_rv, inv_rms_0, k0_2, kb_2, normed_iter_1_outer_rv, ob_1, out_0, sq_sum_3, x_chunk_2)):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that ob_1 is not defined before its use at line 185
rg -n "ob_1\s*=" examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 49


🏁 Script executed:

# Check if the file exists and read the content around line 185
wc -l examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py

Repository: hw-native-sys/pypto-lib

Length of output: 160


🏁 Script executed:

# Read context around line 185
sed -n '180,190p' examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py

Repository: hw-native-sys/pypto-lib

Length of output: 1546


🏁 Script executed:

# Check the beginning of the file to understand what it is
head -30 examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py

Repository: hw-native-sys/pypto-lib

Length of output: 3143


🏁 Script executed:

# Search for where ob_1 is actually used (not just assigned)
rg -n "ob_1" examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py

Repository: hw-native-sys/pypto-lib

Length of output: 1218


Critical: Undefined name ob_1 in init_values at line 185.

The variable ob_1 is used in the init_values tuple of the pl.range() call but is never defined in the file. This would cause a NameError at runtime.

The related variables ob_1_out and ob_1_in are defined as loop variables at lines 51-52 in a different scope. This appears to be a bug in the compiler pass that generates this file—the chunked loop splitting transformation has not properly carried through variable references from the prior pass.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 185-185: Loop control variable inv_rms_iter_1 not used within loop body

Rename unused inv_rms_iter_1 to _inv_rms_iter_1

(B007)


[warning] 185-185: Loop control variable ob_iter_2 not used within loop body

Rename unused ob_iter_2 to _ob_iter_2

(B007)


[warning] 185-185: Loop control variable sq_sum_iter_5 not used within loop body

Rename unused sq_sum_iter_5 to _sq_sum_iter_5

(B007)


[error] 185-185: Undefined name ob_1

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py`
at line 185, The init_values tuple in the pl.range(...) call references an
undefined name ob_1; replace it with the correct scoped variable that was meant
to be carried into this chunked loop (likely ob_1_out or ob_1_in from the
earlier loop) or ensure ob_1 is propagated into this scope by renaming the loop
variable consistently. Locate the original producer loop that defines ob_1_out /
ob_1_in and either (A) change the init_values entry in the pl.range call to use
ob_1_out (or ob_1_in) so the symbol exists, or (B) update the prior pass that
produced this file to pass ob_1 through the split (i.e., ensure the
SplitChunkedLoops transform uses the same variable name across scopes), and
verify pl.range init_values now references a defined variable.

out_5: pl.Tensor[[16, 5120], pl.BF16] = pl.tensor.assemble(out_iter_3_inner, _t60, [b0_1, o0_6])
o0_iter_4_inner_rv, out_iter_3_inner_rv = pl.yield_(o0_6, out_5)
o0_iter_4_outer_rv, out_iter_3_outer_rv = pl.yield_(o0_iter_4_inner_rv, out_iter_3_inner_rv)
gamma_7, inv_rms_2, k0_14, kb_10, normed_7, ob_3, out_2, sq_sum_6, x_chunk_14 = pl.yield_(gamma_9, inv_rms_3, k0_27, kb_17, normed_9, ob_6, out_iter_3_outer_rv, sq_sum_10, x_chunk_19)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Undefined name ob_6 in yield statement.

Similar to ob_1, the variable ob_6 is used in the pl.yield_() call but is never defined in this scope. This is a second instance of the same compiler pass bug.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 274-274: Unpacked variable gamma_7 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable inv_rms_2 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable k0_14 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable kb_10 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable normed_7 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable ob_3 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable sq_sum_6 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 274-274: Unpacked variable x_chunk_14 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[error] 274-274: Undefined name ob_6

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/04_after_SplitChunkedLoops.py`
at line 274, pl.yield_ is referencing an undefined symbol ob_6 in the tuple
(gamma_9, inv_rms_3, k0_27, kb_17, normed_9, ob_6, out_iter_3_outer_rv,
sq_sum_10, x_chunk_19); fix by either introducing/assigning ob_6 in the
surrounding scope before this pl.yield_ call or by replacing ob_6 with the
correct in-scope variable (e.g., ob_3 or the intended buffer name used earlier)
so the names used in the yield tuple match the actual local variables; update
all related occurrences so pl.yield_, the producer assignments, and subsequent
consumers use the same symbol consistently.

with pl.incore():
attn_out_3: pl.Tensor[[16, 5120], pl.FP32] = pl.tensor.assemble(attn_out_iter_1, attn_row_iter_2_outer_l0_rv, [b_0, 0])
attn_out_2, k_cache_2, v_cache_2 = pl.yield_(attn_out_3, k_cache_4, v_cache_4)
for b0_1, (gamma_iter_6, inv_rms_iter_1, k0_iter_13, kb_iter_9, normed_iter_6, ob_iter_2, out_iter_1, sq_sum_iter_5, x_chunk_iter_13) in pl.range(0, 16, 4, init_values=(gamma_iter_1_outer_l0_rv, inv_rms_0, k0_2, kb_2, normed_iter_1_outer_l0_rv, ob_1, out_0, sq_sum_3, x_chunk_2)):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Dead loop-carried scalars are seeded from names that do not exist.

ob_iter_2 and ob_3 are never read, but their init/yield tuples still use ob_1 and ob_6, which are undefined in this function. That turns the post-attention phase into a NameError path. Remove those unused scalars from the loop state.

Also applies to: 280-280

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 188-188: Loop control variable inv_rms_iter_1 not used within loop body

Rename unused inv_rms_iter_1 to _inv_rms_iter_1

(B007)


[warning] 188-188: Loop control variable ob_iter_2 not used within loop body

Rename unused ob_iter_2 to _ob_iter_2

(B007)


[warning] 188-188: Loop control variable sq_sum_iter_5 not used within loop body

Rename unused sq_sum_iter_5 to _sq_sum_iter_5

(B007)


[error] 188-188: Undefined name ob_1

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/05_after_InterchangeChunkLoops.py`
at line 188, The loop's dead loop-carried scalars ob_iter_2 and ob_3 are unused
but their init values reference undefined names ob_1 and ob_6, causing
NameError; remove those unused scalars from the loop state and init/yield
tuples: update the pl.range unpacking (e.g., remove ob_iter_2 and any other
unused ob_iter_X from the left-hand tuple), remove the corresponding ob_1/ob_6
entries from the init_values tuple passed to pl.range, and make the same removal
in the other occurrence mentioned (the similar loop at the later location) so
the loop state and init/yield tuples remain consistent.

o0_iter_4_outer_l1_rv: pl.Scalar[pl.INDEX] = ret_5[0]
out_iter_3_outer_l1_rv: pl.Tensor[[16, 5120], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 163840, 129)] = ret_5[1]
o0_iter_4_outer_l0_rv, out_iter_3_outer_l0_rv = pl.yield_(o0_iter_4_outer_l1_rv, out_iter_3_outer_l1_rv)
gamma_7, inv_rms_2, k0_14, kb_10, normed_7, ob_3, out_2, sq_sum_6, x_chunk_14 = pl.yield_(gamma_9, inv_rms_3, k0_27, kb_17, normed_9, ob_6, out_iter_3_outer_l0_rv, sq_sum_10, x_chunk_19)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Undefined name ob_6 used in pl.yield_.

The variable ob_6 is referenced in the pl.yield_ call but is never defined in this scope. This would cause a NameError at runtime.

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 331-331: Unpacked variable gamma_7 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable inv_rms_2 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable k0_14 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable kb_10 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable normed_7 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable ob_3 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable sq_sum_6 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 331-331: Unpacked variable x_chunk_14 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[error] 331-331: Undefined name ob_6

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/13_after_InitMemRef.py`
at line 331, The pl.yield_ call is referencing an undefined variable ob_6;
update the yield tuple to use the correct defined variable (likely the intended
buffer variable such as ob_3 or the previously-defined name in this scope) or
add the missing definition for ob_6 before this call; specifically check the
surrounding code for the intended symbol (e.g., ob_3/ob_# variables) and either
replace ob_6 with that existing variable name or create/assign ob_6 prior to the
pl.yield_ invocation so pl.yield_(gamma_9, inv_rms_3, k0_27, kb_17, normed_9,
<correct_ob>, out_iter_3_outer_l0_rv, sq_sum_10, x_chunk_19) uses only defined
names.

Comment on lines +47 to +74
def qwen3_decode_layer_incore_2(self, b0_0: pl.Scalar[pl.INDEX], gamma_iter_1_outer_l0: pl.Tensor[[1, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 1024, 0)], hidden_states_0: pl.Tensor[[16, 5120], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 163840, 1)], input_rms_weight_0: pl.Tensor[[1, 5120], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 20480, 2)], inv_rms_tile_0: pl.Tensor[[4, 1], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 16, 3)], k0_iter_8_outer_l0: pl.Scalar[pl.INDEX], k_proj_iter_3_outer_l0: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 4)], kb_iter_6_outer_l0: pl.Scalar[pl.INDEX], normed_iter_1_outer_l0: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 5)], ob_1_out: pl.Scalar[pl.INDEX], v_proj_iter_3_outer_l0: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 6)], wk_0: pl.Tensor[[5120, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 10485760, 7)], wv_0: pl.Tensor[[5120, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 10485760, 8)], x_chunk_bf16_iter_1_outer_l0: pl.Tensor[[4, 256], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 2048, 9)], x_chunk_iter_8_outer_l0: pl.Tensor[[16, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 16384, 10)]) -> tuple[pl.Tensor[[1, 256], pl.FP32], pl.Scalar[pl.INDEX], pl.Tensor[[16, 1024], pl.BF16], pl.Scalar[pl.INDEX], pl.Tensor[[4, 256], pl.FP32], pl.Tensor[[16, 1024], pl.BF16], pl.Tensor[[4, 256], pl.BF16], pl.Tensor[[16, 256], pl.FP32]]:
for ob_1_in, (gamma_iter_1_outer_l1, k0_iter_8_outer_l1, k_proj_iter_3_outer_l1, kb_iter_6_outer_l1, normed_iter_1_outer_l1, v_proj_iter_3_outer_l1, x_chunk_iter_8_outer_l1, x_chunk_bf16_iter_1_outer_l1) in pl.parallel(8, init_values=(gamma_iter_1_outer_l0, k0_iter_8_outer_l0, k_proj_iter_3_outer_l0, kb_iter_6_outer_l0, normed_iter_1_outer_l0, v_proj_iter_3_outer_l0, x_chunk_iter_8_outer_l0, x_chunk_bf16_iter_1_outer_l0)):
kv0_0: pl.Scalar[pl.INDEX] = (0 + (ob_1_out * 8 + ob_1_in) * 1) * 32
k_acc_0: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 17)] = pl.tensor.create([4, 32], dtype=pl.FP32, layout=pl.TensorLayout.ND)
v_acc_0: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 18)] = pl.tensor.create([4, 32], dtype=pl.FP32, layout=pl.TensorLayout.ND)
k_acc_1: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 19)] = pl.tensor.muls(k_acc_0, 0.0)
v_acc_1: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 20)] = pl.tensor.muls(v_acc_0, 0.0)
for kb_8, (gamma_iter_3, k0_iter_10, k_acc_iter_2, normed_iter_3, v_acc_iter_2, x_chunk_iter_10, x_chunk_bf16_iter_3) in pl.range(20, init_values=(gamma_iter_1_outer_l1, k0_iter_8_outer_l1, k_acc_1, normed_iter_1_outer_l1, v_acc_1, x_chunk_iter_8_outer_l1, x_chunk_bf16_iter_1_outer_l1)):
k0_12: pl.Scalar[pl.INDEX] = kb_8 * 256
x_chunk_bf16_5: pl.Tensor[[4, 256], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 2048, 21)] = pl.tensor.slice(hidden_states_0, [4, 256], [b0_0, k0_12])
x_chunk_12: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 22)] = pl.tensor.cast(x_chunk_bf16_5, target_type=pl.FP32, mode='round')
gamma_5: pl.Tensor[[1, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 1024, 23)] = pl.tensor.slice(input_rms_weight_0, [1, 256], [0, k0_12])
_t9: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 24)] = pl.tensor.row_expand_mul(x_chunk_12, inv_rms_tile_0)
normed_5: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 25)] = pl.tensor.col_expand_mul(_t9, gamma_5)
normed_bf16_0: pl.Tensor[[4, 256], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 2048, 26)] = pl.tensor.cast(normed_5, target_type=pl.BF16, mode='round')
wk_chunk_0: pl.Tensor[[256, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 16384, 27)] = pl.tensor.slice(wk_0, [256, 32], [k0_12, kv0_0])
wv_chunk_0: pl.Tensor[[256, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 16384, 28)] = pl.tensor.slice(wv_0, [256, 32], [k0_12, kv0_0])
_t10: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 29)] = pl.tensor.matmul(normed_bf16_0, wk_chunk_0, a_trans=False, b_trans=False, c_matrix_nz=False)
k_acc_4: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 30)] = pl.tensor.add(k_acc_iter_2, _t10)
_t11: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 31)] = pl.tensor.matmul(normed_bf16_0, wv_chunk_0, a_trans=False, b_trans=False, c_matrix_nz=False)
v_acc_4: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 32)] = pl.tensor.add(v_acc_iter_2, _t11)
gamma_4, k0_11, k_acc_3, normed_4, v_acc_3, x_chunk_11, x_chunk_bf16_4 = pl.yield_(gamma_5, k0_12, k_acc_4, normed_5, v_acc_4, x_chunk_12, x_chunk_bf16_5)
_t12: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 39)] = pl.tensor.cast(k_acc_3, target_type=pl.BF16, mode='round')
k_proj_5: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 40)] = pl.tensor.assemble(k_proj_iter_3_outer_l1, _t12, [b0_0, kv0_0])
_t13: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 41)] = pl.tensor.cast(v_acc_3, target_type=pl.BF16, mode='round')
v_proj_5: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 42)] = pl.tensor.assemble(v_proj_iter_3_outer_l1, _t13, [b0_0, kv0_0])
gamma_iter_1_outer_l1_rv, k0_iter_8_outer_l1_rv, k_proj_iter_3_outer_l1_rv, kb_iter_6_outer_l1_rv, normed_iter_1_outer_l1_rv, v_proj_iter_3_outer_l1_rv, x_chunk_iter_8_outer_l1_rv, x_chunk_bf16_iter_1_outer_l1_rv = pl.yield_(gamma_4, k0_11, k_proj_5, kb_8, normed_4, v_proj_5, x_chunk_11, x_chunk_bf16_4)
return gamma_iter_1_outer_l1_rv, k0_iter_8_outer_l1_rv, k_proj_iter_3_outer_l1_rv, kb_iter_6_outer_l1_rv, normed_iter_1_outer_l1_rv, v_proj_iter_3_outer_l1_rv, x_chunk_bf16_iter_1_outer_l1_rv, x_chunk_iter_8_outer_l1_rv
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Outlined loop state still references locals that no longer exist.

qwen3_decode_layer_incore_2 is still carrying gamma/normed/x_chunk/k0/kb state even though those values are overwritten before their first read. After outlining, the caller no longer has gamma_0, normed_0, or x_chunk_bf16_0, so the first broken path is at Line 216. The same stale-state pattern shows up again at Lines 267, 299, and 331 with ob_1, o0_0, and ob_6. Please collapse the helper signatures and loop tuples to the values that are actually consumed and returned.

Also applies to: 267-331

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 48-48: Loop control variable kb_iter_6_outer_l1 not used within loop body

Rename unused kb_iter_6_outer_l1 to _kb_iter_6_outer_l1

(B007)


[warning] 54-54: Loop control variable gamma_iter_3 not used within loop body

Rename unused gamma_iter_3 to _gamma_iter_3

(B007)


[warning] 54-54: Loop control variable k0_iter_10 not used within loop body

Rename unused k0_iter_10 to _k0_iter_10

(B007)


[warning] 54-54: Loop control variable normed_iter_3 not used within loop body

Rename unused normed_iter_3 to _normed_iter_3

(B007)


[warning] 54-54: Loop control variable x_chunk_iter_10 not used within loop body

Rename unused x_chunk_iter_10 to _x_chunk_iter_10

(B007)


[warning] 54-54: Loop control variable x_chunk_bf16_iter_3 not used within loop body

Rename unused x_chunk_bf16_iter_3 to _x_chunk_bf16_iter_3

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/14_after_MemoryReuse.py`
around lines 47 - 74, The outlined helper qwen3_decode_layer_incore_2 keeps
carrying loop-state variables (gamma, normed, x_chunk_bf16, k0, kb and similar
ob_*/o0_*/ob_6 states) that are overwritten before use; remove those stale
values from the helper's parameter list, the pl.parallel init_values tuple, the
pl.range/init_values tuples, and all corresponding pl.yield_/assemble return
tuples so only actually consumed/returned symbols remain (e.g. keep
k_proj_iter_3_outer_l1, v_proj_iter_3_outer_l1, k_acc/v_acc accumulators, and
the final returned tuple items). Update the pl.yield_ calls and the
assemble/return sites in qwen3_decode_layer_incore_2 to match the reduced state
(remove gamma_*, normed_*, x_chunk_*, k0_*, kb_* and ob_*/o0_*/ob_6 references)
so caller and callee signatures align.

Comment on lines +47 to +74
def qwen3_decode_layer_incore_2(self, b0_0: pl.Scalar[pl.INDEX], gamma_iter_1_outer_l0: pl.Tensor[[1, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 1024, 0)], hidden_states_0: pl.Tensor[[16, 5120], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 163840, 1)], input_rms_weight_0: pl.Tensor[[1, 5120], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 20480, 2)], inv_rms_tile_0: pl.Tensor[[4, 1], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 16, 3)], k0_iter_8_outer_l0: pl.Scalar[pl.INDEX], k_proj_iter_3_outer_l0: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 4)], kb_iter_6_outer_l0: pl.Scalar[pl.INDEX], normed_iter_1_outer_l0: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 5)], ob_1_out: pl.Scalar[pl.INDEX], v_proj_iter_3_outer_l0: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 6)], wk_0: pl.Tensor[[5120, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 10485760, 7)], wv_0: pl.Tensor[[5120, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 10485760, 8)], x_chunk_bf16_iter_1_outer_l0: pl.Tensor[[4, 256], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 2048, 9)], x_chunk_iter_8_outer_l0: pl.Tensor[[16, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 16384, 10)]) -> tuple[pl.Tensor[[1, 256], pl.FP32], pl.Scalar[pl.INDEX], pl.Tensor[[16, 1024], pl.BF16], pl.Scalar[pl.INDEX], pl.Tensor[[4, 256], pl.FP32], pl.Tensor[[16, 1024], pl.BF16], pl.Tensor[[4, 256], pl.BF16], pl.Tensor[[16, 256], pl.FP32]]:
for ob_1_in, (gamma_iter_1_outer_l1, k0_iter_8_outer_l1, k_proj_iter_3_outer_l1, kb_iter_6_outer_l1, normed_iter_1_outer_l1, v_proj_iter_3_outer_l1, x_chunk_iter_8_outer_l1, x_chunk_bf16_iter_1_outer_l1) in pl.parallel(8, init_values=(gamma_iter_1_outer_l0, k0_iter_8_outer_l0, k_proj_iter_3_outer_l0, kb_iter_6_outer_l0, normed_iter_1_outer_l0, v_proj_iter_3_outer_l0, x_chunk_iter_8_outer_l0, x_chunk_bf16_iter_1_outer_l0)):
kv0_0: pl.Scalar[pl.INDEX] = (0 + (ob_1_out * 8 + ob_1_in) * 1) * 32
k_acc_0: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 17)] = pl.tensor.create([4, 32], dtype=pl.FP32, layout=pl.TensorLayout.ND)
v_acc_0: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 18)] = pl.tensor.create([4, 32], dtype=pl.FP32, layout=pl.TensorLayout.ND)
k_acc_1: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 19)] = pl.tensor.muls(k_acc_0, 0.0)
v_acc_1: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 20)] = pl.tensor.muls(v_acc_0, 0.0)
for kb_8, (gamma_iter_3, k0_iter_10, k_acc_iter_2, normed_iter_3, v_acc_iter_2, x_chunk_iter_10, x_chunk_bf16_iter_3) in pl.range(20, init_values=(gamma_iter_1_outer_l1, k0_iter_8_outer_l1, k_acc_1, normed_iter_1_outer_l1, v_acc_1, x_chunk_iter_8_outer_l1, x_chunk_bf16_iter_1_outer_l1)):
k0_12: pl.Scalar[pl.INDEX] = kb_8 * 256
x_chunk_bf16_5: pl.Tensor[[4, 256], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 2048, 21)] = pl.tensor.slice(hidden_states_0, [4, 256], [b0_0, k0_12])
x_chunk_12: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 22)] = pl.tensor.cast(x_chunk_bf16_5, target_type=pl.FP32, mode='round')
gamma_5: pl.Tensor[[1, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 1024, 23)] = pl.tensor.slice(input_rms_weight_0, [1, 256], [0, k0_12])
_t9: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 24)] = pl.tensor.row_expand_mul(x_chunk_12, inv_rms_tile_0)
normed_5: pl.Tensor[[4, 256], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 4096, 25)] = pl.tensor.col_expand_mul(_t9, gamma_5)
normed_bf16_0: pl.Tensor[[4, 256], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 2048, 26)] = pl.tensor.cast(normed_5, target_type=pl.BF16, mode='round')
wk_chunk_0: pl.Tensor[[256, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 16384, 27)] = pl.tensor.slice(wk_0, [256, 32], [k0_12, kv0_0])
wv_chunk_0: pl.Tensor[[256, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 16384, 28)] = pl.tensor.slice(wv_0, [256, 32], [k0_12, kv0_0])
_t10: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 29)] = pl.tensor.matmul(normed_bf16_0, wk_chunk_0, a_trans=False, b_trans=False, c_matrix_nz=False)
k_acc_4: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 30)] = pl.tensor.add(k_acc_iter_2, _t10)
_t11: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 31)] = pl.tensor.matmul(normed_bf16_0, wv_chunk_0, a_trans=False, b_trans=False, c_matrix_nz=False)
v_acc_4: pl.Tensor[[4, 32], pl.FP32, pl.MemRef(pl.MemorySpace.DDR, -1, 512, 32)] = pl.tensor.add(v_acc_iter_2, _t11)
gamma_4, k0_11, k_acc_3, normed_4, v_acc_3, x_chunk_11, x_chunk_bf16_4 = pl.yield_(gamma_5, k0_12, k_acc_4, normed_5, v_acc_4, x_chunk_12, x_chunk_bf16_5)
_t12: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 39)] = pl.tensor.cast(k_acc_3, target_type=pl.BF16, mode='round')
k_proj_5: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 40)] = pl.tensor.assemble(k_proj_iter_3_outer_l1, _t12, [b0_0, kv0_0])
_t13: pl.Tensor[[4, 32], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 256, 41)] = pl.tensor.cast(v_acc_3, target_type=pl.BF16, mode='round')
v_proj_5: pl.Tensor[[16, 1024], pl.BF16, pl.MemRef(pl.MemorySpace.DDR, -1, 32768, 42)] = pl.tensor.assemble(v_proj_iter_3_outer_l1, _t13, [b0_0, kv0_0])
gamma_iter_1_outer_l1_rv, k0_iter_8_outer_l1_rv, k_proj_iter_3_outer_l1_rv, kb_iter_6_outer_l1_rv, normed_iter_1_outer_l1_rv, v_proj_iter_3_outer_l1_rv, x_chunk_iter_8_outer_l1_rv, x_chunk_bf16_iter_1_outer_l1_rv = pl.yield_(gamma_4, k0_11, k_proj_5, kb_8, normed_4, v_proj_5, x_chunk_11, x_chunk_bf16_4)
return gamma_iter_1_outer_l1_rv, k0_iter_8_outer_l1_rv, k_proj_iter_3_outer_l1_rv, kb_iter_6_outer_l1_rv, normed_iter_1_outer_l1_rv, v_proj_iter_3_outer_l1_rv, x_chunk_bf16_iter_1_outer_l1_rv, x_chunk_iter_8_outer_l1_rv
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Outlined loop state still references locals that no longer exist.

qwen3_decode_layer_incore_2 is still carrying gamma/normed/x_chunk/k0/kb state even though those values are overwritten before their first read. After outlining, the caller no longer has gamma_0, normed_0, or x_chunk_bf16_0, so the first broken path is at Line 216. The same stale-state pattern shows up again at Lines 267, 299, and 331 with ob_1, o0_0, and ob_6. Please collapse the helper signatures and loop tuples to the values that are actually consumed and returned.

Also applies to: 267-331

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 48-48: Loop control variable kb_iter_6_outer_l1 not used within loop body

Rename unused kb_iter_6_outer_l1 to _kb_iter_6_outer_l1

(B007)


[warning] 54-54: Loop control variable gamma_iter_3 not used within loop body

Rename unused gamma_iter_3 to _gamma_iter_3

(B007)


[warning] 54-54: Loop control variable k0_iter_10 not used within loop body

Rename unused k0_iter_10 to _k0_iter_10

(B007)


[warning] 54-54: Loop control variable normed_iter_3 not used within loop body

Rename unused normed_iter_3 to _normed_iter_3

(B007)


[warning] 54-54: Loop control variable x_chunk_iter_10 not used within loop body

Rename unused x_chunk_iter_10 to _x_chunk_iter_10

(B007)


[warning] 54-54: Loop control variable x_chunk_bf16_iter_3 not used within loop body

Rename unused x_chunk_bf16_iter_3 to _x_chunk_bf16_iter_3

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/build_qwen3_decode_expand_mixed_kernel/passes_dump/15_after_AllocateMemoryAddr.py`
around lines 47 - 74, The outlined helper qwen3_decode_layer_incore_2 still
threads unused locals (gamma_iter_1..., normed_iter_1..., x_chunk_iter_8...,
x_chunk_bf16_iter_1..., k0_iter_8..., kb_iter_6...) through the pl.parallel
init_values, inner pl.range init_values/yield_ and the function's return tuple
even though those values are overwritten before use; remove these stale state
variables from the parallel/init_values tuple, from the inner loop's init_values
and yield_ tuples, and from the function signature/return so the helper only
carries the actual consumed/produced symbols (e.g., keep k_proj_iter_3_outer_l1,
v_proj_iter_3_outer_l1, k_acc/v_acc assembly outputs and the real loop indices
like ob_1_in/kv0_0), collapsing the helper signature and all
pl.yield_/pl.parallel tuples to match the real live values.

z00885570 added 2 commits March 25, 2026 19:16
Kimi K2 fixes:
- Add post_rms_weight parameter for post-attention RMSNorm (separate from input_rms_weight)
- Fix gate_scores computation to accumulate over HIDDEN_BLOCKS instead of overwriting
- Fix 3D expert weight slices with reshape to 2D before matmul operations

MiLM fixes:
- Reorder post_rms_weight parameter before MLP weights to match Qwen3 pattern

These changes address CI review comments from PR #46
- Remove Chinese characters from Kimi K2 docstring
- Remove Chinese characters from MiLM docstring
- Clean up __pycache__ files

This fixes CI encoding issues.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
examples/models/kimi/kimi_k2_decode.py (2)

457-458: ⚠️ Potential issue | 🔴 Critical

Same slice dimension issue for w_down_experts.

The slice shape for w_down_experts also incorrectly uses NUM_EXPERTS_CFG as the first dimension.

🐛 Proposed fix
                             for dob in pl.parallel(0, Q_OUT_BLOCKS, 1, chunk=4):
                                 d0 = dob * Q_OUT_CHUNK
                                 down_prev = pl.slice(expert_out, [BATCH_TILE, Q_OUT_CHUNK], [0, d0])
                                 # Slice and reshape 3D down weights to 2D
-                                w_down_chunk = pl.slice(w_down_experts, [NUM_EXPERTS_CFG, MLP_OUT_CHUNK, Q_OUT_CHUNK], [exp_idx, o0, d0])
+                                w_down_chunk = pl.slice(w_down_experts, [1, MLP_OUT_CHUNK, Q_OUT_CHUNK], [exp_idx, o0, d0])
                                 w_down_chunk = pl.reshape(w_down_chunk, [MLP_OUT_CHUNK, Q_OUT_CHUNK])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/kimi/kimi_k2_decode.py` around lines 457 - 458, The slice for
w_down_experts uses the wrong first-dimension constant; update the pl.slice call
that produces w_down_chunk to use the experts dimension constant matching the
tensor shape (replace NUM_EXPERTS_CFG with the correct experts dimension
constant used elsewhere, e.g. NUM_EXPERTS) so the slice becomes
pl.slice(w_down_experts, [NUM_EXPERTS, MLP_OUT_CHUNK, Q_OUT_CHUNK], [exp_idx,
o0, d0]) before the pl.reshape; ensure the same constant is used consistently
with other slices of w_down_experts and matches the tensor's actual shape.

441-445: ⚠️ Potential issue | 🔴 Critical

Incorrect slice dimensions for expert weights cause out-of-bounds access.

The slice shapes still use NUM_EXPERTS_CFG as the first dimension instead of 1. This causes the slice to attempt extracting all experts starting from exp_idx, which goes out of bounds when exp_idx > 0. Additionally, the subsequent pl.reshape at lines 444-445 expects shape [1, K_CHUNK, MLP_OUT_CHUNK] but receives [NUM_EXPERTS_CFG, K_CHUNK, MLP_OUT_CHUNK].

🐛 Proposed fix
                             for kb in pl.range(HIDDEN_BLOCKS):
                                 k0 = kb * K_CHUNK
                                 post_chunk = pl.slice(post_norm_tile, [BATCH_TILE, K_CHUNK], [0, k0])
                                 # Slice 3D expert weights and reshape to 2D
-                                wg = pl.slice(w_gate_experts, [NUM_EXPERTS_CFG, K_CHUNK, MLP_OUT_CHUNK], [exp_idx, k0, o0])
-                                wu = pl.slice(w_up_experts, [NUM_EXPERTS_CFG, K_CHUNK, MLP_OUT_CHUNK], [exp_idx, k0, o0])
+                                wg = pl.slice(w_gate_experts, [1, K_CHUNK, MLP_OUT_CHUNK], [exp_idx, k0, o0])
+                                wu = pl.slice(w_up_experts, [1, K_CHUNK, MLP_OUT_CHUNK], [exp_idx, k0, o0])
                                 # Reshape from [1, K_CHUNK, MLP_OUT_CHUNK] to [K_CHUNK, MLP_OUT_CHUNK]
                                 wg = pl.reshape(wg, [K_CHUNK, MLP_OUT_CHUNK])
                                 wu = pl.reshape(wu, [K_CHUNK, MLP_OUT_CHUNK])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/kimi/kimi_k2_decode.py` around lines 441 - 445, The slices
for expert weights are using NUM_EXPERTS_CFG as the first dimension and thus
slice out the wrong range; update the pl.slice calls for w_gate_experts and
w_up_experts (symbols: wg, wu, pl.slice, w_gate_experts, w_up_experts, exp_idx,
k0, o0) to request shape [1, K_CHUNK, MLP_OUT_CHUNK] instead of
[NUM_EXPERTS_CFG, K_CHUNK, MLP_OUT_CHUNK], so each slice extracts only the
single expert at exp_idx, and keep the subsequent pl.reshape to [K_CHUNK,
MLP_OUT_CHUNK] to collapse the leading singleton dimension.
🧹 Nitpick comments (1)
examples/models/kimi/kimi_k2_decode.py (1)

80-80: Unused NUM_ACTIVE_CFG variable – all experts are processed.

NUM_ACTIVE_CFG is defined but never used. The current implementation processes all experts (line 426) rather than selecting only the top-K active experts as typical MoE architectures do. This is acceptable for a simplified example but may cause performance degradation at scale.

Consider adding top-K expert selection in a future iteration or removing the unused parameter to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/kimi/kimi_k2_decode.py` at line 80, NUM_ACTIVE_CFG is defined
from num_active_experts but never used while the code currently processes all
experts; either remove this unused config to avoid confusion (delete
NUM_ACTIVE_CFG and the num_active_experts parameter if unused) or implement
top-K routing: before the expert application loop, compute top-k expert indices
based on gating scores (use num_active_experts/NUM_ACTIVE_CFG) and restrict the
expert iteration to those selected experts so only the top-K experts are
executed during the decode/expert-application phase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/models/kimi/kimi_k2_decode.py`:
- Around line 457-458: The slice for w_down_experts uses the wrong
first-dimension constant; update the pl.slice call that produces w_down_chunk to
use the experts dimension constant matching the tensor shape (replace
NUM_EXPERTS_CFG with the correct experts dimension constant used elsewhere, e.g.
NUM_EXPERTS) so the slice becomes pl.slice(w_down_experts, [NUM_EXPERTS,
MLP_OUT_CHUNK, Q_OUT_CHUNK], [exp_idx, o0, d0]) before the pl.reshape; ensure
the same constant is used consistently with other slices of w_down_experts and
matches the tensor's actual shape.
- Around line 441-445: The slices for expert weights are using NUM_EXPERTS_CFG
as the first dimension and thus slice out the wrong range; update the pl.slice
calls for w_gate_experts and w_up_experts (symbols: wg, wu, pl.slice,
w_gate_experts, w_up_experts, exp_idx, k0, o0) to request shape [1, K_CHUNK,
MLP_OUT_CHUNK] instead of [NUM_EXPERTS_CFG, K_CHUNK, MLP_OUT_CHUNK], so each
slice extracts only the single expert at exp_idx, and keep the subsequent
pl.reshape to [K_CHUNK, MLP_OUT_CHUNK] to collapse the leading singleton
dimension.

---

Nitpick comments:
In `@examples/models/kimi/kimi_k2_decode.py`:
- Line 80: NUM_ACTIVE_CFG is defined from num_active_experts but never used
while the code currently processes all experts; either remove this unused config
to avoid confusion (delete NUM_ACTIVE_CFG and the num_active_experts parameter
if unused) or implement top-K routing: before the expert application loop,
compute top-k expert indices based on gating scores (use
num_active_experts/NUM_ACTIVE_CFG) and restrict the expert iteration to those
selected experts so only the top-K experts are executed during the
decode/expert-application phase.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c2ba937-15d0-4600-a628-efdcb6d8b78b

📥 Commits

Reviewing files that changed from the base of the PR and between 23b963b and d2d25ad.

📒 Files selected for processing (2)
  • examples/models/kimi/kimi_k2_decode.py
  • examples/models/milm/milm_decode.py
✅ Files skipped from review due to trivial changes (1)
  • examples/models/milm/milm_decode.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants