Conversation
Introduce a tile-parallel batch hash lookup example with outer probe rounds and generated pass dumps to analyze orchestration/incore lowering behavior. Made-with: Cursor
Made-with: Cursor
- Refactor probe loop to use pl.break_() when round_has_active == 0 - Update pass dumps after run Made-with: Cursor
📝 WalkthroughWalkthroughAdded a complete PyPTO batch hash lookup example ( Changes
Sequence DiagramsequenceDiagram
actor Host
participant Orchestrator as batch_hash_lookup<br/>(orchestration)
participant Init as batch_hash_lookup_incore_0<br/>(initialization)
participant Probe as batch_hash_lookup_incore_1<br/>(probe/update)
participant Data as Hash Pool &<br/>Search Keys
Host->>Orchestrator: search_key, hash_pool, value_ptr_out_0
Orchestrator->>Init: Initialize per-batch output pointers
loop For each batch b
Init->>Data: Read search_key slice
Init->>Init: Create zeroed tile
Init->>Orchestrator: Return initialized value_ptr_out
end
Orchestrator->>Probe: Begin probe rounds (0..7)
loop For each probe round
Probe->>Data: Compute mixed hash from keys
Probe->>Data: Mask & derive bucket indices
loop For each bucket (0..63)
Probe->>Data: Select candidate key/value from hash_pool
Probe->>Probe: Check active status, match keys
Probe->>Probe: Update output (candidate or previous value)
end
Probe->>Orchestrator: Return round_has_active, updated value_ptr_out
alt No active entries remain
Orchestrator->>Orchestrator: Break probe loop early
end
end
Orchestrator->>Host: Return final value_ptr_out
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a batch hash lookup example using the PyPTO DSL, accompanied by its corresponding compiler pass dumps. The implementation leverages tile-parallel operations to perform lookups across multiple tables and batches. Feedback focuses on several high-impact performance and architectural improvements: moving the pl.incore() scope inside parallel loops to enable multi-core execution, replacing synchronous pl.tensor.read() calls within loops to prevent pipeline stalls, optimizing zero-initialization to reduce memory bandwidth, and considering gather-based lookups instead of sequential bucket scans to improve algorithmic complexity.
| with pl.incore(): | ||
| for b in pl.parallel(0, batch_num, 32): | ||
| for ti in pl.parallel(0, num_tables_i, 32): |
There was a problem hiding this comment.
The pl.incore() scope is placed outside the pl.parallel loops in the probe round block. This indicates to the compiler that the entire grid (all batches and tables) should be processed within a single in-core session, which likely prevents multi-core parallelization and increases memory pressure. It should be moved inside the b loop, consistent with the initialization block at lines 96-98, to allow each batch to be processed independently on different cores.
| active_count_s = pl.tensor.read(active_count, [0, 0]) | ||
| if active_count_s != 0: | ||
| round_has_active = 1 |
There was a problem hiding this comment.
Calling pl.tensor.read() inside nested pl.parallel loops is a major performance bottleneck. This operation forces a synchronous data transfer from the device to the host (or scalar unit), stalling the execution pipeline for every tile iteration. To implement early exit across the grid more efficiently, consider using a device-side reduction to a global scalar tensor and performing a single read after the parallel loops have finished.
| for ti in pl.parallel(0, num_tables_i, 32): | ||
| zero_src: pl.Tensor[[1, NTJ], pl.INT32] = pl.view( | ||
| search_key, [1, NTJ], [b, ti, 0] | ||
| ) | ||
| zero_tile = pl.mul(zero_src, 0) | ||
| value_ptr_out = pl.assemble(value_ptr_out, zero_tile, [b, ti, 0]) |
There was a problem hiding this comment.
The initialization loop is inefficient as it reads from search_key and performs an element-wise multiplication by zero just to obtain a zero tile. This incurs unnecessary memory bandwidth and compute overhead. Additionally, multiplying by a Python integer literal 0 may cause the resulting tile to be promoted to pl.INDEX (64-bit), potentially leading to type mismatches when assembling into the pl.INT32 output tensor. It would be better to use a direct zero-initialization function (e.g., pl.zeros) if supported by the DSL.
| for bucket in pl.range(TABLE_BUCKET_COUNT): | ||
| bucket_mask = pl.cmps(h_probe, bucket, cmp_type=0) | ||
| bucket_keys = pl.view( | ||
| hash_pool, [1, NTJ], [ti, bucket, 0] | ||
| ) | ||
| bucket_vals = pl.view( | ||
| hash_pool, | ||
| [1, NTJ], | ||
| [ti, HASH_PLANE_ROWS + bucket, 0], | ||
| ) | ||
| cand_key = pl.sel(bucket_mask, bucket_keys, cand_key) | ||
| cand_val = pl.sel(bucket_mask, bucket_vals, cand_val) |
There was a problem hiding this comment.
The lookup logic uses a sequential loop over all buckets (pl.range(TABLE_BUCKET_COUNT)), resulting in an O(N_buckets) complexity per tile. This approach leads to significant memory bandwidth amplification, as every bucket in the hash pool is read for every input tile. While the docstring notes this as a 'SIMD-friendly' scan to avoid per-lane loops, it is highly inefficient compared to a gather operation. If the hardware and DSL support indirect indexing, a gather-based lookup would be significantly more performant.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/batch_hash_lookup.py (1)
78-79: Unused parameters:hash_table_sizeandhash_base_ptr.These parameters are declared in the function signature but never used within
batch_hash_lookup. If they're reserved for future functionality, consider adding a brief comment indicating this. Otherwise, removing them would simplify the API.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/batch_hash_lookup.py` around lines 78 - 79, The parameters hash_table_size and hash_base_ptr in the batch_hash_lookup function signature are unused; either remove them from the signature (and update any callers, docstring, and tests) or mark them as intentionally reserved by adding a brief inline comment and/or rename to _hash_table_size and _hash_base_ptr to avoid linter warnings—update the batch_hash_lookup docstring to reflect the change so callers know the parameters are removed or intentionally unused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py`:
- Line 37: The hit_mask assignment uses the invalid Python token "pl.tensor.and"
(seen in the expression "hit_mask: pl.Tensor[[1, 32], pl.INDEX] =
pl.tensor.and(active_mask, key_match)"); replace this with the escaped/valid API
name (e.g., call pl.and_(active_mask, key_match) or pl.tensor.and_(active_mask,
key_match)) so the generated dump is valid Python, and apply the same
replacement for the equivalent lines in passes 01–06.
In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py`:
- Around line 52-54: The orchestration loop seeds init_values with an undefined
ti_0 and also passes ti_0 into batch_hash_lookup_incore_1, causing a NameError;
fix by either threading a defined scalar (the intended iteration state) into
init_values and using that named scalar in the call, or remove the unused
ti_0/ti_iter_1 state from the generated signature and callers so
batch_hash_lookup_incore_1 is invoked only with defined symbols; specifically
update the for-loop init_values and the call to batch_hash_lookup_incore_1
(which currently mentions probe_0, b_iter_1, ti_iter_1, value_ptr_out_iter_6) to
use a valid, previously defined scalar (or drop ti_0/ti_iter_1 everywhere) so no
unbound name remains.
- Line 41: The expression using the attribute-style call pl.tensor.and in the
assignment to hit_mask_0 is invalid because "and" is a Python keyword; replace
that call with a valid reference such as getattr(pl.tensor,
"and")(active_mask_0, key_match_0) or use a non-keyword alias (e.g.,
pl.tensor.logical_and or the library's provided alternative) wherever
pl.tensor.and is used (notably the hit_mask_0 computation) across the 14
snapshot files so the files parse correctly.
---
Nitpick comments:
In `@examples/batch_hash_lookup.py`:
- Around line 78-79: The parameters hash_table_size and hash_base_ptr in the
batch_hash_lookup function signature are unused; either remove them from the
signature (and update any callers, docstring, and tests) or mark them as
intentionally reserved by adding a brief inline comment and/or rename to
_hash_table_size and _hash_base_ptr to avoid linter warnings—update the
batch_hash_lookup docstring to reflect the change so callers know the parameters
are removed or intentionally unused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b1153f1b-0dd8-40c9-807b-e57e570451c2
📒 Files selected for processing (15)
examples/batch_hash_lookup.pyexamples/batch_hash_lookup_dump/passes_dump/00_frontend.pyexamples/batch_hash_lookup_dump/passes_dump/01_after_UnrollLoops.pyexamples/batch_hash_lookup_dump/passes_dump/02_after_ConvertToSSA.pyexamples/batch_hash_lookup_dump/passes_dump/03_after_FlattenCallExpr.pyexamples/batch_hash_lookup_dump/passes_dump/04_after_SplitChunkedLoops.pyexamples/batch_hash_lookup_dump/passes_dump/05_after_InterchangeChunkLoops.pyexamples/batch_hash_lookup_dump/passes_dump/06_after_RunVerifier.pyexamples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.pyexamples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.pyexamples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.pyexamples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.pyexamples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.pyexamples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.pyexamples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py
| if active_count_s != 0: | ||
| round_has_active: pl.Scalar[pl.INDEX] = 1 | ||
| key_match: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key, keys_tile, cmp_type=0) | ||
| hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask, key_match) |
There was a problem hiding this comment.
Invalid Python syntax: and is a reserved keyword.
pl.tensor.and(...) cannot be parsed because and is a Python keyword. The compiler's dump generator should emit an escaped name (e.g., pl.tensor.and_) to produce valid Python. The main example file correctly uses pl.and_() at this logical position.
This same issue occurs in all subsequent pass dump files (01–06) at the equivalent line.
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 37-37: Expected an identifier, but found a keyword and that cannot be used here
(invalid-syntax)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py` at line 37, The
hit_mask assignment uses the invalid Python token "pl.tensor.and" (seen in the
expression "hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask,
key_match)"); replace this with the escaped/valid API name (e.g., call
pl.and_(active_mask, key_match) or pl.tensor.and_(active_mask, key_match)) so
the generated dump is valid Python, and apply the same replacement for the
equivalent lines in passes 01–06.
| else: | ||
| round_has_active_6: pl.Scalar[pl.INDEX] = pl.yield_(round_has_active_iter_3) | ||
| key_match_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key_2, keys_tile_0, cmp_type=0) | ||
| hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
bad = False
for path in sorted(Path("examples/batch_hash_lookup_dump/passes_dump").glob("*.py")):
try:
compile(path.read_text(), str(path), "exec")
except SyntaxError as exc:
bad = True
print(f"{path}:{exc.lineno}:{exc.offset}: {exc.msg}")
raise SystemExit(1 if bad else 0)
PYRepository: hw-native-sys/pypto-lib
Length of output: 1373
🏁 Script executed:
head -n 41 examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py | tail -n 1Repository: hw-native-sys/pypto-lib
Length of output: 167
pl.tensor.and(...) is invalid Python syntax. and is a Python keyword and cannot be accessed via attribute notation. Line 41 fails to parse. This issue affects all 14 snapshot files in the pass dumps directory. Emit it via getattr(pl.tensor, "and") or a non-keyword alias instead.
Fix
- hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)
+ hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0) | |
| hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0) |
🧰 Tools
🪛 Ruff (0.15.7)
[warning] 41-41: Expected an identifier, but found a keyword and that cannot be used here
(invalid-syntax)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py` at
line 41, The expression using the attribute-style call pl.tensor.and in the
assignment to hit_mask_0 is invalid because "and" is a Python keyword; replace
that call with a valid reference such as getattr(pl.tensor,
"and")(active_mask_0, key_match_0) or use a non-keyword alias (e.g.,
pl.tensor.logical_and or the library's provided alternative) wherever
pl.tensor.and is used (notably the hit_mask_0 computation) across the 14
snapshot files so the files parse correctly.
| for probe_0, (b_iter_1, ti_iter_1, value_ptr_out_iter_6) in pl.range(0, 8, 1, init_values=(b_0, ti_0, value_ptr_out_2)): | ||
| round_has_active_0: pl.Scalar[pl.INDEX] = 0 | ||
| ret: pl.Tuple([pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Tensor[[1024, 64, 32], pl.INT32]]) = self.batch_hash_lookup_incore_1(hash_pool_0, probe_0, round_has_active_0, search_key_0, ti_0, ti_iter_1, value_ptr_out_2, value_ptr_out_iter_6) |
There was a problem hiding this comment.
batch_hash_lookup uses an unbound ti_0.
Line 52 seeds init_values with ti_0, and Line 54 passes ti_0 into batch_hash_lookup_incore_1, but this method never defines that name. Any runtime path that executes this orchestration function will hit a NameError. Thread a defined scalar here, or remove the dead ti_0 / ti_iter_1 state from the generated signature.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py`
around lines 52 - 54, The orchestration loop seeds init_values with an undefined
ti_0 and also passes ti_0 into batch_hash_lookup_incore_1, causing a NameError;
fix by either threading a defined scalar (the intended iteration state) into
init_values and using that named scalar in the call, or remove the unused
ti_0/ti_iter_1 state from the generated signature and callers so
batch_hash_lookup_incore_1 is invoked only with defined symbols; specifically
update the for-loop init_values and the call to batch_hash_lookup_incore_1
(which currently mentions probe_0, b_iter_1, ti_iter_1, value_ptr_out_iter_6) to
use a valid, previously defined scalar (or drop ti_0/ti_iter_1 everywhere) so no
unbound name remains.
No description provided.