Skip to content

Pr/local examples#68

Open
hengliao1972 wants to merge 3 commits intohw-native-sys:mainfrom
hengliao1972:pr/local-examples
Open

Pr/local examples#68
hengliao1972 wants to merge 3 commits intohw-native-sys:mainfrom
hengliao1972:pr/local-examples

Conversation

@hengliao1972
Copy link
Copy Markdown
Collaborator

No description provided.

Introduce a tile-parallel batch hash lookup example with outer probe rounds and generated pass dumps to analyze orchestration/incore lowering behavior.

Made-with: Cursor
- Refactor probe loop to use pl.break_() when round_has_active == 0
- Update pass dumps after run

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Added a complete PyPTO batch hash lookup example (examples/batch_hash_lookup.py) with supporting implementation, a golden reference function, tensor initializers, and compilation helpers. Included detailed compiler pass dumps demonstrating the full transformation pipeline from frontend through optimization passes.

Changes

Cohort / File(s) Summary
Batch Hash Lookup Main Example
examples/batch_hash_lookup.py
Implemented complete self-contained PyPTO batch hash lookup program with factory function returning a @pl.program class. Includes NumPy/Torch golden reference, tensor specification builder, and compilation/execution helper with optional pass dumping.
Compiler Pass Dumps
examples/batch_hash_lookup_dump/passes_dump/00_frontend.py, 01_after_UnrollLoops.py, 02_after_ConvertToSSA.py, 03_after_FlattenCallExpr.py, 04_after_SplitChunkedLoops.py, 05_after_InterchangeChunkLoops.py, 06_after_RunVerifier.py, 07_after_OutlineIncoreScopes.py, 08_after_ExpandMixedKernel.py, 09_after_ConvertTensorToBlockOps.py, 10_after_InitMemRef.py, 11_after_MemoryReuse.py, 12_after_InsertSync.py, 13_after_AllocateMemoryAddr.py
Sequential compiler transformation artifacts showing incremental lowering of batch hash lookup program through frontend parsing, loop unrolling, SSA conversion, expression flattening, loop splitting/interchange, verification, in-core scope outlining, kernel expansion, tensor-to-block conversion, memory reference initialization, memory reuse optimization, synchronization insertion, and memory address allocation.

Sequence Diagram

sequenceDiagram
    actor Host
    participant Orchestrator as batch_hash_lookup<br/>(orchestration)
    participant Init as batch_hash_lookup_incore_0<br/>(initialization)
    participant Probe as batch_hash_lookup_incore_1<br/>(probe/update)
    participant Data as Hash Pool &<br/>Search Keys

    Host->>Orchestrator: search_key, hash_pool, value_ptr_out_0
    Orchestrator->>Init: Initialize per-batch output pointers
    loop For each batch b
        Init->>Data: Read search_key slice
        Init->>Init: Create zeroed tile
        Init->>Orchestrator: Return initialized value_ptr_out
    end
    
    Orchestrator->>Probe: Begin probe rounds (0..7)
    loop For each probe round
        Probe->>Data: Compute mixed hash from keys
        Probe->>Data: Mask & derive bucket indices
        loop For each bucket (0..63)
            Probe->>Data: Select candidate key/value from hash_pool
            Probe->>Probe: Check active status, match keys
            Probe->>Probe: Update output (candidate or previous value)
        end
        Probe->>Orchestrator: Return round_has_active, updated value_ptr_out
        alt No active entries remain
            Orchestrator->>Orchestrator: Break probe loop early
        end
    end
    
    Orchestrator->>Host: Return final value_ptr_out
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A hash lookup hops with glee,
Through probe rounds batched in harmony,
Masked buckets match and keys align,
SIMD lanes dance in tight design!

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Pr/local examples' is vague and generic, using non-descriptive terms that do not clearly convey what the pull request changes or adds. Use a more descriptive title that summarizes the main change, such as 'Add batch hash lookup example with PyPTO passes dump' to clearly indicate the primary purpose of the PR.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess the relationship between the description and the changeset. Add a pull request description that explains the purpose, motivation, and contents of the changes, such as details about the batch hash lookup example and the compilation passes included.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a batch hash lookup example using the PyPTO DSL, accompanied by its corresponding compiler pass dumps. The implementation leverages tile-parallel operations to perform lookups across multiple tables and batches. Feedback focuses on several high-impact performance and architectural improvements: moving the pl.incore() scope inside parallel loops to enable multi-core execution, replacing synchronous pl.tensor.read() calls within loops to prevent pipeline stalls, optimizing zero-initialization to reduce memory bandwidth, and considering gather-based lookups instead of sequential bucket scans to improve algorithmic complexity.

Comment on lines +110 to +112
with pl.incore():
for b in pl.parallel(0, batch_num, 32):
for ti in pl.parallel(0, num_tables_i, 32):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pl.incore() scope is placed outside the pl.parallel loops in the probe round block. This indicates to the compiler that the entire grid (all batches and tables) should be processed within a single in-core session, which likely prevents multi-core parallelization and increases memory pressure. It should be moved inside the b loop, consistent with the initialization block at lines 96-98, to allow each batch to be processed independently on different cores.

Comment on lines +141 to +143
active_count_s = pl.tensor.read(active_count, [0, 0])
if active_count_s != 0:
round_has_active = 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling pl.tensor.read() inside nested pl.parallel loops is a major performance bottleneck. This operation forces a synchronous data transfer from the device to the host (or scalar unit), stalling the execution pipeline for every tile iteration. To implement early exit across the grid more efficiently, consider using a device-side reduction to a global scalar tensor and performing a single read after the parallel loops have finished.

Comment on lines +98 to +103
for ti in pl.parallel(0, num_tables_i, 32):
zero_src: pl.Tensor[[1, NTJ], pl.INT32] = pl.view(
search_key, [1, NTJ], [b, ti, 0]
)
zero_tile = pl.mul(zero_src, 0)
value_ptr_out = pl.assemble(value_ptr_out, zero_tile, [b, ti, 0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The initialization loop is inefficient as it reads from search_key and performs an element-wise multiplication by zero just to obtain a zero tile. This incurs unnecessary memory bandwidth and compute overhead. Additionally, multiplying by a Python integer literal 0 may cause the resulting tile to be promoted to pl.INDEX (64-bit), potentially leading to type mismatches when assembling into the pl.INT32 output tensor. It would be better to use a direct zero-initialization function (e.g., pl.zeros) if supported by the DSL.

Comment on lines +125 to +136
for bucket in pl.range(TABLE_BUCKET_COUNT):
bucket_mask = pl.cmps(h_probe, bucket, cmp_type=0)
bucket_keys = pl.view(
hash_pool, [1, NTJ], [ti, bucket, 0]
)
bucket_vals = pl.view(
hash_pool,
[1, NTJ],
[ti, HASH_PLANE_ROWS + bucket, 0],
)
cand_key = pl.sel(bucket_mask, bucket_keys, cand_key)
cand_val = pl.sel(bucket_mask, bucket_vals, cand_val)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lookup logic uses a sequential loop over all buckets (pl.range(TABLE_BUCKET_COUNT)), resulting in an O(N_buckets) complexity per tile. This approach leads to significant memory bandwidth amplification, as every bucket in the hash pool is read for every input tile. While the docstring notes this as a 'SIMD-friendly' scan to avoid per-lane loops, it is highly inefficient compared to a gather operation. If the hardware and DSL support indirect indexing, a gather-based lookup would be significantly more performant.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
examples/batch_hash_lookup.py (1)

78-79: Unused parameters: hash_table_size and hash_base_ptr.

These parameters are declared in the function signature but never used within batch_hash_lookup. If they're reserved for future functionality, consider adding a brief comment indicating this. Otherwise, removing them would simplify the API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup.py` around lines 78 - 79, The parameters
hash_table_size and hash_base_ptr in the batch_hash_lookup function signature
are unused; either remove them from the signature (and update any callers,
docstring, and tests) or mark them as intentionally reserved by adding a brief
inline comment and/or rename to _hash_table_size and _hash_base_ptr to avoid
linter warnings—update the batch_hash_lookup docstring to reflect the change so
callers know the parameters are removed or intentionally unused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py`:
- Line 37: The hit_mask assignment uses the invalid Python token "pl.tensor.and"
(seen in the expression "hit_mask: pl.Tensor[[1, 32], pl.INDEX] =
pl.tensor.and(active_mask, key_match)"); replace this with the escaped/valid API
name (e.g., call pl.and_(active_mask, key_match) or pl.tensor.and_(active_mask,
key_match)) so the generated dump is valid Python, and apply the same
replacement for the equivalent lines in passes 01–06.

In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py`:
- Around line 52-54: The orchestration loop seeds init_values with an undefined
ti_0 and also passes ti_0 into batch_hash_lookup_incore_1, causing a NameError;
fix by either threading a defined scalar (the intended iteration state) into
init_values and using that named scalar in the call, or remove the unused
ti_0/ti_iter_1 state from the generated signature and callers so
batch_hash_lookup_incore_1 is invoked only with defined symbols; specifically
update the for-loop init_values and the call to batch_hash_lookup_incore_1
(which currently mentions probe_0, b_iter_1, ti_iter_1, value_ptr_out_iter_6) to
use a valid, previously defined scalar (or drop ti_0/ti_iter_1 everywhere) so no
unbound name remains.
- Line 41: The expression using the attribute-style call pl.tensor.and in the
assignment to hit_mask_0 is invalid because "and" is a Python keyword; replace
that call with a valid reference such as getattr(pl.tensor,
"and")(active_mask_0, key_match_0) or use a non-keyword alias (e.g.,
pl.tensor.logical_and or the library's provided alternative) wherever
pl.tensor.and is used (notably the hit_mask_0 computation) across the 14
snapshot files so the files parse correctly.

---

Nitpick comments:
In `@examples/batch_hash_lookup.py`:
- Around line 78-79: The parameters hash_table_size and hash_base_ptr in the
batch_hash_lookup function signature are unused; either remove them from the
signature (and update any callers, docstring, and tests) or mark them as
intentionally reserved by adding a brief inline comment and/or rename to
_hash_table_size and _hash_base_ptr to avoid linter warnings—update the
batch_hash_lookup docstring to reflect the change so callers know the parameters
are removed or intentionally unused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b1153f1b-0dd8-40c9-807b-e57e570451c2

📥 Commits

Reviewing files that changed from the base of the PR and between 9a4a25f and 925f35e.

📒 Files selected for processing (15)
  • examples/batch_hash_lookup.py
  • examples/batch_hash_lookup_dump/passes_dump/00_frontend.py
  • examples/batch_hash_lookup_dump/passes_dump/01_after_UnrollLoops.py
  • examples/batch_hash_lookup_dump/passes_dump/02_after_ConvertToSSA.py
  • examples/batch_hash_lookup_dump/passes_dump/03_after_FlattenCallExpr.py
  • examples/batch_hash_lookup_dump/passes_dump/04_after_SplitChunkedLoops.py
  • examples/batch_hash_lookup_dump/passes_dump/05_after_InterchangeChunkLoops.py
  • examples/batch_hash_lookup_dump/passes_dump/06_after_RunVerifier.py
  • examples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.py
  • examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py
  • examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py
  • examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py
  • examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py
  • examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py
  • examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py

if active_count_s != 0:
round_has_active: pl.Scalar[pl.INDEX] = 1
key_match: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key, keys_tile, cmp_type=0)
hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask, key_match)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Invalid Python syntax: and is a reserved keyword.

pl.tensor.and(...) cannot be parsed because and is a Python keyword. The compiler's dump generator should emit an escaped name (e.g., pl.tensor.and_) to produce valid Python. The main example file correctly uses pl.and_() at this logical position.

This same issue occurs in all subsequent pass dump files (01–06) at the equivalent line.

🧰 Tools
🪛 Ruff (0.15.7)

[warning] 37-37: Expected an identifier, but found a keyword and that cannot be used here

(invalid-syntax)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py` at line 37, The
hit_mask assignment uses the invalid Python token "pl.tensor.and" (seen in the
expression "hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask,
key_match)"); replace this with the escaped/valid API name (e.g., call
pl.and_(active_mask, key_match) or pl.tensor.and_(active_mask, key_match)) so
the generated dump is valid Python, and apply the same replacement for the
equivalent lines in passes 01–06.

else:
round_has_active_6: pl.Scalar[pl.INDEX] = pl.yield_(round_has_active_iter_3)
key_match_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key_2, keys_tile_0, cmp_type=0)
hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path

bad = False
for path in sorted(Path("examples/batch_hash_lookup_dump/passes_dump").glob("*.py")):
    try:
        compile(path.read_text(), str(path), "exec")
    except SyntaxError as exc:
        bad = True
        print(f"{path}:{exc.lineno}:{exc.offset}: {exc.msg}")

raise SystemExit(1 if bad else 0)
PY

Repository: hw-native-sys/pypto-lib

Length of output: 1373


🏁 Script executed:

head -n 41 examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py | tail -n 1

Repository: hw-native-sys/pypto-lib

Length of output: 167


pl.tensor.and(...) is invalid Python syntax. and is a Python keyword and cannot be accessed via attribute notation. Line 41 fails to parse. This issue affects all 14 snapshot files in the pass dumps directory. Emit it via getattr(pl.tensor, "and") or a non-keyword alias instead.

Fix
-                hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)
+                hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)
hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0)
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 41-41: Expected an identifier, but found a keyword and that cannot be used here

(invalid-syntax)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py` at
line 41, The expression using the attribute-style call pl.tensor.and in the
assignment to hit_mask_0 is invalid because "and" is a Python keyword; replace
that call with a valid reference such as getattr(pl.tensor,
"and")(active_mask_0, key_match_0) or use a non-keyword alias (e.g.,
pl.tensor.logical_and or the library's provided alternative) wherever
pl.tensor.and is used (notably the hit_mask_0 computation) across the 14
snapshot files so the files parse correctly.

Comment on lines +52 to +54
for probe_0, (b_iter_1, ti_iter_1, value_ptr_out_iter_6) in pl.range(0, 8, 1, init_values=(b_0, ti_0, value_ptr_out_2)):
round_has_active_0: pl.Scalar[pl.INDEX] = 0
ret: pl.Tuple([pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Tensor[[1024, 64, 32], pl.INT32]]) = self.batch_hash_lookup_incore_1(hash_pool_0, probe_0, round_has_active_0, search_key_0, ti_0, ti_iter_1, value_ptr_out_2, value_ptr_out_iter_6)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

batch_hash_lookup uses an unbound ti_0.

Line 52 seeds init_values with ti_0, and Line 54 passes ti_0 into batch_hash_lookup_incore_1, but this method never defines that name. Any runtime path that executes this orchestration function will hit a NameError. Thread a defined scalar here, or remove the dead ti_0 / ti_iter_1 state from the generated signature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py`
around lines 52 - 54, The orchestration loop seeds init_values with an undefined
ti_0 and also passes ti_0 into batch_hash_lookup_incore_1, causing a NameError;
fix by either threading a defined scalar (the intended iteration state) into
init_values and using that named scalar in the call, or remove the unused
ti_0/ti_iter_1 state from the generated signature and callers so
batch_hash_lookup_incore_1 is invoked only with defined symbols; specifically
update the for-loop init_values and the call to batch_hash_lookup_incore_1
(which currently mentions probe_0, b_iter_1, ti_iter_1, value_ptr_out_iter_6) to
use a valid, previously defined scalar (or drop ti_0/ti_iter_1 everywhere) so no
unbound name remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant