Pr/local examples by hengliao1972 · Pull Request #68 · hw-native-sys/pypto-lib

hengliao1972 · 2026-04-03T09:54:50Z

No description provided.

Introduce a tile-parallel batch hash lookup example with outer probe rounds and generated pass dumps to analyze orchestration/incore lowering behavior. Made-with: Cursor

Made-with: Cursor

- Refactor probe loop to use pl.break_() when round_has_active == 0 - Update pass dumps after run Made-with: Cursor

coderabbitai · 2026-04-03T09:55:08Z

📝 Walkthrough

Walkthrough

Added a complete PyPTO batch hash lookup example (examples/batch_hash_lookup.py) with supporting implementation, a golden reference function, tensor initializers, and compilation helpers. Included detailed compiler pass dumps demonstrating the full transformation pipeline from frontend through optimization passes.

Changes

Cohort / File(s)	Summary
Batch Hash Lookup Main Example `examples/batch_hash_lookup.py`	Implemented complete self-contained PyPTO batch hash lookup program with factory function returning a `@pl.program` class. Includes NumPy/Torch golden reference, tensor specification builder, and compilation/execution helper with optional pass dumping.
Compiler Pass Dumps `examples/batch_hash_lookup_dump/passes_dump/00_frontend.py`, `01_after_UnrollLoops.py`, `02_after_ConvertToSSA.py`, `03_after_FlattenCallExpr.py`, `04_after_SplitChunkedLoops.py`, `05_after_InterchangeChunkLoops.py`, `06_after_RunVerifier.py`, `07_after_OutlineIncoreScopes.py`, `08_after_ExpandMixedKernel.py`, `09_after_ConvertTensorToBlockOps.py`, `10_after_InitMemRef.py`, `11_after_MemoryReuse.py`, `12_after_InsertSync.py`, `13_after_AllocateMemoryAddr.py`	Sequential compiler transformation artifacts showing incremental lowering of batch hash lookup program through frontend parsing, loop unrolling, SSA conversion, expression flattening, loop splitting/interchange, verification, in-core scope outlining, kernel expansion, tensor-to-block conversion, memory reference initialization, memory reuse optimization, synchronization insertion, and memory address allocation.

Sequence Diagram

sequenceDiagram
    actor Host
    participant Orchestrator as batch_hash_lookup<br/>(orchestration)
    participant Init as batch_hash_lookup_incore_0<br/>(initialization)
    participant Probe as batch_hash_lookup_incore_1<br/>(probe/update)
    participant Data as Hash Pool &<br/>Search Keys

    Host->>Orchestrator: search_key, hash_pool, value_ptr_out_0
    Orchestrator->>Init: Initialize per-batch output pointers
    loop For each batch b
        Init->>Data: Read search_key slice
        Init->>Init: Create zeroed tile
        Init->>Orchestrator: Return initialized value_ptr_out
    end
    
    Orchestrator->>Probe: Begin probe rounds (0..7)
    loop For each probe round
        Probe->>Data: Compute mixed hash from keys
        Probe->>Data: Mask & derive bucket indices
        loop For each bucket (0..63)
            Probe->>Data: Select candidate key/value from hash_pool
            Probe->>Probe: Check active status, match keys
            Probe->>Probe: Update output (candidate or previous value)
        end
        Probe->>Orchestrator: Return round_has_active, updated value_ptr_out
        alt No active entries remain
            Orchestrator->>Orchestrator: Break probe loop early
        end
    end
    
    Orchestrator->>Host: Return final value_ptr_out

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Refactor: clean up repo structure and organize docs #17: Directly inverse change—removes the same examples/batch_hash_lookup.py file and compiler pass dumps that this PR adds.

Poem

🐰 A hash lookup hops with glee,
Through probe rounds batched in harmony,
Masked buckets match and keys align,
SIMD lanes dance in tight design! ✨

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.15% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Pr/local examples' is vague and generic, using non-descriptive terms that do not clearly convey what the pull request changes or adds.	Use a more descriptive title that summarizes the main change, such as 'Add batch hash lookup example with PyPTO passes dump' to clearly indicate the primary purpose of the PR.
Description check	❓ Inconclusive	No pull request description was provided by the author, making it impossible to assess the relationship between the description and the changeset.	Add a pull request description that explains the purpose, motivation, and contents of the changes, such as details about the batch hash lookup example and the compilation passes included.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a batch hash lookup example using the PyPTO DSL, accompanied by its corresponding compiler pass dumps. The implementation leverages tile-parallel operations to perform lookups across multiple tables and batches. Feedback focuses on several high-impact performance and architectural improvements: moving the pl.incore() scope inside parallel loops to enable multi-core execution, replacing synchronous pl.tensor.read() calls within loops to prevent pipeline stalls, optimizing zero-initialization to reduce memory bandwidth, and considering gather-based lookups instead of sequential bucket scans to improve algorithmic complexity.

gemini-code-assist · 2026-04-03T09:59:01Z

examples/batch_hash_lookup.py

+                with pl.incore():
+                    for b in pl.parallel(0, batch_num, 32):
+                        for ti in pl.parallel(0, num_tables_i, 32):


The pl.incore() scope is placed outside the pl.parallel loops in the probe round block. This indicates to the compiler that the entire grid (all batches and tables) should be processed within a single in-core session, which likely prevents multi-core parallelization and increases memory pressure. It should be moved inside the b loop, consistent with the initialization block at lines 96-98, to allow each batch to be processed independently on different cores.

gemini-code-assist · 2026-04-03T09:59:01Z

examples/batch_hash_lookup.py

+                            active_count_s = pl.tensor.read(active_count, [0, 0])
+                            if active_count_s != 0:
+                                round_has_active = 1


Calling pl.tensor.read() inside nested pl.parallel loops is a major performance bottleneck. This operation forces a synchronous data transfer from the device to the host (or scalar unit), stalling the execution pipeline for every tile iteration. To implement early exit across the grid more efficiently, consider using a device-side reduction to a global scalar tensor and performing a single read after the parallel loops have finished.

gemini-code-assist · 2026-04-03T09:59:01Z

examples/batch_hash_lookup.py

+                    for ti in pl.parallel(0, num_tables_i, 32):
+                        zero_src: pl.Tensor[[1, NTJ], pl.INT32] = pl.view(
+                            search_key, [1, NTJ], [b, ti, 0]
+                        )
+                        zero_tile = pl.mul(zero_src, 0)
+                        value_ptr_out = pl.assemble(value_ptr_out, zero_tile, [b, ti, 0])


The initialization loop is inefficient as it reads from search_key and performs an element-wise multiplication by zero just to obtain a zero tile. This incurs unnecessary memory bandwidth and compute overhead. Additionally, multiplying by a Python integer literal 0 may cause the resulting tile to be promoted to pl.INDEX (64-bit), potentially leading to type mismatches when assembling into the pl.INT32 output tensor. It would be better to use a direct zero-initialization function (e.g., pl.zeros) if supported by the DSL.

gemini-code-assist · 2026-04-03T09:59:01Z

examples/batch_hash_lookup.py

+                            for bucket in pl.range(TABLE_BUCKET_COUNT):
+                                bucket_mask = pl.cmps(h_probe, bucket, cmp_type=0)
+                                bucket_keys = pl.view(
+                                    hash_pool, [1, NTJ], [ti, bucket, 0]
+                                )
+                                bucket_vals = pl.view(
+                                    hash_pool,
+                                    [1, NTJ],
+                                    [ti, HASH_PLANE_ROWS + bucket, 0],
+                                )
+                                cand_key = pl.sel(bucket_mask, bucket_keys, cand_key)
+                                cand_val = pl.sel(bucket_mask, bucket_vals, cand_val)


The lookup logic uses a sequential loop over all buckets (pl.range(TABLE_BUCKET_COUNT)), resulting in an O(N_buckets) complexity per tile. This approach leads to significant memory bandwidth amplification, as every bucket in the hash pool is read for every input tile. While the docstring notes this as a 'SIMD-friendly' scan to avoid per-lane loops, it is highly inefficient compared to a gather operation. If the hardware and DSL support indirect indexing, a gather-based lookup would be significantly more performant.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

examples/batch_hash_lookup.py (1)
78-79: Unused parameters: hash_table_size and hash_base_ptr.

These parameters are declared in the function signature but never used within batch_hash_lookup. If they're reserved for future functionality, consider adding a brief comment indicating this. Otherwise, removing them would simplify the API.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/batch_hash_lookup.py` around lines 78 - 79, The parameters
hash_table_size and hash_base_ptr in the batch_hash_lookup function signature
are unused; either remove them from the signature (and update any callers,
docstring, and tests) or mark them as intentionally reserved by adding a brief
inline comment and/or rename to _hash_table_size and _hash_base_ptr to avoid
linter warnings—update the batch_hash_lookup docstring to reflect the change so
callers know the parameters are removed or intentionally unused.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py`:
- Line 37: The hit_mask assignment uses the invalid Python token "pl.tensor.and"
(seen in the expression "hit_mask: pl.Tensor[[1, 32], pl.INDEX] =
pl.tensor.and(active_mask, key_match)"); replace this with the escaped/valid API
name (e.g., call pl.and_(active_mask, key_match) or pl.tensor.and_(active_mask,
key_match)) so the generated dump is valid Python, and apply the same
replacement for the equivalent lines in passes 01–06.

In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py`:
- Around line 52-54: The orchestration loop seeds init_values with an undefined
ti_0 and also passes ti_0 into batch_hash_lookup_incore_1, causing a NameError;
fix by either threading a defined scalar (the intended iteration state) into
init_values and using that named scalar in the call, or remove the unused
ti_0/ti_iter_1 state from the generated signature and callers so
batch_hash_lookup_incore_1 is invoked only with defined symbols; specifically
update the for-loop init_values and the call to batch_hash_lookup_incore_1
(which currently mentions probe_0, b_iter_1, ti_iter_1, value_ptr_out_iter_6) to
use a valid, previously defined scalar (or drop ti_0/ti_iter_1 everywhere) so no
unbound name remains.
- Line 41: The expression using the attribute-style call pl.tensor.and in the
assignment to hit_mask_0 is invalid because "and" is a Python keyword; replace
that call with a valid reference such as getattr(pl.tensor,
"and")(active_mask_0, key_match_0) or use a non-keyword alias (e.g.,
pl.tensor.logical_and or the library's provided alternative) wherever
pl.tensor.and is used (notably the hit_mask_0 computation) across the 14
snapshot files so the files parse correctly.

---

Nitpick comments:
In `@examples/batch_hash_lookup.py`:
- Around line 78-79: The parameters hash_table_size and hash_base_ptr in the
batch_hash_lookup function signature are unused; either remove them from the
signature (and update any callers, docstring, and tests) or mark them as
intentionally reserved by adding a brief inline comment and/or rename to
_hash_table_size and _hash_base_ptr to avoid linter warnings—update the
batch_hash_lookup docstring to reflect the change so callers know the parameters
are removed or intentionally unused.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b1153f1b-0dd8-40c9-807b-e57e570451c2

📥 Commits

Reviewing files that changed from the base of the PR and between 9a4a25f and 925f35e.

📒 Files selected for processing (15)

examples/batch_hash_lookup.py
examples/batch_hash_lookup_dump/passes_dump/00_frontend.py
examples/batch_hash_lookup_dump/passes_dump/01_after_UnrollLoops.py
examples/batch_hash_lookup_dump/passes_dump/02_after_ConvertToSSA.py
examples/batch_hash_lookup_dump/passes_dump/03_after_FlattenCallExpr.py
examples/batch_hash_lookup_dump/passes_dump/04_after_SplitChunkedLoops.py
examples/batch_hash_lookup_dump/passes_dump/05_after_InterchangeChunkLoops.py
examples/batch_hash_lookup_dump/passes_dump/06_after_RunVerifier.py
examples/batch_hash_lookup_dump/passes_dump/07_after_OutlineIncoreScopes.py
examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py
examples/batch_hash_lookup_dump/passes_dump/09_after_ConvertTensorToBlockOps.py
examples/batch_hash_lookup_dump/passes_dump/10_after_InitMemRef.py
examples/batch_hash_lookup_dump/passes_dump/11_after_MemoryReuse.py
examples/batch_hash_lookup_dump/passes_dump/12_after_InsertSync.py
examples/batch_hash_lookup_dump/passes_dump/13_after_AllocateMemoryAddr.py

coderabbitai · 2026-04-03T10:05:03Z

examples/batch_hash_lookup_dump/passes_dump/00_frontend.py

+                        if active_count_s != 0:
+                            round_has_active: pl.Scalar[pl.INDEX] = 1
+                        key_match: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key, keys_tile, cmp_type=0)
+                        hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask, key_match)


⚠️ Potential issue | 🟠 Major

Invalid Python syntax: and is a reserved keyword.

pl.tensor.and(...) cannot be parsed because and is a Python keyword. The compiler's dump generator should emit an escaped name (e.g., pl.tensor.and_) to produce valid Python. The main example file correctly uses pl.and_() at this logical position.

This same issue occurs in all subsequent pass dump files (01–06) at the equivalent line.

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 37-37: Expected an identifier, but found a keyword and that cannot be used here

(invalid-syntax)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/batch_hash_lookup_dump/passes_dump/00_frontend.py` at line 37, The hit_mask assignment uses the invalid Python token "pl.tensor.and" (seen in the expression "hit_mask: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask, key_match)"); replace this with the escaped/valid API name (e.g., call pl.and_(active_mask, key_match) or pl.tensor.and_(active_mask, key_match)) so the generated dump is valid Python, and apply the same replacement for the equivalent lines in passes 01–06.

coderabbitai · 2026-04-03T10:05:03Z

examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py

+                else:
+                    round_has_active_6: pl.Scalar[pl.INDEX] = pl.yield_(round_has_active_iter_3)
+                key_match_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.cmp(cand_key_2, keys_tile_0, cmp_type=0)
+                hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path bad = False for path in sorted(Path("examples/batch_hash_lookup_dump/passes_dump").glob("*.py")): try: compile(path.read_text(), str(path), "exec") except SyntaxError as exc: bad = True print(f"{path}:{exc.lineno}:{exc.offset}: {exc.msg}") raise SystemExit(1 if bad else 0) PY

Repository: hw-native-sys/pypto-lib

Length of output: 1373

🏁 Script executed:

head -n 41 examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py | tail -n 1

Repository: hw-native-sys/pypto-lib

Length of output: 167

pl.tensor.and(...) is invalid Python syntax. and is a Python keyword and cannot be accessed via attribute notation. Line 41 fails to parse. This issue affects all 14 snapshot files in the pass dumps directory. Emit it via getattr(pl.tensor, "and") or a non-keyword alias instead.

Fix

- hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0) + hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)

hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0)

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 41-41: Expected an identifier, but found a keyword and that cannot be used here

(invalid-syntax)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py` at line 41, The expression using the attribute-style call pl.tensor.and in the assignment to hit_mask_0 is invalid because "and" is a Python keyword; replace that call with a valid reference such as getattr(pl.tensor, "and")(active_mask_0, key_match_0) or use a non-keyword alias (e.g., pl.tensor.logical_and or the library's provided alternative) wherever pl.tensor.and is used (notably the hit_mask_0 computation) across the 14 snapshot files so the files parse correctly.

coderabbitai · 2026-04-03T10:05:03Z

examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py

+        for probe_0, (b_iter_1, ti_iter_1, value_ptr_out_iter_6) in pl.range(0, 8, 1, init_values=(b_0, ti_0, value_ptr_out_2)):
+            round_has_active_0: pl.Scalar[pl.INDEX] = 0
+            ret: pl.Tuple([pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Scalar[pl.INDEX], pl.Tensor[[1024, 64, 32], pl.INT32]]) = self.batch_hash_lookup_incore_1(hash_pool_0, probe_0, round_has_active_0, search_key_0, ti_0, ti_iter_1, value_ptr_out_2, value_ptr_out_iter_6)


⚠️ Potential issue | 🔴 Critical

batch_hash_lookup uses an unbound ti_0.

Line 52 seeds init_values with ti_0, and Line 54 passes ti_0 into batch_hash_lookup_incore_1, but this method never defines that name. Any runtime path that executes this orchestration function will hit a NameError. Thread a defined scalar here, or remove the dead ti_0 / ti_iter_1 state from the generated signature.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/batch_hash_lookup_dump/passes_dump/08_after_ExpandMixedKernel.py` around lines 52 - 54, The orchestration loop seeds init_values with an undefined ti_0 and also passes ti_0 into batch_hash_lookup_incore_1, causing a NameError; fix by either threading a defined scalar (the intended iteration state) into init_values and using that named scalar in the call, or remove the unused ti_0/ti_iter_1 state from the generated signature and callers so batch_hash_lookup_incore_1 is invoked only with defined symbols; specifically update the for-loop init_values and the call to batch_hash_lookup_incore_1 (which currently mentions probe_0, b_iter_1, ti_iter_1, value_ptr_out_iter_6) to use a valid, previously defined scalar (or drop ti_0/ti_iter_1 everywhere) so no unbound name remains.

hengliao1972 added 3 commits April 3, 2026 09:52

feat: add batch hash lookup vectorized probe example

7d94b04

Introduce a tile-parallel batch hash lookup example with outer probe rounds and generated pass dumps to analyze orchestration/incore lowering behavior. Made-with: Cursor

chore: update batch_hash_lookup example

2af790d

Made-with: Cursor

feat: use pl.break_() for early exit in batch_hash_lookup probe loop

925f35e

- Refactor probe loop to use pl.break_() when round_has_active == 0 - Update pass dumps after run Made-with: Cursor

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

	hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = pl.tensor.and(active_mask_0, key_match_0)
	hit_mask_0: pl.Tensor[[1, 32], pl.INDEX] = getattr(pl.tensor, "and")(active_mask_0, key_match_0)

Conversation

hengliao1972 commented Apr 3, 2026

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 3, 2026 •

edited

Loading