Refactor layer_norm to two-pass column-chunking pattern by zhangqi-chen · Pull Request #31 · hw-native-sys/pypto-lib

zhangqi-chen · 2026-03-23T03:18:32Z

Summary

Refactor layer_norm.py from single-pass row-only tiling to a two-pass row+column chunking pattern
Pass 1 accumulates sum(x) and sum(x²) across hidden chunks; Pass 2 centres, normalises, and applies gamma/beta per chunk
Variance computed via E[x²] - E[x]² to avoid materialising the centred tensor during accumulation
Matches the pattern used by rms_norm.py and production LLM kernels (qwen3/deepseek)
Bumps HIDDEN from 256 to 512; adds HIDDEN_CHUNK = 64 constant

Testing

Example runs successfully
Code follows pypto frontend coding style

Summary by CodeRabbit

Refactor
- Optimized layer normalization example with an enhanced two-stage computation strategy.
- Added new configurable parameter for finer-grained control over computation tiling.
- Updated default configuration values for improved baseline performance.

Replace the single-pass row-only tiling with a two-pass approach that chunks the hidden dimension, matching the pattern used by rms_norm.py and the production LLM kernels (qwen3/deepseek). Pass 1 accumulates sum(x) and sum(x^2) across hidden chunks, then computes mean and inv_std via E[x^2] - E[x]^2. Pass 2 centres, normalises, and applies gamma/beta per chunk. This enables larger hidden dimensions (HIDDEN bumped from 256 to 512) by avoiding loading the full hidden axis in a single tile.

gemini-code-assist · 2026-03-23T03:18:45Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the layer_norm implementation by transitioning from a basic single-pass approach to a sophisticated two-pass row and column chunking strategy. This change improves computational efficiency, particularly for larger hidden dimensions, by optimizing how sums and squared sums are accumulated and how variance is calculated. The refactoring also brings the layer_norm example in line with established best practices seen in other normalization layers and high-performance LLM kernels, ensuring better scalability and performance.

Highlights

Refactored Layer Normalization: The layer_norm.py implementation was refactored from a single-pass row-only tiling approach to a more efficient two-pass row and column chunking pattern.
Two-Pass Processing: The new pattern involves a first pass to accumulate the sum and squared-sum of input x across hidden chunks, followed by a second pass to center, normalize, and apply gamma/beta per chunk.
Optimized Variance Calculation: Variance is now computed using the formula E[x^2] - E[x]^2, which avoids the need to materialize the centered tensor during the accumulation phase, improving efficiency.
Alignment with Production Patterns: This refactoring aligns the layer_norm implementation with patterns used in rms_norm.py and production-grade LLM kernels like those found in Qwen3 and DeepSeek.
Increased Hidden Dimension and Chunking: The default HIDDEN dimension was increased from 256 to 512, and a new HIDDEN_CHUNK = 64 constant was introduced to manage column chunking.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-23T03:18:52Z

📝 Walkthrough

Walkthrough

The LayerNormProgram implementation was refactored from a single-pass to a two-pass approach with explicit hidden dimension chunking. A new hidden_chunk parameter controls chunk size, and the module constant HIDDEN was increased from 256 to 512. Function signatures were updated to accept the new parameter.

Changes

Cohort / File(s)	Summary
LayerNorm Implementation `examples/layer_norm.py`	Restructured layer normalization from single-pass row-tiling to two-pass scheme with hidden dimension chunking. First pass accumulates `sum(x)` and `sum(x²)` across hidden chunks to compute `mean` and `inv_std`; second pass normalizes using precomputed statistics. Added `hidden_chunk` parameter, updated module constants (`HIDDEN` → 512, added `HIDDEN_CHUNK = 64`), and revised function signatures for `build_layer_norm_program()` and `compile_and_run()`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add layer_norm example and auto-discover examples in CI #29: Modifies the same examples/layer_norm.py file and related build/compile APIs, introducing changes that this PR builds upon or refactors.

Poem

🐰 Two passes now where once was one,

Chunks dancing through the hidden sun,

Mean and variance side by side,

Normalization's cleaner ride! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Refactor layer_norm to two-pass column-chunking pattern' accurately and concisely describes the main change: converting from single-pass to two-pass implementation with column chunking.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors layer_norm.py to a more scalable two-pass column-chunking pattern, which is a solid improvement for handling larger hidden dimensions. The implementation correctly uses the E[x^2] - E[x]^2 formula for variance, which is memory-efficient. The code is clear and aligns well with existing patterns in the codebase. I've included a couple of suggestions for minor performance and style enhancements.

gemini-code-assist · 2026-03-23T03:20:00Z

examples/layer_norm.py

+                    x_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)
+                    x_sum = pl.mul(x_sum, 0.0)
+                    sq_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)
+                    sq_sum = pl.mul(sq_sum, 0.0)


The initialization of x_sum and sq_sum can be made more concise by combining the tensor creation and the zeroing operation into a single statement for each tensor. This improves readability and reduces redundancy.

Suggested change

x_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)

x_sum = pl.mul(x_sum, 0.0)

sq_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)

sq_sum = pl.mul(sq_sum, 0.0)

x_sum = pl.mul(pl.create_tensor([1, row_chunk], dtype=pl.FP32), 0.0)

sq_sum = pl.mul(pl.create_tensor([1, row_chunk], dtype=pl.FP32), 0.0)

gemini-code-assist · 2026-03-23T03:20:00Z

examples/layer_norm.py

+                        centred = pl.row_expand_sub(x_chunk, mean)
+                        normed = pl.row_expand_mul(centred, inv_std)
+                        scaled = pl.col_expand_mul(normed, gamma_chunk)
+                        ones = pl.add(pl.sub(x_chunk, x_chunk), 1.0)


The ones tensor is being recreated in every iteration of the for hb in pl.range(hidden_blocks): loop. Since its shape and value are constant within this loop, it can be created once before the loop begins to avoid redundant computation and improve performance. You could create a template tensor of shape [row_chunk, hidden_chunk] before the loop and use that to generate the ones tensor.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/layer_norm.py`:
- Around line 36-44: The build_layer_norm_program currently computes
hidden_blocks = hidden // hidden_chunk which silently drops a tail when hidden %
hidden_chunk != 0 (and fails entirely if hidden_chunk > hidden); update
build_layer_norm_program to guard and handle tails: either validate the inputs
and raise a clear exception if hidden % hidden_chunk != 0 or compute
hidden_blocks = math.ceil(hidden / hidden_chunk) and add explicit per-block
logic in the loops (using hidden_chunk for full blocks and computing a
last_block_size = hidden - (hidden_blocks-1)*hidden_chunk for the final partial
block) so mean/variance reductions and writes to y only process the actual
remaining columns and use the correct normalization factor (use hidden or the
per-row actual element count) for the tail; reference the symbols hidden_blocks,
hidden_chunk, hidden_inv, and hidden when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b805272c-c629-4a73-9b53-b4e0d8a45439

📥 Commits

Reviewing files that changed from the base of the PR and between 7a8f414 and d734f00.

📒 Files selected for processing (1)

examples/layer_norm.py

coderabbitai · 2026-03-23T03:23:42Z

examples/layer_norm.py

 def build_layer_norm_program(
    rows: int = ROWS,
    hidden: int = HIDDEN,
    row_chunk: int = ROW_CHUNK,
+    hidden_chunk: int = HIDDEN_CHUNK,
    eps: float = EPS,
 ):
+    hidden_blocks = hidden // hidden_chunk
    hidden_inv = 1.0 / hidden


⚠️ Potential issue | 🔴 Critical

Guard hidden_chunk values that do not evenly tile hidden.

hidden_blocks = hidden // hidden_chunk truncates. For any hidden % hidden_chunk != 0, pass 1 drops the tail from the mean/variance reduction and pass 2 never writes the tail columns back to y. If hidden_chunk > hidden, both loops skip entirely and the output stays unwritten. Add a precondition here or explicit tail handling before building the loops.

🛠️ Suggested guard

def build_layer_norm_program( rows: int = ROWS, hidden: int = HIDDEN, row_chunk: int = ROW_CHUNK, hidden_chunk: int = HIDDEN_CHUNK, eps: float = EPS, ): + if hidden <= 0: + raise ValueError(f"`hidden` must be > 0, got {hidden}") + if hidden_chunk <= 0 or hidden % hidden_chunk != 0: + raise ValueError( + "`hidden_chunk` must be a positive divisor of `hidden` " + f"(got hidden={hidden}, hidden_chunk={hidden_chunk})" + ) hidden_blocks = hidden // hidden_chunk hidden_inv = 1.0 / hidden

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/layer_norm.py` around lines 36 - 44, The build_layer_norm_program currently computes hidden_blocks = hidden // hidden_chunk which silently drops a tail when hidden % hidden_chunk != 0 (and fails entirely if hidden_chunk > hidden); update build_layer_norm_program to guard and handle tails: either validate the inputs and raise a clear exception if hidden % hidden_chunk != 0 or compute hidden_blocks = math.ceil(hidden / hidden_chunk) and add explicit per-block logic in the loops (using hidden_chunk for full blocks and computing a last_block_size = hidden - (hidden_blocks-1)*hidden_chunk for the final partial block) so mean/variance reductions and writes to y only process the actual remaining columns and use the correct normalization factor (use hidden or the per-row actual element count) for the tail; reference the symbols hidden_blocks, hidden_chunk, hidden_inv, and hidden when making the change.

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor layer_norm to two-pass column-chunking pattern#31

Refactor layer_norm to two-pass column-chunking pattern#31
zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
zhangqi-chen:refactor/layer-norm-column-chunking

zhangqi-chen commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 23, 2026

Uh oh!

gemini-code-assist bot Mar 23, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangqi-chen commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangqi-chen commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading