Skip to content

Refactor layer_norm to two-pass column-chunking pattern#31

Open
zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
zhangqi-chen:refactor/layer-norm-column-chunking
Open

Refactor layer_norm to two-pass column-chunking pattern#31
zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
zhangqi-chen:refactor/layer-norm-column-chunking

Conversation

@zhangqi-chen
Copy link
Copy Markdown
Collaborator

@zhangqi-chen zhangqi-chen commented Mar 23, 2026

Summary

  • Refactor layer_norm.py from single-pass row-only tiling to a two-pass row+column chunking pattern
  • Pass 1 accumulates sum(x) and sum(x²) across hidden chunks; Pass 2 centres, normalises, and applies gamma/beta per chunk
  • Variance computed via E[x²] - E[x]² to avoid materialising the centred tensor during accumulation
  • Matches the pattern used by rms_norm.py and production LLM kernels (qwen3/deepseek)
  • Bumps HIDDEN from 256 to 512; adds HIDDEN_CHUNK = 64 constant

Testing

  • Example runs successfully
  • Code follows pypto frontend coding style

Summary by CodeRabbit

  • Refactor
    • Optimized layer normalization example with an enhanced two-stage computation strategy.
    • Added new configurable parameter for finer-grained control over computation tiling.
    • Updated default configuration values for improved baseline performance.

Replace the single-pass row-only tiling with a two-pass approach that
chunks the hidden dimension, matching the pattern used by rms_norm.py
and the production LLM kernels (qwen3/deepseek).

Pass 1 accumulates sum(x) and sum(x^2) across hidden chunks, then
computes mean and inv_std via E[x^2] - E[x]^2. Pass 2 centres,
normalises, and applies gamma/beta per chunk.

This enables larger hidden dimensions (HIDDEN bumped from 256 to 512)
by avoiding loading the full hidden axis in a single tile.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the layer_norm implementation by transitioning from a basic single-pass approach to a sophisticated two-pass row and column chunking strategy. This change improves computational efficiency, particularly for larger hidden dimensions, by optimizing how sums and squared sums are accumulated and how variance is calculated. The refactoring also brings the layer_norm example in line with established best practices seen in other normalization layers and high-performance LLM kernels, ensuring better scalability and performance.

Highlights

  • Refactored Layer Normalization: The layer_norm.py implementation was refactored from a single-pass row-only tiling approach to a more efficient two-pass row and column chunking pattern.
  • Two-Pass Processing: The new pattern involves a first pass to accumulate the sum and squared-sum of input x across hidden chunks, followed by a second pass to center, normalize, and apply gamma/beta per chunk.
  • Optimized Variance Calculation: Variance is now computed using the formula E[x^2] - E[x]^2, which avoids the need to materialize the centered tensor during the accumulation phase, improving efficiency.
  • Alignment with Production Patterns: This refactoring aligns the layer_norm implementation with patterns used in rms_norm.py and production-grade LLM kernels like those found in Qwen3 and DeepSeek.
  • Increased Hidden Dimension and Chunking: The default HIDDEN dimension was increased from 256 to 512, and a new HIDDEN_CHUNK = 64 constant was introduced to manage column chunking.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

The LayerNormProgram implementation was refactored from a single-pass to a two-pass approach with explicit hidden dimension chunking. A new hidden_chunk parameter controls chunk size, and the module constant HIDDEN was increased from 256 to 512. Function signatures were updated to accept the new parameter.

Changes

Cohort / File(s) Summary
LayerNorm Implementation
examples/layer_norm.py
Restructured layer normalization from single-pass row-tiling to two-pass scheme with hidden dimension chunking. First pass accumulates sum(x) and sum(x²) across hidden chunks to compute mean and inv_std; second pass normalizes using precomputed statistics. Added hidden_chunk parameter, updated module constants (HIDDEN → 512, added HIDDEN_CHUNK = 64), and revised function signatures for build_layer_norm_program() and compile_and_run().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Two passes now where once was one,

Chunks dancing through the hidden sun,

Mean and variance side by side,

Normalization's cleaner ride! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Refactor layer_norm to two-pass column-chunking pattern' accurately and concisely describes the main change: converting from single-pass to two-pass implementation with column chunking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors layer_norm.py to a more scalable two-pass column-chunking pattern, which is a solid improvement for handling larger hidden dimensions. The implementation correctly uses the E[x^2] - E[x]^2 formula for variance, which is memory-efficient. The code is clear and aligns well with existing patterns in the codebase. I've included a couple of suggestions for minor performance and style enhancements.

Comment on lines +61 to +64
x_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)
x_sum = pl.mul(x_sum, 0.0)
sq_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)
sq_sum = pl.mul(sq_sum, 0.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The initialization of x_sum and sq_sum can be made more concise by combining the tensor creation and the zeroing operation into a single statement for each tensor. This improves readability and reduces redundancy.

Suggested change
x_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)
x_sum = pl.mul(x_sum, 0.0)
sq_sum = pl.create_tensor([1, row_chunk], dtype=pl.FP32)
sq_sum = pl.mul(sq_sum, 0.0)
x_sum = pl.mul(pl.create_tensor([1, row_chunk], dtype=pl.FP32), 0.0)
sq_sum = pl.mul(pl.create_tensor([1, row_chunk], dtype=pl.FP32), 0.0)

centred = pl.row_expand_sub(x_chunk, mean)
normed = pl.row_expand_mul(centred, inv_std)
scaled = pl.col_expand_mul(normed, gamma_chunk)
ones = pl.add(pl.sub(x_chunk, x_chunk), 1.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ones tensor is being recreated in every iteration of the for hb in pl.range(hidden_blocks): loop. Since its shape and value are constant within this loop, it can be created once before the loop begins to avoid redundant computation and improve performance. You could create a template tensor of shape [row_chunk, hidden_chunk] before the loop and use that to generate the ones tensor.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/layer_norm.py`:
- Around line 36-44: The build_layer_norm_program currently computes
hidden_blocks = hidden // hidden_chunk which silently drops a tail when hidden %
hidden_chunk != 0 (and fails entirely if hidden_chunk > hidden); update
build_layer_norm_program to guard and handle tails: either validate the inputs
and raise a clear exception if hidden % hidden_chunk != 0 or compute
hidden_blocks = math.ceil(hidden / hidden_chunk) and add explicit per-block
logic in the loops (using hidden_chunk for full blocks and computing a
last_block_size = hidden - (hidden_blocks-1)*hidden_chunk for the final partial
block) so mean/variance reductions and writes to y only process the actual
remaining columns and use the correct normalization factor (use hidden or the
per-row actual element count) for the tail; reference the symbols hidden_blocks,
hidden_chunk, hidden_inv, and hidden when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b805272c-c629-4a73-9b53-b4e0d8a45439

📥 Commits

Reviewing files that changed from the base of the PR and between 7a8f414 and d734f00.

📒 Files selected for processing (1)
  • examples/layer_norm.py

Comment on lines 36 to 44
def build_layer_norm_program(
rows: int = ROWS,
hidden: int = HIDDEN,
row_chunk: int = ROW_CHUNK,
hidden_chunk: int = HIDDEN_CHUNK,
eps: float = EPS,
):
hidden_blocks = hidden // hidden_chunk
hidden_inv = 1.0 / hidden
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard hidden_chunk values that do not evenly tile hidden.

hidden_blocks = hidden // hidden_chunk truncates. For any hidden % hidden_chunk != 0, pass 1 drops the tail from the mean/variance reduction and pass 2 never writes the tail columns back to y. If hidden_chunk > hidden, both loops skip entirely and the output stays unwritten. Add a precondition here or explicit tail handling before building the loops.

🛠️ Suggested guard
 def build_layer_norm_program(
     rows: int = ROWS,
     hidden: int = HIDDEN,
     row_chunk: int = ROW_CHUNK,
     hidden_chunk: int = HIDDEN_CHUNK,
     eps: float = EPS,
 ):
+    if hidden <= 0:
+        raise ValueError(f"`hidden` must be > 0, got {hidden}")
+    if hidden_chunk <= 0 or hidden % hidden_chunk != 0:
+        raise ValueError(
+            "`hidden_chunk` must be a positive divisor of `hidden` "
+            f"(got hidden={hidden}, hidden_chunk={hidden_chunk})"
+        )
     hidden_blocks = hidden // hidden_chunk
     hidden_inv = 1.0 / hidden
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/layer_norm.py` around lines 36 - 44, The build_layer_norm_program
currently computes hidden_blocks = hidden // hidden_chunk which silently drops a
tail when hidden % hidden_chunk != 0 (and fails entirely if hidden_chunk >
hidden); update build_layer_norm_program to guard and handle tails: either
validate the inputs and raise a clear exception if hidden % hidden_chunk != 0 or
compute hidden_blocks = math.ceil(hidden / hidden_chunk) and add explicit
per-block logic in the loops (using hidden_chunk for full blocks and computing a
last_block_size = hidden - (hidden_blocks-1)*hidden_chunk for the final partial
block) so mean/variance reductions and writes to y only process the actual
remaining columns and use the correct normalization factor (use hidden or the
per-row actual element count) for the tail; reference the symbols hidden_blocks,
hidden_chunk, hidden_inv, and hidden when making the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants