Skip to content

feat: tensor aggregators (mean, std)#39

Open
JakubPekar wants to merge 21 commits intomainfrom
feature/aggregations
Open

feat: tensor aggregators (mean, std)#39
JakubPekar wants to merge 21 commits intomainfrom
feature/aggregations

Conversation

@JakubPekar
Copy link
Collaborator

@JakubPekar JakubPekar commented Mar 7, 2026

Summary by CodeRabbit

  • New Features

    • Added tensor mean and tensor standard-deviation aggregations (flexible axis handling, null-handling, stable parallel variance)
    • Exposed aggregators via the Ray aggregations package
  • Documentation

    • Added reference pages and navigation entries for the new aggregations
  • Tests

    • Added comprehensive tests for global/axis reductions, invalid-axis validation, and group-by scenarios
  • Chores

    • Bumped version to 1.3.0 and tightened Python requirement to >=3.12,<3.14

@JakubPekar JakubPekar requested review from a team March 7, 2026 20:24
@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds TensorMean and TensorStd Ray aggregators (with axis validation and Chan-style variance), re-exports them in package init, tightens python bound to <3.14, and adds tests, docs, and mkdocs navigation.

Changes

Cohort / File(s) Summary
Configuration
pyproject.toml
Bumped version to 1.3.0; tightened Python requirement to >=3.12,<3.14.
Aggregators
ratiopath/ray/aggregate/tensor_mean.py, ratiopath/ray/aggregate/tensor_std.py
Added TensorMean and TensorStd classes: axis validation (axis must include 0 when provided), zero_factory, aggregate_block, combine (Chan-style for std), and finalize.
Package Exports
ratiopath/ray/aggregate/__init__.py
New initializer re-exports TensorMean and TensorStd via __all__.
Tests
tests/test_ray_aggregations.py
New tests for global and axis-based reductions, invalid-axis validation, groupby behavior, and Ray setup/teardown.
Docs / Site
docs/reference/ray/aggregations/tensor_mean.md, docs/reference/ray/aggregations/tensor_std.md, mkdocs.yml
Added autodoc pages for the new aggregators and inserted them under Ray → Aggregations in site navigation.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant RayDS as Ray Dataset
    participant Scheduler as Ray Scheduler
    participant Block as Data Block
    participant Agg as Aggregator (TensorMean / TensorStd)
    participant Acc as Accumulator

    Client->>RayDS: request aggregate(Agg)
    RayDS->>Scheduler: schedule block tasks
    Scheduler->>Block: process block
    Block-->>Agg: provide block data
    Agg->>Acc: aggregate_block -> partial accumulator
    Acc-->>Scheduler: return partial accumulator
    Scheduler->>Agg: combine partial accumulators
    Agg->>Acc: merge accumulators (Chan-style for std)
    Scheduler->>Agg: finalize(accumulator)
    Agg-->>Client: return final aggregated result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • vejtek

Poem

🐇
I nibble tensors row by row,
sum and square in tidy flow.
Axes checked, counts kept bright,
mean and std hop into light.
Carrots, code, and stats delight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: introducing two new tensor aggregators (mean and std) for Ray aggregations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/aggregations
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `pylint` configuration to improve the quality of Python code reviews.

Add a pylint configuration file to your project to customize how CodeRabbit runs pylint.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the data processing capabilities by introducing new tensor aggregators for Ray Data. Specifically, it adds functionalities to compute the mean and standard deviation of tensor columns, allowing for flexible axis-based aggregation. The changes also include an update to the Python version compatibility and a simplification of dependency declarations, ensuring robustness and maintainability of the project.

Highlights

  • New Tensor Aggregators: Introduced TensorMean and TensorStd classes to calculate the mean and standard deviation of tensor columns within Ray Data, supporting aggregation across specified axes including the batch dimension.
  • Python Version Constraint Update: The project's Python version requirement has been updated to explicitly support Python 3.12 while excluding Python 3.14, reflected in pyproject.toml and uv.lock.
  • Dependency Simplification: Removed platform-specific markers from several dependency entries in uv.lock, streamlining the dependency resolution process.
  • Comprehensive Testing: Added new unit tests to validate the functionality of TensorMean and TensorStd for various aggregation scenarios, including global, batch-only, and grouped aggregations.
Changelog
  • pyproject.toml
    • Updated Python version constraint to '>=3.12,<3.14'.
  • ratiopath/ray/aggregate/init.py
    • Added TensorMean and TensorStd to the module's __all__ export list.
    • Imported TensorMean and TensorStd classes.
  • ratiopath/ray/aggregate/tensor_mean.py
    • Added TensorMean class, an AggregateFnV2 implementation for calculating the mean of tensor columns.
    • Implemented __init__, aggregate_block, combine, and finalize methods for distributed mean calculation.
    • Included detailed docstrings and examples for TensorMean usage.
  • ratiopath/ray/aggregate/tensor_std.py
    • Added TensorStd class, an AggregateFnV2 implementation for calculating the standard deviation of tensor columns using Chan's method for numerical stability.
    • Implemented __init__, aggregate_block, combine, and finalize methods for distributed standard deviation calculation.
    • Included detailed docstrings and examples for TensorStd usage.
  • tests/test_ray_aggregations.py
    • Added a pytest fixture for Ray initialization and shutdown.
    • Added test_tensor_mean_global to verify global mean calculation.
    • Added test_tensor_mean_int_shorthand to verify mean calculation with integer axis shorthand.
    • Added test_tensor_mean_batch_only to verify mean calculation across only the batch dimension.
    • Added test_tensor_std_global to verify global standard deviation calculation.
    • Added test_tensor_std_batch_only to verify standard deviation calculation across only the batch dimension.
    • Added test_invalid_axis_tuple to ensure ValueError is raised for invalid axis configurations.
    • Added test_tensor_aggregate_groupby to verify TensorMean and TensorStd functionality within groupby operations.
  • uv.lock
    • Updated the requires-python constraint to '>=3.12, <3.14'.
    • Removed numerous Python 3.14 specific wheel entries for various packages (e.g., backrefs, charset_normalizer, imagecodecs, mypy, numpy, pillow, pyproj, rpds_py, scikit_learn, scipy).
    • Removed platform-specific marker conditions from dependencies like nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, and triton.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two useful aggregators for Ray DataFrames, TensorMean and TensorStd, which allow for calculating statistics on columns containing tensors. The implementations are well-documented and follow the AggregateFnV2 interface correctly. The accompanying tests cover the primary use cases, including global, per-axis, and grouped aggregations.

My review includes suggestions to enhance robustness by handling empty data blocks, which could otherwise lead to runtime errors. I've also provided feedback to improve the clarity and completeness of the tests.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/test_ray_aggregations.py (1)

99-105: Cover TensorStd in the invalid-axis test too.

TensorStd duplicates this validation in a separate public class, so only asserting TensorMean still leaves one entry point unguarded.

Suggested change
-def test_invalid_axis_tuple(ray_start):
+@pytest.mark.parametrize("agg_cls", [TensorMean, TensorStd])
+def test_invalid_axis_tuple(ray_start, agg_cls):
     """Verifies that providing a tuple without axis 0 raises ValueError."""
     with pytest.raises(
         ValueError, match=r"Axis 0 \(the batch dimension\) must be included"
     ):
-        TensorMean(on="m", axis=(1, 2))
+        agg_cls(on="m", axis=(1, 2))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_ray_aggregations.py` around lines 99 - 105, The test
test_invalid_axis_tuple currently only asserts that TensorMean(on="m", axis=(1,
2)) raises the ValueError; also instantiate TensorStd(on="m", axis=(1, 2)) in
the same pytest.raises block (or add a separate assertion) to ensure TensorStd
performs the same validation—update the test to call both TensorMean and
TensorStd with axis=(1, 2) and verify they raise the same ValueError about Axis
0 (the batch dimension) being required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Around line 63-75: The constructor of TensorMean allows on: None but later
aggregation assumes a concrete column; in TensorMean.__init__ (and similarly for
TensorStd) either make the on parameter non-optional in the signature or
validate and reject None early: check the on argument in TensorMean.__init__ and
if on is None raise a ValueError with a clear message (e.g., "on must be
provided for TensorMean"), preventing instantiation with None; update the
name/zero_factory logic to use the validated on value thereafter (and apply the
same guard in the TensorStd constructor).

In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 52-64: The constructor for the aggregator (__init__) accepts on:
str | None but later assumes a real target column, so either make the parameter
required (change signature to on: str) or validate and reject None up front;
update the __init__ in tensor_std (and the other similar ctor at the indicated
block around lines 79-81) to raise a clear ValueError/TypeError if on is None
(before calling super()) or change the type to non-optional so callers must
provide a column; ensure the error message mentions the parameter name ("on")
and keep alias_name logic intact when constructing name=f"std({on})".

In `@tests/test_ray_aggregations.py`:
- Around line 123-127: Add an assertion that validates the second group's
standard deviation as well: after computing res_std (from
groupby("id").aggregate(TensorStd(on="m", axis=0")) and sorting it, assert that
res_std[1]["std(m)"] equals the expected singleton std (e.g., [0.0, 0.0]) using
np.testing.assert_array_equal so the singleton group B is checked for
regressions.

---

Nitpick comments:
In `@tests/test_ray_aggregations.py`:
- Around line 99-105: The test test_invalid_axis_tuple currently only asserts
that TensorMean(on="m", axis=(1, 2)) raises the ValueError; also instantiate
TensorStd(on="m", axis=(1, 2)) in the same pytest.raises block (or add a
separate assertion) to ensure TensorStd performs the same validation—update the
test to call both TensorMean and TensorStd with axis=(1, 2) and verify they
raise the same ValueError about Axis 0 (the batch dimension) being required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9bba0655-d2b1-438a-a8f9-f67db49854bd

📥 Commits

Reviewing files that changed from the base of the PR and between b5fcc00 and 9900fee.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • pyproject.toml
  • ratiopath/ray/aggregate/__init__.py
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py
  • tests/test_ray_aggregations.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
ratiopath/ray/aggregate/tensor_mean.py (1)

88-88: Inconsistent axis ordering with TensorStd.

TensorStd uses tuple(sorted(axes)) at line 77, while this uses tuple(axes). Since axes is a set, iteration order is not guaranteed to be sorted, which could lead to subtle inconsistencies in behavior.

♻️ Proposed fix for consistency
-            self.aggregate_axis = tuple(axes)
+            self.aggregate_axis = tuple(sorted(axes))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ratiopath/ray/aggregate/tensor_mean.py` at line 88, TensorMean currently
assigns self.aggregate_axis = tuple(axes) where axes is a set, causing
nondeterministic ordering versus TensorStd which uses tuple(sorted(axes));
change TensorMean to assign self.aggregate_axis = tuple(sorted(axes)) (use the
same sorted normalization as TensorStd) so aggregate_axis ordering is
deterministic and consistent across TensorMean and TensorStd.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 1-148: The file fails ruff formatting; run the project formatter
(or ruff) to reformat tensor_std.py and fix style issues in class TensorStd and
its methods (zero_factory, aggregate_block, combine, finalize) so the file
passes lint/format checks; specifically run the suggested command (uvx ruff
format --write ratiopath/ray/aggregate/tensor_std.py) or an equivalent
ruff/black formatter and commit the resulting changes.

---

Nitpick comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Line 88: TensorMean currently assigns self.aggregate_axis = tuple(axes) where
axes is a set, causing nondeterministic ordering versus TensorStd which uses
tuple(sorted(axes)); change TensorMean to assign self.aggregate_axis =
tuple(sorted(axes)) (use the same sorted normalization as TensorStd) so
aggregate_axis ordering is deterministic and consistent across TensorMean and
TensorStd.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 02aebbe7-fe3f-4443-a8c4-137febe0fc41

📥 Commits

Reviewing files that changed from the base of the PR and between 9900fee and f2176f4.

📒 Files selected for processing (3)
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py
  • tests/test_ray_aggregations.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_ray_aggregations.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ratiopath/ray/aggregate/tensor_std.py (1)

141-149: Consider documenting the ddof=0 (population std) behavior.

The implementation computes population standard deviation (sqrt(ssd/count)) rather than sample standard deviation (sqrt(ssd/(count-1))). This matches numpy's default behavior but users expecting sample std may be surprised.

Consider either:

  1. Adding a note in the docstring that this computes population std (like numpy.std(ddof=0))
  2. Or optionally exposing a ddof parameter for consistency with numpy's API
📝 Suggested docstring addition
     Args:
         on: The name of the column containing tensors or numbers.
         axis: The axis or axes along which the reduction is computed.
             - `None`: Global reduction. Collapses all dimensions (including batch)
               to a single scalar.
             - `int`: Aggregates over both the batch (axis 0) AND the specified
               tensor dimension. For example, `axis=1` collapses the batch and
               the first dimension of the tensors.
             - `tuple`: A sequence of axes that **must** explicitly include `0`.
         ignore_nulls: Whether to ignore null values. Defaults to True.
         alias_name: Optional name for the resulting column. Defaults to "std(<on>)".
+
+    Note:
+        This computes the **population** standard deviation (equivalent to
+        ``numpy.std(ddof=0)``), not the sample standard deviation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ratiopath/ray/aggregate/tensor_std.py` around lines 141 - 149, The finalize
method in tensor_std currently computes the population standard deviation (uses
sqrt(ssd/count), ddof=0) which may surprise users expecting sample std; update
the tensor_std implementation by documenting this behavior on the class or the
finalize method docstring (mentioning "population std, equivalent to
numpy.std(ddof=0)") or optionally add a ddof parameter (propagate ddof through
the accumulator logic and use count - ddof when computing the denominator) so
callers can request sample std; refer to the finalize method and accumulator
keys ("ssd", "count", "shape") when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 141-149: The finalize method in tensor_std currently computes the
population standard deviation (uses sqrt(ssd/count), ddof=0) which may surprise
users expecting sample std; update the tensor_std implementation by documenting
this behavior on the class or the finalize method docstring (mentioning
"population std, equivalent to numpy.std(ddof=0)") or optionally add a ddof
parameter (propagate ddof through the accumulator logic and use count - ddof
when computing the denominator) so callers can request sample std; refer to the
finalize method and accumulator keys ("ssd", "count", "shape") when making the
change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba520392-8419-4b9e-ba91-1fe3c4a89a41

📥 Commits

Reviewing files that changed from the base of the PR and between f2176f4 and eb4b53c.

📒 Files selected for processing (1)
  • ratiopath/ray/aggregate/tensor_std.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Around line 63-88: The constructor for TensorMean is using
self._aggregate_axis before it's set and allows axis=None which leads to calling
.flatten() on a NumPy scalar; fix __init__ in class TensorMean by setting a
sensible default for self._aggregate_axis when axis is None (e.g., (0,)
representing batch reduction), and when axis is provided convert it to a tuple
only after validating inclusion of 0 (use axes = {0, axis} or set(axis) then
self._aggregate_axis = tuple(axes)); this ensures downstream code never sees an
uninitialized _aggregate_axis or tries to call .flatten() on a scalar.

In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 157-165: The finalize method currently returns a scalar np.nan
when count - self._ddof <= 0, collapsing tensor-shaped results; change that
branch to return a NaN array matching the reduced tensor shape stored in
accumulator["shape"]. In the finalize function (method finalize), replace the
scalar return with something like np.full(accumulator["shape"], np.nan,
dtype=float) or np.broadcast_to(np.asarray(np.nan, dtype=float),
accumulator["shape"]) so the returned value preserves shape for axis/tuple
reductions while keeping dtype consistent.
- Around line 65-94: The constructor (__init__) of TensorStd leaves
self._aggregate_axis unset when axis is None and relies later on it and on
array.flatten() which fails for NumPy scalars; set a sensible default for
self._aggregate_axis (e.g., (0,)) when axis is None and ensure the reduction
result is treated as at least 1-D before any .flatten() calls (for example by
using np.atleast_1d or wrapping scalars in an array) so TensorStd(on=...,
axis=None) follows the global-reduction path without error; update the logic in
TensorStd.__init__ (and the related reduction code that uses
self._aggregate_axis and .flatten()) to initialize and normalize these values
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0be1eb14-3f12-4059-b728-fa3eae864ff3

📥 Commits

Reviewing files that changed from the base of the PR and between eb4b53c and 73eaa0d.

📒 Files selected for processing (2)
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py

Adames4
Adames4 previously approved these changes Mar 9, 2026
Adames4
Adames4 previously approved these changes Mar 12, 2026
@matejpekar
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new TensorMean and TensorStd aggregators for Ray Data, which is a valuable feature for tensor-based data processing. The implementations are generally well-structured, particularly the use of Chan's algorithm for stable standard deviation calculation. I've identified a critical bug in the TensorStd implementation within the combine method due to a key mismatch in the accumulator dictionary, which would cause distributed aggregations to fail. I have also suggested an improvement to the tests to ensure they are robust enough to catch such distributed computing issues by forcing multi-block processing.

@JakubPekar
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@JakubPekar
Copy link
Collaborator Author

/gemini review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/test_ray_aggregations.py (2)

84-86: Swap pytest.approx to wrap the expected value.

The idiomatic pattern is assert actual == pytest.approx(expected, rel=...). The current form works but is less readable and the second positional arg to pytest.approx is rel, not abs.

Suggested fix
-    assert pytest.approx(result["std(m)"], 0.0001) == expected
+    assert result["std(m)"] == pytest.approx(expected, rel=1e-4)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_ray_aggregations.py` around lines 84 - 86, The assertion uses
pytest.approx with the actual value wrapped incorrectly; update the test in
tests/test_ray_aggregations.py so the actual result from
ds.aggregate(TensorStd(on="m", axis=None, ddof=0)) is compared to
pytest.approx(expected) instead of the other way around — e.g., assert
result["std(m)"] == pytest.approx(expected, rel=0.0001) — ensuring the tolerance
is passed as a keyword/rel argument.

13-14: Remove leftover development comment.

The comment "Adjust imports based on your file structure" appears to be a development note that should be removed before merging.

Suggested fix
-# Adjust imports based on your file structure
 from ratiopath.ray.aggregate import TensorMean, TensorStd
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_ray_aggregations.py` around lines 13 - 14, Remove the leftover
development comment "Adjust imports based on your file structure" from the
imports block in tests/test_ray_aggregations.py so only the actual import
remains; ensure the import line "from ratiopath.ray.aggregate import TensorMean,
TensorStd" is preserved and no other dev notes remain in that file's top-of-file
imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 13: The pyproject.toml Python version constraint currently uses the
requires-python key with ">=3.12,<3.14", which prevents installation on Python
3.14+; remove the upper bound or broaden it (for example change requires-python
to ">=3.12" or ">=3.12, <4.0" after testing) so the package supports newer
interpreters, then run your test suite against Python 3.14 to verify
compatibility before finalizing the change.

In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Line 86: The assignment self._aggregate_axis = tuple(axes) in TensorMean
produces a non-deterministic ordering when axes is a set; change it to use a
sorted ordering (e.g., tuple(sorted(axes))) so TensorMean's _aggregate_axis is
deterministic and consistent with TensorStd (which already uses
tuple(sorted(axes))). Locate the assignment in the TensorMean initializer (the
line setting _aggregate_axis) and replace the tuple(axes) construction with a
sorted tuple of axes.

---

Nitpick comments:
In `@tests/test_ray_aggregations.py`:
- Around line 84-86: The assertion uses pytest.approx with the actual value
wrapped incorrectly; update the test in tests/test_ray_aggregations.py so the
actual result from ds.aggregate(TensorStd(on="m", axis=None, ddof=0)) is
compared to pytest.approx(expected) instead of the other way around — e.g.,
assert result["std(m)"] == pytest.approx(expected, rel=0.0001) — ensuring the
tolerance is passed as a keyword/rel argument.
- Around line 13-14: Remove the leftover development comment "Adjust imports
based on your file structure" from the imports block in
tests/test_ray_aggregations.py so only the actual import remains; ensure the
import line "from ratiopath.ray.aggregate import TensorMean, TensorStd" is
preserved and no other dev notes remain in that file's top-of-file imports.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 04bf19aa-00e6-44fe-8c76-93a57c88e27d

📥 Commits

Reviewing files that changed from the base of the PR and between 0dab47a and ec462b7.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • docs/reference/ray/aggregations/tensor_mean.md
  • docs/reference/ray/aggregations/tensor_std.md
  • mkdocs.yml
  • pyproject.toml
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py
  • tests/test_ray_aggregations.py
✅ Files skipped from review due to trivial changes (1)
  • docs/reference/ray/aggregations/tensor_mean.md

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new TensorMean and TensorStd aggregators for Ray DataFrames, which is a great feature for tensor-based data processing. The implementation of TensorStd using Chan's method for numerical stability is particularly well-done. I've found a couple of critical correctness issues that could lead to runtime errors. My main feedback focuses on improving null value handling in both aggregators and fixing a bug in TensorStd related to global aggregations. The provided tests are comprehensive and cover the main use cases well.

vejtek
vejtek previously approved these changes Mar 12, 2026
@JakubPekar
Copy link
Collaborator Author

/gemini review

@JakubPekar
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new TensorMean and TensorStd aggregators for Ray Data, which is a great feature for tensor-based data processing. The implementation of TensorStd correctly uses Chan's algorithm for stable parallel standard deviation calculation, which is excellent for numerical stability.

However, I've found a critical issue in the null-handling logic in both new aggregator classes. The call to block_acc.count is incorrect and will cause a runtime error. Additionally, the handling of nulls when ignore_nulls=False is missing, and there's an edge case with blocks containing only nulls when ignore_nulls=True that could lead to a crash. I've left detailed comments with suggestions to fix these issues.

The rest of the code, including documentation and tests for the happy paths, looks good.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Line 101: There's a minor typo in the inline comment "# Filter out Nones)" —
remove the extraneous closing parenthesis so the comment reads "# Filter out
Nones"; locate this comment in tensor_mean.py (around the aggregation logic in
the tensor mean function / reduce step) and update it accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0433680-1596-4952-96c7-6e797191a8cf

📥 Commits

Reviewing files that changed from the base of the PR and between ec462b7 and 5215154.

📒 Files selected for processing (3)
  • ratiopath/ray/aggregate/tensor_mean.py
  • ratiopath/ray/aggregate/tensor_std.py
  • tests/test_ray_aggregations.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • ratiopath/ray/aggregate/tensor_std.py
  • tests/test_ray_aggregations.py

@JakubPekar JakubPekar requested review from Adames4 and vejtek March 13, 2026 09:29
vejtek
vejtek previously approved these changes Mar 13, 2026
Copy link
Member

@Adames4 Adames4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix mypy

@JakubPekar JakubPekar requested review from Adames4 and vejtek March 16, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants