Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds TensorMean and TensorStd Ray aggregators (with axis validation and Chan-style variance), re-exports them in package init, tightens python bound to <3.14, and adds tests, docs, and mkdocs navigation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant RayDS as Ray Dataset
participant Scheduler as Ray Scheduler
participant Block as Data Block
participant Agg as Aggregator (TensorMean / TensorStd)
participant Acc as Accumulator
Client->>RayDS: request aggregate(Agg)
RayDS->>Scheduler: schedule block tasks
Scheduler->>Block: process block
Block-->>Agg: provide block data
Agg->>Acc: aggregate_block -> partial accumulator
Acc-->>Scheduler: return partial accumulator
Scheduler->>Agg: combine partial accumulators
Agg->>Acc: merge accumulators (Chan-style for std)
Scheduler->>Agg: finalize(accumulator)
Agg-->>Client: return final aggregated result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use your project's `pylint` configuration to improve the quality of Python code reviews.Add a pylint configuration file to your project to customize how CodeRabbit runs |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the data processing capabilities by introducing new tensor aggregators for Ray Data. Specifically, it adds functionalities to compute the mean and standard deviation of tensor columns, allowing for flexible axis-based aggregation. The changes also include an update to the Python version compatibility and a simplification of dependency declarations, ensuring robustness and maintainability of the project. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two useful aggregators for Ray DataFrames, TensorMean and TensorStd, which allow for calculating statistics on columns containing tensors. The implementations are well-documented and follow the AggregateFnV2 interface correctly. The accompanying tests cover the primary use cases, including global, per-axis, and grouped aggregations.
My review includes suggestions to enhance robustness by handling empty data blocks, which could otherwise lead to runtime errors. I've also provided feedback to improve the clarity and completeness of the tests.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/test_ray_aggregations.py (1)
99-105: CoverTensorStdin the invalid-axis test too.
TensorStdduplicates this validation in a separate public class, so only assertingTensorMeanstill leaves one entry point unguarded.Suggested change
-def test_invalid_axis_tuple(ray_start): +@pytest.mark.parametrize("agg_cls", [TensorMean, TensorStd]) +def test_invalid_axis_tuple(ray_start, agg_cls): """Verifies that providing a tuple without axis 0 raises ValueError.""" with pytest.raises( ValueError, match=r"Axis 0 \(the batch dimension\) must be included" ): - TensorMean(on="m", axis=(1, 2)) + agg_cls(on="m", axis=(1, 2))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_ray_aggregations.py` around lines 99 - 105, The test test_invalid_axis_tuple currently only asserts that TensorMean(on="m", axis=(1, 2)) raises the ValueError; also instantiate TensorStd(on="m", axis=(1, 2)) in the same pytest.raises block (or add a separate assertion) to ensure TensorStd performs the same validation—update the test to call both TensorMean and TensorStd with axis=(1, 2) and verify they raise the same ValueError about Axis 0 (the batch dimension) being required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Around line 63-75: The constructor of TensorMean allows on: None but later
aggregation assumes a concrete column; in TensorMean.__init__ (and similarly for
TensorStd) either make the on parameter non-optional in the signature or
validate and reject None early: check the on argument in TensorMean.__init__ and
if on is None raise a ValueError with a clear message (e.g., "on must be
provided for TensorMean"), preventing instantiation with None; update the
name/zero_factory logic to use the validated on value thereafter (and apply the
same guard in the TensorStd constructor).
In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 52-64: The constructor for the aggregator (__init__) accepts on:
str | None but later assumes a real target column, so either make the parameter
required (change signature to on: str) or validate and reject None up front;
update the __init__ in tensor_std (and the other similar ctor at the indicated
block around lines 79-81) to raise a clear ValueError/TypeError if on is None
(before calling super()) or change the type to non-optional so callers must
provide a column; ensure the error message mentions the parameter name ("on")
and keep alias_name logic intact when constructing name=f"std({on})".
In `@tests/test_ray_aggregations.py`:
- Around line 123-127: Add an assertion that validates the second group's
standard deviation as well: after computing res_std (from
groupby("id").aggregate(TensorStd(on="m", axis=0")) and sorting it, assert that
res_std[1]["std(m)"] equals the expected singleton std (e.g., [0.0, 0.0]) using
np.testing.assert_array_equal so the singleton group B is checked for
regressions.
---
Nitpick comments:
In `@tests/test_ray_aggregations.py`:
- Around line 99-105: The test test_invalid_axis_tuple currently only asserts
that TensorMean(on="m", axis=(1, 2)) raises the ValueError; also instantiate
TensorStd(on="m", axis=(1, 2)) in the same pytest.raises block (or add a
separate assertion) to ensure TensorStd performs the same validation—update the
test to call both TensorMean and TensorStd with axis=(1, 2) and verify they
raise the same ValueError about Axis 0 (the batch dimension) being required.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9bba0655-d2b1-438a-a8f9-f67db49854bd
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
pyproject.tomlratiopath/ray/aggregate/__init__.pyratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.pytests/test_ray_aggregations.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
ratiopath/ray/aggregate/tensor_mean.py (1)
88-88: Inconsistent axis ordering with TensorStd.
TensorStdusestuple(sorted(axes))at line 77, while this usestuple(axes). Sinceaxesis a set, iteration order is not guaranteed to be sorted, which could lead to subtle inconsistencies in behavior.♻️ Proposed fix for consistency
- self.aggregate_axis = tuple(axes) + self.aggregate_axis = tuple(sorted(axes))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ratiopath/ray/aggregate/tensor_mean.py` at line 88, TensorMean currently assigns self.aggregate_axis = tuple(axes) where axes is a set, causing nondeterministic ordering versus TensorStd which uses tuple(sorted(axes)); change TensorMean to assign self.aggregate_axis = tuple(sorted(axes)) (use the same sorted normalization as TensorStd) so aggregate_axis ordering is deterministic and consistent across TensorMean and TensorStd.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 1-148: The file fails ruff formatting; run the project formatter
(or ruff) to reformat tensor_std.py and fix style issues in class TensorStd and
its methods (zero_factory, aggregate_block, combine, finalize) so the file
passes lint/format checks; specifically run the suggested command (uvx ruff
format --write ratiopath/ray/aggregate/tensor_std.py) or an equivalent
ruff/black formatter and commit the resulting changes.
---
Nitpick comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Line 88: TensorMean currently assigns self.aggregate_axis = tuple(axes) where
axes is a set, causing nondeterministic ordering versus TensorStd which uses
tuple(sorted(axes)); change TensorMean to assign self.aggregate_axis =
tuple(sorted(axes)) (use the same sorted normalization as TensorStd) so
aggregate_axis ordering is deterministic and consistent across TensorMean and
TensorStd.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 02aebbe7-fe3f-4443-a8c4-137febe0fc41
📒 Files selected for processing (3)
ratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.pytests/test_ray_aggregations.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_ray_aggregations.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ratiopath/ray/aggregate/tensor_std.py (1)
141-149: Consider documenting the ddof=0 (population std) behavior.The implementation computes population standard deviation (
sqrt(ssd/count)) rather than sample standard deviation (sqrt(ssd/(count-1))). This matches numpy's default behavior but users expecting sample std may be surprised.Consider either:
- Adding a note in the docstring that this computes population std (like
numpy.std(ddof=0))- Or optionally exposing a
ddofparameter for consistency with numpy's API📝 Suggested docstring addition
Args: on: The name of the column containing tensors or numbers. axis: The axis or axes along which the reduction is computed. - `None`: Global reduction. Collapses all dimensions (including batch) to a single scalar. - `int`: Aggregates over both the batch (axis 0) AND the specified tensor dimension. For example, `axis=1` collapses the batch and the first dimension of the tensors. - `tuple`: A sequence of axes that **must** explicitly include `0`. ignore_nulls: Whether to ignore null values. Defaults to True. alias_name: Optional name for the resulting column. Defaults to "std(<on>)". + + Note: + This computes the **population** standard deviation (equivalent to + ``numpy.std(ddof=0)``), not the sample standard deviation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ratiopath/ray/aggregate/tensor_std.py` around lines 141 - 149, The finalize method in tensor_std currently computes the population standard deviation (uses sqrt(ssd/count), ddof=0) which may surprise users expecting sample std; update the tensor_std implementation by documenting this behavior on the class or the finalize method docstring (mentioning "population std, equivalent to numpy.std(ddof=0)") or optionally add a ddof parameter (propagate ddof through the accumulator logic and use count - ddof when computing the denominator) so callers can request sample std; refer to the finalize method and accumulator keys ("ssd", "count", "shape") when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 141-149: The finalize method in tensor_std currently computes the
population standard deviation (uses sqrt(ssd/count), ddof=0) which may surprise
users expecting sample std; update the tensor_std implementation by documenting
this behavior on the class or the finalize method docstring (mentioning
"population std, equivalent to numpy.std(ddof=0)") or optionally add a ddof
parameter (propagate ddof through the accumulator logic and use count - ddof
when computing the denominator) so callers can request sample std; refer to the
finalize method and accumulator keys ("ssd", "count", "shape") when making the
change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ba520392-8419-4b9e-ba91-1fe3c4a89a41
📒 Files selected for processing (1)
ratiopath/ray/aggregate/tensor_std.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Around line 63-88: The constructor for TensorMean is using
self._aggregate_axis before it's set and allows axis=None which leads to calling
.flatten() on a NumPy scalar; fix __init__ in class TensorMean by setting a
sensible default for self._aggregate_axis when axis is None (e.g., (0,)
representing batch reduction), and when axis is provided convert it to a tuple
only after validating inclusion of 0 (use axes = {0, axis} or set(axis) then
self._aggregate_axis = tuple(axes)); this ensures downstream code never sees an
uninitialized _aggregate_axis or tries to call .flatten() on a scalar.
In `@ratiopath/ray/aggregate/tensor_std.py`:
- Around line 157-165: The finalize method currently returns a scalar np.nan
when count - self._ddof <= 0, collapsing tensor-shaped results; change that
branch to return a NaN array matching the reduced tensor shape stored in
accumulator["shape"]. In the finalize function (method finalize), replace the
scalar return with something like np.full(accumulator["shape"], np.nan,
dtype=float) or np.broadcast_to(np.asarray(np.nan, dtype=float),
accumulator["shape"]) so the returned value preserves shape for axis/tuple
reductions while keeping dtype consistent.
- Around line 65-94: The constructor (__init__) of TensorStd leaves
self._aggregate_axis unset when axis is None and relies later on it and on
array.flatten() which fails for NumPy scalars; set a sensible default for
self._aggregate_axis (e.g., (0,)) when axis is None and ensure the reduction
result is treated as at least 1-D before any .flatten() calls (for example by
using np.atleast_1d or wrapping scalars in an array) so TensorStd(on=...,
axis=None) follows the global-reduction path without error; update the logic in
TensorStd.__init__ (and the related reduction code that uses
self._aggregate_axis and .flatten()) to initialize and normalize these values
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0be1eb14-3f12-4059-b728-fa3eae864ff3
📒 Files selected for processing (2)
ratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.py
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces new TensorMean and TensorStd aggregators for Ray Data, which is a valuable feature for tensor-based data processing. The implementations are generally well-structured, particularly the use of Chan's algorithm for stable standard deviation calculation. I've identified a critical bug in the TensorStd implementation within the combine method due to a key mismatch in the accumulator dictionary, which would cause distributed aggregations to fail. I have also suggested an improvement to the tests to ensure they are robust enough to catch such distributed computing issues by forcing multi-block processing.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
/gemini review |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/test_ray_aggregations.py (2)
84-86: Swappytest.approxto wrap the expected value.The idiomatic pattern is
assert actual == pytest.approx(expected, rel=...). The current form works but is less readable and the second positional arg topytest.approxisrel, notabs.Suggested fix
- assert pytest.approx(result["std(m)"], 0.0001) == expected + assert result["std(m)"] == pytest.approx(expected, rel=1e-4)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_ray_aggregations.py` around lines 84 - 86, The assertion uses pytest.approx with the actual value wrapped incorrectly; update the test in tests/test_ray_aggregations.py so the actual result from ds.aggregate(TensorStd(on="m", axis=None, ddof=0)) is compared to pytest.approx(expected) instead of the other way around — e.g., assert result["std(m)"] == pytest.approx(expected, rel=0.0001) — ensuring the tolerance is passed as a keyword/rel argument.
13-14: Remove leftover development comment.The comment "Adjust imports based on your file structure" appears to be a development note that should be removed before merging.
Suggested fix
-# Adjust imports based on your file structure from ratiopath.ray.aggregate import TensorMean, TensorStd🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_ray_aggregations.py` around lines 13 - 14, Remove the leftover development comment "Adjust imports based on your file structure" from the imports block in tests/test_ray_aggregations.py so only the actual import remains; ensure the import line "from ratiopath.ray.aggregate import TensorMean, TensorStd" is preserved and no other dev notes remain in that file's top-of-file imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 13: The pyproject.toml Python version constraint currently uses the
requires-python key with ">=3.12,<3.14", which prevents installation on Python
3.14+; remove the upper bound or broaden it (for example change requires-python
to ">=3.12" or ">=3.12, <4.0" after testing) so the package supports newer
interpreters, then run your test suite against Python 3.14 to verify
compatibility before finalizing the change.
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Line 86: The assignment self._aggregate_axis = tuple(axes) in TensorMean
produces a non-deterministic ordering when axes is a set; change it to use a
sorted ordering (e.g., tuple(sorted(axes))) so TensorMean's _aggregate_axis is
deterministic and consistent with TensorStd (which already uses
tuple(sorted(axes))). Locate the assignment in the TensorMean initializer (the
line setting _aggregate_axis) and replace the tuple(axes) construction with a
sorted tuple of axes.
---
Nitpick comments:
In `@tests/test_ray_aggregations.py`:
- Around line 84-86: The assertion uses pytest.approx with the actual value
wrapped incorrectly; update the test in tests/test_ray_aggregations.py so the
actual result from ds.aggregate(TensorStd(on="m", axis=None, ddof=0)) is
compared to pytest.approx(expected) instead of the other way around — e.g.,
assert result["std(m)"] == pytest.approx(expected, rel=0.0001) — ensuring the
tolerance is passed as a keyword/rel argument.
- Around line 13-14: Remove the leftover development comment "Adjust imports
based on your file structure" from the imports block in
tests/test_ray_aggregations.py so only the actual import remains; ensure the
import line "from ratiopath.ray.aggregate import TensorMean, TensorStd" is
preserved and no other dev notes remain in that file's top-of-file imports.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 04bf19aa-00e6-44fe-8c76-93a57c88e27d
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
docs/reference/ray/aggregations/tensor_mean.mddocs/reference/ray/aggregations/tensor_std.mdmkdocs.ymlpyproject.tomlratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.pytests/test_ray_aggregations.py
✅ Files skipped from review due to trivial changes (1)
- docs/reference/ray/aggregations/tensor_mean.md
There was a problem hiding this comment.
Code Review
This pull request introduces new TensorMean and TensorStd aggregators for Ray DataFrames, which is a great feature for tensor-based data processing. The implementation of TensorStd using Chan's method for numerical stability is particularly well-done. I've found a couple of critical correctness issues that could lead to runtime errors. My main feedback focuses on improving null value handling in both aggregators and fixing a bug in TensorStd related to global aggregations. The provided tests are comprehensive and cover the main use cases well.
|
/gemini review |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Code Review
This pull request introduces new TensorMean and TensorStd aggregators for Ray Data, which is a great feature for tensor-based data processing. The implementation of TensorStd correctly uses Chan's algorithm for stable parallel standard deviation calculation, which is excellent for numerical stability.
However, I've found a critical issue in the null-handling logic in both new aggregator classes. The call to block_acc.count is incorrect and will cause a runtime error. Additionally, the handling of nulls when ignore_nulls=False is missing, and there's an edge case with blocks containing only nulls when ignore_nulls=True that could lead to a crash. I've left detailed comments with suggestions to fix these issues.
The rest of the code, including documentation and tests for the happy paths, looks good.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ratiopath/ray/aggregate/tensor_mean.py`:
- Line 101: There's a minor typo in the inline comment "# Filter out Nones)" —
remove the extraneous closing parenthesis so the comment reads "# Filter out
Nones"; locate this comment in tensor_mean.py (around the aggregation logic in
the tensor mean function / reduce step) and update it accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b0433680-1596-4952-96c7-6e797191a8cf
📒 Files selected for processing (3)
ratiopath/ray/aggregate/tensor_mean.pyratiopath/ray/aggregate/tensor_std.pytests/test_ray_aggregations.py
🚧 Files skipped from review as they are similar to previous changes (2)
- ratiopath/ray/aggregate/tensor_std.py
- tests/test_ray_aggregations.py
Summary by CodeRabbit
New Features
Documentation
Tests
Chores