feat: add one-shot resample function for simplified resampling#78
Conversation
a394304 to
95c51e4
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a one-shot resample() convenience API for both Rust and Python to resample a batch of timestamp/value pairs without instantiating a Resampler, and clarifies/fixes interval-boundary semantics so first_timestamp only affects output timestamp labeling.
Changes:
- Added one-shot
resample()functions to Rust (src/lib.rs) and Python bindings (src/python.rs), plus exports/stubs. - Standardized resampling interval semantics to
[start, end)regardless offirst_timestamp, updating docs and tests accordingly. - Updated README and release notes to document the new API and the semantics fix.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/lib.rs |
Adds Rust resample() convenience function and SimpleSample, re-exports epoch_align. |
src/python.rs |
Adds Python resample() binding and exports it from the extension module. |
src/resampler.rs |
Fixes boundary/grouping behavior to always use [start, end); makes epoch_align() public. |
src/tests.rs |
Updates existing Rust tests and adds coverage for boundary/labeling semantics and one-shot resampling. |
tests/test_resampler.py |
Updates Python tests for new semantics and adds one-shot resample() tests. |
frequenz/resampling/_rust_backend.pyi |
Exposes resample() in stubs and widens input type to Sequence. |
frequenz/resampling/__init__.py |
Re-exports resample() from the Python package. |
README.md |
Documents one-shot vs stateful resampling for Rust and Python. |
RELEASE_NOTES.md |
Notes the first_timestamp semantic fix and the new one-shot API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for the review feedback. Regarding the division-by-zero concern for sub-millisecond/zero intervals: this is a pre-existing issue in the codebase, not introduced by this PR. The The scope of this PR is adding the |
95c51e4 to
a2a4745
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| interval: timedelta, | ||
| method: ResamplingFunction, | ||
| *, | ||
| first_timestamp: bool = True, |
There was a problem hiding this comment.
You might consider using the same terminology as in pandas, e.g. this would be label which can be "right" or "left" (see also my other comment).
There was a problem hiding this comment.
added
| resampling_function: The resampling function. | ||
| max_age_in_intervals: The maximum age of a sample in intervals. | ||
| start: The start time of the resampling. | ||
| first_timestamp: Whether the resampled timestamp should be the first |
There was a problem hiding this comment.
Not sure about this change. There are two separate features here, the "closedness" definition of the intervals, and the label. I wonder if this parameter was intended to be for the interval definition, namely to distinguish:
- left-closed:
[t, t+1)(what we typically use in reporting API) - right-closed:
(t, t+1](what we typically use in the microgrid data pipeline in the SDK).
Resampled data cannot be converted from one interval definition to another.
For the other feature, the label, the data can easily be fixed afterwards by shifting the timestamps by the sampling period.
So to me it makes sense to have the original functionality and not change its meaning. Or if so, we better have two parameters.
See also closed and label parameters here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html.
There was a problem hiding this comment.
added
966ba31 to
f9d418a
Compare
malteschaaf
left a comment
There was a problem hiding this comment.
Only two small comments, but also I can't say much about the rust code.
| max_age_in_intervals: int, | ||
| start: datetime, | ||
| first_timestamp: bool = True, | ||
| closed: Literal["left", "right"], |
There was a problem hiding this comment.
One minor thing, because I introduced a similar change for the SDK resampler and was adviced to use an enum instead of a literal here. Did you give this a thought? I know pandas also uses the Literal but it might be benficial here to be a bit more explicit.
There was a problem hiding this comment.
Yes, I liked the literal stuff more, but I get your point.
There was a problem hiding this comment.
changed it
| closed: Which interval edge is closed for sample membership. Use | ||
| `"left"` for `[start, end)` intervals or `"right"` for | ||
| `(start, end]` intervals. | ||
| label: Which interval edge to use for output timestamps. Use `"left"` | ||
| for the interval start or `"right"` for the interval end. |
There was a problem hiding this comment.
Small wording suggestion to make this a bit clearer and more consistent with interval terminology to avoid mapping left/right to start/end:
| closed: Which interval edge is closed for sample membership. Use | |
| `"left"` for `[start, end)` intervals or `"right"` for | |
| `(start, end]` intervals. | |
| label: Which interval edge to use for output timestamps. Use `"left"` | |
| for the interval start or `"right"` for the interval end. | |
| closed: Which interval edge is closed for sample membership. Use `"left"` for | |
| left-closed, right-open intervals `[start, end)` and `"right"` for | |
| right-closed, left-open intervals `(start, end]`. | |
| label: Which interval edge to use for output timestamps. Use `"left"` for the left bin edge and `"right"` for the right bin edge. |
eb5c08c to
89fffb9
Compare
|
|
||
| assert result == [ | ||
| (start + 5 * step, 20.0), | ||
| (start + 10 * step, None), |
There was a problem hiding this comment.
Shouldn't this be
(start, 10.0),
(start + 5 * step, 20),
There was a problem hiding this comment.
Do you mean
(start, 10.0),
(start + 5 * step, 20),
because start*step doesn't work, I think.
There was a problem hiding this comment.
With closed=Closed.Right and label=Label.Right, the buckets are (0s, 5s] and (5s, 10s], labeled at 5s and 10s. That means:
- the sample at start (0s) is excluded from the first bucket
- the sample at start + 5 * step (5s) is included in the first bucket
There was a problem hiding this comment.
It depends on whether your reference is the index or the data. If it's the data, the first bucket would be introduced as (-5s, 0s] to cover your data point.
There was a problem hiding this comment.
E.g. for pandas:
pd.DataFrame([10,20], index=pd.to_datetime(["1970-01-01T00:00:00Z", "1970-01-01T00:00:05Z"])).resample("5s", closed="right", label="right").sum()
0
1970-01-01 00:00:00+00:00 10
1970-01-01 00:00:05+00:00 20
There was a problem hiding this comment.
You can also see the new bucket when changing the label to left:
pd.DataFrame([10,20], index=pd.to_datetime(["1970-01-01T00:00:00Z", "1970-01-01T00:00:05Z"])).resample("5s", closed="right", label="left").sum()
0
1969-12-31 23:59:55+00:00 10
1970-01-01 00:00:00+00:00 20
There was a problem hiding this comment.
I can change it to behave exactly like pandas if you like. What will be more helpful for our use cases?
There was a problem hiding this comment.
Personally I prefer the pandas behavior since it focuses on keeping the data and not the index.
There was a problem hiding this comment.
okay, done
| index=pd.DatetimeIndex([timestamp for timestamp, _ in data]), | ||
| dtype="float64", | ||
| ) | ||
| pandas_result = pandas_series.resample("5s", closed="left", label="right").mean() |
There was a problem hiding this comment.
This can iterate over all combinations of closed/label options to validate consistency between both.
|
|
||
| # First value in each interval is None | ||
| data: list[tuple[dt.datetime, float | None]] = [ | ||
| (start + i * step, None if i in (0, 5) else float(i + 1)) for i in range(10) |
There was a problem hiding this comment.
What if the interval only contains None?
There was a problem hiding this comment.
Complete bucket of only None:
- Average, Sum, Min, Max, First, Last, Coalesce -> None
- Count -> 0
Partially None bucket:
- Average, Sum, Min, Max ignore the Nones
- First and Last do not ignore them; if the first/last sample is None, result can be None
- Coalesce returns the first non-None
- Count counts only non-None
|
|
||
|
|
||
| def test_resample_function_with_none_values() -> None: | ||
| """Test the resample function with None values.""" |
There was a problem hiding this comment.
Complete bucket of only NaN:
- all aggregations except Count return NaN
- Count -> counts them as present
Partially NaN bucket:
- Average, Sum -> NaN
- Min, Max -> effectively ignore NaN when real numbers are present
- First, Last -> return NaN if the chosen edge sample is NaN
- Coalesce -> returns NaN if it is the first non-None
- Count -> counts NaN as present
| # Interval [5, 10): values 7,8,9,10 → avg = 8.5 | ||
| assert result[0] == (start, 3.5) | ||
| assert result[1] == (start + 5 * step, 8.5) | ||
|
|
There was a problem hiding this comment.
Complete bucket of only inf:
- all aggregations except Count return inf
- Count -> counts them as present
Partially inf bucket:
- Average, Sum -> inf
- Min -> finite minimum wins if any smaller finite value exists
- Max -> inf
- First, Last -> return inf if the chosen edge sample is inf
- Coalesce -> returns inf if it is the first non-None
- Count -> counts inf as present
|
@cwasicki pandas treats both None and NaN as missing in numeric series, while our implementation only treats None as missing. Should we adapt it to work exactly like pandas or do we accept this difference? I think our way is better, since None and NaN are two different concepts. None means, nothing arrived. NaN means, NaN arrived. |
|
I added some more tests. |
89fffb9 to
b1ac2c1
Compare
I haven't thought about this enough but your explanation makes sense to me. But I think this should be clearly explained in the docs, maybe even stating the different behavior in pandas. |
b1ac2c1 to
aa4820c
Compare
…_timestamp The `first_timestamp` parameter was incorrectly affecting both output timestamp labeling and interval grouping semantics. When `first_timestamp=false`, intervals used `(start, end]` instead of `[start, end)`, causing samples at exact interval boundaries to be excluded. This fix makes interval grouping consistently use `[start, end)` regardless of `first_timestamp` value. The parameter now only controls the output timestamp labeling as documented: `true` labels with the interval start, `false` with the interval end. Changes include: - Simplified boundary checking functions to remove first_timestamp logic - Updated documentation to clarify the parameter's sole purpose - Fixed test data to start at t=0 to match README example - Added explicit test for first_timestamp=false behavior - Updated RELEASE_NOTES with bug fix details Signed-off-by: Phillip Wenig <phillip.wenig@frequenz.com>
Add tests to strengthen coverage of the first_timestamp fix: Rust tests (src/tests.rs): - test_first_timestamp_same_values: Verifies both settings produce identical aggregated values, differing only in output timestamps - test_sample_at_interval_boundary: Confirms sample at t=5 goes to interval [5, 10), validating [start, end) semantics - test_data_starting_mid_interval: Tests correct behavior when data doesn't start at interval boundary Python tests (tests/test_resampler.py): - Updated all tests to start data at t=0 to match README example - Fixed expected values to match correct [start, end) interval semantics Signed-off-by: Phillip Wenig <phillip.wenig@frequenz.com>
Add a new `resample()` convenience function available in both Rust and Python APIs, allowing users to resample data in a single call without needing to instantiate and manage a `Resampler` instance. The Rust implementation includes: - A new `SimpleSample` type for simple timestamp/value pairs - The public `resample()` function that internally uses `Resampler` - Made `epoch_align()` public to support the new functionality The Python implementation provides the same functionality through PyO3 bindings with matching semantics. Also updated: - README with "One-Shot Resampling" and "Stateful Resampling" sections - Type stubs with function signature using Sequence for flexibility - Comprehensive test coverage in both Rust and Python - RELEASE_NOTES.md Signed-off-by: Phillip Wenig <phillip.wenig@frequenz.com>
… label parameters Replace the single boolean `first_timestamp` parameter with two explicit enums in both `Resampler::new()` and `resample()` functions: - `Closed`: Controls which edge of the interval is closed for sample membership. Use `Closed::Left` for `[start, end)` intervals or `Closed::Right` for `(start, end]` intervals. - `Label`: Controls which edge of the interval is used for output timestamps. Use `Label::Left` for the interval start or `Label::Right` for the interval end. This clarifies the API by making the distinction between interval membership semantics (closed) and output timestamp labeling (label) explicit, rather than bundling them into a single boolean. BREAKING CHANGE: The `first_timestamp` parameter is removed from both Rust and Python APIs. Replace calls using `first_timestamp=true` with `Closed::Left, Label::Left` (Rust) or `closed="left", label="left"` (Python). Replace `first_timestamp=false` with `Closed::Left, Label::Right` (Rust) or `closed="left", label="right"` (Python). Signed-off-by: Phillip Wenig <phillip.wenig@frequenz.com>
Signed-off-by: Phillip Wenig <phillip.wenig@frequenz.com>
aa4820c to
fbae0fc
Compare
Signed-off-by: Phillip Wenig <phillip.wenig@frequenz.com>
32e4b19 to
ba9ed6d
Compare
Summary
Add a new
resample()convenience function available in both Rust and Python APIs, allowing users to resample data in a single call without needing to instantiate and manage aResamplerinstance.Rust API
Python API
Changes
resample()function in Rust (src/lib.rs) withSimpleSampletyperesample()Python binding (src/python.rs)epoch_align()public to support the new functionalitySequenceinput type for flexibilitydepends on #75