Skip to content

feat: add Collector storage that keeps original per-bin values#1133

Open
henryiii wants to merge 3 commits into
scikit-hep:developfrom
henryiii:feat/collector-storage
Open

feat: add Collector storage that keeps original per-bin values#1133
henryiii wants to merge 3 commits into
scikit-hep:developfrom
henryiii:feat/collector-storage

Conversation

@henryiii

@henryiii henryiii commented Jun 8, 2026

Copy link
Copy Markdown
Member

🤖 AI text below 🤖

Summary

Adds bh.storage.Collector, a storage that keeps the original sample values that fall into each bin (a variable-length list per bin) instead of aggregating them. This finishes the idea started in #378 — the C++ accumulators::collector<> it builds on has since been upstreamed into Boost.Histogram and is already vendored in extern/, so this PR is the Python binding + integration.

import boost_histogram as bh
h = bh.Histogram(bh.axis.Regular(3, 0, 3), storage=bh.storage.Collector())
h.fill([0, 0, 1, 2, 2, 2], sample=[1., 2., 3., 4., 5., 6.])
h.view()   # object array: [array([1., 2.]), array([3.]), array([4., 5., 6.])]
h[0]       # [1.0, 2.0]

Each value in sample= is appended to the bin chosen by the corresponding coordinate.

Design

  • Storage is dense_storage<accumulators::collector<std::vector<double>>> (double only), filled through a new fill_impl overload for the collector's accumulator_traits_holder<false, const double&> (none of the existing overloads matched).
  • Because per-bin data is ragged, there is no buffer protocol. view() returns a NumPy object-dtype array (shape = axes), each element a 1-D float64 copy of that bin's values. A dedicated register_histogram specialization replaces def_buffer with an bh::indexed-based object-array builder.
  • Works (all via the C++ backend, all concatenating where bins merge): integer/slice __getitem__, project, slicing/factor-rebin reduce, h1 + h2 / +=, sum, pickle, copy/deepcopy.
  • Raises NotImplementedError — the object view returns copies, so anything that writes back through it is unsupported: item assignment, array/scalar arithmetic (h * 2), group rebinning, integer picking on a subset of axes, and list-based selection. Weighted and threaded fills raise too.
  • Pickle stores a per-bin counts array plus a single flat values array — the offsets+content layout a future to_awkward() would need.

Out of scope / follow-ups

  • Awkward Array conversion (the serialization layout is the groundwork).
  • A writable ragged view (would re-enable setitem / group rebin).
  • Non-double element types.

Not UHI-serializable, consistent with MultiCell.

Tests

  • New tests/test_collector.py (23 tests): fill, object-array view incl. flow, scalar-arg broadcast, 2D, indexing, sum/project/reduce/factor-rebin concatenation, addition, pickle round-trip, copy/deepcopy, reset, equality, structural match, and all the unsupported-op guards.
  • Full suite: 978 passed, 1 pre-existing xfail. prek -a clean (ruff, clang-format, mypy).

Adds bh.storage.Collector, which stores the original sample values that
fall into each bin (a variable-length list per bin) instead of
aggregating them. It is built on Boost's vendored
accumulators::collector<std::vector<double>> in a dense_storage and is
filled via h.fill(x, sample=values).

Because the per-bin data is ragged, view() returns a NumPy object-dtype
array of per-bin float64 copies (no buffer protocol). Operations that go
through the C++ backend work and concatenate (indexing, project,
slicing/factor-rebin reduce, +/+=, sum, pickle, copy). Operations that
would need to write back through the copy-only view raise
NotImplementedError (item assignment, array/scalar arithmetic, group
rebinning, pick-on-subset, list selection); weighted and threaded fills
also raise.

Pickle serializes a per-bin counts array plus a flat values array, which
also seeds a future Awkward conversion (offsets + content).

Assisted-by: ClaudeCode:claude-opus-4.8
@github-actions github-actions Bot added the needs changelog Might need a changelog entry label Jun 8, 2026
pre-commit-ci Bot and others added 2 commits June 8, 2026 20:58
Mark the locals in make_object_view and the collector sum() result const,
as required by clang-tidy --warnings-as-errors.

Assisted-by: ClaudeCode:claude-opus-4.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs changelog Might need a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant