feat: add DoubleSparse sparse storage backend by henryiii · Pull Request #1139 · scikit-hep/boost-histogram

henryiii · 2026-06-09T15:14:19Z

🤖 AI text below 🤖

What

Adds a sparse DoubleSparse storage backend, backed by
boost::histogram::storage_adaptor<std::unordered_map<std::size_t, double>>.
Only filled cells are allocated, so histograms over very large or
high-dimensional axis spaces (where a dense grid would not fit in memory)
become feasible. Based on the WIP at pfackeldey/boost-histogram@sparse_hists.

Scope: `double` only, for now

Accumulator-backed sparse storages (Weight, Mean, …) need
boostorg/histogram#421, which
ships in Boost 1.92. This repo vendors Boost 1.90, and a double-backed
sparse storage uses a plain arithmetic value type unaffected by that bug, so it
works today. The siblings can be added after a Boost bump.

Data access — COO-first, not a full drop-in

A hash map has no contiguous buffer, so:

.view() and np.asarray(h) raise TypeError — there is no buffer to
view into, and implicit array conversion of a potentially huge sparse
histogram would be a silent memory hazard.
Histogram.to_coo(flow=False) returns the filled cells as
(indices, values), where indices is a tuple of per-axis index arrays
(numpy-nonzero style, usable directly as a fancy index). flow=True returns
the with-flow grid (underflow at 0).
The explicit copying accessors values() / variances() / counts()
densify into a fresh array, so they keep working (densifying a huge grid is
then the caller's deliberate choice).
at() reads, slicing, projection, sum, and histogram+histogram addition
stay sparse. __setitem__ and scalar h * 2 raise (they need a live buffer).

Reading the filled cells / SciPy

For a 2D histogram, to_coo() maps directly onto a scipy.sparse.coo_array:

(rows, cols), values = h.to_coo()
matrix = coo_array((values, (rows, cols)), shape=h.axes.size)

Serialization

Pickle / copy / deepcopy round-trip via free save/load overloads for
double_sparse (logical size + COO key/value arrays).
UHI to_uhi / from_uhi store the filled cells as a "sparse" COO
payload plus writer_info storage_type="DoubleSparse", reconstructed with a
C++ _from_coo bulk setter. The sparse encoding is boost-histogram specific.

Implementation notes

storage_adaptor publicly inherits map_impl : unordered_map, so
static_cast<unordered_map&>(unsafe_access::storage(h)) iterates only the
filled cells (no densification) for COO extraction and pickling.
COO get/set are bound only on the sparse histogram via a small
is_sparse_storage trait + tag-dispatched register_sparse_coo.

Docs & tests

New Sparse storages section in the storage user guide, including the SciPy
coo_array example above.
New tests/test_sparse_storage.py: fill/read, stays-sparse under
slice/project, to_coo with/without flow, .view()/np.asarray raise while
the copying accessors densify and match a dense Double histogram, repr/str
without densifying, copy/pickle, and UHI round-trips (incl. JSON, empty,
structure-only). Full suite green; prek (ruff/mypy/clang-format/codespell)
clean.

Add a hash-map backed `DoubleSparse` storage that only allocates filled cells, making histograms over very large or high-dimensional axis spaces feasible. Backed by `storage_adaptor<unordered_map<size_t, double>>`. Scoped to `double` only for now: accumulator-backed sparse storages need boostorg/histogram#421, which ships in Boost 1.92 (the repo vendors 1.90). Data access is COO-first and deliberately not a full drop-in: - `.view()` raises (no contiguous buffer to view into). - `Histogram.to_coo(flow=False)` returns the filled cells as `(per-axis index tuple, values)`, numpy-nonzero style. - The copying accessors `values()`/`variances()`/`counts()` and `np.asarray(h)` densify into a fresh array, so they keep working. - `at()` reads, slicing, projection, sum and histogram addition stay sparse. Serialization: pickle round-trips via free save/load on `double_sparse` (COO key/value arrays); UHI to_uhi/from_uhi store COO under a `"sparse"` payload plus writer_info, reconstructed via a C++ `_from_coo` bulk setter. Assisted-by: ClaudeCode:claude-opus-4.8

np.asarray(h) now raises for sparse storage like .view() does, rather than silently densifying — implicit array conversion of a potentially huge sparse histogram is a memory hazard. The explicit copying accessors (values/variances/ counts) still densify on demand. Add a "Sparse storages" section to the storage user guide, including how to build a scipy.sparse.coo_array from to_coo() output. Assisted-by: ClaudeCode:claude-opus-4.8

- Add missing indices.shape(1) vs values.shape(0) validation in histogram_from_coo to prevent out-of-bounds reads. - Hoist std::vector allocation out of the map loop in histogram_to_coo. - Add tests for setitem raising on sparse and sparse+histogram addition. Assisted-by: opencode:NRP/kimi

Regular._repr_args_ built the full edges array just to read the first and last edge; for a very large axis (e.g. the sparse-storage repr test with 1e9 bins) this allocates gigabytes, OOM-killing PyPy CI and overflowing the max array size on 32-bit Windows. Use value() for the endpoints instead, which is O(1). Also pass the coo arrays to histogram_from_coo by const reference to satisfy clang-tidy's performance-unnecessary-value-param check. Assisted-by: ClaudeCode:claude-opus-4.8

Route scalar * and / on sparse storage through the COO get/set path so they scale only the filled cells (0*x == 0 preserves sparsity) instead of densifying via the NumPy view. Scalar +/- are refused with a clear error since they would fill every cell. Document which operations stay sparse, densify, or raise, and note mixed int+slice indexing as a follow-up. Assisted-by: ClaudeCode:claude-opus-4.8

henryiii · 2026-06-09T18:48:42Z

Review from Claude Fable 5:

🤖 AI text below 🤖

I have everything I need — here's the review of PR #1139: feat: add DoubleSparse sparse storage backend (draft, sparse-storage → develop, +858/−13, 15 files).

Overview

Adds a DoubleSparse storage backed by storage_adaptor<std::unordered_map<size_t, double>>, with COO-based data access (to_coo/_from_coo), explicit densifying accessors, pickle/UHI serialization, docs, and a thorough new test file. The design philosophy is sound and consistently applied: keep what can stay sparse sparse, raise loudly where a dense buffer would be required, and make densification an explicit caller choice.

What's good

The COO unravel/ravel logic in histogram.hpp correctly matches Boost's internal layout (underflow at offset 0 when present), and _from_coo validates both index-array dimensions before writing.
I verified in extern/histogram/storage_adaptor.hpp that map_impl::reference erases on zero-assignment and doesn't insert on reads, so _from_coo with zero values and point reads (h[2] == 0.0) both preserve sparsity. The upcast-to-unordered_map trick is legitimate given the public inheritance chain.
Test coverage is genuinely strong: flow/no-flow COO, dense-equivalence checks, all the raise paths, JSON/UHI/pickle round-trips, structure-only, and empty histograms. All 29 pass locally.
Error messages consistently point users at to_coo() — good UX for a non-drop-in storage.

Issues

1. sum(), ==, and therefore repr are O(dense size), which breaks the headline use case. Boost's map_impl iterator walks the logical index space, not the filled cells. I measured on a 10⁹-bin sparse histogram with 2 entries: h.sum(flow=True) takes 0.9 s and h == h2 takes 1.6 s. __repr__ calls sum() twice, so test_repr_and_str_do_not_densify alone takes 6.2 s. The docs' motivating example is a 10¹⁵-cell grid — at this rate, merely displaying that histogram in a REPL would take on the order of a week, and h1 + h2 equality checks or .sum() in user code would hang similarly. Suggested fixes:

Specialize sum for sparse storage (flow=True is just summing the map values; flow=False reuses the to_coo flow-filter logic). Could be done in C++ next to histogram_to_coo, or in Python by routing Histogram.sum through _to_coo for sparse.
== is harder (it's C++ histogram equality, and maps can hold explicit zeros via operator-=), but at minimum the docs' "stays sparse (or returns a scalar) — delegates to the C++ storage" list shouldn't imply .sum() is cheap; today it's a time bomb, not a memory bomb.

I'd treat this as blocking for un-drafting: the feature's whole pitch is huge axis spaces, and the default REPL display hangs on them.

2. UHI payload masquerades as "type": "double". A non-boost-histogram UHI consumer (or an older boost-histogram) reading keep_storage=True output sees type == "double" with a values key present — but it's a 1D COO array, not the dense grid the type promises. Best case it fails with a confusing shape error; worst case something broadcasts it. Since the encoding is admittedly bh-specific anyway, consider a distinct type string (e.g. "sparse_double") so foreign readers fail loudly at the type check, keeping "double" only for the structure-only (keep_storage=False) payload where it's accurate.

3. Concurrent fills can corrupt the map under free-threading. fill.hpp releases the GIL, and map_impl explicitly declares has_threading_support = false — concurrent fill() on one sparse histogram can rehash the unordered_map concurrently, which is a crash/UB rather than the benign lost-update races dense Double has. With 3.14t supported and tests/test_threaded_fill.py in the suite, this deserves at least a docs warning in the new storage section ("do not fill one sparse histogram from multiple threads"), and ideally a thought about whether fill should keep the GIL/lock for this storage.

Minor notes

histogram_from_coo casts indices through int for at() — that's Boost's own index_type limit, so fine, but a comment noting axes >2³¹ bins per axis are unsupported wouldn't hurt.
h *= np.array(...) on sparse takes the array branch and dies with the .view() error ("use .to_coo()") — slightly misleading for a write operation, but acceptable.
The indices.ndim == 1 reshape fallback in from_uhi looks unreachable in practice ([[]] round-trips JSON as 2-D); harmless defensive code.
The Regular._repr_args_ change to self._ax.value(0)/value(self.size) is correct (identical to edges[0]/edges[-1], including for transformed axes) and necessary for huge axes — nice catch.
The "needs changelog" label is already on the PR; a changelog entry is still pending.
Stub additions (any_double_sparse, double_sparse), register_sparse_coo tag dispatch, and the _view_or_dense accessor split all follow existing project patterns cleanly.

Verdict

Architecture, serialization, and test coverage are in good shape. The blocker is finding #1: hidden O(dense-size) iteration in sum/==/repr defeats the feature's core promise and should be fixed (at least sum, which also fixes repr) before this leaves draft. Findings #2 and #3 are smaller, addressable with a type-string change and a docs warning respectively.

The storage iterator of a map-backed histogram visits every logical cell, so bh::algorithm::sum/empty and histogram::operator== were O(dense size) -- repr (which calls sum twice) on the documented 10**15-cell example would never return. Dispatch sum, empty, and ==/!= through overloads that walk only the hash map for sparse storage, treating explicit zeros (left behind by -=) the same as absent cells. Assisted-by: ClaudeCode:claude-fable-5

henryiii · 2026-06-09T19:32:37Z

🤖 Fable 5 text below 🤖

Done — committed as ef75137 on sparse-storage (not pushed).

About the hang you hit: it wasn't the fix itself — uv run doesn't rebuild when only C++ headers change (no [tool.uv] cache-keys configured), so both your run and my first run executed the stale 12:03 binary. Its old O(dense-size) empty() met the new test's 10¹⁵-cell grid and spun forever in C++ with the GIL released, which is why you needed ctrl+\. After forcing a rebuild with uv run --reinstall-package boost-histogram pytest, everything passes. If you want, adding cache-keys = [{file = "include/**"}, {file = "src/**/*.cpp"}] under [tool.uv] would make uv run match what CLAUDE.md promises.

What the fix does:

include/bh_python/histogram.hpp — new histogram_sum, histogram_empty, and histogram_equal overload sets: a generic version delegating to the existing bh::algorithm/operator== paths, and sparse specializations that iterate only the hash map. flow=False reuses the COO unravel logic via a new detail::is_inner_cell helper. Equality compares the two maps treating absent keys as zero, since -= can leave explicit zeros behind (so it's not a structural map compare).
register_histogram.hpp — the sum, empty, __eq__, and __ne__ bindings now route through those overloads; dense storages compile to exactly the old behavior.
repr needed no changes beyond this: it only calls sum() twice (axes repr was already fixed for huge axes, and __str__ already falls back to repr for sparse), so it never touches a dense view or array.

Verification:

On a 10⁹-bin sparse histogram: sum 0.9 s → 0.00 ms, == 1.6 s → 0.01 ms, repr → 0.03 ms.
New tests: a 10¹⁵-cell (100k³) test for sum/empty/==, a sparse-vs-dense sum/empty agreement test, and an explicit-zeros equality test (h -= h.copy() then compare to a fresh histogram). The sparse file went from 6.27 s to 0.07 s for 32 tests.
Full suite: 1012 passed, 1 expected xfail; prek -a --quiet clean.
Docs: the storage guide bullet now states sum/empty/== iterate only the filled cells.

… 32-bit Replace manual early-return loops in histogram_empty and histogram_equal with std::all_of to satisfy the readability-use-anyofallof clang-tidy rule. Skip test_huge_axis_ops_iterate_filled_cells_only on 32-bit Python: a 100000^3 histogram has ~10^15 cells, overflowing 32-bit std::size_t linear keys, so is_inner_cell misclassifies filled cells on those platforms. Assisted-by: ClaudeCode:claude-sonnet-4-6

github-actions Bot added the needs changelog Might need a changelog entry label Jun 9, 2026

pre-commit-ci Bot and others added 5 commits June 9, 2026 15:14

style: pre-commit fixes

a9d5e50

henryiii marked this pull request as ready for review June 9, 2026 21:38

henryiii requested a review from pfackeldey June 9, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DoubleSparse sparse storage backend#1139

feat: add DoubleSparse sparse storage backend#1139
henryiii wants to merge 8 commits into
scikit-hep:developfrom
henryiii:sparse-storage

henryiii commented Jun 9, 2026 •

edited

Loading

Uh oh!

henryiii commented Jun 9, 2026

Uh oh!

henryiii commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henryiii commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Scope: double only, for now

Data access — COO-first, not a full drop-in

Reading the filled cells / SciPy

Serialization

Implementation notes

Docs & tests

Uh oh!

henryiii commented Jun 9, 2026

Overview

What's good

Issues

Minor notes

Verdict

Uh oh!

henryiii commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

henryiii commented Jun 9, 2026 •

edited

Loading

Scope: `double` only, for now