Skip to content

feat: add DoubleSparse sparse storage backend#1139

Open
henryiii wants to merge 8 commits into
scikit-hep:developfrom
henryiii:sparse-storage
Open

feat: add DoubleSparse sparse storage backend#1139
henryiii wants to merge 8 commits into
scikit-hep:developfrom
henryiii:sparse-storage

Conversation

@henryiii

@henryiii henryiii commented Jun 9, 2026

Copy link
Copy Markdown
Member

🤖 AI text below 🤖

What

Adds a sparse DoubleSparse storage backend, backed by
boost::histogram::storage_adaptor<std::unordered_map<std::size_t, double>>.
Only filled cells are allocated, so histograms over very large or
high-dimensional axis spaces (where a dense grid would not fit in memory)
become feasible. Based on the WIP at pfackeldey/boost-histogram@sparse_hists.

Scope: double only, for now

Accumulator-backed sparse storages (Weight, Mean, …) need
boostorg/histogram#421, which
ships in Boost 1.92. This repo vendors Boost 1.90, and a double-backed
sparse storage uses a plain arithmetic value type unaffected by that bug, so it
works today. The siblings can be added after a Boost bump.

Data access — COO-first, not a full drop-in

A hash map has no contiguous buffer, so:

  • .view() and np.asarray(h) raise TypeError — there is no buffer to
    view into, and implicit array conversion of a potentially huge sparse
    histogram would be a silent memory hazard.
  • Histogram.to_coo(flow=False) returns the filled cells as
    (indices, values), where indices is a tuple of per-axis index arrays
    (numpy-nonzero style, usable directly as a fancy index). flow=True returns
    the with-flow grid (underflow at 0).
  • The explicit copying accessors values() / variances() / counts()
    densify into a fresh array, so they keep working (densifying a huge grid is
    then the caller's deliberate choice).
  • at() reads, slicing, projection, sum, and histogram+histogram addition
    stay sparse. __setitem__ and scalar h * 2 raise (they need a live buffer).

Reading the filled cells / SciPy

For a 2D histogram, to_coo() maps directly onto a scipy.sparse.coo_array:

(rows, cols), values = h.to_coo()
matrix = coo_array((values, (rows, cols)), shape=h.axes.size)

Serialization

  • Pickle / copy / deepcopy round-trip via free save/load overloads for
    double_sparse (logical size + COO key/value arrays).
  • UHI to_uhi / from_uhi store the filled cells as a "sparse" COO
    payload plus writer_info storage_type="DoubleSparse", reconstructed with a
    C++ _from_coo bulk setter. The sparse encoding is boost-histogram specific.

Implementation notes

  • storage_adaptor publicly inherits map_impl : unordered_map, so
    static_cast<unordered_map&>(unsafe_access::storage(h)) iterates only the
    filled cells (no densification) for COO extraction and pickling.
  • COO get/set are bound only on the sparse histogram via a small
    is_sparse_storage trait + tag-dispatched register_sparse_coo.

Docs & tests

  • New Sparse storages section in the storage user guide, including the SciPy
    coo_array example above.
  • New tests/test_sparse_storage.py: fill/read, stays-sparse under
    slice/project, to_coo with/without flow, .view()/np.asarray raise while
    the copying accessors densify and match a dense Double histogram, repr/str
    without densifying, copy/pickle, and UHI round-trips (incl. JSON, empty,
    structure-only). Full suite green; prek (ruff/mypy/clang-format/codespell)
    clean.

Add a hash-map backed `DoubleSparse` storage that only allocates filled
cells, making histograms over very large or high-dimensional axis spaces
feasible. Backed by `storage_adaptor<unordered_map<size_t, double>>`.

Scoped to `double` only for now: accumulator-backed sparse storages need
boostorg/histogram#421, which ships in Boost 1.92 (the repo vendors 1.90).

Data access is COO-first and deliberately not a full drop-in:
- `.view()` raises (no contiguous buffer to view into).
- `Histogram.to_coo(flow=False)` returns the filled cells as
  `(per-axis index tuple, values)`, numpy-nonzero style.
- The copying accessors `values()`/`variances()`/`counts()` and
  `np.asarray(h)` densify into a fresh array, so they keep working.
- `at()` reads, slicing, projection, sum and histogram addition stay sparse.

Serialization: pickle round-trips via free save/load on `double_sparse`
(COO key/value arrays); UHI to_uhi/from_uhi store COO under a `"sparse"`
payload plus writer_info, reconstructed via a C++ `_from_coo` bulk setter.

Assisted-by: ClaudeCode:claude-opus-4.8
@github-actions github-actions Bot added the needs changelog Might need a changelog entry label Jun 9, 2026
pre-commit-ci Bot and others added 5 commits June 9, 2026 15:14
np.asarray(h) now raises for sparse storage like .view() does, rather than
silently densifying — implicit array conversion of a potentially huge sparse
histogram is a memory hazard. The explicit copying accessors (values/variances/
counts) still densify on demand.

Add a "Sparse storages" section to the storage user guide, including how to
build a scipy.sparse.coo_array from to_coo() output.

Assisted-by: ClaudeCode:claude-opus-4.8
- Add missing indices.shape(1) vs values.shape(0) validation in
  histogram_from_coo to prevent out-of-bounds reads.
- Hoist std::vector allocation out of the map loop in histogram_to_coo.
- Add tests for setitem raising on sparse and sparse+histogram addition.

Assisted-by: opencode:NRP/kimi
Regular._repr_args_ built the full edges array just to read the first
and last edge; for a very large axis (e.g. the sparse-storage repr test
with 1e9 bins) this allocates gigabytes, OOM-killing PyPy CI and
overflowing the max array size on 32-bit Windows. Use value() for the
endpoints instead, which is O(1).

Also pass the coo arrays to histogram_from_coo by const reference to
satisfy clang-tidy's performance-unnecessary-value-param check.

Assisted-by: ClaudeCode:claude-opus-4.8
Route scalar * and / on sparse storage through the COO get/set path so
they scale only the filled cells (0*x == 0 preserves sparsity) instead
of densifying via the NumPy view. Scalar +/- are refused with a clear
error since they would fill every cell. Document which operations stay
sparse, densify, or raise, and note mixed int+slice indexing as a
follow-up.

Assisted-by: ClaudeCode:claude-opus-4.8
@henryiii

henryiii commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Review from Claude Fable 5:

🤖 AI text below 🤖

I have everything I need — here's the review of PR #1139: feat: add DoubleSparse sparse storage backend (draft, sparse-storagedevelop, +858/−13, 15 files).

Overview

Adds a DoubleSparse storage backed by storage_adaptor<std::unordered_map<size_t, double>>, with COO-based data access (to_coo/_from_coo), explicit densifying accessors, pickle/UHI serialization, docs, and a thorough new test file. The design philosophy is sound and consistently applied: keep what can stay sparse sparse, raise loudly where a dense buffer would be required, and make densification an explicit caller choice.

What's good

  • The COO unravel/ravel logic in histogram.hpp correctly matches Boost's internal layout (underflow at offset 0 when present), and _from_coo validates both index-array dimensions before writing.
  • I verified in extern/histogram/storage_adaptor.hpp that map_impl::reference erases on zero-assignment and doesn't insert on reads, so _from_coo with zero values and point reads (h[2] == 0.0) both preserve sparsity. The upcast-to-unordered_map trick is legitimate given the public inheritance chain.
  • Test coverage is genuinely strong: flow/no-flow COO, dense-equivalence checks, all the raise paths, JSON/UHI/pickle round-trips, structure-only, and empty histograms. All 29 pass locally.
  • Error messages consistently point users at to_coo() — good UX for a non-drop-in storage.

Issues

1. sum(), ==, and therefore repr are O(dense size), which breaks the headline use case. Boost's map_impl iterator walks the logical index space, not the filled cells. I measured on a 10⁹-bin sparse histogram with 2 entries: h.sum(flow=True) takes 0.9 s and h == h2 takes 1.6 s. __repr__ calls sum() twice, so test_repr_and_str_do_not_densify alone takes 6.2 s. The docs' motivating example is a 10¹⁵-cell grid — at this rate, merely displaying that histogram in a REPL would take on the order of a week, and h1 + h2 equality checks or .sum() in user code would hang similarly. Suggested fixes:

  • Specialize sum for sparse storage (flow=True is just summing the map values; flow=False reuses the to_coo flow-filter logic). Could be done in C++ next to histogram_to_coo, or in Python by routing Histogram.sum through _to_coo for sparse.
  • == is harder (it's C++ histogram equality, and maps can hold explicit zeros via operator-=), but at minimum the docs' "stays sparse (or returns a scalar) — delegates to the C++ storage" list shouldn't imply .sum() is cheap; today it's a time bomb, not a memory bomb.

I'd treat this as blocking for un-drafting: the feature's whole pitch is huge axis spaces, and the default REPL display hangs on them.

2. UHI payload masquerades as "type": "double". A non-boost-histogram UHI consumer (or an older boost-histogram) reading keep_storage=True output sees type == "double" with a values key present — but it's a 1D COO array, not the dense grid the type promises. Best case it fails with a confusing shape error; worst case something broadcasts it. Since the encoding is admittedly bh-specific anyway, consider a distinct type string (e.g. "sparse_double") so foreign readers fail loudly at the type check, keeping "double" only for the structure-only (keep_storage=False) payload where it's accurate.

3. Concurrent fills can corrupt the map under free-threading. fill.hpp releases the GIL, and map_impl explicitly declares has_threading_support = false — concurrent fill() on one sparse histogram can rehash the unordered_map concurrently, which is a crash/UB rather than the benign lost-update races dense Double has. With 3.14t supported and tests/test_threaded_fill.py in the suite, this deserves at least a docs warning in the new storage section ("do not fill one sparse histogram from multiple threads"), and ideally a thought about whether fill should keep the GIL/lock for this storage.

Minor notes

  • histogram_from_coo casts indices through int for at() — that's Boost's own index_type limit, so fine, but a comment noting axes >2³¹ bins per axis are unsupported wouldn't hurt.
  • h *= np.array(...) on sparse takes the array branch and dies with the .view() error ("use .to_coo()") — slightly misleading for a write operation, but acceptable.
  • The indices.ndim == 1 reshape fallback in from_uhi looks unreachable in practice ([[]] round-trips JSON as 2-D); harmless defensive code.
  • The Regular._repr_args_ change to self._ax.value(0)/value(self.size) is correct (identical to edges[0]/edges[-1], including for transformed axes) and necessary for huge axes — nice catch.
  • The "needs changelog" label is already on the PR; a changelog entry is still pending.
  • Stub additions (any_double_sparse, double_sparse), register_sparse_coo tag dispatch, and the _view_or_dense accessor split all follow existing project patterns cleanly.

Verdict

Architecture, serialization, and test coverage are in good shape. The blocker is finding #1: hidden O(dense-size) iteration in sum/==/repr defeats the feature's core promise and should be fixed (at least sum, which also fixes repr) before this leaves draft. Findings #2 and #3 are smaller, addressable with a type-string change and a docs warning respectively.

The storage iterator of a map-backed histogram visits every logical
cell, so bh::algorithm::sum/empty and histogram::operator== were
O(dense size) -- repr (which calls sum twice) on the documented
10**15-cell example would never return. Dispatch sum, empty, and
==/!= through overloads that walk only the hash map for sparse
storage, treating explicit zeros (left behind by -=) the same as
absent cells.

Assisted-by: ClaudeCode:claude-fable-5
@henryiii

henryiii commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

🤖 Fable 5 text below 🤖

Done — committed as ef75137 on sparse-storage (not pushed).

About the hang you hit: it wasn't the fix itself — uv run doesn't rebuild when only C++ headers change (no [tool.uv] cache-keys configured), so both your run and my first run executed the stale 12:03 binary. Its old O(dense-size) empty() met the new test's 10¹⁵-cell grid and spun forever in C++ with the GIL released, which is why you needed ctrl+\. After forcing a rebuild with uv run --reinstall-package boost-histogram pytest, everything passes. If you want, adding cache-keys = [{file = "include/**"}, {file = "src/**/*.cpp"}] under [tool.uv] would make uv run match what CLAUDE.md promises.

What the fix does:

  • include/bh_python/histogram.hpp — new histogram_sum, histogram_empty, and histogram_equal overload sets: a generic version delegating to the existing bh::algorithm/operator== paths, and sparse specializations that iterate only the hash map. flow=False reuses the COO unravel logic via a new detail::is_inner_cell helper. Equality compares the two maps treating absent keys as zero, since -= can leave explicit zeros behind (so it's not a structural map compare).
  • register_histogram.hpp — the sum, empty, __eq__, and __ne__ bindings now route through those overloads; dense storages compile to exactly the old behavior.
  • repr needed no changes beyond this: it only calls sum() twice (axes repr was already fixed for huge axes, and __str__ already falls back to repr for sparse), so it never touches a dense view or array.

Verification:

  • On a 10⁹-bin sparse histogram: sum 0.9 s → 0.00 ms, == 1.6 s → 0.01 ms, repr → 0.03 ms.
  • New tests: a 10¹⁵-cell (100k³) test for sum/empty/==, a sparse-vs-dense sum/empty agreement test, and an explicit-zeros equality test (h -= h.copy() then compare to a fresh histogram). The sparse file went from 6.27 s to 0.07 s for 32 tests.
  • Full suite: 1012 passed, 1 expected xfail; prek -a --quiet clean.
  • Docs: the storage guide bullet now states sum/empty/== iterate only the filled cells.

… 32-bit

Replace manual early-return loops in histogram_empty and histogram_equal
with std::all_of to satisfy the readability-use-anyofallof clang-tidy rule.

Skip test_huge_axis_ops_iterate_filled_cells_only on 32-bit Python: a
100000^3 histogram has ~10^15 cells, overflowing 32-bit std::size_t linear
keys, so is_inner_cell misclassifies filled cells on those platforms.

Assisted-by: ClaudeCode:claude-sonnet-4-6
@henryiii henryiii marked this pull request as ready for review June 9, 2026 21:38
@henryiii henryiii requested a review from pfackeldey June 9, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs changelog Might need a changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant