Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ repos:
rev: v2.4.2
hooks:
- id: codespell
args: ["-L", "hist,nd,circularly,ba,fo", "-w"]
args: ["-L", "hist,nd,circularly,ba,fo,coo", "-w"]
exclude: ^(notebooks/xarray.ipynb|notebooks/BoostHistogramHandsOn.ipynb|.*\.svg)$

- repo: local
Expand Down
108 changes: 108 additions & 0 deletions docs/user-guide/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,114 @@ This storage is like ``Int64()``, but also provides a thread safety guarantee.
You can fill a single histogram from multiple threads.


Sparse storages
---------------

DoubleSparse
^^^^^^^^^^^^

The ``DoubleSparse()`` storage is like ``Double()``, but only the bins that are
actually filled are stored (in a hash map keyed on the bin index). This makes
histograms over very large or high-dimensional axis spaces feasible, where a
dense grid would not fit in memory even though only a small fraction of the bins
are ever filled.

.. code-block:: python3

import boost_histogram as bh

# A 5-dimensional grid; dense storage would need 10**15 cells.
axes = [bh.axis.Regular(1000, 0, 1) for _ in range(5)]
h = bh.Histogram(*axes, storage=bh.storage.DoubleSparse())
h.fill(*data) # only the filled cells are allocated

Because the data is not stored as a dense array, **sparse storage is not a
drop-in replacement for a dense histogram**. Operations that can be expressed
without materializing the full grid keep the result sparse; operations that need
a dense buffer raise a clear ``TypeError`` (pointing you at
:meth:`~boost_histogram.Histogram.to_coo`) rather than silently densifying.

Stays sparse (or returns a scalar) -- delegates to the C++ storage:

* Reading a single bin: ``h[bh.loc(...)]``, ``h[5]``.
* Integer slicing (``h[10:60]``), rebinning (``h[:: bh.rebin(2)]``), integration
(``h[10:60:sum]``), and projection (``h.project(0)``).
* ``.sum()``, ``.empty()``, and ``==`` comparison -- these iterate only the
filled cells, never the dense grid.
* Adding two histograms (``h1 + h2``) and scalar ``*`` / ``/`` (which only
scale the filled cells, so ``0`` stays ``0``).
* ``copy``/``deepcopy``, pickling, and UHI serialization.

Densifies into a fresh array (works, but allocates the full dense shape):

* The copying accessors ``h.values()``, ``h.variances()``, ``h.counts()``. Only
use these when the dense shape fits in memory; otherwise prefer
:meth:`~boost_histogram.Histogram.to_coo`.

Unsupported -- raises ``TypeError``:

* ``h.view()`` and ``np.asarray(h)`` -- there is no contiguous buffer to view.
* Writes: ``h[...] = value`` (filling with ``h.fill(...)`` is the only write path).
* Scalar ``+`` / ``-`` -- these would add to *every* cell, filling the grid.
* Fancy indexing that gathers through a dense view: mixed integer + slice
(``h[2, :]``), categorical list picks (``h[:, [0, 2]]``), and NumPy array
indexing (``h[np.array([...])]``).

.. note::

Mixed integer + slice indexing (``h[2, :]``) is a common pattern and a likely
follow-up to support natively on sparse storage; for now, slice without
collapsing the axis (``h[2:3, :]``) to stay sparse, or densify first.

Only ``double`` sparse storage is available for now; accumulator-backed sparse
storages (the equivalent of ``Weight``, ``Mean``, ...) are waiting on a
Boost.Histogram fix that ships in Boost 1.92.

Reading the filled cells
""""""""""""""""""""""""

:meth:`~boost_histogram.Histogram.to_coo` returns the filled cells in coordinate
(COO) form -- a tuple of per-axis index arrays (in the style of
:func:`numpy.nonzero`, so they can be used directly as a fancy index) and the
array of corresponding values:

.. code-block:: python3

h = bh.Histogram(bh.axis.Regular(5, 0, 5), storage=bh.storage.DoubleSparse())
h.fill([1.5, 1.5, 3.5])

indices, values = h.to_coo()
# indices == (array([1, 3]),) values == array([2., 1.])

Pass ``flow=True`` to include the underflow/overflow bins (the indices then run
over the with-flow grid, with underflow at ``0``).

Converting to a SciPy sparse array
""""""""""""""""""""""""""""""""""

For a 2D histogram, the COO output maps directly onto a
:class:`scipy.sparse.coo_array`. The per-axis index arrays are the row and
column coordinates, and ``h.axes.size`` is the dense shape:

.. code-block:: python3

from scipy.sparse import coo_array

h = bh.Histogram(
bh.axis.Regular(1000, 0, 1000),
bh.axis.Regular(1000, 0, 1000),
storage=bh.storage.DoubleSparse(),
)
h.fill(xs, ys)

(rows, cols), values = h.to_coo()
matrix = coo_array((values, (rows, cols)), shape=h.axes.size)

(SciPy sparse arrays are two-dimensional, so this applies to 2D histograms; for
higher dimensions, use the ``(indices, values)`` pair directly, e.g. with an
N-dimensional sparse library such as ``sparse``.)


Accumulator storages
--------------------

Expand Down
237 changes: 237 additions & 0 deletions include/bh_python/histogram.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,18 @@
#include <bh_python/accumulators/weighted_sum.hpp>
#include <bh_python/axis.hpp>
#include <bh_python/multi_cell.hpp>
#include <bh_python/storage.hpp>

#include <boost/histogram/algorithm/empty.hpp>
#include <boost/histogram/algorithm/sum.hpp>
#include <boost/histogram/detail/axes.hpp>
#include <boost/histogram/histogram.hpp>
#include <boost/histogram/unsafe_access.hpp>

#include <algorithm>
#include <unordered_map>
#include <vector>

namespace pybind11 {

/// The descriptor for atomic_* is the same as the descriptor for *, as long this uses
Expand Down Expand Up @@ -80,6 +87,17 @@ py::buffer_info make_buffer(bh::histogram<A, bh::dense_storage<T>>& h, bool flow
return detail::make_buffer_impl(axes, flow, &storage[0]);
}

/// Sparse storage keeps only filled cells in a hash map, so there is no
/// contiguous buffer to expose. Callers should use ``to_coo()`` instead. We
/// throw rather than return an empty buffer so ``.view()`` / ``np.asarray`` give
/// a single, clear error.
template <class A, class T>
py::buffer_info make_buffer(bh::histogram<A, storage::sparse_storage<T>>& /* h */,
bool /* flow */) {
throw py::type_error(
"sparse storage does not support .view() or buffer access; use .to_coo()");
}

/// Specialization for unlimited_buffer
template <class A, class Allocator>
py::buffer_info make_buffer(bh::histogram<A, bh::unlimited_storage<Allocator>>& h,
Expand Down Expand Up @@ -123,6 +141,225 @@ py::buffer_info make_buffer(bh::histogram<A, bh::multi_cell<T>>& h, bool flow) {
std_as_const(new_axes), flow, static_cast<double*>(storage.get_buffer()));
}

/// Trait identifying our sparse (hash-map backed) storages.
template <class S>
struct is_sparse_storage : std::false_type {};

template <class T>
struct is_sparse_storage<storage::sparse_storage<T>> : std::true_type {};

namespace detail {
/// Per-axis layout used to (un)ravel a flat sparse storage index. The flat index
/// counts flow bins, with underflow (when present) at position 0, exactly like
/// make_buffer_impl above.
struct axis_layout {
std::size_t extent; // size + has_underflow + has_overflow
std::size_t size; // number of normal (in-range) bins
std::size_t underflow; // 1 if this axis has an underflow bin, else 0
};

template <class Axes>
std::vector<axis_layout> make_axis_layout(const Axes& axes) {
std::vector<axis_layout> layout;
bh::detail::for_each_axis(axes, [&layout](const auto& ax) {
const bool has_underflow
= bh::axis::traits::options(ax) & bh::axis::option::underflow;
layout.push_back({static_cast<std::size_t>(bh::axis::traits::extent(ax)),
static_cast<std::size_t>(ax.size()),
has_underflow ? std::size_t{1} : std::size_t{0}});
});
return layout;
}

/// True if a flat sparse storage key addresses an in-range (non-flow) cell.
inline bool is_inner_cell(std::size_t lin, const std::vector<axis_layout>& layout) {
for(const auto& info : layout) {
const std::size_t iflow = lin % info.extent;
lin /= info.extent;
if(iflow < info.underflow || iflow >= info.underflow + info.size)
return false;
}
return true;
}
} // namespace detail

/// Extract the filled cells of a sparse histogram in COO form. Returns a tuple
/// ``(indices, values)`` where ``indices`` is an ``(ndim, n)`` integer array and
/// ``values`` an ``n``-length array. With ``flow=False`` the flow cells are
/// dropped and indices run 0..size-1; with ``flow=True`` indices run
/// 0..extent-1 (underflow at 0), matching the ``view(flow=True)`` grid.
template <class A, class T>
py::tuple histogram_to_coo(bh::histogram<A, storage::sparse_storage<T>>& h, bool flow) {
const auto& axes = bh::unsafe_access::axes(h);
auto& storage = bh::unsafe_access::storage(h);
const auto layout = detail::make_axis_layout(axes);
const auto rank = layout.size();

// storage_adaptor publicly inherits map_impl<T>, which publicly inherits the
// underlying map, so this upcast exposes only the filled cells.
using map_type = std::unordered_map<std::size_t, T>;
const auto& map = static_cast<const map_type&>(storage);

std::vector<std::vector<py::ssize_t>> idx_cols(rank);
std::vector<T> vals;
vals.reserve(map.size());

std::vector<py::ssize_t> multi(rank);
for(const auto& kv : map) {
std::size_t lin = kv.first;
bool keep = true;
for(std::size_t ax = 0; ax < rank; ++ax) {
const auto& info = layout[ax];
const std::size_t iflow = lin % info.extent;
lin /= info.extent;
if(flow) {
multi[ax] = static_cast<py::ssize_t>(iflow);
} else {
const py::ssize_t real = static_cast<py::ssize_t>(iflow)
- static_cast<py::ssize_t>(info.underflow);
if(real < 0 || real >= static_cast<py::ssize_t>(info.size)) {
keep = false;
break;
}
multi[ax] = real;
}
}
if(!keep)
continue;
for(std::size_t ax = 0; ax < rank; ++ax)
idx_cols[ax].push_back(multi[ax]);
vals.push_back(kv.second);
}

const auto n = static_cast<py::ssize_t>(vals.size());
py::array_t<py::ssize_t> indices({static_cast<py::ssize_t>(rank), n});
if(n > 0) {
auto ind = indices.template mutable_unchecked<2>();
for(std::size_t ax = 0; ax < rank; ++ax)
for(py::ssize_t j = 0; j < n; ++j)
ind(static_cast<py::ssize_t>(ax), j)
= idx_cols[ax][static_cast<std::size_t>(j)];
}

py::array_t<T> values(n);
std::copy(vals.begin(), vals.end(), values.mutable_data());

return py::make_tuple(std::move(indices), std::move(values));
}

/// Inverse of histogram_to_coo: scatter ``(indices, values)`` back into a sparse
/// histogram. Uses the same flow convention as histogram_to_coo. This is the
/// only bulk write path for sparse storage, since the numpy-buffer based
/// __setitem__ cannot densify the map.
template <class A, class T>
void histogram_from_coo(bh::histogram<A, storage::sparse_storage<T>>& h,
const py::array_t<py::ssize_t>& indices,
const py::array_t<T>& values,
bool flow) {
const auto& axes = bh::unsafe_access::axes(h);
const auto layout = detail::make_axis_layout(axes);
const auto rank = layout.size();

auto ind = indices.template unchecked<2>();
auto val = values.template unchecked<1>();
if(static_cast<std::size_t>(ind.shape(0)) != rank)
throw py::value_error("indices first dimension must equal histogram rank");

const auto n = val.shape(0);
if(ind.shape(1) != n)
throw py::value_error("indices second dimension must match values length");
std::vector<int> at_index(rank);
for(py::ssize_t j = 0; j < n; ++j) {
for(std::size_t ax = 0; ax < rank; ++ax) {
const py::ssize_t coord = ind(static_cast<py::ssize_t>(ax), j);
// Convert back to boost's at() signed index (underflow -1, overflow size).
const py::ssize_t signed_idx
= flow ? coord - static_cast<py::ssize_t>(layout[ax].underflow) : coord;
at_index[ax] = static_cast<int>(signed_idx);
}
h.at(at_index) = val(j);
}
}

template <class A, class S>
auto histogram_sum(const bh::histogram<A, S>& h, bool flow) {
return bh::algorithm::sum(h, flow ? bh::coverage::all : bh::coverage::inner);
}

/// The storage iterator of a map-backed histogram visits every *logical* cell
/// (hash lookup per cell), so bh::algorithm::sum would be O(dense size) β€” for
/// the huge axis spaces sparse storage targets, that never finishes. Sum only
/// the filled cells instead; flow=false unravels each key to skip flow cells.
template <class A, class T>
T histogram_sum(const bh::histogram<A, storage::sparse_storage<T>>& h, bool flow) {
using map_type = std::unordered_map<std::size_t, T>;
const auto& map = static_cast<const map_type&>(bh::unsafe_access::storage(h));

T total{};
if(flow) {
for(const auto& kv : map)
total += kv.second;
return total;
}

const auto layout = detail::make_axis_layout(bh::unsafe_access::axes(h));
for(const auto& kv : map)
if(detail::is_inner_cell(kv.first, layout))
total += kv.second;
return total;
}

template <class A, class S>
bool histogram_empty(const bh::histogram<A, S>& h, bool flow) {
return bh::algorithm::empty(h, flow ? bh::coverage::all : bh::coverage::inner);
}

/// Sparse counterpart of bh::algorithm::empty, iterating only the filled cells
/// (see histogram_sum). Explicit zeros (left behind by e.g. -=) count as empty.
template <class A, class T>
bool histogram_empty(const bh::histogram<A, storage::sparse_storage<T>>& h, bool flow) {
using map_type = std::unordered_map<std::size_t, T>;
const auto& map = static_cast<const map_type&>(bh::unsafe_access::storage(h));

if(flow)
return std::all_of(
map.begin(), map.end(), [](const auto& kv) { return kv.second == T{}; });

const auto layout = detail::make_axis_layout(bh::unsafe_access::axes(h));
return std::all_of(map.begin(), map.end(), [&layout](const auto& kv) {
return kv.second == T{} || !detail::is_inner_cell(kv.first, layout);
});
}

template <class A, class S>
bool histogram_equal(const bh::histogram<A, S>& a, const bh::histogram<A, S>& b) {
return a == b;
}

/// Sparse counterpart of histogram::operator==, whose storage comparison walks
/// every logical cell (see histogram_sum). Compare the two maps directly,
/// treating absent keys as zero β€” the maps are not structurally identical for
/// equal histograms, since -= can leave explicit zeros behind.
template <class A, class T>
bool histogram_equal(const bh::histogram<A, storage::sparse_storage<T>>& a,
const bh::histogram<A, storage::sparse_storage<T>>& b) {
if(!bh::detail::axes_equal(bh::unsafe_access::axes(a), bh::unsafe_access::axes(b)))
return false;

using map_type = std::unordered_map<std::size_t, T>;
const auto& ma = static_cast<const map_type&>(bh::unsafe_access::storage(a));
const auto& mb = static_cast<const map_type&>(bh::unsafe_access::storage(b));

for(const auto& kv : ma) {
const auto it = mb.find(kv.first);
if(kv.second != (it == mb.end() ? T{} : it->second))
return false;
}
return std::all_of(mb.begin(), mb.end(), [&ma](const auto& kv) {
return kv.second == T{} || ma.find(kv.first) != ma.end();
});
}

/// Compute the bin of an array from a runtime list
/// For example, [1,3,2] will return that bin of an array
template <class F, int Opt>
Expand Down
Loading
Loading