Skip to content

feat: AnnData.can_write based on AnnData.fold + refactored AnnData.__sizeof__#2372

Draft
ilan-gold wants to merge 5 commits intomainfrom
ig/fold_can_write
Draft

feat: AnnData.can_write based on AnnData.fold + refactored AnnData.__sizeof__#2372
ilan-gold wants to merge 5 commits intomainfrom
ig/fold_can_write

Conversation

@ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Mar 19, 2026

maybe TODOs:

  • fold? reduce?
  • Implement this inside write_{zarr,h5ad} to prevent partial stores from being written?

We introduce AnnData.fold and AnnData.can_write for "crawling" an AnnData and accumulating, implementing a can_write to tell if an AnnData can be written to a given format.

I'm also open to reduce as the name instead of fold.

For #2236, I think we could implement this more complexly by building on fold and accumulating a set of AdRef (or similar) objects pointing at what can't be written. Open to implement this! I reimplemented __sizeof__ to stress-test using reference-accessors and it went ok, but it's not amazingly smooth (like uns and trying to figure out if you are in a "top-level" elem like obs or varm)

If zarr-developers/zarr-python#3613 ever happens (I hope so) or we do #2238, this will be crucial to making things smoother with hdf5 for preventing incompatibilities.

  • Release note not necessary because:

@ilan-gold ilan-gold added this to the 0.13.0 milestone Mar 19, 2026
@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 86.15385% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.40%. Comparing base (753f058) to head (0f4d1b0).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/anndata/_core/anndata.py 87.75% 6 Missing ⚠️
src/anndata/acc/__init__.py 80.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2372      +/-   ##
==========================================
- Coverage   87.36%   85.40%   -1.97%     
==========================================
  Files          49       49              
  Lines        7787     7830      +43     
==========================================
- Hits         6803     6687     -116     
- Misses        984     1143     +159     
Files with missing lines Coverage Δ
src/anndata/types.py 100.00% <100.00%> (ø)
src/anndata/acc/__init__.py 95.94% <80.00%> (-0.68%) ⬇️
src/anndata/_core/anndata.py 83.54% <87.75%> (+0.19%) ⬆️

... and 8 files with indirect coverage changes

@ilan-gold ilan-gold changed the title feat: AnnData.can_write based on AnnData.fold feat: AnnData.can_write based on AnnData.fold + refactored AnnData.__sizeof__ Mar 19, 2026
}

def predicate(x: RWAble, *, accumulate: bool, ref_acc: AdRef | RefAcc | None):
if isinstance(ref_acc, AdRef) or ref_acc is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and above uns is ref_acc is None but I feel like there should be a better way to describe this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, let’s talk about this!

@ilan-gold ilan-gold requested a review from flying-sheep March 19, 2026 14:41
@ilan-gold
Copy link
Contributor Author

Just looking for a first-pass vibe check on this!

Comment on lines +542 to +547
if is_ad_ref:
key = type(ref_acc.acc)
elif is_ref_acc:
key = ref_acc.parent_type
else:
key = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was discussed during the PR for reference accessors but I think it should be easier to traverse up the tree. open to ideas!

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting! I think this mainly needs some serious documenting and examples, not much else.

For examples, maybe first something that just handles one kind of accessor/reference first to show how one can short-circuit by returning the accumulator early.

Regarding naming: the canonical name in the stdlib is reduce, it even used to be in builtins, but maybe the scientific python world deviates for no good reason again. It’s also fair if you have other objections to reduce!

init: T,
order: Literal["DFS-pre", "DFS-post"] = "DFS-post",
) -> T:
"""Accumulate a value starting from init by iterating over the "elems"/leaf nodes of the AnnData object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what this function passes to ref_acc for which elements.

Either

  • link explicitly to FoldFunc (additionally to the typehints auto link) and explain it there, or
  • explain it here and direct people from the FoldFunc docs to here.

Comment on lines +532 to +537
(is_ad_ref := isinstance(ref_acc, AdRef))
or (
is_ref_acc := isinstance(
ref_acc, LayerAcc | MultiAcc | GraphAcc
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is

isinstance(ref_acc, AdRef | RefAcc) and not isinstance(ref_acc, MetaAcc)

because you want to handle each homogeneous array.

I think there are some ways to make this more ergonomic, e.g.:

  • we can add more marker superclasses that handle this use case, or
  • we could change the implementation of fold/reduce so it descends into things and returning or raising some marker would prevent descent. that might help handling things as early as possible and therefore having less complicated logic.

@katosh
Copy link
Contributor

katosh commented Mar 20, 2026

Hi @ilan-gold, interesting direction! The HTML repr PR (#2236) intersects with fold in a couple of ways, so I wanted to share some observations from the implementation side.

HTML repr (#2236): traversal overlap

The _repr_html_ traverses the same element tree that fold formalizes. The core loop lives in html.py:_render_all_sections and iterates over a SECTION_ORDER tuple:

SECTION_ORDER = ("X", "obs", "var", "uns", "obsm", "varm", "layers", "obsp", "varp", "raw")

for section in SECTION_ORDER:
    parts.append(_render_section(adata, section, context))

This is structurally very similar to fold's inner loop. A few differences worth noting:

  1. raw is missing from fold. The repr includes .raw as a collapsible section (X, var, varm). Would fold eventually cover raw?

  2. Section-level grouping. The repr builds a tree (section headers, then entries, then nested content). Since entries within a section are visited contiguously, the callback could track state in the accumulator to detect section transitions: in DFS-pre mode the parent visit signals "entering," and the next section's parent visit implicitly signals "exiting" the previous one. This is workable, but it would be smoother if fold supported explicit enter/exit events for sections, so consumers don't need to roll their own state tracking for something that fold already knows internally.

  3. Custom/unknown sections. The repr also detects mapping-like attributes not in the standard list (for ecosystem packages like TreeData that add .obst/.vart). fold currently hardcodes the same attribute list. Worth considering whether it should be extensible or whether that's out of scope.

  4. Nested AnnData in uns. The repr handles AnnData objects found inside uns values, with depth tracking to prevent infinite recursion. fold doesn't recurse into nested AnnData within uns, so that use case wouldn't be covered.

On the positive side, the per-column iteration of obs/var via MetaAcc aligns well with how the repr already iterates df.columns to render each column individually.

The repr would naturally use DFS-pre order (render section header first, then its entries), so it's good that option exists. If fold stabilizes, I could likely use it as the backbone of the repr's traversal by accumulating a list[RenderedSection]. The SECTION_ORDER loop would become a fold call. The main prerequisite would be that fold supports distinguishing "entering a section" from "visiting a leaf within it."

Serialization checks: can_write overlap

I understand can_write has a different primary goal: preventing partial writes to zarr/h5 stores by checking upfront whether the whole object is writable. That's valuable on its own. But there's a nice overlap: the repr already performs the same registry lookups at element granularity (utils.py:is_serializable, formatters.py:_check_series_serializability) to show warning icons on entries that can't be written. These checks:

  • Query _REGISTRY.write to test if a type has a registered writer (same approach as can_write's predicate)
  • Additionally check Series backing arrays, key name validity (slashes, non-string keys), and nested container contents
  • Return (is_serializable, reason) tuples so the repr can show why something can't be written

If the underlying element-level check were factored into a shared utility, something like can_write_elem(obj, store_type=) -> tuple[bool, str], both use cases could build on it: can_write would fold with all(...), and the repr would call it per-entry to get the reason string for its warning icon. I'd be happy to contribute to making that work if it's a direction you'd consider.

ref_acc API feedback

From the repr's perspective, the current ref_acc protocol is workable but has a few rough edges:

  • None for uns: I understand uns is intentionally excluded from the accessor system because it's unstructured and not axis-aligned, and that makes sense. But from a fold consumer's perspective, using ref_acc is None as a proxy for "this is a uns element" is easy to get wrong (as the __sizeof__ refactor shows with ref_acc is None and X is not self.uns). Maybe fold could wrap this in a lightweight sentinel (even just a string tag) so callers don't need to reason about the accessor system's internals?
  • Distinguishing leaf vs parent visits requires isinstance chains on ref_acc types. For the repr, something like ref_acc.is_leaf or separate callback signatures for section-enter/element/section-exit would be more ergonomic.

Happy to discuss further, just wanted to flag the touchpoints while the API is still taking shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants