[DataOriented] Fastcache, perf, pruning#705
Conversation
Baseline state of branch is the perf-mitigations work (cache bound callable, opt-in stable_members short-circuit for spec-key and args-hash walks, skip per-call _BoundedDifferentiableMethod alloc) plus a new test file pinning down the failure mode when calling a @qd.func taking a typed-dataclass arg from inside a @qd.data_oriented method that passes self.dataclass_member. The baseline (typed-dataclass kernel arg + @qd.func) passes. The four data_oriented variants all fail.
… data_oriented self @qd.func helpers with typed-dataclass parameters were unreachable from @qd.data_oriented kernel methods that wanted to pass self.dataclass_member: the caller-side AST expansion in _expand_Call_dataclass_args / _expand_Call_dataclass_kwargs only fired for dataclass *types* attached to bare ast.Name nodes (typed kernel args), not for dataclass *instances* attached to ast.Attribute nodes (self.X access). Extend both expansion paths to recognise the instance-of-dataclass case and emit per-leaf ast.Attribute children. The positional path additionally threads the callee parameter name and callee_needed set through, so callee-side pruning of unused dataclass fields stays consistent with caller-side emission. Tests in tests/python/test_data_oriented_qd_func_dataclass.py: - baseline typed-arg + qd.func call (passes today) - data_oriented method + qd.func with positional dataclass member - ... with keyword dataclass member - ... with stable_members=True - ... with two dataclass members (Genesis-shaped) All 5 pass. Design doc: perso_hugh/doc/data_oriented_qd_func_dataclass.md (Option A chosen).
… self
Adds 4 tests:
- nested dataclass (Outer{Inner{ndarray}}) passed via self.outer, positional
- nested dataclass passed via self.outer, kwarg (stable_members=True)
- two-step @qd.func chain (outer_write -> inner_write) with self.state
- combined: nested dataclass threaded through a 2-step @qd.func chain
All pass. The outermost data_oriented call site uses the new instance-of-dataclass
branch (with recursion threading callee_param); inner qd.func -> qd.func calls
use the original typed-arg expansion path unchanged.
…achinery When a @qd.data_oriented `self` is passed as a `qd.template()` kernel arg, `_predeclare_struct_ndarrays` walks the entire object graph and registers every reachable ndarray as a kernel parameter. For real-world classes (e.g. Genesis's RigidSolver) that's hundreds of ndarrays per kernel, even when the kernel only touches a few — every extra arg slows down each launch's launch-context population. Hook into the same 2-pass compile machinery that prunes typed-dataclass arg flat-names: - Pass 0 (non-enforcing): `_predeclare_struct_ndarrays` registers every reachable ndarray as today. `_promote_ndarray_if_declared` now records `id(ndarray)` in `pruning.used_struct_ndarray_ids` whenever an attribute chain like `self.x.y` resolves to one of these pre-declared ndarrays — both for direct accesses in the kernel body and for accesses inside inlined `@qd.func` bodies. - Pass 1 (enforcing): `_predeclare_struct_ndarrays` only registers ndarrays whose id was observed in pass 0. Unused ndarrays are dropped from the kernel's parameter list and from `struct_ndarray_launch_info`, so neither compile nor each launch pays for them. On a Genesis non-batched single-Franka CPU rigid step with `RigidSolver` migrated to `@qd.data_oriented(stable_members=True)`: - step_1 ndarray-args: 326 -> 217 (-109) - step_2 ndarray-args: 326 -> 145 (-181) - steady-state step time: 493 us -> 403 us (FPS 2030 -> 2482) Fastcache hit (pass-0 skipped) is gated via `pruning.pass_0_ran`: the set is unreliable in that case so we fall back to registering every reachable ndarray, matching historical behavior.
… in data_oriented walk Mitigation 1 (perf branch) stashes a per-instance BoundQuadrantsCallable in instance.__dict__ on first instance.method access so subsequent lookups skip __get__ allocation. The fastcache args-hasher's @qd.data_oriented walk iterates over obj.__dict__ and previously fell through to the [FASTCACHE][PARAM_INVALID] warning when it encountered that cached entry, disabling fastcache for the whole call (reproduced by test_fastcache_kernel_parameter). These descriptor-cache entries are not data; skip them in the walk so the fastcache key only reflects real members.
Mitigation 5's first cut over-conservatively marked every ndarray reachable from a wholesale-passed dataclass: Option A in call_transformer expands func(self.dc) to per-leaf children func(self.dc.x, self.dc.y, ...), build_stmt runs on each, and _promote_ndarray_if_declared was marking the id as used regardless of whether the callee actually touches it. This left ~205 unused ndarray args still registered per step in the Genesis rigid_solver migration. Two coordinated fixes: 1. Mirror build_Name's expanding_dataclass_call_parameters gate in _promote_ndarray_if_declared. The leaf accesses synthesized by Option A don't represent the kernel body actually touching the ndarray — only the callee body's own accesses (which run with the flag = False) should count. 2. Tag each pre-declared ndarray's AnyArray proxy with _qd_source_ndarray_id. After Option A's expansion, the callee's typed-arg flat-name locals are bound to already-promoted AnyArrays, so when the inlined callee body accesses them, the value reaching _promote_ndarray_if_declared isn't an Ndarray anymore. Tagging lets us mark via the AnyArray too. On Genesis non-batched single-Franka CPU with rigid_solver migrated to @qd.data_oriented(stable_members=True): - step_1 ndarray-args: 217 -> 120 (matches baseline exactly) - step_2 ndarray-args: 145 -> 37 (matches baseline exactly) - total ndarray-args/step: 644 -> 439 (matches baseline exactly) - steady-state step time: 403 us -> 337 us (vs baseline 338-345 us) The migration is now performance-neutral (was -33% FPS, then -22%, now ~0%). 1173 tests pass; the same 8 quadrants-main pre-existing failures remain (4x test_ad_global_data_access_rule_checker, etc.).
…-class The args_hash data_oriented walker added in a0db648 ([Fix] args_hash invalidates when data_oriented ndarray member is reassigned) ran unconditionally for every arg of every kernel call. Even after 93893e5 cached the per-class attribute paths, the per-call ``is_data_oriented(arg)`` + ``type(arg).__dict__.get`` chain still cost ~15% FPS on small-step CPU benches (anymal_zero CPU bs=0: 7231 -> 5955 FPS = -17.6% vs the pre-branch reference). Two coordinated optimisations: 1. Only iterate ``self.template_slot_locations`` instead of all args. Typed-dataclass args carry a specific dataclass type by construction and a data_oriented class is never a dataclass, so the only positions where a data_oriented container can appear are the ``qd.template()`` annotated ones — already tracked by the kernel decorator. Genesis main ``kernel_step_1`` has 4 template positions of 16 args; reduces the per-call work proportionally. 2. Per-``type(arg)`` precomputed dispatch: ``_arg_nd_paths_or_none`` maps each seen type to either the cached path list to walk, or ``None`` (skip — covers primitive templates, non-data_oriented composites, ``_qd_stable_members`` data_oriented, and data_oriented with zero ndarrays). One ``dict.get`` per candidate per call after warmup, replacing the previous ``is_data_oriented`` + ``__dict__.get`` + ``_struct_nd_paths_for`` chain. Measured on cluster ``rtx-mid`` single process, ``test_speed[anymal_zero-None-None- 0-cpu]``, 3-run median, Genesis main + Quadrants branch: - pre-fix tip (02e5660): 5955 FPS (-17.6% vs a22cc2d reference 7231) - after this commit: 6935 FPS (-4.1% vs reference) Recovery: +16.5% FPS on Genesis main; +11.2% on Genesis ``hp/data-oriented-rigid- solver`` (6315 -> 7020). Brings CPU bs=0 within ~3-4% of the pre-branch baseline. Other Quadrants tests (test_data_oriented_ndarray, test_data_oriented_qd_func_dataclass, test_callable_template_mapper, test_kernel_templates, test_template_typing) still pass.
…_oriented Genesis unit tests on cluster hit RecursionError (118 instances across test_rigid_physics, test_fem, test_hybrid, test_render, ...). Two independent root causes, both in the recursive ndarray-graph walkers used to discover ndarray members of ``@qd.data_oriented`` / ``dataclass`` kernel args: 1. ``is_data_oriented(obj)`` did ``getattr(type(obj), "_data_oriented", False)``. For Genesis containers like ``RigidOptions`` (a ``pydantic.BaseModel`` subclass), the metaclass ``ModelMetaclass.__getattr__`` recurses infinitely on missing class attribute names, blowing the stack on every call. Fix: walk MRO and look up ``_data_oriented`` directly in each class's ``__dict__`` — never goes through ``getattr`` / ``__getattr__`` so it's immune to pathological metaclasses. ``@qd.data_oriented`` sets the flag directly on the decorated class so the MRO walk still finds it. 2. ``_build_struct_nd_paths`` (in ``_template_mapper_hotpath.py``) and ``_walk_obj`` (in ``function_def_transformer.py``) had no cycle detection. Genesis object graphs have cross-references (e.g. ``solver <-> scene <-> sim <-> solver``) so the walkers recurse forever on real workloads. Fix: track ``id(obj)`` in a per-traversal ``seen`` set and skip re-entering a node we've already expanded. Adds ``test_is_data_oriented_safe_on_pydantic_like_metaclass``, ``test_data_oriented_with_pydantic_like_child``, and ``test_data_oriented_with_cyclic_attr_graph`` to pin both fixes.
…ache-stale leaves
Two related robustness fixes surfaced by the Genesis ``hp/data-oriented-
rigid-solver`` migration on cluster unit tests.
## Problem 1: ``_uid: UID`` disables fastcache on stable_members classes
After Genesis migrated ``kernel_step_1`` / ``kernel_step_2`` to methods on
``@qd.data_oriented(stable_members=True) class RigidSolver``, the
fastcache args-hasher walks ``RigidSolver.__dict__``, encounters
``_uid`` of type ``genesis.utils.uid.UID``, can't recognise it, and
disables fastcache for the whole call:
[FASTCACHE][PARAM_INVALID] Parameter with path ('0', '_uid') and type
<class 'genesis.utils.uid.UID'> not allowed by fast cache.
[FASTCACHE][INVALID_FUNC] The pure function step_1 could not be fast
cached, because one or more parameter types were invalid
Causes 5 ``test_quadrants.py`` failures (``test_num_envs``,
``test_ndarray_no_compile`` on both backends) that all assert fastcache
fires for ndarray-backend ``RigidSolver`` invocations.
``stable_members=True`` already promises the class's member set / types
don't change after construction. Under that contract, opaque metadata
(``UID``, etc.) is inert from fastcache's perspective: it doesn't affect
kernel codegen. Treat ``stable_members=True`` containers as tolerant —
skip unrecognised members silently and continue, instead of returning
None and killing fastcache.
Also silence the per-member ``[FASTCACHE][PARAM_INVALID]`` log inside a
stable_members walk via a depth counter, so the user doesn't see
warnings for members they explicitly opted out of caring about.
## Problem 2: cached ndarray-path leaves can be stale across instances
``_struct_nd_paths_cache`` is keyed on ``type(arg)`` and assumes the set
of ndarray-reachable attribute chains is stable across instances. That's
the common case but breaks on polymorphic Genesis solvers: ``FEMSolver``
/ ``MPMSolver`` / ``SPHSolver`` can hold a ``qd.Tensor`` whose underlying
impl swaps between an ``Ndarray`` and a ``MatrixField`` between
instances. ``_collect_struct_nd_descriptors`` then walks a cached path
to a ``MatrixField`` and crashes with::
AttributeError: 'MatrixField' object has no attribute 'element_type'
Fix: defensively check ``isinstance(v, Ndarray)`` after the
tensor-wrapper unwrap and skip stale entries silently. ``element_type``
/ ``shape`` / ``_qd_layout`` are Ndarray-only; non-Ndarray leaves can't
contribute a meaningful descriptor anyway, and the per-instance
``weakref(arg)`` part of the spec key still ensures cache discrimination.
Adds ``test_data_oriented_polymorphic_attr_across_instances`` to pin the
cache-stale-leaf behaviour.
…ail on Field/MatrixField
My previous commit ``5add57b6a`` was too loose: it silently skipped *any*
member that ``stringify_obj_type`` returned ``None`` for, including
``Field`` / ``MatrixField``. That broke ``test_quadrants.test_num_envs[
False-*]`` (field backend), which pins the contract that fastcache must
fail when an arg's subtree contains a recognised-but-unsupported
tensor-like type (whose value affects kernel codegen).
Differentiate two reasons ``stringify_obj_type`` returns ``None``:
(a) RECOGNISED-BUT-UNSUPPORTED: ``ScalarField`` / ``MatrixField`` (and
any future type explicitly hitting ``_mark_warn_if_not_tensor_
annotation``). These now also call ``_mark_hit_recognised_
unsupported()`` to flip a module-level flag. The flag bubbles up
naturally through nested dataclass / data_oriented walkers since
they propagate ``None``.
(b) TRULY-OPAQUE: unknown types falling through to the
``[FASTCACHE][PARAM_INVALID]`` branch (``RigidSolver._uid: UID``,
etc.). These don't set the flag.
The ``stable_members=True`` data_oriented walker snapshots the flag
around each child's recursive call. If a child returned ``None`` AND
the flag was set, fastcache fails (any tensor-like leaf in the subtree
invalidates the hash). If the flag was clear, the child is truly opaque
metadata — skip it silently under the user's stability contract.
``_hit_recognised_unsupported`` is reset at the top of ``hash_args`` and
before each child probe so the snapshot reflects only the just-completed
recursion.
`dataclasses.is_dataclass(obj)` calls `hasattr(type(obj), '__dataclass_fields__')`, which delegates to the metaclass `__getattr__` for missing names. Pydantic's `ModelMetaclass` (and our `RecursingMeta` regression fixture) recurse infinitely on arbitrary lookups and blow the stack — same class of failure as the previously-fixed `is_data_oriented(obj)` path. Add `is_dataclass_instance` in `lang/util.py` that walks `type(obj).__mro__` probing `klass.__dict__` directly (never via `getattr`), and use it everywhere the kernel pipeline tests user values for dataclass-ness: - `_template_mapper_hotpath._build_struct_nd_paths` - `function_def_transformer._walk_obj` (both branches) - `function_def_transformer` dataclass-vs-`__dict__` walker dispatch - `args_hasher.stringify_obj_type` Annotations/types are untouched (`call_transformer`, `_signature`, `_kernel_impl_dataclass`): those check user-declared dataclass types, not runtime values that can carry pathological metaclasses. Fixes `test_data_oriented_with_pydantic_like_child` (added in b3457a6 to pin this exact regression but caught only the `is_data_oriented` half of it).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55ecf95d55
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
|
|
||
| def data_oriented(cls): | ||
| def data_oriented(cls=None, *, stable_members: bool = False): |
There was a problem hiding this comment.
Document the new data_oriented stable_members API
This changes the public API by allowing @qd.data_oriented(stable_members=True) and by documenting _qd_stable_members as an equivalent class-level knob in the runtime docstring, but the commit has no corresponding docs/ update. The repository's AGENTS.md requires user-facing docs to stay in sync with public API or usage changes; this is especially important here because the existing docs still describe ndarray-member reassignment for @qd.data_oriented containers as supported, while this option makes reassignment undefined behavior.
Useful? React with 👍 / 👎.
Black/ruff reformatted multi-import statements onto multiple lines.
The previous design used a ``stable_members=True`` opt-in flag (or per-class ``_qd_stable_members`` attribute) to tell the fastcache hasher to silently skip opaque-typed members of ``@qd.data_oriented`` containers. Without the opt-in, any unrecognised member type disabled fastcache for the whole call, which made adding a UUID, Pydantic config object, or back-pointer to ``self`` silently kill fastcache. That contract was brittle: adding any new metadata member to a long-lived ``@qd.data_oriented`` class could disable fastcache without warning, and the opt-in was an "I promise the layout doesn't change" contract that has nothing to do with the actual fastcache invariant. The actual invariant: opaque Python types cannot affect kernel codegen because the kernel cannot read them. Only recognised types — ndarrays, primitives, enums, dataclasses, nested ``@qd.data_oriented`` objects — can be read by kernel code. So *all* container walkers (``dataclass_to_repr`` and the ``data_oriented`` branch in ``stringify_obj_type``) can safely skip opaque members from the hash, no opt-in needed. Recognised-but-unsupported types (``qd.field`` / ``qd.Matrix.field``) are distinct: their shape/dtype affect kernel codegen but fastcache doesn't yet know how to hash them. These still disable fastcache for the whole call — behaviour is unchanged. Top-level kernel-arg opaqueness is also distinct: an opaque top-level arg is a user error (the kernel's argument is uninterpretable to fastcache) and still emits the ``[FASTCACHE][PARAM_INVALID]`` warning. Implementation: ``stringify_obj_type`` now takes a ``nested: bool`` parameter that suppresses the warning for nested opaque types. Container walkers pass ``nested=True``. Removed the global ``_skip_unknown_warn_depth`` counter and ``_hit_recognised_unsupported`` flag — replaced with a clean ``_FAIL_FASTCACHE`` sentinel distinct from ``None`` (opaque/silent-skip). The ``@qd.data_oriented(stable_members=True)`` flag and ``_qd_stable_members`` attribute remain — they still gate the launch-context per-call walker optimization in ``_template_mapper`` and ``Kernel.launch_kernel``. Removed from the fastcache hasher's logic only. Added 3 regression tests pinning the new defaults: - data_oriented with opaque member: silently hashable. - data_oriented with nested field: still FastcacheSkip. - dataclass with opaque field: silently hashable. All 130 fastcache + data_oriented tests pass on x64.
…le_members scope Docs the new behaviour committed in 49ffb3b: - ``compound_types.md`` ``### Fastcache``: explain the three-bucket type-based classification (recognised+valid / opaque / recognised+unsupported) that applies to every ``@qd.data_oriented`` argument by default. Add a separate ``### stable_members=True`` subsection clarifying that the flag is a per-call launch performance hint (template-mapper + launch-context cache), not a fastcache contract. - ``fastcache.md`` compound-type rules: add the opaque-skip bullet and the opaque-vs-recognised-unsupported note. - ``kernel_impl.data_oriented`` docstring: narrow ``stable_members`` to its actual scope (per-call walker skip) and explicitly note that fastcache silences opaque members regardless of the flag.
…ify stable_members scope" This reverts commit 7757907.
This reverts commit 49ffb3b.
…llback
Unrecognised types in fastcache argument hashing previously had two failure modes, both bad:
- Top-level: ``[FASTCACHE][PARAM_INVALID]`` warn + return None, disabling fastcache for the whole call.
Any solver-like object carrying a single opaque metadata field (Genesis ``UID``, Pydantic config,
back-pointer) silently killed the cache.
- Nested under ``@qd.data_oriented(stable_members=True)``: silent skip. Worked for the Genesis case but is
dangerous: if someone later adds a new tensor-like type (e.g. ``BFloat16Tensor``) whose value affects
kernel codegen but forgets to register it in args_hasher's recognised set, the silent skip serves stale
cache results without any indication.
Both paths are replaced with a single ``type(v).__qualname__``-based fallback (``opaque-<module>.<qualname>``)
that emits a one-shot ``[FASTCACHE][UNKNOWN_TYPE]`` warning per type. Properties:
- Cache key stable across instances of the same opaque class (Genesis UID #1 and UID #2 produce the same
key). Kernels cannot read non-recognised Python types so opaque metadata cannot affect codegen, making
type-identity-only hashing correct for genuinely opaque members.
- Loud diagnostic for the dangerous case: any unrecognised type that ever gets hashed prints a warning
pointing at args_hasher.stringify_obj_type so a missed tensor-like registration is impossible to miss.
- ``ScalarField`` / ``MatrixField`` (recognised-but-unsupported tensor-like) still disable fastcache via
a new ``_FAIL_FASTCACHE`` sentinel — their shape/dtype affect codegen but fastcache doesn't yet handle
them. Distinct from the qualname fallback so the field path remains correct.
Also adds ``pruning_paths`` and ``parent_flat`` plumbing through ``stringify_obj_type`` / ``dataclass_to_repr`` /
``hash_args`` for the upcoming pruning-driven narrow walk (L1 cache lookup of kernel-accessed flat names);
the new parameters default to None so this commit alone is the qualname-fallback baseline.
``test_src_ll_cache_arg_warnings`` updated to assert the new ``[UNKNOWN_TYPE]`` warning (instead of the old
``[PARAM_INVALID]`` + ``[INVALID_FUNC]`` dead-end).
The ``_qd_stable_members`` flag is no longer read by args_hasher; its launch-context role
(``_mutable_nd_cached_val`` short-circuit) is unchanged in this commit and will be addressed separately.
Replaces the pre-refactor single-level cache (one key derived from source + config + a *wide* args walk)
with a two-level pruning-driven scheme:
- L1 key (``src_hasher.make_source_config_key``): source + config + version, no args dependence. Stores
``PruningInfo`` — the set of kernel-accessed flat names produced during compile (``Pruning``'s
``used_vars_by_func_id[KERNEL_FUNC_ID]``, folded with data_oriented ndarray attribute chains from
``struct_ndarray_launch_info``). Also persists ``graph_do_while_arg`` (source-deterministic).
- L2 key (``src_hasher.make_full_cache_key``): L1 key + ``narrow_args_hash``. The narrow hash walks only
paths in the L1 pruning set, so unrelated metadata changes on the same kernel-accessed surface no
longer invalidate the cache.
Lookup flow (warm call): L1 lookup → narrow args walk using L1 pruning info → L2 lookup → load artifact.
Cold compile: L1 miss → full compile (pass 0 + pass 1) → store L1 → compute narrow args hash → store L2.
Crucially, "L1 hit but L2 miss" still triggers full pass 0+1 (not just pass 1): pass 0 is what populates
per-callee-func pruning info, and L1 only stores the kernel-level set, so skipping pass 0 is only safe when
the C++ artifact is already loaded (``only_parse_function_def=True``).
Pruning narrowing rules in ``args_hasher.stringify_obj_type``:
- Dataclass children: flat-name pruning is *complete* (every dataclass field is flattened by
``FlattenAttributeNameTransformer``), so narrow walking by ``child_flat in pruning_paths`` is safe.
- Data_oriented children: pruning is only complete for ndarray members (via ``struct_ndarray_launch_info``).
Primitive members (template-position values baked into the kernel) are NOT tracked by pruning. To stay
correct, the data_oriented branch only narrows *ndarray* children; non-ndarray children are always
walked (the recursive call still narrows nested dataclasses). This is why
``test_template_raise_on_data_oriented_floats`` and the dtype-distinct cache-key test both still pass:
primitives keep contributing to the hash, only kernel-unused ndarrays get pruned.
Behavior change: ``test_src_ll_cache_arg_warnings`` and ``test_fastcache_field_warnings_warn_struct_template_field``
updated to reflect that fastcache no longer fires ``[PARAM_INVALID]`` or ``[INVALID_FUNC]`` for unrecognised
types at the *top level* (qualname fallback from previous commit handles them) or for Field-bearing
*unused* dataclass members (narrowing skips them). Tests now exercise the genuinely-live cases.
``test_src_hasher_*`` updated to use the new ``make_source_config_key`` / ``make_full_cache_key`` API.
stable_members is no longer read by the args hasher (handled by previous commit); its launch-context role
in ``Kernel.launch_kernel`` still uses the legacy flag and will be addressed in a follow-up.
…erf-only After the previous two commits, fastcache is no longer brittle wrt opaque members: the cache key is derived from kernel pruning info, and unrecognised types at kernel-read paths fall back to a deterministic type(v).__qualname__ hash with a one-shot [UNKNOWN_TYPE] warning. This commit aligns the user-visible docs (fastcache.md, compound_types.md) and the data_oriented(stable_members=...) docstring with that semantic. stable_members is documented as *purely* a launch-time perf hint with no fastcache role; the launch-context comments in kernel.py and _template_mapper.py are updated to call this out explicitly. Also fixes a pylint no-else-return warning introduced by the refactor.
…name fallback Three rules now strictly enforced by the args hasher: 1. The cache key may only include contributions from kernel-pruned paths. Never a qualname-based hash for unrecognised types — that captures type identity without type parameters (dtype/shape) and would silently mask value-affecting changes. 2. Unrecognised types at kernel-read paths must not be silently dropped. Fastcache is disabled loudly with a one-shot [UNKNOWN_TYPE] warning plus [INVALID_FUNC] log line. 3. Fastcache works for data_oriented containers — pruning info now covers every attribute chain rooted at a kernel arg, not just ndarrays. Compiler-side: ASTTransformer.build_Name annotates non-flattened kernel-arg Names with ``_qd_arg_chain``; build_Attribute propagates the annotation through ``self.dofs.x`` chains and records them via the new ``Pruning.mark_kernel_arg_chain_used`` (separate set so they don't poison ``struct_locals`` and break codegen). ``Pruning.record_after_call`` was extended to propagate chain-path entries across @qd.func calls including Attribute args (``f(self.dofs)``). After both compile passes, ``Kernel._fold_kernel_arg_chain_paths_into_pruning`` merges the kernel's chain-paths into ``used_vars_by_func_id[KERNEL_FUNC_ID]`` (same set as ``used_py_dataclass_parameters_by_key_enforcing[key]`` by reference) so the fastcache args-hash narrow walk picks them up. Args-hasher side: removed the data_oriented ndarray-only carveout — ``_is_path_used(pruning_paths, child_flat)`` now applies to every member. Removed ``_qualname_fallback``; replaced with ``_fail_unknown_type`` which returns _FAIL_FASTCACHE and emits the [UNKNOWN_TYPE] warning. Tests + docs updated to match. Full x64 suite: 4063 passed.
Five new tests in test_data_oriented_ndarray.py covering the three rules the args hasher now enforces (see fastcache.md "Pruning-driven argument hashing"): - test_data_oriented_kernel_unused_opaque_member_does_not_affect_cache: rule 1 — two State instances differing only in an uuid member that the kernel never reads share the same compiled artifact across processes. - test_data_oriented_kernel_read_opaque_member_fails_fastcache: rule 2 — when the kernel actually reads an unrecognised-type member, fastcache fails loudly with [UNKNOWN_TYPE]+[INVALID_FUNC]. Kernel still runs. - test_data_oriented_kernel_read_primitive_distinguishes_cache_key: rule 3 — kernel reading a primitive member (s.n baked in) cold-compiles per value and both values load distinct artifacts on warm start. - test_data_oriented_kernel_unread_primitive_does_not_affect_cache: rule 1 mirror for primitives — unused_n differences don't perturb the cache key. - test_data_oriented_qd_func_chain_propagation_distinguishes_cache_key: Pruning.record_after_call propagation through @qd.func(s.dofs) — the inner dofs.x dtype must reach the kernel's pruning set so changes invalidate the cache.
…e-mode note The earlier docstring mentioned a qualname fallback for unrecognised types, which was true at the time but was subsequently removed in the strict-rules refactor. Update the note to match the actual current behaviour: unrecognised types at kernel-read paths fail fastcache loudly with [UNKNOWN_TYPE] + [INVALID_FUNC].
The TemplateMapper's args_hash walk used a per-class cache of attribute paths
populated from the first instance ever seen of each class. That cache was wrong
for @qd.data_oriented classes whose attribute structure varies across instances
(motivating case: Genesis ``DataManager``, which only allocates
``*_adjoint_cache`` members when ``requires_grad=True``).
Two failure modes existed:
- Forward direction (first instance has the attr, second misses it): the walk
crashed with ``AttributeError: 'DataManager' object has no attribute
'dofs_state_adjoint_cache'`` when launching kernels on the second instance.
Observed on Genesis ``test_rigid_mpm_legacy_coupling`` (macos-15 GPU job in
PR genesis-world#2799).
- Inverse direction (first instance lacks the attr, second has it): silently
miscached — the new ndarray's id never made it into args_hash, so a later
reassignment of that attribute wouldn't trigger spec re-derivation.
Fix: stash the walked path list on the *instance* (``arg._qd_nd_paths``) via
``object.__setattr__`` (compatible with frozen dataclasses, mirroring the
existing ``_qd_dc_repr`` pattern in ``args_hasher.dataclass_to_repr``). Each
instance is walked once on first kernel call; subsequent calls fetch the cached
list via instance ``__dict__`` lookup (~30 ns, same order as the previous
class-level ``dict.get``).
Steady-state perf: unchanged on franka cpu single env (one solver instance,
walked once at scene build, fetched per-call thereafter). Startup pays one
walk per instance lifetime — ~10us per scene build for Genesis-shaped
workloads. ``__slots__`` classes that can't accept the instance stash fall back
to per-class caching and retain the legacy polymorphic-instance limitation;
Genesis data_oriented containers don't use ``__slots__``.
``_classify_for_args_hash`` is split into a per-class disposition (``_SKIP`` /
``_PER_INSTANCE``) plus a per-instance ``_struct_nd_paths_for`` call. The
``_qd_stable_members`` flag still short-circuits the entire walk for users who
opt into the "no ndarray reassignment, ever" promise.
Test ``test_data_oriented_polymorphic_attribute_set_across_instances`` covers
both forward and inverse directions on a ``DataManager``-shaped class.
- ``test_data_oriented_polymorphic_attribute_set_across_instances``: the
inverse-direction case now uses a kernel that *reads* ``s.extra`` (the
conditional attribute) — without the per-instance walk this would
silently miss ``('extra',)`` from the kernel-used path list. Adds a
reassignment step that verifies same-shape ndarray swaps go through the
per-call ``id(v)`` folding cleanly.
- ``test_src_ll_cache_hit_predeclare_struct_ndarrays_pruned``: pins
``710ee4705``. A data_oriented arg with three ndarrays (``a``/``b``/``c``)
but a kernel that only writes ``b``. Cold compile populates the fastcache
with the flat-name pruning set; ``qd.reset()`` + ``qd.init()`` reloads it;
cache-hit branch in ``_predeclare_struct_ndarrays`` must reproduce the
same single-ndarray registration set, otherwise insertion-order
registration would scramble slots and the write would land in
``state.a`` instead of ``state.b``.
…hash Pins the L2 collision between needs_grad=False (cold) and needs_grad=True (hot) scenes that differ only on the .grad-present flag. ``args_hasher.stringify_obj_type`` stringifies ndarray leaves by (dtype, ndim) only, so the narrow args_hash is the same and the second scene loads the without-grad artifact — the kernel's compiled parameter slot has needs_grad=False baked in but the launch routes the with-grad ndarray through the _QD_ARRAY_WITH_GRAD bucket, mis-aligning the parameter struct (silent wrong results or runtime OOB depending on slot layout). Test FAILS on this commit (asserts cache_loaded is False after the with-grad launch; observed True with the unfixed args_hasher). Fix to follow in next commit.
``ScalarNdarray``/``VectorNdarray``/``MatrixNdarray`` instances now stringify with an extra ``-g`` tag when their grad buffer is present. needs_grad is part of the compiled parameter-struct layout (``insert_ndarray_param`` bakes the grad pointer into the slot iff needs_grad=True), and the launch path picks between ``_QD_ARRAY`` and ``_QD_ARRAY_WITH_GRAD`` buckets off ``v.grad is not None`` — so two scenes that differ only by needs_grad MUST hash distinctly, otherwise L2 returns an artifact whose slots are mismatched at launch (silent miscomputation or runtime OOB depending on slot offset alignment). This is the root cause of the Genesis ``test_diff_*`` autodiff failures: the non-grad ``kernel_init_link_fields`` artifact landed in L2 first; the ``requires_grad=True`` run loaded that artifact and routed ``links_state.quat`` through ``_QD_ARRAY_WITH_GRAD`` against a slot declared without grad, producing the "Out of bound access to ndarray at arg 44 with indices [0,0,0]" assertion. Reproducer test was added in the previous commit; it now passes on x64, vulkan and cuda. Full fast_caching + test_data_oriented_ndarray + test_ad_dataclass suite: 257 passed, 6 skipped.
Summary
Stacks onto #704. Follow-up fixes uncovered while migrating Genesis
RigidSolverto@qd.data_orientedand benchmarking against real workloads.Core refactor (latest)
Fastcache opaque-member silencing is now the default, not an opt-in via
stable_members=True.The previous design used
@qd.data_oriented(stable_members=True)(or_qd_stable_members = True) to tell the fastcache hasher to silently skip opaque-typed members (UID identifiers, PydanticBaseModel, back-pointers up the object graph). Without the opt-in, any unrecognised member type disabled fastcache for the whole call — adding a UUID toselfsilently killed fastcache.That contract was brittle. The actual invariant: opaque Python types cannot affect kernel codegen because the kernel cannot read non-recognised Python types. Only ndarrays, primitives, enums, dataclasses, and nested
@qd.data_orientedobjects are readable by kernel code. So all container walkers (dataclass_to_reprand thedata_orientedbranch instringify_obj_type) can safely skip opaque members from the hash, no opt-in needed.Recognised-but-unsupported types (
qd.field/qd.Matrix.field) are distinct — their shape/dtype affect kernel codegen but fastcache doesn't yet know how to hash them. These still disable fastcache for the whole call (unchanged).Top-level opaque args still emit
[FASTCACHE][PARAM_INVALID](unchanged).Implementation:
stringify_obj_typenow takesnested: boolthat suppresses the warning for nested opaque types; container walkers passnested=True. Removed global_skip_unknown_warn_depthcounter and_hit_recognised_unsupportedflag — replaced with a clean_FAIL_FASTCACHEsentinel distinct fromNone(opaque/silent-skip).The
@qd.data_oriented(stable_members=True)flag and_qd_stable_membersattribute remain but their scope is narrowed: they only gate the launch-time perf optimization in_template_mapperandKernel.launch_kernel. They no longer affect fastcache.Other fixes
_build_struct_nd_pathsand_walk_obj(seen: set[id]);is_data_orientedwalkstype(obj).__mro__via__dict__so Pydantic'sModelMetaclass.__getattr__doesn't blow the stack; mirror MRO-safe walk inis_dataclass_instanceand use it everywhere the kernel pipeline tests user values for dataclass-ness.QuadrantsCallable/BoundQuadrantsCallableentries cached oninstance.__dict__so they don't poison hash keys.@qd.data_orientedclasses can call@qd.funcs passing dataclass-instance args.Perf
TemplateMapper.lookup: only walk template-slot args, with per-class cached classification, recovering the ~15% CPU bs=0 regression inbench_cluster_wandb.py.pruning.used_struct_ndarray_ids,pass_0_ran) to prune unused@qd.data_orientedndarrays.Tests
@qd.funccalls from data_orientedself.test_args_hasher_data_oriented_with_opaque_member_silently_skipped,test_args_hasher_data_oriented_nested_field_still_fails,test_args_hasher_dataclass_with_opaque_field_silently_skipped).Docs
compound_types.md: new### Fastcachesection with the three-bucket type classification (recognised+valid / opaque / recognised+unsupported);### stable_members=Truesubsection narrowed to launch-perf scope.fastcache.md: compound-type-cache-keying rules updated with the opaque-skip bullet and the opaque-vs-recognised-unsupported note.Test plan
tests/pythonfastcache or data_oriented or py_dataclass or purepasses locally on x64 (135 passed, 1 skip, 3 xfail).test_args_hasher.pypasses locally on x64 (30 passed, including 3 new).pre-commit run -aclean.pyrightclean for changed files.Made with Cursor