Skip to content

[MISC] Migrate rigid_solver.py to data_oriented.#2799

Open
hughperkins wants to merge 15 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/data-oriented-rigid-solver
Open

[MISC] Migrate rigid_solver.py to data_oriented.#2799
hughperkins wants to merge 15 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/data-oriented-rigid-solver

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Description

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

…stable_members=True)

Checkpoint of the migration attempt: moves kernel_step_1, kernel_step_2, and
update_geom_aabbs into RigidSolver as @qd.kernel methods, decorates the class
with @qd.data_oriented(stable_members=True), exposes _contact_island_state and
_collider_state as direct attributes, and updates the substep() / detection()
call sites.

This was originally failing against quadrants hp/data-oriented-ndarray-fix with:
  Missing argument __qd_links_state__qd_cinr_inertial.
  Unexpected argument links_state.

The fix lives on quadrants branch hp/data-oriented-qd-func-dataclass (Option A:
extend caller-side dataclass-arg expansion to handle data_oriented self.X
attribute access).
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Note: this depends on a new Quadrants release of a new Quadrants branch

hugh and others added 6 commits May 17, 2026 21:04
The forward path was migrated to @qd.kernel methods on the data_oriented
RigidSolver (self.step_1 / self.step_2). The only remaining caller of
the top-level kernel_step_2 was the backward-grad call site; replace it
with self.step_2.grad(is_backward=True) (the BoundQuadrantsCallable.grad
descriptor uses the kernels adjoint correctly) and delete both old
top-level kernel definitions.
…ep_1

- test_grad.py::test_diff_solver now calls rigid_solver.step_1(...) (method
  form) instead of the deleted top-level kernel_step_1.
- test_quadrants.py: access RigidSolver.step_1._primal (via the class-level
  QuadrantsCallable descriptor) for the cache observation checks. The
  attribute chain works because QuadrantsCallable directly exposes _primal.

Smoke-tested both forward and backward paths (scene.step() + rigid_solver.step_2.grad).
``qd.simt.subgroup`` recently moved to the suffix convention where the
sized form is ``<op>_tiled(v, log2_size)`` and the no-suffix form
``<op>(v)`` operates over the full subgroup
(quadrants commit d07644e4, "rename to _tiled suffix convention").
``solver._kernel_mass_factor_solve_chol_subgroup_16`` was still calling
the old ``reduce_all_add(dot, 4)`` two-arg form, which no longer exists
on the new API (the 1-arg form ignores tile width and reduces over the
full subgroup, which would give wrong results for the 16-thread
parallel dot the kernel implements).  Switch the two call sites to
``reduce_all_add_tiled(dot, 4)`` to preserve the intended 2**4 = 16
lane tile.
…module-level globals

Two qd.static expressions captured module-level mutable globals
(`gs.backend`, `gs.qd_float`) whose values can vary across process runs
but aren't part of the kernel's fastcache key — risking cross-process
cache collisions where a kernel compiled for one backend / precision
gets loaded for another.

Fix by routing both through declared `static_rigid_sim_config` fields:

* `constraint/solver.py::add_frictionloss_constraints`:
  `gs.backend != gs.metal` -> `static_rigid_sim_config.backend != gs.metal`

* `constraint/solver.py::func_cholesky_solve_tiled`:
  `gs.qd_float == qd.f32` -> `static_rigid_sim_config.is_qd_float_f32`

* `array_class.py::RigidSimStaticConfig`: new `is_qd_float_f32: bool`
  field, populated by `RigidSolver._build_static_config` and
  `KinematicSolver._build_static_config`.

All `qd.static(...)` expressions in fastcache-enabled kernels must derive
from declared kernel parameters (now documented in the fastcache user
guide).

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Benchmark results:

20260518-1644-rigid_body_do_fix_prune

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Genesis unit tests passing:

Screenshot 2026-05-18 at 19 21 54

@hughperkins hughperkins changed the title [MISC] migrate rigid_solver.py to data_oriented [MISC] Migrate rigid_solver.py to data_oriented. May 19, 2026
@hughperkins
Copy link
Copy Markdown
Collaborator Author

apparently 0.8.1b was inadvertently built from main...

@hughperkins
Copy link
Copy Markdown
Collaborator Author

making a new quadrants release...

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Note: for the delta in the benchmarks, got AI to compare main versus this branch using several runs of each, on the same machine. Results:

  For the same-machine anymal_uniform-None-None-30000-gpu benchmark comparison, the run counts were:

  • Branch: 4 runs total (one initial + 3 reps)
    • 10,586,272 / 10,614,827 / 10,649,966 / 10,612,440 FPS → avg 10,615,876
  • Main: 4 runs total (one initial + 3 reps)
    • 10,653,310 / 10,644,604 / 10,656,977 / 10,657,268 FPS → avg 10,653,040
  • Mean diff: -0.35%

@hughperkins hughperkins marked this pull request as ready for review May 20, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants