Skip to content

[CFG] Replace lambdas in control_flow_graph.cpp with named helpers#699

Open
hughperkins wants to merge 18 commits into
mainfrom
hp/remove-cfg-lambdas
Open

[CFG] Replace lambdas in control_flow_graph.cpp with named helpers#699
hughperkins wants to merge 18 commits into
mainfrom
hp/remove-cfg-lambdas

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Promote three closures inside CFGNode::get_store_forwarding_data to a file-local free function (may_contain_address) and two private methods (is_visible_at, update_forwarding_result). Promote the in_final_node_live_gen closure in live_variable_analysis to a file-local free function. Replace four std::any_of predicates with explicit loops, the std::sort comparator with a local function-object struct, and inline the FNV mix lambda inside FingerprintHash. No behaviour change.

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Promote three closures inside CFGNode::get_store_forwarding_data to a
file-local free function (may_contain_address) and two private methods
(is_visible_at, update_forwarding_result). Promote the
in_final_node_live_gen closure in live_variable_analysis to a file-local
free function. Replace four std::any_of predicates with explicit loops,
the std::sort comparator with a local function-object struct, and inline
the FNV mix lambda inside FingerprintHash. No behaviour change.
Break the 450-line monster into a 30-line top-level plus 11 helpers in a
file-local anonymous namespace, each handling exactly one phase of the
AD-stack sizing algorithm: collect_adaptive_ad_stacks,
accumulate_per_stack_per_node_size_deltas, compute_outgoing_node_ids,
tarjan_scc, classify_scc_edges, identify_cyclic_sccs_and_topo,
group_stacks_by_fingerprint, plus four DP helpers
(classify_cyclic_scc_fast_path, spread_max_begin_over_zero_scc,
dp_mixed_sign_cyclic_scc, update_global_max_and_relax_inter_scc) driven
by run_ad_stack_size_dp_for_representative, finishing with
apply_ad_stack_dp_results. No behaviour change; the top-level prose
docstring is preserved verbatim and per-phase rationale is moved to the
helper that owns it.
Break the 190-line CFGNode::dead_store_elimination into a 30-line
top-level driver plus file-local helpers: build_matrix_ptr_alias_maps
(alias tables), is_dse_eligible_pointer (uniform eligibility predicate),
dse_store_destinations (AD-stack-aware store ptrs), is_store_dead,
try_eliminate_dead_store_at (the main eliminate/weaken branch),
record_weakened_atomic_as_load, try_eliminate_identical_load_at, and
mark_loads_live_in_this_node. The DseAliasMaps and DseLiveState structs
replace ad-hoc map pairs threaded through every helper. No behavior
change: the atomic-dead-but-not-weakable case retains its original
no-state-update fall-through.
Break CFGNode::get_store_forwarding_data into three private methods plus
a thin top-level driver: find_intra_block_last_def handles the
intra-block scan with the quant and MatrixInitStmt special cases,
find_cross_block_def folds reach_in / reach_gen definitions through
update_forwarding_result and returns the last_def_position as
optional<int> (nullopt = abort, value = position or -1),
any_aliased_store_breaks_forwarding deduplicates the aliased-store
between-check that previously appeared verbatim at both the intra- and
cross-block exit paths. The top-level body is now a ~25-line driver. No
behavior change.
Break the single-loop CFGNode::store_to_load_forwarding into two private
methods (try_forward_load_at, try_eliminate_identical_store_at) plus a
~15-line driver. Forwarding takes precedence and short-circuits the
identical-store pass for the same stmt; both helpers manage |i| and
|modified| in lockstep with the legacy semantics, including the
replace-with-zero-does-not-flip-modified quirk and the alloca-zero
special case. No behavior change.
Extract two file-local helpers (is_external_input_pointer,
seed_start_node_reach_gen) for the entry-node reach_gen seeding phase
and one (is_reach_in_stmt_killed_at) for the per-stmt kill test inside
the worklist loop. The top-level body is now a ~40-line worklist
algorithm reading top-to-bottom: seed, per-node init, propagate. The
docstring is condensed but preserves the algorithmic contract.
Extract five file-local helpers (format_cfg_node_range_label,
format_neighbor_indices, format_live_var_names, write_cfg_node_header,
write_cfg_node_statements) that each handle one piece of the textual
dump. The top-level body is now a ~20-line driver: open file, build
node->index map, iterate nodes writing header and statements. No
behavior change.
Extract the kernel-wide live-out seeding of the synthetic exit node
(controlled by !after_lower_access and the SFG eliminable-snodes config)
into a file-local helper. The top-level body is now a ~35-line worklist
algorithm: seed, per-node init, propagate. Mirrors the shape of the
sibling reaching_definition_analysis. No behavior change.
The DSE and reaching-definition splits assumed std::vector<Stmt*> with
.front(), but irpass::analysis::get_store_destination /
get_load_pointers return quadrants::stmt_refs (= one_or_more<Stmt*>),
which has no .front() and a non-const begin()/end(). Switch:

- dse_store_destinations now returns stmt_refs (single-element cases
  use stmt_refs(ptr)); mark_loads_live_in_this_node takes stmt_refs &.
- Replace .front() with *begin() in the DSE driver.
- Drop const on local store_ptrs/load_ptrs whose range-for iteration
  needs non-const begin() (DSE driver, is_reach_in_stmt_killed_at).
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Open the file with a paragraph explaining the two classes (CFGNode is
per-block intra-block work; ControlFlowGraph is the whole-graph driver),
the five analyses/transforms implemented (RD, LV, S2L, DSE,
determine_ad_stack_size), and a glossary defining reach_in/out/gen/kill,
live_in/out/gen/kill, UD-chain, adaptive AD-stack, stmt_refs, and the
MatrixPtrStmt aliasing rule. None of this was previously in-file; a
reader had to bounce through ir.h, analysis.h, statements.h, and
one_or_more.h before the first function.
Header: extend class doc comments on CFGNode and ControlFlowGraph to
spell out that the two classes implement the same five passes at
different scopes (per-block intra-block work on CFGNode, whole-graph
driver on ControlFlowGraph). On each paired method declaration, add a
one-line summary distinguishing the per-node work from the whole-graph
work (RD/LV seed + worklist; S2L/DSE flat per-node loop).

.cpp: add two banner section comments demarcating CFGNode method
definitions from ControlFlowGraph method definitions so a reader
scrolling the file can see the per-node vs whole-graph boundary without
hunting for the class qualifier on each function.
…eamble

Lead with a "What problem this solves" paragraph in user-facing terms
(reverse-mode autodiff records primal writes on per-variable stacks; we
need to size adaptive stacks by walking pushes-minus-pops on every CFG
path), then a "High-level algorithm" section that introduces SCC
condensation and fingerprint-deduplicated DP in narrative form,
explaining the two sign-based fast paths in one bullet each. The dense
implementation/perf notes follow as a "Performance summary" instead of
being the first thing a reader hits. No behavior change.
Three names that did not match what the function/method does, renamed
for self-documentation:

- get_store_forwarding_data -> find_forwardable_store_value
  Returns the forwarded *value* (Stmt *), not "data". "find" reflects
  that it's a UD-chain search that can fail (returns nullptr).

- update_forwarding_result -> fold_definition_into_result
  Folds one UD-chain definition into the running (result,
  result_visible) state; "fold" is the standard reduce-step term.

- reach_kill_variable -> is_reach_killed
  Question-form, signalling it is a query ("does this node kill the
  definition reaching it?") rather than a mutator. Aligns with the
  reach_kill set name.

All call sites are within control_flow_graph.{h,cpp}.
determine_ad_stack_size and its ~20 file-local helpers are a
self-contained ~500-line algorithm (Tarjan SCC condensation +
fingerprint-deduplicated DP) that has no per-node CFGNode counterpart
and is not used by any other pass in control_flow_graph.cpp. Pulling it
into a sibling file drops control_flow_graph.cpp from ~2100 to ~1480
lines and lets a first-time reader see at a glance that the AD-stack
sizer is a distinct concern.

The build picks the file up automatically through the existing
`file(GLOB ... "quadrants/ir/*")` in cmake/QuadrantsCore.cmake; no
CMakeLists change required. The implementation is a verbatim move
(including the rewritten "What problem this solves" docstring); access
to `start_node` still works because the method is still defined on
ControlFlowGraph.
@github-actions
Copy link
Copy Markdown

Add `AdStackAllocaStmt::kAdaptiveSentinel` (=0) and
`AdStackAllocaStmt::is_adaptive()`, and route every "is this stack still
unsized?" check through the predicate. Five call sites updated:
- quadrants/ir/determine_ad_stack_size.cpp (2)
- quadrants/transforms/determine_ad_stack_size.cpp (3)
The local `AdStackDPResult::max_size == 0` check in
quadrants/ir/determine_ad_stack_size.cpp is unrelated (compares a DP
result, not the stmt) and is left alone.

This makes the "0 means adaptive" magic local to one class. A future
change to the sentinel (e.g. switching to std::optional<size_t> or a
typed state enum) now touches only the one predicate. No behavior
change.
The old run_ad_stack_size_dp_for_representative took 11 positional args
of which 7 were vector<vector<int>>& (three of them being
next_ids_intra_fwd / next_ids_intra_back / next_ids_inter, structurally
identical and trivial to swap). Bundle them into two structs:

  AdStackDPGraph     -- per-graph (computed once, reused per stack):
                        start_node, num_sccs, scc_nodes, scc_is_cyclic,
                        scc_topo, next_ids_intra_{fwd,back}, next_ids_inter.
  AdStackPerNodeRow  -- per-stack (varies per rep): increased_size +
                        max_increased_size.

Call site uses designated initializers, so swapping e.g.
`next_ids_intra_fwd` and `next_ids_intra_back` (or `increased_size` and
`max_increased_size`) is now a compile error rather than a silent
positive-cycle-detection regression. No behavior change.

Chose this over an `enum class NodeId/StackId/SccId` strong typedef:
the strong-typedef approach would have forced ~40 static_cast<int>
insertions in the inner loops for ~zero added type safety inside the
helpers, where the actual swap risk is at the function boundary.
New private `ControlFlowGraph::assert_structural_invariants()` checks:
nodes non-empty, start_node and final_node in range, and both endpoints
actually allocated. Called from each whole-graph driver:

- reaching_definition_analysis
- live_variable_analysis
- store_to_load_forwarding
- dead_store_elimination
- gather_loaded_snodes
- simplify_graph
- unreachable_code_elimination
- dump_graph_to_file
- determine_ad_stack_size

These invariants are assumed by every helper in the file (worklist
seeding via `nodes[start_node]`, dump loop indexing by start_node /
final_node, DP scratch buffer indexed by start_node, ...). A violation
used to segfault or silently corrupt; now it asserts at the boundary.

Cross-pass ordering checks (e.g. "did you run RD before S2L?") are not
added because the drivers self-bootstrap: `store_to_load_forwarding`
calls `reaching_definition_analysis()` internally; `dead_store_elimination`
calls `live_variable_analysis()` internally. The ordering is structural,
not a per-call precondition.
Five branches whose behavior is load-bearing for compiler correctness but
where the reason "looks like dead code / obvious cleanup" to a casual
reader (human or AI). Each now opens with an explicit '[SEMANTIC TRAP --
DO NOT REMOVE ...]' header, names the legacy contract being preserved,
and (where I know it) explains the failure mode if the branch is changed.

1. try_eliminate_dead_store_at: the dead-but-unweakable atomic case
   returns false without updating killed/live state. (A previous refactor
   broke exactly this; tests do not cover it directly.)
2. try_forward_load_at: replace-with-zero does NOT flip `modified`.
   Preserved verbatim from the pre-refactor code; reasoning not captured
   in the original, comment now flags that any change needs a benchmark.
3. try_eliminate_identical_store_at: the alloca-init-0 fast path is gated
   on !autodiff_enabled and only fires for const-0 stores. Both are load-
   bearing under autodiff push/pop semantics.
4. find_intra_block_last_def: the is_quant() guard skips forwarding of
   quant-typed stores (implicit cast may truncate, so the forwarded value
   may differ from what would be loaded back).
5. any_aliased_store_breaks_forwarding: the non-tensor-alloca early-out
   skips the aliasing scan; without it, may_contain_address would mis-
   report aliasing for unrelated stores.

No behavior change.
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant