perf: direct i1 branch propagation for integer comparisons and compound loop guards#508
Closed
perf: direct i1 branch propagation for integer comparisons and compound loop guards#508
Conversation
Contributor
Benchmark Results (Linux x86-64)
CLI Tool Benchmarks
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two related lowering fixes for the branch-condition path. Both are correctness-preserving cleanups that kill dead conversions in the emitted LLVM IR — not performance transforms in the usual sense, just removing wasted code the optimizer would have liked to eliminate but couldn't.
Fix 1: wantsI1 fast-path in Uint8Array + charAt comparison optimizations
`BinaryExpressionGenerator` has two fast-path helpers that recognize specific comparison shapes:
Both helpers correctly produce an `i1` from the `icmp`, but neither honored the global `wantsI1` flag that tells a comparison "your result is going directly into a branch condition, skip the i32/double round-trip." So even though the comparison was already a boolean ready to feed `br i1`, they would emit:
```llvm
%81 = icmp eq i8 %80, 1 ; the useful part
%82 = zext i1 %81 to i32 ; ← dead
%83 = sitofp i32 %82 to double ; ← dead
%84 = fcmp one double %83, 0.0 ; ← dead
br i1 %84, label %then, ... ; ← the branch we wanted three instructions ago
```
Every other comparison path in `BinaryExpressionGenerator` checks `getWantsI1()` at the end and returns the raw `i1` when a branch wants it (see `generateNumericComparison` at line ~573). These two fast paths were missed.
Fix is 4 lines in each helper: check `getWantsI1()` before the zext/sitofp chain, and if set, return the raw i1 directly. The `tryOptimizeCharAtComparison` case also required restructuring its three-way `phi` to merge `i1` values instead of `i32`, since the phi operands need to match the new return type.
Post-fix IR for `if (flags[p] === 1) { ... }` in the sieve benchmark:
```llvm
%81 = icmp eq i8 %80, 1 ; single icmp
br i1 %81, label %then12, ... ; direct branch
```
Two instructions, not five.
Fix 2: `isSimpleComparisonForBranch` was too conservative on compound operands
`generateBranchCondition` in `control-flow.ts` has a fast path that recognizes simple comparison expressions and sets `wantsI1=true` before recursing into `generateExpression`, so the final comparison returns a raw `i1` for the branch. The predicate `isSimpleComparisonForBranch` decided which expressions qualified.
It bailed on any nested binary operand — i.e. `while (p * p <= LIMIT)` fell through to the slow path because `p * p` is a `binary` node. The slow path generates the comparison as an expression (returning a `double` from `generateNumericComparison`'s zext+sitofp branch) and then `convertToBool`'s that double back to `i1` via another `fcmp one 0.0`. Six extra instructions per loop iteration for an entirely dead conversion.
Replaced the flat "reject binary" check with a recursive `isPureSubexprForBranch` walker. A sub-expression is pure for the branch path iff it does not consume the `wantsI1` flag itself — which means no nested comparisons, no logical ops (`&&`/`||`/`??`), no conditionals, no calls, no arrow functions, no awaits. Pure arithmetic (mul/add/sub/shifts), unary minus/bitnot, variable reads, member/index accesses, and literals all pass.
The walker is deliberately conservative about `unary !` (which wraps a comparison that would consume wantsI1) and `binary` comparison/logical ops (which would also consume wantsI1 before the outer comparison sees it). For everything else, wantsI1 propagates cleanly.
Post-fix IR for sieve's outer loop `while (p * p <= LIMIT)`:
```llvm
while_cond9:
%65 = load i64, i64* %p.addr.3
%66 = load i64, i64* %p.addr.3
%67 = mul i64 %65, %66
%68 = load double, double* @limit
%69 = sitofp i64 %67 to double
%70 = fcmp ole double %69, %68
br i1 %70, label %while_body10, label %while_end11
```
Previously this had three extra instructions (`zext i1 → i32`, `sitofp i32 → double`, `fcmp one double, 0.0`) between the `fcmp ole` and the `br`. Gone now.
Honest note on measurable impact
Both fixes are structurally correct but the measurable perf impact on the current benchmark suite is small, and I want to be upfront about that rather than oversell:
So: expect sieve to move somewhere between 0% and 3% on the dashboard, probably landing inside noise. The real value of this PR is cleaner IR that makes downstream LLVM optimizations easier and removes a class of "future we'll fix this" footguns where a new comparison fast path gets added and forgets to honor `wantsI1`.
What this PR does NOT fix
These are all on the perf follow-up roadmap; this PR is the small correct cleanup that generalizes most broadly.
Tests
```
✔ 777/777 pass
suites 37
duration 65.7s
```
All three of the existing `tryOptimizeUint8ArrayComparison` / `tryOptimizeCharAtComparison` / `isSimpleComparisonForBranch` code paths are exercised in the test corpus, and none needed updates. The fixes are strictly subtractive — they replace emitted IR with a shorter equivalent that produces the same branch outcome.
Measurement
Scope
Two files:
Diff: +83 / −12 across two files. No new dependencies, no new runtime code.