sparse strips: Lazily push wide tile layer buffers by tomcur · Pull Request #1414 · linebender/vello

tomcur · 2026-02-03T21:59:13Z

This continues on #1403, and is based on/closes #1383.

In coarse.rs we are doing O(viewport_width x viewport_size) work for pushing and popping buffers for each layer needing buffers. PR #1403 already ensures we aren't actually sending commands for buffer blending through to rendering anymore for empty buffers. This PR makes command generation lazy so that we also don't do the push_buf work for the entire viewport anymore, but only for the wide tiles that actually get drawn on.

Some operations still fall back to pushing buffers for the full viewport, like filter layers and destructive blends. We may be able to do away with some of that in the future, but for now, those operations require buffers even if they don't have content (doing away or tightening the condition causes various tests to fail or asserts to be hit).

tomcur

I've done a bit of micro-optimizing to keep the cost low for the worst-case where we're actually drawing a (nearly) dense layer.

Benchmarks in the case of layers look as follows. The first group times wide when drawing the content three layers deep at the content's natural size. Ghostscript Tiger is nearly completely dense, whereas Paris-30k is quite sparse. The tiger regresses as we're now doing the ensure_layer_stack_bufs and end up doing the work for almost all tiles, so it would've been faster to just push all the buffers beforehand, but Paris-30k is much improved.

The second group does the same but draws at 4k. That's much bigger than Tiger's natural size, and so this case also improves a lot. (Paris-30k is still better, but becomes relatively more dense than at its natural, larger size).

coarse_with_layer/Ghostscript_Tiger
                        time:   [5.8025 µs 5.8357 µs 5.8685 µs]
                        change: [+5.3922% +5.8851% +6.4095%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 50 measurements (4.00%)
  2 (4.00%) high mild
coarse_with_layer/paris-30k
                        time:   [620.76 µs 623.32 µs 626.41 µs]
                        change: [-54.967% -52.845% -50.795%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 50 measurements (8.00%)
  1 (2.00%) high mild
  3 (6.00%) high severe

coarse_with_layer_4k/Ghostscript_Tiger
                        time:   [107.11 µs 107.33 µs 107.58 µs]
                        change: [-73.934% -73.739% -73.575%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 50 measurements (12.00%)
  5 (10.00%) high mild
  1 (2.00%) high severe
coarse_with_layer_4k/paris-30k
                        time:   [340.78 µs 342.86 µs 345.49 µs]
                        change: [-40.644% -39.706% -38.685%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 50 measurements (8.00%)
  1 (2.00%) high mild
  3 (6.00%) high severe

In case there are no layers at all, we end up never having to push buffers, so there we also needlessly pay for every call. In the sparse Paris-30k case it's not too bad.

coarse/Ghostscript_Tiger
                        time:   [3.2046 µs 3.2230 µs 3.2462 µs]
                        change: [+3.2502% +4.2128% +5.1102%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.00%) high severe
coarse/paris-30k        time:   [356.05 µs 357.00 µs 358.15 µs]
                        change: [+0.5005% +1.5403% +2.4935%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.00%) high severe

tomcur · 2026-02-19T23:00:41Z

(Getting rid of the subtraction (n_bufs - n_clip) makes no substantial difference.)

LaurenzV · 2026-02-20T06:56:39Z

Hmm, I guess some of those gains do look pretty nice. If we decide to merge some variant of #1454, the "no layer" case would also become irrelevant since we would completely skip coarse rasterization (at least for vello_hybrid).

However, would like to hear what @taj-p or @grebmeg think before reviewing this PR more deeply, as it does add some more complexity.

laurenz-canva · 2026-02-23T15:35:45Z

Wow, I just benchmarked this more carefully and this is a huge win!! On a scene with 750 rectangles, each wrapped in an opacity layer, rendering for 150 frames gives me the following number:

Before:

After:

Since the no-layer case will be optimized to completely circumvent coarse rasterization anyway, I think that's a trade-off worth making!

LaurenzV

.

sparse_strips/vello_common/src/coarse.rs

LaurenzV · 2026-02-23T15:55:57Z

sparse_strips/vello_common/src/coarse.rs

+/// This is a small wrapper around [`Vec`] so we can keep the
+/// [`LayerKindAndOccupiedTiles::occupied_tiles`] allocations around.
+#[derive(Debug, Default)]
+struct NeedsBufLayerStack {


Not for this PR, but just a drive-by comment, I realized something like this would also be useful for the scratch buffers that are allocated in vello_cpu. It might be worth turning this into a small vector abstraction so that it can be reused in multiple places.

LaurenzV · 2026-02-23T15:57:48Z

sparse_strips/vello_common/src/coarse.rs

+    #[inline(always)]
+    fn ensure_layer_stack_bufs(&mut self, tile_idx: usize) {
+        let tile = &mut self.tiles[tile_idx];
+        let layer_bufs = tile.n_bufs - tile.n_clip;


I think a comment would be good we why need to do - tile.n_clip here

LaurenzV · 2026-02-23T16:01:36Z

sparse_strips/vello_common/src/coarse.rs

+                let idx = self.get_idx(wtile_x, strip_y);
+                self.ensure_layer_stack_bufs(idx);
+                self.tiles[idx].strip(cmd, current_layer_id);


I think this (as well as other places) is pretty unfortunate, because it's pretty easy to miss that you need to call ensure_layer_stack_bufs before calling any method on the wide tile. Would it not be possible to move that method into WideTile, and then change each method on the wide tile to take a &mut NeedsBufLayerStack such that it becomes impossible to miss this?

I've done this refactor, and this should actually be slightly more performant in some cases, because the wide tile itself has a bit more information on whether buffers need pushing (e.g., if it's in a zero clip, a fill doesn't need buffers). One unfortunate ergonomic aspect remains in that knowing the wide tile's index is required, so we still need to do a little let idx = ...; self.tiles[idx].fill(idx, ...) dance, instead of being able to do self.get_mut(...).fill(...). (A wrapper could work, but perhaps that's just even more mental overhead.)

LaurenzV · 2026-02-23T16:05:20Z

sparse_strips/vello_common/src/coarse.rs

+                // Optimization: If no drawing happened since the last `PushBuf`, then we don't
+                // need to do any masking or buffer-wide opacity work. The same holds for
+                // blending, unless it is destructive blending.
+                let has_draw_commands = !matches!(tile.cmds.last().unwrap(), &Cmd::PushBuf(_));


Can this even happen now that layers are pushed lazily?

Yes, e.g. destructive blends still need to eagerly push buffers (e.g. copy blending has to copy (0,0,0,0) from empty source tiles to the destination).

I've clarified the comment a bit

LaurenzV · 2026-02-23T16:06:33Z

sparse_strips/vello_common/src/coarse.rs

+            if self.clipped_filter_layer_depth == 0 {
+                for tile in &mut self.tiles {
+                    tile.in_clipped_filter_layer = false;
+                }
+            }


Am I correct that this isn't related to this PR specifically, but a bug you found as part of your work?

Yes, but it seems this isn't necessary anymore for tests to pass, but I believe it was in an earlier version of this. It may be best to leave it out for now. Do you know if the bool is intended to be true precisely when the layer itself is a clipped filter layer, or also when it is any other kind of layer nested in a clipped filter layer?

This continues on linebender#1403, and is based on/closes linebender#1383. In `coarse.rs` we are doing `O(viewport_width x viewport_size)` work for pushing and popping buffers for each layer needing buffers. PR linebender#1403 already ensures we aren't actually sending commands for buffer blending through to fine rendering for empty buffers anymore. This PR makes command generation lazy so that we also don't do the `push_buf` work for the entire viewport anymore, but only for the wide tiles that actually get drawn on. Some operations still fall back to pushing buffers for the full viewport, like filter layers and destructive blends. We may be able to do away with some of that in the future, but for now, those operations require buffers even if they don't have content (doing away or tightening the condition causes various tests to fail or asserts to be hit).

…ent))

LaurenzV

Nice, this looks very solid to me now! However, I'd like to wait for @grebmeg to give final approval, since it's still a somewhat delicate change.

LaurenzV · 2026-02-24T12:19:24Z

sparse_strips/vello_common/src/coarse.rs

+impl NeedsBufLayerStack {
+    fn clear(&mut self) {
+        for idx in 0..self.len {
+            self.stack[idx].occupied_tiles.clear();
+        }
+        self.len = 0;
+    }
+
+    fn push(&mut self, kind: LayerKind) {
+        if self.len == self.stack.len() {
+            self.stack.push(LayerKindAndOccupiedTiles {
+                kind,
+                occupied_tiles: vec![],
+            });
+        } else {
+            self.stack[self.len].occupied_tiles.clear();
+            self.stack[self.len].kind = kind;
+        }
+        self.len += 1;
+    }
+
+    fn pop(&mut self) {
+        debug_assert!(self.len > 0, "Called `pop` at the root");
+        self.len -= 1;
+    }
+
+    fn last(&self) -> Option<&LayerKindAndOccupiedTiles> {
+        self.len.checked_sub(1).map(|idx| &self.stack[idx])
+    }
+}


Should these be inlined just to be safe?

LaurenzV · 2026-02-24T12:30:18Z

sparse_strips/vello_common/src/coarse.rs

+        // function call that is not inlined. That keeps the generated code size small at our call
+        // sites.
+        (layer_bufs < layers.len).then(
+            #[inline(never)]


This is what I get when I turn it into a normal if statement:

coarse_with_layer/Ghostscript_Tiger time: [3.1904 µs 3.2009 µs 3.2124 µs] change: [-6.1956% -5.3240% -4.3456%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 50 measurements (20.00%) 7 (14.00%) high mild 3 (6.00%) high severe

I think we should just do it? It's just a single for loop after all, so i don't think we need to be too worried about code bloat. :D

Interesting... I get these results (on x86) if I turn it into a normal if:

coarse_with_layer/Ghostscript_Tiger time: [5.8312 µs 5.8537 µs 5.8745 µs] change: [+1.8618% +2.2918% +2.7258%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 50 measurements (6.00%) 3 (6.00%) high mild coarse_with_layer/paris-30k time: [661.21 µs 666.10 µs 673.03 µs] change: [+1.9323% +3.6281% +5.2749%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 50 measurements (8.00%) 2 (4.00%) high mild 2 (4.00%) high severe coarse_with_layer_4k/Ghostscript_Tiger time: [130.49 µs 133.27 µs 137.10 µs] change: [+8.5701% +10.965% +13.581%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 50 measurements (16.00%) 1 (2.00%) high mild 7 (14.00%) high severe coarse_with_layer_4k/paris-30k time: [359.92 µs 365.88 µs 373.64 µs] change: [+0.4684% +2.1311% +3.9630%] (p = 0.02 < 0.05) Change within noise threshold. Found 7 outliers among 50 measurements (14.00%) 2 (4.00%) high mild 5 (10.00%) high severe

Oh damn... Okay, I suppose let's leave it how it is then.

Maybe this is something where core::hint::cold_path would be helpful, but it's nightly-only.

(Or automatic profile-guided optimization.)

LaurenzV · 2026-02-24T12:34:40Z

sparse_strips/vello_common/src/coarse.rs

-    /// The number of pushed buffers.
+    /// The number of pushed buffers, including buffers for clips.
+    ///
+    /// Note not all layers require their own buffers.


That sentence confuses me a bit. 😄 Why do not all layers require their own buffers?

Oh right, I see now, basically if needs_buf returns None then we don't have an explicit layer, right?

basically if needs_buf returns None then we don't have an explicit layer, right?

Right! Added a documentation link in the comment to that method.

Doclink doesn't like that the WideTile field is public and the method the docs refers to is private :) maybe all WideTile fields should just become private. Only the module itself uses them.

LaurenzV

Talked to Alex, and he said it's fine landing if I consider it good. 👍

LaurenzV · 2026-02-25T08:32:22Z

sparse_strips/vello_common/src/coarse.rs

+        for idx in 0..self.len {
+            self.stack[idx].occupied_tiles.clear();
+        }


I'm wondering, do we even need this if we clear upon every push? It's not like this will free the memory, right? And it will also only clear the elements of the current length, not previous entries that we might have popped in the meanwhile.

tomcur added the C-sparse-strips Applies to sparse strips variants of vello in general label Feb 3, 2026

tomcur force-pushed the lazy-push-buf branch from 4810356 to 300168a Compare February 3, 2026 22:04

tomcur force-pushed the lazy-push-buf branch from 300168a to 684b673 Compare February 19, 2026 20:35

tomcur commented Feb 19, 2026

View reviewed changes

LaurenzV mentioned this pull request Feb 23, 2026

vello_common: Always inline push_buf and pop_buf #1472

Merged

LaurenzV reviewed Feb 23, 2026

View reviewed changes

tomcur added 8 commits February 24, 2026 10:05

Fix test

9944711

Micro-optimize

4aba2fd

Add some basic coarse rasterization benchmarks

7a6ac9f

Fix comment

0148996

Clippy

9ab2205

Remove comment

72b7a40

Comments, move clear

d3eb5d0

tomcur force-pushed the lazy-push-buf branch from 6746f6d to d3eb5d0 Compare February 24, 2026 09:21

tomcur added 4 commits February 24, 2026 10:28

Remove resetting in_clipped_filter_layer (see linebender#1414 (comm…

d00e4e2

…ent))

Clarify why we still check for empty buffers

8bc4927

Move calling ensure_layer_stack_bufs into WideTile methods

91d43c1

Simplify loop

cae41e1

tomcur requested a review from LaurenzV February 24, 2026 10:51

Typo

41b7041

LaurenzV reviewed Feb 24, 2026

View reviewed changes

LaurenzV requested a review from grebmeg February 24, 2026 12:51

Layer::needs_buf breadcrumb

8cc08ff

LaurenzV approved these changes Feb 25, 2026

View reviewed changes

Comments

Conversation

tomcur commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomcur left a comment

Choose a reason for hiding this comment

Uh oh!

tomcur commented Feb 19, 2026

Uh oh!

LaurenzV commented Feb 20, 2026

Uh oh!

laurenz-canva commented Feb 23, 2026

Uh oh!

LaurenzV left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomcur Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomcur Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomcur Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomcur commented Feb 3, 2026 •

edited

Loading

LaurenzV left a comment •

edited

Loading

tomcur Feb 24, 2026 •

edited

Loading

tomcur Feb 24, 2026 •

edited

Loading

tomcur Feb 24, 2026 •

edited

Loading