Skip to content

feat: introduce stack-allocated PyBuffer#5894

Open
winstxnhdw wants to merge 17 commits intoPyO3:mainfrom
winstxnhdw:feat/stacked-pybuffer
Open

feat: introduce stack-allocated PyBuffer#5894
winstxnhdw wants to merge 17 commits intoPyO3:mainfrom
winstxnhdw:feat/stacked-pybuffer

Conversation

@winstxnhdw
Copy link
Copy Markdown

@winstxnhdw winstxnhdw commented Mar 19, 2026

Summary

Currently, the overhead of PyBuffer takes a couple of microseconds to allocate on the heap, which may be too much overhead for some workloads.

This PR implements a pinned stack-allocated PyUntypedBuffer variant, PyUntypedBufferView.

Closes #5836

Comment thread src/buffer.rs Outdated
Copy link
Copy Markdown
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this. Various thoughts around the accessor methods.

Also, out of scope for this PR, but I keep wondering if we should provide iterators for these structures. Especially with the strides / suboffsets etc, it's not necessarily trivial to get this right.

Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs
@winstxnhdw
Copy link
Copy Markdown
Author

winstxnhdw commented Mar 20, 2026

Also, out of scope for this PR, but I keep wondering if we should provide iterators for these structures. Especially with the strides / suboffsets etc, it's not necessarily trivial to get this right.

Yes, I think it would definitely be useful as well. I am not yet sure how this would look like though.

Copy link
Copy Markdown
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the continued work here, looking great!

Implementing Drop directly on PyUntypedBufferView is functionally equivalent to what I had in mind from the "drop guard", so gets a 👍 from me.

Afraid I had quite a few more thoughts around some of the edge cases (plus some ideas we can probably ignore). Hopefully helps us get to the right eventual abstraction!

Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs
@winstxnhdw winstxnhdw force-pushed the feat/stacked-pybuffer branch 3 times, most recently from 5430e96 to 48b5fdd Compare March 22, 2026 18:11
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 24, 2026

Merging this PR will degrade performance by 10.73%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 104 untouched benchmarks
⏩ 1 skipped benchmark1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
bench_pyclass_create 3.9 µs 4.4 µs -10.73%

Comparing winstxnhdw:feat/stacked-pybuffer (7f2de53) with main (962a535)

Open in CodSpeed

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

@winstxnhdw winstxnhdw force-pushed the feat/stacked-pybuffer branch from b4f1eab to dd63129 Compare March 27, 2026 15:03
Comment thread src/buffer.rs Outdated
@winstxnhdw winstxnhdw requested a review from davidhewitt March 30, 2026 10:09
Copy link
Copy Markdown
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

After reading through this implementation a couple of times, I really like the expressiveness this gives in the type system. I have some ideas which might refine it further (see BufferFlags struct idea).

Main concern is about generic code bloat from explosion of generic parameters. I am unsure if there's a way that we can mitigate that at all.

Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs
Comment thread src/buffer.rs
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs
@davidhewitt
Copy link
Copy Markdown
Member

Thanks for all of this - would be interested to see what you think of these suggestions (none are mandatory, I'm just pushing out ideas of what I think I like, which may not be to others' taste!)

@davidhewitt
Copy link
Copy Markdown
Member

#5870 has now merged, we'll want to add a similar API here.

@winstxnhdw winstxnhdw force-pushed the feat/stacked-pybuffer branch 6 times, most recently from fcb9203 to c1872f3 Compare April 7, 2026 21:31
@winstxnhdw
Copy link
Copy Markdown
Author

winstxnhdw commented Apr 7, 2026

Sorry for the delay. I've been a little burnt out from school + work. I've adapted your suggestions and modified them to be what I think is appropriate.

Also, if you do PyBufferFlags::simple().full(), you won't be able to append let's say format() on it anymore because full() already implies format(). Just a nice DX win that I haven't really seen in other libraries like reqwest or tokio.

This is ok.

let bytes = PyBytes::new(py, b"abcde");
let flags = PyBufferFlags::simple()
    .writable()
    .format()
    .c_contiguous()
    .strides();

PyUntypedBufferView::with_flags(&bytes, flags, |view| {
 ...
})

This is not ok.

let bytes = PyBytes::new(py, b"abcde");
let flags = PyBufferFlags::simple()
    .writable()
    .format()
    .full() // errors here
    .c_contiguous()
    .strides();

PyUntypedBufferView::with_flags(&bytes, flags, |view| {
 ...
})

@winstxnhdw winstxnhdw requested a review from davidhewitt April 9, 2026 06:39
Copy link
Copy Markdown
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is looking really cool. We're obviously venturing into new territory here so please bear with me if we have a few rounds of review while we experiment with what feels good (both to implement and to use).

A bunch of comments applied. I suspect we'll learn from these.

Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
@winstxnhdw winstxnhdw requested a review from davidhewitt April 12, 2026 14:39
Copy link
Copy Markdown
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is looking in pretty good shape. Let's see if we get an answer on python/cpython#148431 regarding format / itemsize interactions just in case that encourages us to make some other tweaks.

I would love feedback from some buffer API consumers about how this looks.

Ping @alex @kylebarron, I believe you both use PyO3's buffer API, please forgive the ping. I would love to ask:

  • Would this view type be useful to you for performance benefit over the PyBuffer<T> case which incurs an unconditional heap allocation? (I think it seems worth it for some consumers even if it doesn't affect your use cases, so this won't block merge if you say no.)
  • As consumers of the API, what do you think of the PyBufferFlags type which encodes the request and what fields from the exporter are meaningful in the type system. See e.g. #5894 (comment) - if the buffer request didn't ask for shape information, then .shape() accessor is not available, for example. I think we've cooked up something pretty nice here, though it would be nice to know whether others agree before we merge it 😅

See also docs for this PyUntypedBufferView on this branch at https://69dbb09a02f842688b3dcc9f--pyo3.netlify.app/main/doc/pyo3/buffer/struct.pyuntypedbufferview

Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated

/// Type-safe buffer request flags. The const parameters encode which fields
/// the exporter is required to fill.
pub struct PyBufferFlags<
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought: if we wanted to change CONTIGUITY to be an enum later (when possible) or add SUBOFFSETS, we would need users to be unable to directly name this type.

I think we might be able to achieve that by making this struct something like PyBufferFlagsImpl and having the real PyBufferFlags type be a thin re-export around it which uses the PyBufferFlagsImpl as a type parameter, similar to the way the view types do this.

I'm unsure of the merits of doing so, what do you think? It would be nice to have control over the implementation later if it doesn't add tons of complexity for both us and readers.

Copy link
Copy Markdown
Author

@winstxnhdw winstxnhdw Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we shouldn't expose PyBufferFlags to minimise breaking changes. I've made the change, but I think it looks a bit ugly. I've made it a separate commit so that you can easily view the diff, and for me to revert if needed. Let me know what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem too bad to me, what bit do you dislike?

If you don't like the name FlagsImpl, one thought I have is that we could rename PyBufferFlags -> PyBufferRequest and PyBufferFlagsImpl -> PyBufferFlags.

"Compound Requests" is the term given in the Python docs for the flag combinations, after all.

See also my other comment on .format() method which might hide many of the uses of FlagsImpl if applied to all possible patterns.

Comment thread src/buffer.rs Outdated
@kylebarron
Copy link
Copy Markdown
Contributor

kylebarron commented Apr 13, 2026

I'm not sure I have enough context to answer these questions. Do you have a full usage example perhaps?

  • Would this view type be useful to you for performance benefit over the PyBuffer<T> case which incurs an unconditional heap allocation?

What is PyBuffer<T> allocating? It's not copying the entire buffer, right? Do you mean it's essentially Boxing the reference to the Python data? If so, that seems like a small price to pay to be amortized over the entire input buffer (I tend to use the buffer protocol with large inputs, so for a small overhead that's amortized over the entire input I don't mind much).

  • As consumers of the API, what do you think of the PyBufferFlags type which encodes the request and what fields from the exporter are meaningful in the type system. See e.g. #5894 (comment) - if the buffer request didn't ask for shape information, then .shape() accessor is not available, for example. I think we've cooked up something pretty nice here, though it would be nice to know whether others agree before we merge it 😅

That does look cool.

For my usage I tend to want to expose an input Python buffer as either Bytes or an Arrow Buffer. See for example FromPyObject for my PyBytes, which then I (unsafely) implement AsRef<[u8]> so that I can interpret the Python memory region for Bytes::from_owner.

I think this is sound if the Python user doesn't mutate the buffer while the Rust code is running? My use cases usually expect immutable buffers, even though there's no invariant to force that.

I don't think I'd have much benefit from a FnOnce closure that can access the Python region just once, even though that does look safer than my usage.

@winstxnhdw
Copy link
Copy Markdown
Author

winstxnhdw commented Apr 14, 2026

I've also been thinking if we could implement the extractor for the view buffers. It would definitely be nicer to use. I'll investigate.

Copy link
Copy Markdown
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some responses to the PyBufferFlagsImpl changes.

Comment thread src/buffer.rs Outdated

/// Type-safe buffer request flags. The const parameters encode which fields
/// the exporter is required to fill.
pub struct PyBufferFlags<
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem too bad to me, what bit do you dislike?

If you don't like the name FlagsImpl, one thought I have is that we could rename PyBufferFlags -> PyBufferRequest and PyBufferFlagsImpl -> PyBufferFlags.

"Compound Requests" is the term given in the Python docs for the flag combinations, after all.

See also my other comment on .format() method which might hide many of the uses of FlagsImpl if applied to all possible patterns.

Comment thread src/buffer.rs Outdated
Comment on lines +1123 to +1126
/// A [struct module style](https://docs.python.org/3/c-api/buffer.html#c.Py_buffer.format)
/// string describing the contents of a single item.
#[inline]
pub fn format(&self) -> &CStr {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of simplifying the amount of FlagsImpl bouncing around, I came up with this commit: 3a977f3

Two upsides to that commit:

  • The .format() method is now always available, but has a trait bound with a friendly error message if the buffer request does not provide (or imply) format information.
  • There is now no mention of FlagsImpl at all on the .format() method, just a trait bound on Flags generic parameter.

What do you think of that? If you like it, maybe you can pull it and do similar for the other accessor methods?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I didn't know you could do something like this. That looks good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now no mention of FlagsImpl at all on the .format() method, just a trait bound on Flags generic parameter.

I am not sure what's a good middle-ground for this. Your commit only has IncludesFormat, but technically, we can go all the way to include all of these traits as well.

  • CanRequestFormat
  • CanRequestShape
  • CanRequestStrides
  • CanRequestIndirect
  • CanRequestWritable
  • CanRequestContiguity
  • IncludesFormat
  • IncludesShape
  • IncludesStrides
  • IncludesSuboffsets
  • GuaranteesWritable
  • GuaranteesCContiguous
  • GuaranteesFContiguous

Doing this will get rid much of the FlagsImpl, but the cost of that is adding many more lines of code.

Comment thread src/buffer.rs Outdated
Comment thread src/buffer.rs Outdated
@davidhewitt
Copy link
Copy Markdown
Member

What is PyBuffer<T> allocating? It's not copying the entire buffer, right? Do you mean it's essentially Boxing the reference to the Python data? If so, that seems like a small price to pay to be amortized over the entire input buffer (I tend to use the buffer protocol with large inputs, so for a small overhead that's amortized over the entire input I don't mind much).

You're absolutely correct - for large buffers uses the allocation overhead can be irrelevant, and keeping the buffer information on the stack has the downside that you can't keep the buffer export alive for later consumption. OP noticed the performance advantage was relevant to their use case, however.

I can imagine this stack-based form is generally better for short-lived temporary buffers used to read (or write) a small bytes export.

I've also been thinking if we could implement the extractor for the view buffers. It would definitely be nicer to use. I'll investigate.

Can you clarify what you're thinking here? Maybe I can offer ideas :)

@davidhewitt
Copy link
Copy Markdown
Member

I think this is sound if the Python user doesn't mutate the buffer while the Rust code is running? My use cases usually expect immutable buffers, even though there's no invariant to force that.

Yes, that sounds about right. @alex has previously written about how the current buffer API is deficient of guarantees against data races, for now I guess we live with that. I think @ngoldbaum is also thinking about this problem from time to time (with all three of free-threading, numpy, and PyO3 hats on, even).

@winstxnhdw
Copy link
Copy Markdown
Author

Can you clarify what you're thinking here? Maybe I can offer ideas :)

Currently, our API is kinda like this, and it isn't really coherent with the rest of PyO3's API.

#[pyfunction]
fn sum_f32(py: Python<'_>, buf: PyUntypedBuffer) -> PyResult<f32> {
    let obj = buf.obj(py).unwrap();
    PyBufferView::<f32>::with_flags(obj, PyBufferFlags::contig(), |buf| {
        Ok(buf.as_contiguous_slice(py).iter().map(|x| x.get()).sum())
    })?
}

Ideally, I think we would want something like this.

#[pyfunction]
fn sum_f32(
    py: Python<'_>,
    buf: PyUntypedBufferWith<BufferRequest::ContigRo>,
) -> PyResult<f32> {
    let buf = buf.as_typed::<f32>()?;
    Ok(buf
        .as_contiguous_slice(py)
        .iter()
        .map(|x| x.get())
        .sum())
}

@winstxnhdw winstxnhdw force-pushed the feat/stacked-pybuffer branch 7 times, most recently from a6d4f80 to d22c422 Compare April 16, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider stack-allocated PyBuffer variant

3 participants