Improve reinterpret performance for padded types, with minimal harm to compilation time#60415
Improve reinterpret performance for padded types, with minimal harm to compilation time#60415NHDaly wants to merge 41 commits into
reinterpret performance for padded types, with minimal harm to compilation time#60415Conversation
This reverts commit 32f7597.
Move the complex logic into a preprocessing step, so that the actual recursive function is _dead-simple_.
Apparently julia can fully specialize these kinds of for-loops over types now! Neat.
doing a new padding comparison computation.
Perf improvement on Base.padding (and Base.packedsize): while-loop implementation of base.padding. Makes `padding()` about 4x faster: ``` julia> @Btime Base.padding($(type1(30))); 34.125 μs (1175 allocations: 119.58 KiB) julia> @Btime Base.padding($(type1(30))); 8.070 μs (170 allocations: 10.72 KiB) ```
…on. (About 0.2 seconds)
|
Put an AI to bang a bit on this (so take it for what it is worth), and it came up with this, which passes on master but fails here. Since there is no discussion about that, I guess it is unintended? using Test
struct Inner
x::UInt8
y::UInt8
end
struct Outer1
a::UInt32
b::Inner
end
struct Outer2
a::Inner
b::UInt32
end
@testset "reinterpret padded region order" begin
o1 = Outer1(0x04030201, Inner(0x05, 0x06))
expected_o2 = Outer2(Inner(0x01, 0x02), 0x06050403)
@test reinterpret(Outer2, o1) == expected_o2
o2 = expected_o2
expected_o1 = Outer1(0x04030201, Inner(0x05, 0x06))
@test reinterpret(Outer1, o2) == expected_o1
end🤖 🤖 : |
cbcfcef to
16b20a8
Compare
|
Ah, yep! Thanks @KristofferC!! I actually just discovered exactly the same issue, and pushed up some broken tests. 👍 I'll push up a fix in the next couple days. |
|
Okay, I've pushed a fix! 😊 Thanks for the report. EDIT: Though after that fix, the compilation time went up a bit: julia> const Out = Tuple{UInt8, Int64, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Float64, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8};
julia> @time @eval reinterpret($Out, $((0x01, (0x03, (0x05, (0x07, (0x09, (0x0b, (0x0d, (0x0f, (0x11, (0x13, (0x15, (0x17, (0x19, (0x1b, (0x1d, (0x1f, (0x21, (0x23, (0x25, (0x27, (0x29, (0x2b, (0x2d, (0x2f, (0x31, (0x33, (0x35, (0x37, (0x39, (0x3b, (0x3d, 4991188238874984254, 0x46), 0x3c), 0x3a), 0x38), 0x36), 0x34), 0x32), 0x30), 0x2e), 0x2c), 0x2a), 0x28), 0x26), 0x24), 0x22), 0x20), 0x1e), 0x1c), 0x1a), 0x18), 0x16), 0x14), 0x12), 0x10), 0x0e), 0x0c), 0x0a), 0x08), 0x06), 0x04), 0x02), 1, 0x02, 0x00));
1.950291 seconds (5.81 M allocations: 258.578 MiB, 11.26% gc time, 99.99% compilation time)I'd love to try to figure out how to get that back down again. |
There was a problem hiding this comment.
Pull request overview
This PR significantly improves the performance of reinterpret for types with internal padding by generating specialized memory copy instructions at compile time, rather than using runtime reflection and loops. The key innovation is computing precise packed regions for each type and generating targeted unsafe_copyto! calls for each region.
Key changes:
- Replaced runtime reflection-based copying with compile-time generation of precise memcopy instructions for packed regions
- Optimized
padding()and introduced_packed_regions()to use iterative depth-first traversal instead of recursion, reducing compiler overhead by ~5x - Added comprehensive test coverage for padded-to-padded, padded-to-packed, and packed-to-padded conversions
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| base/reinterpret.jl | New file implementing the optimized reinterpret logic with packed region computation and specialized memory copying |
| base/reinterpretarray.jl | Removed old runtime-reflection-based implementation, optimized padding() to use iterative traversal, simplified ispacked() to read from struct metadata |
| base/Base.jl | Added include for the new reinterpret.jl file |
| test/reinterpretarray.jl | Added comprehensive test cases for various reinterpret scenarios including padded structs, nested tuples, and edge cases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| end | ||
|
|
||
| # Simple memcopy between two types of the same size. | ||
| @inline function byte_cast(::Type{T}, x::V) where {T,V} |
There was a problem hiding this comment.
I'm having trouble understanding why these three separate methods are necessary. Other than the assert, byte_cast and _byte_cast_smaller_src are identical. And the test code has a test case that hits _byte_cast_smaller_src but doesn't hit byte_cast or _byte_cast_smaller_dst: reinterpret(Tuple{Int64, Int64, Int8}, ntuple(_->0x1, 17))
I do see that the _dst variant is slightly slower due to the source Ref/preserving the pointer, but why is that necessary here but not needed in the smaller and equal cases?
There was a problem hiding this comment.
Yeah, you're right, the first two are identical.
If we were only manually copying bytes between types of the same size, these wouldn't be needed. But if they have different sizes, we have to do things differently, to take care that we don't accidentally do any buffer-overflow:
If the source is smaller, we are guaranteed that the dst pointer will have "room" for the entire source, so we can cast the dst pointer into a SourceType pointer, and simply store the source into that pointer.
Whereas if the dst is smaller, we need to make sure that we are truncating the source value. To do that, we need to cast the source pointer into a DestType pointer, and then read the value from that pointer, and write the value into the dst pointer.
Note that we can't use that same code in the first case, since if the src is smaller, attempting to cast it into a DestType pointer and then read from it, could possibly read past the valid buffer, still causing a segfault.
So since the smaller-dst and smaller-src variants are both needed, i figured it made sense to have all three. Even though the code is the same in the first two, since if they have the same size, you can use either approach, and i picked the "simpler" one.
There was a problem hiding this comment.
Ok. As for the duplication, I suppose for someone else coming to the code it just makes it a bit harder to understand the design if there's code duplication without a clear reason for it.
There was a problem hiding this comment.
Perhaps move to a reinterpret.jl file to mirror the base reorg?
Description
This PR improves the performance of
reinterpret(T, x)for types with internal padding.Support for
reinterpreton padded structs was added in this nice PR #47116 by @BioTurboNick. We have made heavy use of this feature in our code at RelationalAI (thanks @BioTurboNick!). In the process, we've found some opportunities to improve performance for reinterpret on padded structs, quite dramatically in some cases. :)Before this PR, on master, the current implementation of reinterpret specializes to the types involved. However, despite specializing to the types, the generated code still contains runtime reflection and for-loops over the structs' definitions, with possible dynamic dispatches.
The new PR takes advantage of the fact that we are specializing these functions, to generate the precise memcopy instructions for each packed-region of the two types. To illustrate the idea, given these two types --

Tuple{UInt8, UInt16}, Tuple{UInt16, UInt8}:we generate code that looks roughly like this:
(As before, if the two types have identical padding, we can simply memcopy the whole byte range wholesale, i.e. perform a compiler-only typecast.)
Compile Time
Finally, regarding compilation time, I took a lot of care to make sure we aren't increasing compilation time too much in order to compute all of those memcopy ranges described above. In order to achieve that, most of the work to compute the packed-regions is performed in normal user code over vectors, in functions marked
@assume_effects :foldable. Then we convert those vectors to a wide Tuple of regions, and the final code-generation is achieved by recursing over that tuple and callingunsafe_copyto!in each iteration.Additionally, I took care to avoid any recursion in the
:foldablefunctions, since I think this would cause the compiler to cache each MethodInstance during the recursion. To do that, I converted from recursion to an explicit depth-first search. I found that this made an appreciable performance improvement to compile-times, and so I also applied the same optimization toBase.padding()which existed before this PR. That improved compile times by ~5x for that function.Here is one test showing the difference in compile times. This converts between two types with 33 fields, but different padding.
master:
this PR:
Benchmark Results
Here are some of the best-case highlights from the before and after on the reinterpret benchmarks added here: JuliaCI/BaseBenchmarks.jl#339:
Before this PR:
After this PR:
Full before/after here:
Benchmarks on
3b21c7f60d1Benchmarks on this PR