Skip to content

Improve reinterpret performance for padded types, with minimal harm to compilation time#60415

Open
NHDaly wants to merge 41 commits into
masterfrom
nhd/reinterpret-padded-struct-performance
Open

Improve reinterpret performance for padded types, with minimal harm to compilation time#60415
NHDaly wants to merge 41 commits into
masterfrom
nhd/reinterpret-padded-struct-performance

Conversation

@NHDaly
Copy link
Copy Markdown
Member

@NHDaly NHDaly commented Dec 18, 2025

Description

This PR improves the performance of reinterpret(T, x) for types with internal padding.

Support for reinterpret on padded structs was added in this nice PR #47116 by @BioTurboNick. We have made heavy use of this feature in our code at RelationalAI (thanks @BioTurboNick!). In the process, we've found some opportunities to improve performance for reinterpret on padded structs, quite dramatically in some cases. :)

Before this PR, on master, the current implementation of reinterpret specializes to the types involved. However, despite specializing to the types, the generated code still contains runtime reflection and for-loops over the structs' definitions, with possible dynamic dispatches.

The new PR takes advantage of the fact that we are specializing these functions, to generate the precise memcopy instructions for each packed-region of the two types. To illustrate the idea, given these two types -- Tuple{UInt8, UInt16}, Tuple{UInt16, UInt8}:
Screenshot 2025-12-18 at 2 33 01 PM

we generate code that looks roughly like this:

## Pseudocode for reinterpret(Tuple{UInt8, UInt16}, (0x0001, 0x2)):
#  Source layout bytes:        [a3 a2 a1 (pad)]
#  Result layout bytes:        [b3 (pad) b1 b0]
    memcpy(b + 0, a + 0, 1);  # b3 = a3
    memcpy(b + 2, a + 1, 2);  # b1 = a2, b0 = a1

(As before, if the two types have identical padding, we can simply memcopy the whole byte range wholesale, i.e. perform a compiler-only typecast.)

Compile Time

Finally, regarding compilation time, I took a lot of care to make sure we aren't increasing compilation time too much in order to compute all of those memcopy ranges described above. In order to achieve that, most of the work to compute the packed-regions is performed in normal user code over vectors, in functions marked @assume_effects :foldable. Then we convert those vectors to a wide Tuple of regions, and the final code-generation is achieved by recursing over that tuple and calling unsafe_copyto! in each iteration.

Additionally, I took care to avoid any recursion in the :foldable functions, since I think this would cause the compiler to cache each MethodInstance during the recursion. To do that, I converted from recursion to an explicit depth-first search. I found that this made an appreciable performance improvement to compile-times, and so I also applied the same optimization to Base.padding() which existed before this PR. That improved compile times by ~5x for that function.

Here is one test showing the difference in compile times. This converts between two types with 33 fields, but different padding.

master:

julia> const Out = Tuple{UInt8, Int64, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Float64, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8};

julia> @time @eval reinterpret($Out, $((0x01, (0x03, (0x05, (0x07, (0x09, (0x0b, (0x0d, (0x0f, (0x11, (0x13, (0x15, (0x17, (0x19, (0x1b, (0x1d, (0x1f, (0x21, (0x23, (0x25, (0x27, (0x29, (0x2b, (0x2d, (0x2f, (0x31, (0x33, (0x35, (0x37, (0x39, (0x3b, (0x3d, 4991188238874984254, 0x46), 0x3c), 0x3a), 0x38), 0x36), 0x34), 0x32), 0x30), 0x2e), 0x2c), 0x2a), 0x28), 0x26), 0x24), 0x22), 0x20), 0x1e), 0x1c), 0x1a), 0x18), 0x16), 0x14), 0x12), 0x10), 0x0e), 0x0c), 0x0a), 0x08), 0x06), 0x04), 0x02), 1, 0x02, 0x00));
  0.924027 seconds (5.81 M allocations: 279.472 MiB, 17.23% gc time, 99.95% compilation time)

this PR:

julia> const Out = Tuple{UInt8, Int64, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Float64, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8};

julia> @time @eval reinterpret($Out, $((0x01, (0x03, (0x05, (0x07, (0x09, (0x0b, (0x0d, (0x0f, (0x11, (0x13, (0x15, (0x17, (0x19, (0x1b, (0x1d, (0x1f, (0x21, (0x23, (0x25, (0x27, (0x29, (0x2b, (0x2d, (0x2f, (0x31, (0x33, (0x35, (0x37, (0x39, (0x3b, (0x3d, 4991188238874984254, 0x46), 0x3c), 0x3a), 0x38), 0x36), 0x34), 0x32), 0x30), 0x2e), 0x2c), 0x2a), 0x28), 0x26), 0x24), 0x22), 0x20), 0x1e), 0x1c), 0x1a), 0x18), 0x16), 0x14), 0x12), 0x10), 0x0e), 0x0c), 0x0a), 0x08), 0x06), 0x04), 0x02), 1, 0x02, 0x00));
  1.230188 seconds (3.50 M allocations: 159.093 MiB, 7.44% gc time, 99.99% compilation time)

Benchmark Results

Here are some of the best-case highlights from the before and after on the reinterpret benchmarks added here: JuliaCI/BaseBenchmarks.jl#339:
Before this PR:

julia> @btime reinterpret(Tuple{Int16, Int8, Int64, Int8, Int64, Int64, Int8}, $((Int64(1), 0x0001, (0x01, Int64(2), 0x01), 0x01, 1.0)));
  2.768 μs (32 allocations: 960 bytes)

julia> @btime reinterpret(Tuple{UInt8, Int64, Vararg{UInt8, 100}}, $(ntuple(_->0x1, 100), 0x2, 3));
  1.146 μs (0 allocations: 0 bytes)

julia> @btime reinterpret(Tuple{UInt8, UInt64}, $(1, 0x2));
  45.121 ns (0 allocations: 0 bytes)

julia> struct ByteString0 end

julia> @btime reinterpret(Tuple{}, $(ByteString0()));
  2.833 ns (0 allocations: 0 bytes)

After this PR:

julia> @btime reinterpret(Tuple{Int16, Int8, Int64, Int8, Int64, Int64, Int8}, $((Int64(1), 0x0001, (0x01, Int64(2), 0x01), 0x01, 1.0)));
  3.500 ns (0 allocations: 0 bytes)

julia> @btime reinterpret(Tuple{UInt8, Int64, Vararg{UInt8, 100}}, $(ntuple(_->0x1, 100), 0x2, 3));
  3.208 ns (0 allocations: 0 bytes)

julia> @btime reinterpret(Tuple{UInt8, UInt64}, $(1, 0x2));
  1.959 ns (0 allocations: 0 bytes)

julia> struct ByteString0 end

julia> @btime reinterpret(Tuple{}, $(ByteString0()));
  1.125 ns (0 allocations: 0 bytes)

Full before/after here:

Benchmarks on 3b21c7f60d1
julia> versioninfo()
Julia Version 1.14.0-DEV.1386
Commit 3b21c7f60d1 (2025-12-18 16:29 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LLVM: libLLVM-20.1.8 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_SSL_CA_ROOTS_PATH = 

julia> BaseBenchmarks.load!("reinterpret"); run(BaseBenchmarks.SUITE["reinterpret"])
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "mixed_tuples" => 4-element BenchmarkTools.BenchmarkGroup:
          tags: []
          (104, 104) => Trial(16.791 μs)
          (228, 228) => Trial(31.916 μs)
          (0, 0) => Trial(4.166 μs)
          (100, 100) => Trial(17.209 μs)
  "padded_to_padded" => 6-element BenchmarkTools.BenchmarkGroup:
          tags: []
          (29, 48, 56) => Trial(3.970 ms)
          (10, 24, 24) => Trial(9.041 μs)
          (29, 56, 48) => Trial(3.756 ms)
          (117, 128, 128) => Trial(17.475 ms)
          (10, 12, 24) => Trial(131.583 μs)
          (0, 0, 0) => Trial(4.250 μs)
  "packed_types" => 5-element BenchmarkTools.BenchmarkGroup:
          tags: []
          17 => Trial(9.459 μs)
          49 => Trial(12.708 μs)
          8 => Trial(12.042 μs)
          128 => Trial(19.292 μs)
          0 => Trial(4.125 μs)
  "padded_types" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          (29, 56) => Trial(3.720 ms)
          (10, 24) => Trial(73.292 μs)
          (117, 128) => Trial(99.875 μs)
Benchmarks on this PR
julia> versioninfo()
Julia Version 1.14.0-DEV.1400
Commit 53e69766b5* (2025-12-18 19:46 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin25.2.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LLVM: libLLVM-20.1.8 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_SSL_CA_ROOTS_PATH = 

julia> BaseBenchmarks.load!("reinterpret"); run(BaseBenchmarks.SUITE["reinterpret"])
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "mixed_tuples" => 4-element BenchmarkTools.BenchmarkGroup:
          tags: []
          (104, 104) => Trial(16.208 μs)
          (228, 228) => Trial(31.167 μs)
          (0, 0) => Trial(958.000 ns)
          (100, 100) => Trial(16.541 μs)
  "padded_to_padded" => 6-element BenchmarkTools.BenchmarkGroup:
          tags: []
          (29, 48, 56) => Trial(15.834 μs)
          (10, 24, 24) => Trial(8.792 μs)
          (29, 56, 48) => Trial(16.584 μs)
          (117, 128, 128) => Trial(23.958 μs)
          (10, 12, 24) => Trial(10.292 μs)
          (0, 0, 0) => Trial(916.000 ns)
  "packed_types" => 5-element BenchmarkTools.BenchmarkGroup:
          tags: []
          17 => Trial(8.750 μs)
          49 => Trial(11.959 μs)
          8 => Trial(11.750 μs)
          128 => Trial(18.333 μs)
          0 => Trial(958.000 ns)
  "padded_types" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          (29, 56) => Trial(12.125 μs)
          (10, 24) => Trial(9.208 μs)
          (117, 128) => Trial(21.791 μs)

Move the complex logic into a preprocessing step, so that the actual
recursive function is _dead-simple_.
Apparently julia can fully specialize these kinds of for-loops over
types now! Neat.
Perf improvement on Base.padding (and Base.packedsize):
while-loop implementation of base.padding. Makes `padding()` about 4x faster:

```
julia> @Btime Base.padding($(type1(30)));
  34.125 μs (1175 allocations: 119.58 KiB)

julia> @Btime Base.padding($(type1(30)));
  8.070 μs (170 allocations: 10.72 KiB)
```
@KristofferC
Copy link
Copy Markdown
Member

KristofferC commented Dec 18, 2025

Put an AI to bang a bit on this (so take it for what it is worth), and it came up with this, which passes on master but fails here. Since there is no discussion about that, I guess it is unintended?

using Test

struct Inner
    x::UInt8
    y::UInt8
end

struct Outer1
    a::UInt32
    b::Inner
end

struct Outer2
    a::Inner
    b::UInt32
end

@testset "reinterpret padded region order" begin
    o1 = Outer1(0x04030201, Inner(0x05, 0x06))
    expected_o2 = Outer2(Inner(0x01, 0x02), 0x06050403)
    @test reinterpret(Outer2, o1) == expected_o2

    o2 = expected_o2
    expected_o1 = Outer1(0x04030201, Inner(0x05, 0x06))
    @test reinterpret(Outer1, o2) == expected_o1
end

🤖 🤖 :

  - packed_regions emits regions out of offset order for nested composite fields.
    match_packed_regions assumes ascending offsets, so it copies bytes in the wrong order and
    corrupts reinterpret results for nested structs. base/reinterpret.jl:70
  - The new iterative padding traversal can change the order of padding regions vs the old
    recursive walk for nested types. That can break padding(Out) == padding(In) and downstream
    struct_subpadding/array_subpadding decisions. base/reinterpretarray.jl:780

@NHDaly NHDaly force-pushed the nhd/reinterpret-padded-struct-performance branch from cbcfcef to 16b20a8 Compare December 18, 2025 22:26
@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Dec 18, 2025

Ah, yep! Thanks @KristofferC!! I actually just discovered exactly the same issue, and pushed up some broken tests. 👍
I think i know where the bug is (it's in the order of the traversal in the depth-first search).

I'll push up a fix in the next couple days.

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Dec 18, 2025

Okay, I've pushed a fix! 😊 Thanks for the report.

EDIT: Though after that fix, the compilation time went up a bit:

julia> const Out = Tuple{UInt8, Int64, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Tuple{UInt8, Float64, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8}, UInt8};

julia> @time @eval reinterpret($Out, $((0x01, (0x03, (0x05, (0x07, (0x09, (0x0b, (0x0d, (0x0f, (0x11, (0x13, (0x15, (0x17, (0x19, (0x1b, (0x1d, (0x1f, (0x21, (0x23, (0x25, (0x27, (0x29, (0x2b, (0x2d, (0x2f, (0x31, (0x33, (0x35, (0x37, (0x39, (0x3b, (0x3d, 4991188238874984254, 0x46), 0x3c), 0x3a), 0x38), 0x36), 0x34), 0x32), 0x30), 0x2e), 0x2c), 0x2a), 0x28), 0x26), 0x24), 0x22), 0x20), 0x1e), 0x1c), 0x1a), 0x18), 0x16), 0x14), 0x12), 0x10), 0x0e), 0x0c), 0x0a), 0x08), 0x06), 0x04), 0x02), 1, 0x02, 0x00));
  1.950291 seconds (5.81 M allocations: 258.578 MiB, 11.26% gc time, 99.99% compilation time)

I'd love to try to figure out how to get that back down again.

@NHDaly NHDaly requested a review from Copilot December 19, 2025 05:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly improves the performance of reinterpret for types with internal padding by generating specialized memory copy instructions at compile time, rather than using runtime reflection and loops. The key innovation is computing precise packed regions for each type and generating targeted unsafe_copyto! calls for each region.

Key changes:

  • Replaced runtime reflection-based copying with compile-time generation of precise memcopy instructions for packed regions
  • Optimized padding() and introduced _packed_regions() to use iterative depth-first traversal instead of recursion, reducing compiler overhead by ~5x
  • Added comprehensive test coverage for padded-to-padded, padded-to-packed, and packed-to-padded conversions

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
base/reinterpret.jl New file implementing the optimized reinterpret logic with packed region computation and specialized memory copying
base/reinterpretarray.jl Removed old runtime-reflection-based implementation, optimized padding() to use iterative traversal, simplified ispacked() to read from struct metadata
base/Base.jl Added include for the new reinterpret.jl file
test/reinterpretarray.jl Added comprehensive test cases for various reinterpret scenarios including padded structs, nested tuples, and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread base/reinterpret.jl Outdated
Comment thread base/reinterpret.jl
Comment thread base/reinterpret.jl Outdated
@NHDaly NHDaly requested a review from BioTurboNick December 23, 2025 19:15
Comment thread base/reinterpret.jl
end

# Simple memcopy between two types of the same size.
@inline function byte_cast(::Type{T}, x::V) where {T,V}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding why these three separate methods are necessary. Other than the assert, byte_cast and _byte_cast_smaller_src are identical. And the test code has a test case that hits _byte_cast_smaller_src but doesn't hit byte_cast or _byte_cast_smaller_dst: reinterpret(Tuple{Int64, Int64, Int8}, ntuple(_->0x1, 17))

I do see that the _dst variant is slightly slower due to the source Ref/preserving the pointer, but why is that necessary here but not needed in the smaller and equal cases?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, the first two are identical.

If we were only manually copying bytes between types of the same size, these wouldn't be needed. But if they have different sizes, we have to do things differently, to take care that we don't accidentally do any buffer-overflow:

If the source is smaller, we are guaranteed that the dst pointer will have "room" for the entire source, so we can cast the dst pointer into a SourceType pointer, and simply store the source into that pointer.

Whereas if the dst is smaller, we need to make sure that we are truncating the source value. To do that, we need to cast the source pointer into a DestType pointer, and then read the value from that pointer, and write the value into the dst pointer.

Note that we can't use that same code in the first case, since if the src is smaller, attempting to cast it into a DestType pointer and then read from it, could possibly read past the valid buffer, still causing a segfault.

So since the smaller-dst and smaller-src variants are both needed, i figured it made sense to have all three. Even though the code is the same in the first two, since if they have the same size, you can use either approach, and i picked the "simpler" one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. As for the duplication, I suppose for someone else coming to the code it just makes it a bit harder to understand the design if there's code duplication without a clear reason for it.

Comment thread test/reinterpretarray.jl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move to a reinterpret.jl file to mirror the base reorg?

Comment thread base/reinterpret.jl Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants