offload frontend draft impl by Sa4dUs · Pull Request #8 · ZuseZ4/rust_perf

Sa4dUs · 2026-05-20T13:05:57Z

some things that would make this better

maybe expose the index to the user? (but without allowing direct indexing), it just makes things easier
cfg clause to determine whether the code is being run on host or device

Sa4dUs · 2026-05-20T15:12:33Z

+}
+
+impl PartitioningStrategy for Linear1D {
+    fn get_mut<'a, T>(data: &'a mut [T]) -> Option<&'a mut T> {


been thinking of returning just the index here, simplifies the 1 element indexing but for slicing the container in smaller parts (not elements) we need something like this

the end-user is not affected so ig it's fine

Sa4dUs · 2026-05-20T17:51:06Z

+}
+
+struct Region<'a, T, S> {
+    data: &'a mut [T],


also, wondering if we should restrict to slices here, we can do more abstract but i'm afrait of it getting to clunky for the user

Sa4dUs · 2026-05-21T15:10:30Z

+        #[cfg(target_os = "linux")]
+        let i = 0;
+        if i < data.len() {
+            Some(&mut data[i])


Suggested change

Some(&mut data[i])

unsafe { Some(&mut *data.as_mut_ptr().add(i)) }

would this be correct?

Sa4dUs · 2026-05-21T15:20:48Z

+fn main() {
+    let mut x = [0.0f64; 256];
+    let mut reg = Region::<_, Linear1D>::new(&mut x);
+    core::intrinsics::offload::<_, _, ()>(foo, [1, 1, 1], [256, 1, 1], (&mut reg,));


on the offload! macro, maybe not exposing block and grid size and calc it based on mutable args? we can also just leave it as a clippy lint (clippy: you have X threads doing absolutely nothing! xD) though

Sa4dUs · 2026-05-23T11:17:57Z

when trying to implement strided and stencil patterns as johannes suggested on the last meeting, i noticed some flaws on my previous impl, i think we now have a better usage of raw ptrs

the code was verified on my NVIDIA GeForce RTX 4050 laptop

Sa4dUs · 2026-05-23T11:20:22Z

+    type ViewMut<'a, T: 'a> = StencilViewMut<'a, T>;
+
+    unsafe fn get<'a, T>(_: *const T, _: usize, _: Self::Shape) -> Option<Self::View<'a, T>> {
+        unimplemented!()


not sure what to do with this, i guess we can just add raw indexing on the original slice but while using this pattern, the user is not supposed to access those elements right?

ZuseZ4 · 2026-05-23T12:09:51Z

+    Dim3 { x: 0, y: 0, z: 0 }
+}
+
+pub trait PartitioningStrategy {


That trait should be unsafe, a bug here easily will cause UB.

(Still open, in case you forgot)

ZuseZ4 · 2026-05-23T12:12:21Z

This looks great so far. A little different than what I had in mind, but it looks generally nice, I'll check it in more detail later.

ZuseZ4 · 2026-05-23T12:13:28Z

+        }
+    }
+    #[cfg(target_os = "linux")]
+    Dim3 { x: 0, y: 0, z: 0 }


Future extension, run that in a threadpool and under miri. That should catch some bugs in implementations.

ZuseZ4 · 2026-05-23T12:20:45Z

+        1.0, 1.0, 1.0, 1.0,
+    ];
+    let mut reg_stencil = Region::<_, Stencil2D<1>>::new(&mut grid, (4, 4));
+    core::intrinsics::offload::<_, _, ()>(stencil2d, [1, 1, 1], [2, 2, 1], (&mut reg_stencil,));


Nit: I think that's stencil1d, or?
2d would be

1

+1+4+1
+1

ZuseZ4 · 2026-05-24T07:43:36Z

I think your confusion and the unimplemented parts are because you got confused for the stencil operations. We don't need to split inputs, just the output.
You can try to visualize if you write it either with a separate output image, in which case it's just another scalar case.
You can also overwrite the input image, in that case you still need to split it as scalar, but also need a syncthreads, after computing the new value, before writing it (so neighbouring threads get to read the old value).
https://rocm.docs.amd.com/projects/HIP/en/develop/tutorial/programming-patterns/stencil_operations.html#the-convolution-kernel is an AMD example, though I think it's lacking the syncthreads before the final write.

I'd probably just focus on the scalar and batched output case, where e.g. each thread writes to [0..3], [4..7], etc.

Sa4dUs · 2026-05-26T10:52:11Z

+    // core::intrinsics::offload::<_, _, ()>(linear1d, [1, 1, 1], [256, 1, 1], (&mut reg,));
+    offload! {
+        kernel = linear1d,
+        grid_dim = [256, 1, 1],


should be block_dim to match the prev intrinisc call, but just to test wider variety of macro fields

ZuseZ4 · 2026-05-26T12:49:57Z

+        block_dim = [2, 2, 1],
+        args = (&mut reg_stencil,),
+    };
+    // thread (0, 0, 0) will have center on (x, y) = 1 (index = 5), so (1 + 4 + 1) / 3 = 20


(1 + 4 + 1) / 3 = 20
interesting :D

ZuseZ4 · 2026-05-26T13:05:11Z

+        unsafe {
+            &*self
+                .base_ptr
+                .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox)


This is UB.
We have one thread A set its center to field 4.
We have another thread B set its center to field 5.
Now, from A you use get_neighbour, which gives you a const reference to field 5.
You then use B.get_mut() to overwrite field 5. The value under a const ref just changed, that is UB since you don't have any synchronization. We should talk Wednesday, but I think you won't be able to safely and soundly support this.

ZuseZ4 · 2026-05-26T13:08:43Z

+        let mid = *view.get_neighbour(0, 0);
+        let left = *view.get_neighbour(-1, 0);
+        let right = *view.get_neighbour(1, 0);
+        view.set_center((left + mid + right) / 3.0);


This races and is therefore UB. It's unfortunate if that has not shown in tests.

You'd need a barrier right before set_center.
We can talk Wednesday, but it effectively still means all writes like set_center have to be unsafe, which is what we're trying to avoid.

Is this file outdated and intended to be deleted? How does it compare to crates/rustc_offload_frontend/src/main.rs? I find the current PR separation a bit confusing.

outdated, will do all frontend related tasks in crates/rustc_offload_frontend, removing it on the next commit

ZuseZ4 · 2026-05-26T13:23:17Z

That's a good number of strategies ^^ Still, at least a few are unsound.

Can you please move the trait and impl's into a lib, which you then just include in your main? That way, we can mimic how it would look like if we were to include it into the stdlib.

Also, all of your examples only show single-argument kernels, we'll need some demonstration of multi-arg kernels. I think that would also make things simpler, so we have more sound example. E.g. the stencils could have one input, and one output slice, so it's not ub.

Sa4dUs · 2026-05-27T22:10:12Z

doing the tests here now, just not to mix the perf suite with random frontend examples

lmk when it's ready for review.

should be ready for another pass, quick summary of what i did

fix stencil case with separate input and output args so it's not UB

array example (saxpy)

RawRegion so the user doesn't need to worry abt different cases (&mut [T], &mut [T; N], etc)

i'm using Region even for not mutable elements, it doesn't introduce extra complexity to the implementation and it's something the users are gonna need to write kernels (we can also just expose indexes as cuda-oxide, but idk, to be consistent with our setup)

before leaving, i'll use our safe frontend in rajaperf test cases, with that we should have enough variety of examples.

also, when writing those kernels, in case the design has any other flaws (user interface wise) i expect them to arise there

we can also just expose indexes

More the other way around, I don't think we can avoid that. People can always call the underlying intrinsics themselves. But I also don't we should try and hide it, since I don't think we'll cover all cases. If 80% are expressible with our abstractions that's (more than) good enough, people can (and should) add their own ones.

i meant inside the pattern, like Linear1D::index() and returns thread_x + block_x * block_dim_x, but yes, i agree

Sa4dUs · 2026-05-28T10:02:45Z

+    println!("GPU bits: {:064b} value: {:?}", x[0].to_bits(), x[0]);
+    println!("CPU bits: {:064b} value: {:?}", 42.0f64.to_bits(), 42.0);
+    for i in 0..x.len() {
+        assert_eq!(x[i], 42.0 as f64);
+    }


i probably messed things up while modifying Region to accept &mut [T], &[T], &[T; N], &mut [T; N]
but i'd expect out buffer is set to 0.0 or some random trash value if things didn't went as expected. instead, printing and using it says that x[0] = 42.0 but, although i'm not an expert in computer architecture, 0000000000000000000000000000000000000000000000000000000000000000 is not 42.0

GPU bits: 0000000000000000000000000000000000000000000000000000000000000000 value: 42.0 CPU bits: 0100000001000101000000000000000000000000000000000000000000000000 value: 42.0 thread 'main' (21919) panicked at src/main.rs:71:9: assertion `left == right` failed left: 42.0 right: 42.0 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Aborted (core dumped)

~~maybe the compiler is propagating the constant when seeing the assert statement?~~ no it's not, when removing the assert_eq line it remains the same lol

That looks like something UB would cause :D
I'll have another look in an hour or so.

already solved :)

Sa4dUs · 2026-05-31T15:03:25Z

+            &*self
+                .base_ptr
+                .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox)


miri detected UB here

error: Undefined Behavior: in-bounds pointer arithmetic failed: attempting to offset pointer by -48 bytes, but got alloc52791 which is at the beginning of the allocation --> src/partition.rs:175:15 | 175 | &*self | _______________^ 176 | | .base_ptr 177 | | .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox) | |____________________________________________________________________________________^ Undefined Behavior occurred here |

Glad we got miri. Yeah, you'll need to check if the final offset is in [0, len). But it isn't enough to fix the transfer weirdness, or?

ZuseZ4 · 2026-06-05T06:11:15Z

-            0,
-            (
-                self.phidat as *mut [Real; 390400],
+        let mut phidat = unsafe { &mut *(self.phidat as *mut [Real; 390400]) };


Uuh, this looks scary. Can you please write out the types these unsafe cases, to be a bit more explicit? I think that would have also caught a bug.

In general I think the pattern is ok, but I you got the indirections wrong. I made the same mistake when porting ADBench to rust/autodiff -.-

I think you have
let mut phidat: &mut [Real; 390400] = unsafe { &mut *(self.phidat as *mut [Real; 390400]) };
Therefore you call preload_mut with the type
&mut &mut [Real; 390400] which I think could be part of the reason why it misscompiles in release (?)
Tbf the preload function should also reject it, since it doesn't match what we're expecting.

offload frontend draft impl

25bacad

Sa4dUs commented May 20, 2026

View reviewed changes

ZuseZ4 reviewed May 21, 2026

View reviewed changes

Comment thread src/base/frontend.rs Outdated

Sa4dUs commented May 21, 2026

View reviewed changes

add more indexing patterns and modify design

98cbdfb

Sa4dUs commented May 23, 2026

View reviewed changes

ZuseZ4 reviewed May 23, 2026

View reviewed changes

add offload! macro

154244d

Sa4dUs commented May 26, 2026

View reviewed changes

ZuseZ4 reviewed May 26, 2026

View reviewed changes

Sa4dUs added 4 commits May 26, 2026 18:01

add offload frontend submodule

a70c194

use rustc_offload_frontend crate

47bf051

fix stencil add linear2d and blur example

cc763c0

show rustc_offload_frontend submodule file tree on changes

2361ad7

Sa4dUs commented May 27, 2026

View reviewed changes

strange bug, we should look at this at some point

0d27058

Sa4dUs commented May 28, 2026

View reviewed changes

allow mut and inmut slices and arrays

1b80aba

ZuseZ4 reviewed May 28, 2026

View reviewed changes

Comment thread crates/rustc_offload_frontend/src/partition.rs

ZuseZ4 mentioned this pull request May 28, 2026

tracking issue - needed PR #9

Open

Sa4dUs commented May 31, 2026

View reviewed changes

Sa4dUs added 11 commits May 31, 2026 21:57

fix ub and update to rl/r main

1998105

remove from non-mut args

da2456b

move memory with preload and remove double indirection

30fad32

update intrinsic calls for dyn_cache arg

89d985e

update perf suite to match rustc_offload_frontend style

4ca96bb

move fir to new frontend

aa381b4

move energy to new frontend + format

e3e6c9d

move pressure to new frontend

d319bac

vol3d to frontend

38c4dd5

use macros on del_dot_vec_2d, ltimes and matvec_3d_stencil

67940cf

complete ltimes

d95bb4e

ZuseZ4 reviewed Jun 5, 2026

View reviewed changes

	Some(&mut data[i])
	unsafe { Some(&mut *data.as_mut_ptr().add(i)) }

Conversation

Sa4dUs commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sa4dUs commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZuseZ4 commented May 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZuseZ4 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZuseZ4 commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sa4dUs May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Sa4dUs commented May 20, 2026 •

edited

Loading

Sa4dUs commented May 23, 2026 •

edited

Loading

ZuseZ4 commented May 24, 2026 •

edited

Loading

Sa4dUs May 28, 2026 •

edited

Loading