Skip to content

offload frontend draft impl#8

Draft
Sa4dUs wants to merge 20 commits into
mainfrom
rustc_frontend
Draft

offload frontend draft impl#8
Sa4dUs wants to merge 20 commits into
mainfrom
rustc_frontend

Conversation

@Sa4dUs
Copy link
Copy Markdown
Collaborator

@Sa4dUs Sa4dUs commented May 20, 2026

some things that would make this better

  • maybe expose the index to the user? (but without allowing direct indexing), it just makes things easier
  • cfg clause to determine whether the code is being run on host or device

cc @ZuseZ4

Comment thread src/base/frontend.rs Outdated
}

impl PartitioningStrategy for Linear1D {
fn get_mut<'a, T>(data: &'a mut [T]) -> Option<&'a mut T> {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

been thinking of returning just the index here, simplifies the 1 element indexing but for slicing the container in smaller parts (not elements) we need something like this

the end-user is not affected so ig it's fine

Comment thread src/base/frontend.rs Outdated
}

struct Region<'a, T, S> {
data: &'a mut [T],
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, wondering if we should restrict to slices here, we can do more abstract but i'm afrait of it getting to clunky for the user

Comment thread src/base/frontend.rs Outdated
Comment thread src/base/frontend.rs Outdated
#[cfg(target_os = "linux")]
let i = 0;
if i < data.len() {
Some(&mut data[i])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some(&mut data[i])
unsafe { Some(&mut *data.as_mut_ptr().add(i)) }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be correct?

Comment thread src/base/frontend.rs Outdated
fn main() {
let mut x = [0.0f64; 256];
let mut reg = Region::<_, Linear1D>::new(&mut x);
core::intrinsics::offload::<_, _, ()>(foo, [1, 1, 1], [256, 1, 1], (&mut reg,));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the offload! macro, maybe not exposing block and grid size and calc it based on mutable args? we can also just leave it as a clippy lint (clippy: you have X threads doing absolutely nothing! xD) though

@Sa4dUs
Copy link
Copy Markdown
Collaborator Author

Sa4dUs commented May 23, 2026

when trying to implement strided and stencil patterns as johannes suggested on the last meeting, i noticed some flaws on my previous impl, i think we now have a better usage of raw ptrs

the code was verified on my NVIDIA GeForce RTX 4050 laptop

Comment thread src/base/frontend.rs Outdated
type ViewMut<'a, T: 'a> = StencilViewMut<'a, T>;

unsafe fn get<'a, T>(_: *const T, _: usize, _: Self::Shape) -> Option<Self::View<'a, T>> {
unimplemented!()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what to do with this, i guess we can just add raw indexing on the original slice but while using this pattern, the user is not supposed to access those elements right?

Comment thread src/base/frontend.rs Outdated
Dim3 { x: 0, y: 0, z: 0 }
}

pub trait PartitioningStrategy {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That trait should be unsafe, a bug here easily will cause UB.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Still open, in case you forgot)

@ZuseZ4
Copy link
Copy Markdown
Owner

ZuseZ4 commented May 23, 2026

This looks great so far. A little different than what I had in mind, but it looks generally nice, I'll check it in more detail later.

Comment thread src/base/frontend.rs Outdated
}
}
#[cfg(target_os = "linux")]
Dim3 { x: 0, y: 0, z: 0 }
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future extension, run that in a threadpool and under miri. That should catch some bugs in implementations.

Comment thread src/base/frontend.rs Outdated
1.0, 1.0, 1.0, 1.0,
];
let mut reg_stencil = Region::<_, Stencil2D<1>>::new(&mut grid, (4, 4));
core::intrinsics::offload::<_, _, ()>(stencil2d, [1, 1, 1], [2, 2, 1], (&mut reg_stencil,));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think that's stencil1d, or?
2d would be

   1

+1+4+1
+1

@ZuseZ4
Copy link
Copy Markdown
Owner

ZuseZ4 commented May 24, 2026

I think your confusion and the unimplemented parts are because you got confused for the stencil operations. We don't need to split inputs, just the output.
You can try to visualize if you write it either with a separate output image, in which case it's just another scalar case.
You can also overwrite the input image, in that case you still need to split it as scalar, but also need a syncthreads, after computing the new value, before writing it (so neighbouring threads get to read the old value).
https://rocm.docs.amd.com/projects/HIP/en/develop/tutorial/programming-patterns/stencil_operations.html#the-convolution-kernel is an AMD example, though I think it's lacking the syncthreads before the final write.

I'd probably just focus on the scalar and batched output case, where e.g. each thread writes to [0..3], [4..7], etc.

Comment thread src/base/frontend.rs Outdated
// core::intrinsics::offload::<_, _, ()>(linear1d, [1, 1, 1], [256, 1, 1], (&mut reg,));
offload! {
kernel = linear1d,
grid_dim = [256, 1, 1],
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be block_dim to match the prev intrinisc call, but just to test wider variety of macro fields

Comment thread src/base/frontend.rs Outdated
block_dim = [2, 2, 1],
args = (&mut reg_stencil,),
};
// thread (0, 0, 0) will have center on (x, y) = 1 (index = 5), so (1 + 4 + 1) / 3 = 20
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1 + 4 + 1) / 3 = 20
interesting :D

Comment thread src/base/frontend.rs Outdated
unsafe {
&*self
.base_ptr
.offset((self.center_idx as isize) + (oy * self.cols as isize) + ox)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is UB.
We have one thread A set its center to field 4.
We have another thread B set its center to field 5.
Now, from A you use get_neighbour, which gives you a const reference to field 5.
You then use B.get_mut() to overwrite field 5. The value under a const ref just changed, that is UB since you don't have any synchronization. We should talk Wednesday, but I think you won't be able to safely and soundly support this.

Comment thread src/base/frontend.rs Outdated
let mid = *view.get_neighbour(0, 0);
let left = *view.get_neighbour(-1, 0);
let right = *view.get_neighbour(1, 0);
view.set_center((left + mid + right) / 3.0);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This races and is therefore UB. It's unfortunate if that has not shown in tests.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd need a barrier right before set_center.
We can talk Wednesday, but it effectively still means all writes like set_center have to be unsafe, which is what we're trying to avoid.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file outdated and intended to be deleted? How does it compare to crates/rustc_offload_frontend/src/main.rs? I find the current PR separation a bit confusing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated, will do all frontend related tasks in crates/rustc_offload_frontend, removing it on the next commit

@ZuseZ4
Copy link
Copy Markdown
Owner

ZuseZ4 commented May 26, 2026

That's a good number of strategies ^^ Still, at least a few are unsound.

Can you please move the trait and impl's into a lib, which you then just include in your main? That way, we can mimic how it would look like if we were to include it into the stdlib.

Also, all of your examples only show single-argument kernels, we'll need some demonstration of multi-arg kernels. I think that would also make things simpler, so we have more sound example. E.g. the stencils could have one input, and one output slice, so it's not ub.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing the tests here now, just not to mix the perf suite with random frontend examples

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk when it's ready for review.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be ready for another pass, quick summary of what i did

  • fix stencil case with separate input and output args so it's not UB
  • array example (saxpy)
  • RawRegion so the user doesn't need to worry abt different cases (&mut [T], &mut [T; N], etc)
  • i'm using Region even for not mutable elements, it doesn't introduce extra complexity to the implementation and it's something the users are gonna need to write kernels (we can also just expose indexes as cuda-oxide, but idk, to be consistent with our setup)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before leaving, i'll use our safe frontend in rajaperf test cases, with that we should have enough variety of examples.

also, when writing those kernels, in case the design has any other flaws (user interface wise) i expect them to arise there

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also just expose indexes

More the other way around, I don't think we can avoid that. People can always call the underlying intrinsics themselves. But I also don't we should try and hide it, since I don't think we'll cover all cases. If 80% are expressible with our abstractions that's (more than) good enough, people can (and should) add their own ones.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant inside the pattern, like Linear1D::index() and returns thread_x + block_x * block_dim_x, but yes, i agree

Comment on lines +68 to +72
println!("GPU bits: {:064b} value: {:?}", x[0].to_bits(), x[0]);
println!("CPU bits: {:064b} value: {:?}", 42.0f64.to_bits(), 42.0);
for i in 0..x.len() {
assert_eq!(x[i], 42.0 as f64);
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i probably messed things up while modifying Region to accept &mut [T], &[T], &[T; N], &mut [T; N]
but i'd expect out buffer is set to 0.0 or some random trash value if things didn't went as expected. instead, printing and using it says that x[0] = 42.0 but, although i'm not an expert in computer architecture, 0000000000000000000000000000000000000000000000000000000000000000 is not 42.0

GPU bits: 0000000000000000000000000000000000000000000000000000000000000000 value: 42.0
CPU bits: 0100000001000101000000000000000000000000000000000000000000000000 value: 42.0

thread 'main' (21919) panicked at src/main.rs:71:9:
assertion `left == right` failed
  left: 42.0
 right: 42.0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

Copy link
Copy Markdown
Collaborator Author

@Sa4dUs Sa4dUs May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the compiler is propagating the constant when seeing the assert statement? no it's not, when removing the assert_eq line it remains the same lol

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like something UB would cause :D
I'll have another look in an hour or so.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already solved :)

Comment thread crates/rustc_offload_frontend/src/partition.rs
Comment on lines +175 to +177
&*self
.base_ptr
.offset((self.center_idx as isize) + (oy * self.cols as isize) + ox)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miri detected UB here

error: Undefined Behavior: in-bounds pointer arithmetic failed: attempting to offset pointer by -48 bytes, but got alloc52791 which is at the beginning of the allocation
  --> src/partition.rs:175:15
   |
175 |               &*self
   |  _______________^
176 | |                 .base_ptr
177 | |                 .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox)
   | |____________________________________________________________________________________^ Undefined Behavior occurred here
   |

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we got miri. Yeah, you'll need to check if the final offset is in [0, len). But it isn't enough to fix the transfer weirdness, or?

Comment thread src/apps/ltimes.rs
0,
(
self.phidat as *mut [Real; 390400],
let mut phidat = unsafe { &mut *(self.phidat as *mut [Real; 390400]) };
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuh, this looks scary. Can you please write out the types these unsafe cases, to be a bit more explicit? I think that would have also caught a bug.

In general I think the pattern is ok, but I you got the indirections wrong. I made the same mistake when porting ADBench to rust/autodiff -.-

I think you have
let mut phidat: &mut [Real; 390400] = unsafe { &mut *(self.phidat as *mut [Real; 390400]) };
Therefore you call preload_mut with the type
&mut &mut [Real; 390400] which I think could be part of the reason why it misscompiles in release (?)
Tbf the preload function should also reject it, since it doesn't match what we're expecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants