offload frontend draft impl#8
Conversation
| } | ||
|
|
||
| impl PartitioningStrategy for Linear1D { | ||
| fn get_mut<'a, T>(data: &'a mut [T]) -> Option<&'a mut T> { |
There was a problem hiding this comment.
been thinking of returning just the index here, simplifies the 1 element indexing but for slicing the container in smaller parts (not elements) we need something like this
the end-user is not affected so ig it's fine
| } | ||
|
|
||
| struct Region<'a, T, S> { | ||
| data: &'a mut [T], |
There was a problem hiding this comment.
also, wondering if we should restrict to slices here, we can do more abstract but i'm afrait of it getting to clunky for the user
| #[cfg(target_os = "linux")] | ||
| let i = 0; | ||
| if i < data.len() { | ||
| Some(&mut data[i]) |
There was a problem hiding this comment.
| Some(&mut data[i]) | |
| unsafe { Some(&mut *data.as_mut_ptr().add(i)) } |
There was a problem hiding this comment.
would this be correct?
| fn main() { | ||
| let mut x = [0.0f64; 256]; | ||
| let mut reg = Region::<_, Linear1D>::new(&mut x); | ||
| core::intrinsics::offload::<_, _, ()>(foo, [1, 1, 1], [256, 1, 1], (&mut reg,)); |
There was a problem hiding this comment.
on the offload! macro, maybe not exposing block and grid size and calc it based on mutable args? we can also just leave it as a clippy lint (clippy: you have X threads doing absolutely nothing! xD) though
|
when trying to implement strided and stencil patterns as johannes suggested on the last meeting, i noticed some flaws on my previous impl, i think we now have a better usage of raw ptrs the code was verified on my NVIDIA GeForce RTX 4050 laptop |
| type ViewMut<'a, T: 'a> = StencilViewMut<'a, T>; | ||
|
|
||
| unsafe fn get<'a, T>(_: *const T, _: usize, _: Self::Shape) -> Option<Self::View<'a, T>> { | ||
| unimplemented!() |
There was a problem hiding this comment.
not sure what to do with this, i guess we can just add raw indexing on the original slice but while using this pattern, the user is not supposed to access those elements right?
| Dim3 { x: 0, y: 0, z: 0 } | ||
| } | ||
|
|
||
| pub trait PartitioningStrategy { |
There was a problem hiding this comment.
That trait should be unsafe, a bug here easily will cause UB.
|
This looks great so far. A little different than what I had in mind, but it looks generally nice, I'll check it in more detail later. |
| } | ||
| } | ||
| #[cfg(target_os = "linux")] | ||
| Dim3 { x: 0, y: 0, z: 0 } |
There was a problem hiding this comment.
Future extension, run that in a threadpool and under miri. That should catch some bugs in implementations.
| 1.0, 1.0, 1.0, 1.0, | ||
| ]; | ||
| let mut reg_stencil = Region::<_, Stencil2D<1>>::new(&mut grid, (4, 4)); | ||
| core::intrinsics::offload::<_, _, ()>(stencil2d, [1, 1, 1], [2, 2, 1], (&mut reg_stencil,)); |
There was a problem hiding this comment.
Nit: I think that's stencil1d, or?
2d would be
1
+1+4+1
+1
|
I think your confusion and the unimplemented parts are because you got confused for the stencil operations. We don't need to split inputs, just the output. I'd probably just focus on the scalar and batched output case, where e.g. each thread writes to [0..3], [4..7], etc. |
| // core::intrinsics::offload::<_, _, ()>(linear1d, [1, 1, 1], [256, 1, 1], (&mut reg,)); | ||
| offload! { | ||
| kernel = linear1d, | ||
| grid_dim = [256, 1, 1], |
There was a problem hiding this comment.
should be block_dim to match the prev intrinisc call, but just to test wider variety of macro fields
| block_dim = [2, 2, 1], | ||
| args = (&mut reg_stencil,), | ||
| }; | ||
| // thread (0, 0, 0) will have center on (x, y) = 1 (index = 5), so (1 + 4 + 1) / 3 = 20 |
There was a problem hiding this comment.
(1 + 4 + 1) / 3 = 20
interesting :D
| unsafe { | ||
| &*self | ||
| .base_ptr | ||
| .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox) |
There was a problem hiding this comment.
This is UB.
We have one thread A set its center to field 4.
We have another thread B set its center to field 5.
Now, from A you use get_neighbour, which gives you a const reference to field 5.
You then use B.get_mut() to overwrite field 5. The value under a const ref just changed, that is UB since you don't have any synchronization. We should talk Wednesday, but I think you won't be able to safely and soundly support this.
| let mid = *view.get_neighbour(0, 0); | ||
| let left = *view.get_neighbour(-1, 0); | ||
| let right = *view.get_neighbour(1, 0); | ||
| view.set_center((left + mid + right) / 3.0); |
There was a problem hiding this comment.
This races and is therefore UB. It's unfortunate if that has not shown in tests.
There was a problem hiding this comment.
You'd need a barrier right before set_center.
We can talk Wednesday, but it effectively still means all writes like set_center have to be unsafe, which is what we're trying to avoid.
There was a problem hiding this comment.
Is this file outdated and intended to be deleted? How does it compare to crates/rustc_offload_frontend/src/main.rs? I find the current PR separation a bit confusing.
There was a problem hiding this comment.
outdated, will do all frontend related tasks in crates/rustc_offload_frontend, removing it on the next commit
|
That's a good number of strategies ^^ Still, at least a few are unsound. Can you please move the trait and impl's into a lib, which you then just include in your main? That way, we can mimic how it would look like if we were to include it into the stdlib. Also, all of your examples only show single-argument kernels, we'll need some demonstration of multi-arg kernels. I think that would also make things simpler, so we have more sound example. E.g. the stencils could have one input, and one output slice, so it's not ub. |
There was a problem hiding this comment.
doing the tests here now, just not to mix the perf suite with random frontend examples
There was a problem hiding this comment.
should be ready for another pass, quick summary of what i did
- fix stencil case with separate input and output args so it's not UB
- array example (saxpy)
RawRegionso the user doesn't need to worry abt different cases (&mut [T],&mut [T; N], etc)- i'm using
Regioneven for not mutable elements, it doesn't introduce extra complexity to the implementation and it's something the users are gonna need to write kernels (we can also just expose indexes as cuda-oxide, but idk, to be consistent with our setup)
There was a problem hiding this comment.
before leaving, i'll use our safe frontend in rajaperf test cases, with that we should have enough variety of examples.
also, when writing those kernels, in case the design has any other flaws (user interface wise) i expect them to arise there
There was a problem hiding this comment.
we can also just expose indexes
More the other way around, I don't think we can avoid that. People can always call the underlying intrinsics themselves. But I also don't we should try and hide it, since I don't think we'll cover all cases. If 80% are expressible with our abstractions that's (more than) good enough, people can (and should) add their own ones.
There was a problem hiding this comment.
i meant inside the pattern, like Linear1D::index() and returns thread_x + block_x * block_dim_x, but yes, i agree
| println!("GPU bits: {:064b} value: {:?}", x[0].to_bits(), x[0]); | ||
| println!("CPU bits: {:064b} value: {:?}", 42.0f64.to_bits(), 42.0); | ||
| for i in 0..x.len() { | ||
| assert_eq!(x[i], 42.0 as f64); | ||
| } |
There was a problem hiding this comment.
i probably messed things up while modifying Region to accept &mut [T], &[T], &[T; N], &mut [T; N]
but i'd expect out buffer is set to 0.0 or some random trash value if things didn't went as expected. instead, printing and using it says that x[0] = 42.0 but, although i'm not an expert in computer architecture, 0000000000000000000000000000000000000000000000000000000000000000 is not 42.0
GPU bits: 0000000000000000000000000000000000000000000000000000000000000000 value: 42.0
CPU bits: 0100000001000101000000000000000000000000000000000000000000000000 value: 42.0
thread 'main' (21919) panicked at src/main.rs:71:9:
assertion `left == right` failed
left: 42.0
right: 42.0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)
There was a problem hiding this comment.
maybe the compiler is propagating the constant when seeing the assert statement? no it's not, when removing the assert_eq line it remains the same lol
There was a problem hiding this comment.
That looks like something UB would cause :D
I'll have another look in an hour or so.
| &*self | ||
| .base_ptr | ||
| .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox) |
There was a problem hiding this comment.
miri detected UB here
error: Undefined Behavior: in-bounds pointer arithmetic failed: attempting to offset pointer by -48 bytes, but got alloc52791 which is at the beginning of the allocation
--> src/partition.rs:175:15
|
175 | &*self
| _______________^
176 | | .base_ptr
177 | | .offset((self.center_idx as isize) + (oy * self.cols as isize) + ox)
| |____________________________________________________________________________________^ Undefined Behavior occurred here
|
There was a problem hiding this comment.
Glad we got miri. Yeah, you'll need to check if the final offset is in [0, len). But it isn't enough to fix the transfer weirdness, or?
| 0, | ||
| ( | ||
| self.phidat as *mut [Real; 390400], | ||
| let mut phidat = unsafe { &mut *(self.phidat as *mut [Real; 390400]) }; |
There was a problem hiding this comment.
Uuh, this looks scary. Can you please write out the types these unsafe cases, to be a bit more explicit? I think that would have also caught a bug.
In general I think the pattern is ok, but I you got the indirections wrong. I made the same mistake when porting ADBench to rust/autodiff -.-
I think you have
let mut phidat: &mut [Real; 390400] = unsafe { &mut *(self.phidat as *mut [Real; 390400]) };
Therefore you call preload_mut with the type
&mut &mut [Real; 390400] which I think could be part of the reason why it misscompiles in release (?)
Tbf the preload function should also reject it, since it doesn't match what we're expecting.
some things that would make this better
cfgclause to determine whether the code is being run on host or devicecc @ZuseZ4