device_vector using usm device memory and becoming allocator aware#1
device_vector using usm device memory and becoming allocator aware#1danhoeflinger wants to merge 23 commits intoSYCLomaticfrom
Conversation
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
|
The shift to use usm device memory rather than usm shared memory for device_vector will change the performance of users of this type. However, when SYCLomatic is run, it injects generated headers into the user's repository, rather than relying upon the headers like a library. This means that only if users re-migrate their code will they experience the difference in performance. It is difficult to say if the performance changes will be a net benefit or detriment because it depends on usage pattern. With this in mind, would it be good to provide the legacy shared usm implementation of dpct::device_vector under another name?. If so, do we provide it as it was prior to this PR, or work to add AllocatorAwareContainer requirements to it as well? |
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
|
I've realized I'm missing a big piece here, which is proper handling of Currently, those merely use This also means that the tests that are working are not really using the code inside the |
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
| template <typename OtherT> | ||
| device_reference &operator=(const device_reference<OtherT> &input) { | ||
| value = input; | ||
| __assign_from(input.__get_value()); |
There was a problem hiding this comment.
All of these operators which use both __assign_from() and __get_value() are using two sycl kernels, when this should be able to be accomplished with a single custom kernel.
I have an implementation which attempts to accomplish this on a branch. It uses the preprocessor macro __SYCL_DEVICE_ONLY__ to choose to either submit a custom single_task, or just run simple commands if already on the device.
However, it seems that with this structure based on __SYCL_DEVICE_ONLY__, the compiler does not realize it needs to compile this kernel and it results in a runtime exception, failing to find the custom kernel. My hypothesis is that the kernel submission must be present in an active code path when __SYCL_DEVICE_ONLY__ is defined for the compiler to recognize it. This is a somewhat fundamental issue with this strategy.
There was a problem hiding this comment.
Actually, there is a bug here which is allowing the __assign_from() and __get_value() to work, in the current state of the code. __SYCL_DEVICE_ONLY should be __SYCL_DEVICE_ONLY__, and this is allowing the kernels to be compiled. Fixing that bug breaks the code generally, due to the runtime failure described above.
Using __SYCL_DEVICE_ONLY__ in this fashion does not seem to actually be viable.
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
|
If users really wanted explicitly device USM memory, we could provide explicitly However, the current shared USM implementation provides the seamless experience of host and device accessible memory currently, without requiring the programmer to specify where the data will be used. I don't think that is likely to be a winning trade-off. The allocator aware functionality can be broken off and added separately. |
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
This PR adds support for thread names in lldb on Windows. ``` (lldb) thr list Process 2960 stopped thread oneapi-src#53: tid = 0x03a0, 0x00007ff84582db34 ntdll.dll`NtWaitForMultipleObjects + 20 thread oneapi-src#29: tid = 0x04ec, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'SPUW.6' thread oneapi-src#89: tid = 0x057c, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x1000019] physics[main]' thread oneapi-src#3: tid = 0x0648, 0x00007ff843c2cafe combase.dll`InternalDoATClassCreate + 39518 thread oneapi-src#93: tid = 0x0688, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x100501d] uMovie::StreamingThread' thread #1: tid = 0x087c, 0x00007ff842e7a104 win32u.dll`NtUserMsgWaitForMultipleObjectsEx + 20 thread oneapi-src#96: tid = 0x0890, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x1002020] HLE Video Decoder' <...> ```
The upstream test relies on jump-tables, which are lowered in dramatically different ways with later arm64e/ptrauth patches. Concretely, it's failing for at least two reasons: - ptrauth removes x16/x17 from tcGPR64 to prevent indirect tail-calls from using either register as the callee, conflicting with their usage as scratch for the tail-call LR auth checking sequence. In the 1/2_available_regs_left tests, this causes the MI scheduler to move the load up across some of the inlineasm register clobbers. - ptrauth adds an x16/x17-using pseudo for jump-table dispatch, which looks somewhat different from the regular jump-table dispatch codegen by itself, but also prevents compression currently. They seem like sensible changes. But they mean the tests aren't really testing what they're intented to, because there's always an implicit x16/x17 clobber when using jump-tables. This updates the test in a way that should work identically regardless of ptrauth support, with one exception, #1 above, which merely reorders the load/inlineasm w.r.t. eachother. I verified the tests still fail the live-reg assertions when applicable.
This reverts commit daebe5c. This commit causes the following asan issue: ``` <snip>/llvm-project/build/bin/mlir-opt <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir | <snip>/llvm-project/build/bin/FileCheck <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir # executed command: <snip>/llvm-project/build/bin/mlir-opt <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir # .---command stderr------------ # | ================================================================= # | ==2772558==ERROR: AddressSanitizer: stack-use-after-return on address 0x7fd2c2c42b90 at pc 0x55e406d54614 bp 0x7ffc810e4070 sp 0x7ffc810e4068 # | READ of size 8 at 0x7fd2c2c42b90 thread T0 # | #0 0x55e406d54613 in operator()<long int const*> /usr/include/c++/13/bits/predefined_ops.h:318 # | #1 0x55e406d54613 in __count_if<long int const*, __gnu_cxx::__ops::_Iter_pred<mlir::verifyListOfOperandsOrIntegers(Operation*, llvm::StringRef, unsigned int, llvm::ArrayRef<long int>, ValueRange)::<lambda(int64_t)> > > /usr/include/c++/13/bits/stl_algobase.h:2125 # | #2 0x55e406d54613 in count_if<long int const*, mlir::verifyListOfOperandsOrIntegers(Operation*, ... ```
The problematic program is as follows:
```shell
#define pre_a 0
#define PRE(x) pre_##x
void f(void) {
PRE(a) && 0;
}
int main(void) { return 0; }
```
in which after token concatenation (`##`), there's another nested macro
`pre_a`.
Currently only the outer expansion region will be produced. ([compiler
explorer
link](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:___c,selection:(endColumn:29,endLineNumber:8,positionColumn:29,positionLineNumber:8,selectionStartColumn:29,selectionStartLineNumber:8,startColumn:29,startLineNumber:8),source:'%23define+pre_a+0%0A%23define+PRE(x)+pre_%23%23x%0A%0Avoid+f(void)+%7B%0A++++PRE(a)+%26%26+0%3B%0A%7D%0A%0Aint+main(void)+%7B+return+0%3B+%7D'),l:'5',n:'0',o:'C+source+%231',t:'0')),k:51.69491525423727,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:cclang_assertions_trunk,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'0',intel:'0',libraryCode:'1',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:___c,libs:!(),options:'-fprofile-instr-generate+-fcoverage-mapping+-fcoverage-mcdc+-Xclang+-dump-coverage-mapping+',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+x86-64+clang+(assertions+trunk)+(Editor+%231)',t:'0')),k:34.5741843594503,l:'4',m:28.903654485049834,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x86-64+clang+(trunk)',editorid:1,fontScale:14,fontUsePx:'0',j:2,wrap:'1'),l:'5',n:'0',o:'Output+of+x86-64+clang+(assertions+trunk)+(Compiler+%232)',t:'0')),header:(),l:'4',m:71.09634551495017,n:'0',o:'',s:0,t:'0')),k:48.30508474576271,l:'3',n:'0',o:'',t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4))
```text
f:
File 0, 4:14 -> 6:2 = #0
Decision,File 0, 5:5 -> 5:16 = M:0, C:2
Expansion,File 0, 5:5 -> 5:8 = #0 (Expanded file = 1)
File 0, 5:15 -> 5:16 = #1
Branch,File 0, 5:15 -> 5:16 = 0, 0 [2,0,0]
File 1, 2:16 -> 2:23 = #0
File 2, 1:15 -> 1:16 = #0
File 2, 1:15 -> 1:16 = #0
Branch,File 2, 1:15 -> 1:16 = 0, 0 [1,2,0]
```
The inner expansion region isn't produced because:
1. In the range-based for loop quoted below, each sloc is processed and
possibly emit a corresponding expansion region.
2. For our sloc in question, its direct parent returned by
`getIncludeOrExpansionLoc()` is a `<scratch space>`, because that's how
`##` is processed.
https://github.com/llvm/llvm-project/blob/88b6186af3908c55b357858eb348b5143f21c289/clang/lib/CodeGen/CoverageMappingGen.cpp#L518-L520
3. This `<scratch space>` cannot be found in the FileID mapping so
`ParentFileID` will be assigned an `std::nullopt`
https://github.com/llvm/llvm-project/blob/88b6186af3908c55b357858eb348b5143f21c289/clang/lib/CodeGen/CoverageMappingGen.cpp#L521-L526
4. As a result this iteration of for loop finishes early and no
expansion region is added for the sloc.
This problem gets worse with MC/DC: as the example shows, there's a
branch from File 2 but File 2 itself is missing. This will trigger
assertion failures.
The fix is more or less a workaround and takes a similar approach as
#89573.
~~Depends on #89573.~~ This includes #89573. Kudos to @chapuni!
This and #89573 together fix #87000: I tested locally, both the reduced
program and my original use case (fwiw, Linux kernel) can run
successfully.
---------
Co-authored-by: NAKAMURA Takumi <geek4civic@gmail.com>
Update: There are a pair of fundamental issues discovered in the comments below. I think that using USM device memory may not be possible to achieve this functionality in this way. Read the comments for more details.
dpct::device_vector currently uses
sycl::usm::alloc::sharedmemory to provide users with the ability to have seamless usage on host and device.This PR changes the semantics of
dpct::device_vectorto instead usesycl::usm::alloc::devicememory, while preserving the ability to seamlessly use thedpct::device_vectorfrom the host.This PR also upgrades dpct::device_vector to be close to satisfying AllocatorAwareContainer. However, it diverges from that requirement in the following ways:
construct()calls do not use::std::allocator_traits<allocator_type>::construct(alloc, pointer, args), but ratherdpct::device_allocator_traits<allocator_type>::construct(pointer,args)which uses a staticconstruct()call from the supplied allocator, or provides a default if one is not defined.destroy()calls do not use::std::allocator_traits<allocator_type>::destroy(alloc, pointer), but rather the static functiondpct::device_allocator_traits<allocator_type>::destroy(pointer)which uses a staticdestroy()call from the supplied allocator, or provides a default if one is not defined.These changes are done to avoid passing the allocator object into the sycl kernel, as construction and destruction must occur on the device in a kernel. The selection of the call can happen at compile time and must not involve the individual instance of the allocator.
To avoid unnecessary overhead, for
dpct::device_vectora kernel is only launched to destroy data if a custom staticdestroy()function is provided in the allocator. Prior to this PR, no construction or destruction was done, and we do not want to add extra kernel launches to "destroy" data when this will generally result in a no-op in effect.Changes to the testing can be found here:
danhoeflinger/SYCLomatic-test#1
note: this PR is meant to be a draft on my fork for discussion prior to introducing it to SYCLomatic itself.