Skip to content

device_vector using usm device memory and becoming allocator aware#1

Closed
danhoeflinger wants to merge 23 commits intoSYCLomaticfrom
dev/dhoeflin/device_vector
Closed

device_vector using usm device memory and becoming allocator aware#1
danhoeflinger wants to merge 23 commits intoSYCLomaticfrom
dev/dhoeflin/device_vector

Conversation

@danhoeflinger
Copy link
Owner

@danhoeflinger danhoeflinger commented Mar 31, 2023

Update: There are a pair of fundamental issues discovered in the comments below. I think that using USM device memory may not be possible to achieve this functionality in this way. Read the comments for more details.


dpct::device_vector currently uses sycl::usm::alloc::shared memory to provide users with the ability to have seamless usage on host and device.

This PR changes the semantics of dpct::device_vector to instead use sycl::usm::alloc::device memory, while preserving the ability to seamlessly use the dpct::device_vector from the host.

This PR also upgrades dpct::device_vector to be close to satisfying AllocatorAwareContainer. However, it diverges from that requirement in the following ways:

  1. construct() calls do not use ::std::allocator_traits<allocator_type>::construct(alloc, pointer, args), but rather dpct::device_allocator_traits<allocator_type>::construct(pointer,args) which uses a static construct() call from the supplied allocator, or provides a default if one is not defined.
  2. destroy() calls do not use ::std::allocator_traits<allocator_type>::destroy(alloc, pointer), but rather the static functiondpct::device_allocator_traits<allocator_type>::destroy(pointer) which uses a static destroy() call from the supplied allocator, or provides a default if one is not defined.

These changes are done to avoid passing the allocator object into the sycl kernel, as construction and destruction must occur on the device in a kernel. The selection of the call can happen at compile time and must not involve the individual instance of the allocator.

To avoid unnecessary overhead, for dpct::device_vector a kernel is only launched to destroy data if a custom static destroy() function is provided in the allocator. Prior to this PR, no construction or destruction was done, and we do not want to add extra kernel launches to "destroy" data when this will generally result in a no-op in effect.

Changes to the testing can be found here:
danhoeflinger/SYCLomatic-test#1

note: this PR is meant to be a draft on my fork for discussion prior to introducing it to SYCLomatic itself.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger
Copy link
Owner Author

danhoeflinger commented Mar 31, 2023

The shift to use usm device memory rather than usm shared memory for device_vector will change the performance of users of this type. However, when SYCLomatic is run, it injects generated headers into the user's repository, rather than relying upon the headers like a library. This means that only if users re-migrate their code will they experience the difference in performance.

It is difficult to say if the performance changes will be a net benefit or detriment because it depends on usage pattern.
The usm shared memory implementation lets the runtime decide where to hold the memory based on the usage. The new implementation always keeps it on the device, and transfers data to and/or from the device each time it is used on the host.

With this in mind, would it be good to provide the legacy shared usm implementation of dpct::device_vector under another name?. If so, do we provide it as it was prior to this PR, or work to add AllocatorAwareContainer requirements to it as well?

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger
Copy link
Owner Author

I've realized I'm missing a big piece here, which is proper handling of device_pointer, device_iterator and a solution does not seem easy.

Currently, those merely use value_type & as their reference rather than device_reference<value_type>, and making that switch does not play well with the details of oneDPL. Below is an example of a functor used in fill which assumes a raw reference, and will not function with a device_reference<value_type>:

template <typename _SourceT>
struct fill_functor
{
    _SourceT __value;
    template <typename _TargetT>
    void
    operator()(_TargetT& __target) const
    {
        __target = __value;
    }
};

This also means that the tests that are working are not really using the code inside the __SYCL_DEVICE_ONLY__ macros to access data within kernels, but rather just depending on device_iterator using raw value_type & when dereferenced, as we are passing device_iterator to kernels themselves.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
template <typename OtherT>
device_reference &operator=(const device_reference<OtherT> &input) {
value = input;
__assign_from(input.__get_value());
Copy link
Owner Author

@danhoeflinger danhoeflinger Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these operators which use both __assign_from() and __get_value() are using two sycl kernels, when this should be able to be accomplished with a single custom kernel.

I have an implementation which attempts to accomplish this on a branch. It uses the preprocessor macro __SYCL_DEVICE_ONLY__ to choose to either submit a custom single_task, or just run simple commands if already on the device.

However, it seems that with this structure based on __SYCL_DEVICE_ONLY__, the compiler does not realize it needs to compile this kernel and it results in a runtime exception, failing to find the custom kernel. My hypothesis is that the kernel submission must be present in an active code path when __SYCL_DEVICE_ONLY__ is defined for the compiler to recognize it. This is a somewhat fundamental issue with this strategy.

Copy link
Owner Author

@danhoeflinger danhoeflinger Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a bug here which is allowing the __assign_from() and __get_value() to work, in the current state of the code. __SYCL_DEVICE_ONLY should be __SYCL_DEVICE_ONLY__, and this is allowing the kernels to be compiled. Fixing that bug breaks the code generally, due to the runtime failure described above.

Using __SYCL_DEVICE_ONLY__ in this fashion does not seem to actually be viable.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger danhoeflinger marked this pull request as draft April 4, 2023 20:08
@danhoeflinger
Copy link
Owner Author

If users really wanted explicitly device USM memory, we could provide explicitly host_accessible_device_vector, which would function seamlessly on the host with device data.
From this container, you could have begin(), end() functions which would provide device accessible iterators. It may also be possible to provide explicitly host_begin() and host_end() to provide host accessible iterators.

However, the current shared USM implementation provides the seamless experience of host and device accessible memory currently, without requiring the programmer to specify where the data will be used. I don't think that is likely to be a winning trade-off.

The allocator aware functionality can be broken off and added separately.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
danhoeflinger pushed a commit that referenced this pull request Jan 12, 2024
This PR adds support for thread names in lldb on Windows.

```
(lldb) thr list
Process 2960 stopped
  thread oneapi-src#53: tid = 0x03a0, 0x00007ff84582db34 ntdll.dll`NtWaitForMultipleObjects + 20
  thread oneapi-src#29: tid = 0x04ec, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'SPUW.6'
  thread oneapi-src#89: tid = 0x057c, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x1000019] physics[main]'
  thread oneapi-src#3: tid = 0x0648, 0x00007ff843c2cafe combase.dll`InternalDoATClassCreate + 39518
  thread oneapi-src#93: tid = 0x0688, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x100501d] uMovie::StreamingThread'
  thread #1: tid = 0x087c, 0x00007ff842e7a104 win32u.dll`NtUserMsgWaitForMultipleObjectsEx + 20
  thread oneapi-src#96: tid = 0x0890, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x1002020] HLE Video Decoder'
<...>
```
danhoeflinger pushed a commit that referenced this pull request Jan 12, 2024
The upstream test relies on jump-tables, which are lowered in
dramatically different ways with later arm64e/ptrauth patches.

Concretely, it's failing for at least two reasons:
- ptrauth removes x16/x17 from tcGPR64 to prevent indirect tail-calls
  from using either register as the callee, conflicting with their usage
  as scratch for the tail-call LR auth checking sequence.  In the
  1/2_available_regs_left tests, this causes the MI scheduler to move
  the load up across some of the inlineasm register clobbers.

- ptrauth adds an x16/x17-using pseudo for jump-table dispatch, which
  looks somewhat different from the regular jump-table dispatch codegen
  by itself, but also prevents compression currently.

They seem like sensible changes.  But they mean the tests aren't really
testing what they're intented to, because there's always an implicit
x16/x17 clobber when using jump-tables.

This updates the test in a way that should work identically regardless
of ptrauth support, with one exception, #1 above, which merely reorders
the load/inlineasm w.r.t. eachother.
I verified the tests still fail the live-reg assertions when
applicable.
danhoeflinger pushed a commit that referenced this pull request Apr 8, 2024
This reverts commit daebe5c.

This commit causes the following asan issue:

```
<snip>/llvm-project/build/bin/mlir-opt <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir | <snip>/llvm-project/build/bin/FileCheck <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
# executed command: <snip>/llvm-project/build/bin/mlir-opt <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
# .---command stderr------------
# | =================================================================
# | ==2772558==ERROR: AddressSanitizer: stack-use-after-return on address 0x7fd2c2c42b90 at pc 0x55e406d54614 bp 0x7ffc810e4070 sp 0x7ffc810e4068
# | READ of size 8 at 0x7fd2c2c42b90 thread T0
# |     #0 0x55e406d54613 in operator()<long int const*> /usr/include/c++/13/bits/predefined_ops.h:318
# |     #1 0x55e406d54613 in __count_if<long int const*, __gnu_cxx::__ops::_Iter_pred<mlir::verifyListOfOperandsOrIntegers(Operation*, llvm::StringRef, unsigned int, llvm::ArrayRef<long int>, ValueRange)::<lambda(int64_t)> > > /usr/include/c++/13/bits/stl_algobase.h:2125
# |     #2 0x55e406d54613 in count_if<long int const*, mlir::verifyListOfOperandsOrIntegers(Operation*, 
...
```
danhoeflinger pushed a commit that referenced this pull request Jul 8, 2024
The problematic program is as follows:

```shell
#define pre_a 0
#define PRE(x) pre_##x

void f(void) {
    PRE(a) && 0;
}

int main(void) { return 0; }
```

in which after token concatenation (`##`), there's another nested macro
`pre_a`.

Currently only the outer expansion region will be produced. ([compiler
explorer
link](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:___c,selection:(endColumn:29,endLineNumber:8,positionColumn:29,positionLineNumber:8,selectionStartColumn:29,selectionStartLineNumber:8,startColumn:29,startLineNumber:8),source:'%23define+pre_a+0%0A%23define+PRE(x)+pre_%23%23x%0A%0Avoid+f(void)+%7B%0A++++PRE(a)+%26%26+0%3B%0A%7D%0A%0Aint+main(void)+%7B+return+0%3B+%7D'),l:'5',n:'0',o:'C+source+%231',t:'0')),k:51.69491525423727,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:cclang_assertions_trunk,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'0',intel:'0',libraryCode:'1',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:___c,libs:!(),options:'-fprofile-instr-generate+-fcoverage-mapping+-fcoverage-mcdc+-Xclang+-dump-coverage-mapping+',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+x86-64+clang+(assertions+trunk)+(Editor+%231)',t:'0')),k:34.5741843594503,l:'4',m:28.903654485049834,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x86-64+clang+(trunk)',editorid:1,fontScale:14,fontUsePx:'0',j:2,wrap:'1'),l:'5',n:'0',o:'Output+of+x86-64+clang+(assertions+trunk)+(Compiler+%232)',t:'0')),header:(),l:'4',m:71.09634551495017,n:'0',o:'',s:0,t:'0')),k:48.30508474576271,l:'3',n:'0',o:'',t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4))

```text
f:
  File 0, 4:14 -> 6:2 = #0
  Decision,File 0, 5:5 -> 5:16 = M:0, C:2
  Expansion,File 0, 5:5 -> 5:8 = #0 (Expanded file = 1)
  File 0, 5:15 -> 5:16 = #1
  Branch,File 0, 5:15 -> 5:16 = 0, 0 [2,0,0] 
  File 1, 2:16 -> 2:23 = #0
  File 2, 1:15 -> 1:16 = #0
  File 2, 1:15 -> 1:16 = #0
  Branch,File 2, 1:15 -> 1:16 = 0, 0 [1,2,0] 
```

The inner expansion region isn't produced because:

1. In the range-based for loop quoted below, each sloc is processed and
possibly emit a corresponding expansion region.
2. For our sloc in question, its direct parent returned by
`getIncludeOrExpansionLoc()` is a `<scratch space>`, because that's how
`##` is processed.


https://github.com/llvm/llvm-project/blob/88b6186af3908c55b357858eb348b5143f21c289/clang/lib/CodeGen/CoverageMappingGen.cpp#L518-L520

3. This `<scratch space>` cannot be found in the FileID mapping so
`ParentFileID` will be assigned an `std::nullopt`


https://github.com/llvm/llvm-project/blob/88b6186af3908c55b357858eb348b5143f21c289/clang/lib/CodeGen/CoverageMappingGen.cpp#L521-L526

4. As a result this iteration of for loop finishes early and no
expansion region is added for the sloc.

This problem gets worse with MC/DC: as the example shows, there's a
branch from File 2 but File 2 itself is missing. This will trigger
assertion failures.

The fix is more or less a workaround and takes a similar approach as
#89573.

~~Depends on #89573.~~ This includes #89573. Kudos to @chapuni!
This and #89573 together fix #87000: I tested locally, both the reduced
program and my original use case (fwiw, Linux kernel) can run
successfully.

---------

Co-authored-by: NAKAMURA Takumi <geek4civic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant