device_vector using usm device memory and becoming allocator aware#1

Closed

danhoeflinger wants to merge 23 commits intoSYCLomaticfrom

dev/dhoeflin/device_vector

Owner

danhoeflinger commented Mar 31, 2023 •

edited

Loading

Update: There are a pair of fundamental issues discovered in the comments below. I think that using USM device memory may not be possible to achieve this functionality in this way. Read the comments for more details.

dpct::device_vector currently uses sycl::usm::alloc::shared memory to provide users with the ability to have seamless usage on host and device.

This PR changes the semantics of dpct::device_vector to instead use sycl::usm::alloc::device memory, while preserving the ability to seamlessly use the dpct::device_vector from the host.

This PR also upgrades dpct::device_vector to be close to satisfying AllocatorAwareContainer. However, it diverges from that requirement in the following ways:

construct() calls do not use ::std::allocator_traits<allocator_type>::construct(alloc, pointer, args), but rather dpct::device_allocator_traits<allocator_type>::construct(pointer,args) which uses a static construct() call from the supplied allocator, or provides a default if one is not defined.
destroy() calls do not use ::std::allocator_traits<allocator_type>::destroy(alloc, pointer), but rather the static functiondpct::device_allocator_traits<allocator_type>::destroy(pointer) which uses a static destroy() call from the supplied allocator, or provides a default if one is not defined.

These changes are done to avoid passing the allocator object into the sycl kernel, as construction and destruction must occur on the device in a kernel. The selection of the call can happen at compile time and must not involve the individual instance of the allocator.

To avoid unnecessary overhead, for dpct::device_vector a kernel is only launched to destroy data if a custom static destroy() function is provided in the allocator. Prior to this PR, no construction or destruction was done, and we do not want to add extra kernel launches to "destroy" data when this will generally result in a no-op in effect.

Changes to the testing can be found here:
danhoeflinger/SYCLomatic-test#1

note: this PR is meant to be a draft on my fork for discussion prior to introducing it to SYCLomatic itself.

danhoeflinger added 14 commits

March 30, 2023 13:31


          device allocated USM seamless host access

2d0fc8f

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          construct allocator aware (BROKEN: cant pass allocator to kernel)

26347b4

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          working with static construct function

d59369c

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          Fix treatment of default constructor

a519a84

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          adding destroy functionality

8bb63db

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          making device_allocator_traits similar to std

c77643d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          switch to using oneDPL range functionality; misc.

7894c6d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          Comments

e1da65a

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          removing unnecessary TODO

4040b6a

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          Adding warning to usm_device_allocator

54861b3

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          clang-format

c93b743

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          provide explicit copy constructor

9297bc1

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          remove ambiguous constructor

f55a86b

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          device_vector AllocatorAwareContainer requirements

bcc6507

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

Owner Author

danhoeflinger commented Mar 31, 2023 •

edited

Loading

The shift to use usm device memory rather than usm shared memory for device_vector will change the performance of users of this type. However, when SYCLomatic is run, it injects generated headers into the user's repository, rather than relying upon the headers like a library. This means that only if users re-migrate their code will they experience the difference in performance.

It is difficult to say if the performance changes will be a net benefit or detriment because it depends on usage pattern.
The usm shared memory implementation lets the runtime decide where to hold the memory based on the usage. The new implementation always keeps it on the device, and transfers data to and/or from the device each time it is used on the host.

With this in mind, would it be good to provide the legacy shared usm implementation of dpct::device_vector under another name?. If so, do we provide it as it was prior to this PR, or work to add AllocatorAwareContainer requirements to it as well?

danhoeflinger mentioned this pull request

device_vector test improvements danhoeflinger/SYCLomatic-test#1

Open

danhoeflinger added 3 commits

April 3, 2023 11:17


          Fix bug with operator--

9309b93

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          removing custom swap kernel

504e0ff

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          adjusting device reference to hold ptr internally

3220e78

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

Owner Author

danhoeflinger commented Apr 4, 2023

I've realized I'm missing a big piece here, which is proper handling of device_pointer, device_iterator and a solution does not seem easy.

Currently, those merely use value_type & as their reference rather than device_reference<value_type>, and making that switch does not play well with the details of oneDPL. Below is an example of a functor used in fill which assumes a raw reference, and will not function with a device_reference<value_type>:

template <typename _SourceT>
struct fill_functor
{
    _SourceT __value;
    template <typename _TargetT>
    void
    operator()(_TargetT& __target) const
    {
        __target = __value;
    }
};

This also means that the tests that are working are not really using the code inside the __SYCL_DEVICE_ONLY__ macros to access data within kernels, but rather just depending on device_iterator using raw value_type & when dereferenced, as we are passing device_iterator to kernels themselves.


          passing pointer, not reference to device_reference

7069a8d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger commented

View reviewed changes

clang/test/dpct/helper_files_ref/include/dpl_extras/memory.h

    
                template <typename OtherT>

                device_reference &operator=(const device_reference<OtherT> &input) {

                  value = input;

                  __assign_from(input.__get_value());

Owner Author

danhoeflinger Apr 4, 2023 •

edited

Loading

All of these operators which use both __assign_from() and __get_value() are using two sycl kernels, when this should be able to be accomplished with a single custom kernel.

I have an implementation which attempts to accomplish this on a branch. It uses the preprocessor macro __SYCL_DEVICE_ONLY__ to choose to either submit a custom single_task, or just run simple commands if already on the device.

However, it seems that with this structure based on __SYCL_DEVICE_ONLY__, the compiler does not realize it needs to compile this kernel and it results in a runtime exception, failing to find the custom kernel. My hypothesis is that the kernel submission must be present in an active code path when __SYCL_DEVICE_ONLY__ is defined for the compiler to recognize it. This is a somewhat fundamental issue with this strategy.

Owner Author

danhoeflinger Apr 4, 2023 •

edited

Loading

Actually, there is a bug here which is allowing the __assign_from() and __get_value() to work, in the current state of the code. __SYCL_DEVICE_ONLY should be __SYCL_DEVICE_ONLY__, and this is allowing the kernels to be compiled. Fixing that bug breaks the code generally, due to the runtime failure described above.

Using __SYCL_DEVICE_ONLY__ in this fashion does not seem to actually be viable.


          (breaking) bug fix

3d43770

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger marked this pull request as draft

April 4, 2023 20:08

Owner Author

danhoeflinger commented Apr 4, 2023

If users really wanted explicitly device USM memory, we could provide explicitly host_accessible_device_vector, which would function seamlessly on the host with device data.
From this container, you could have begin(), end() functions which would provide device accessible iterators. It may also be possible to provide explicitly host_begin() and host_end() to provide host accessible iterators.

However, the current shared USM implementation provides the seamless experience of host and device accessible memory currently, without requiring the programmer to specify where the data will be used. I don't think that is likely to be a winning trade-off.

The allocator aware functionality can be broken off and added separately.

danhoeflinger mentioned this pull request

[SYCLomatic] Make device_vector AllocatorAwareContainer #2

Closed

danhoeflinger added 4 commits

April 6, 2023 11:01


          resolves issue of missing kernel compilation

438616d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          launching single kernel per operator

4afce04

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


           __attribute__((__used__)) instead of virtual

7a77d30

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>


          Adding device_ref to be ref of device_pointer

87a173e

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger pushed a commit that referenced this pull request


          [lldb] add support for thread names on Windows (#74731)

95e5839

This PR adds support for thread names in lldb on Windows.

```
(lldb) thr list
Process 2960 stopped
  thread oneapi-src#53: tid = 0x03a0, 0x00007ff84582db34 ntdll.dll`NtWaitForMultipleObjects + 20
  thread oneapi-src#29: tid = 0x04ec, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'SPUW.6'
  thread oneapi-src#89: tid = 0x057c, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x1000019] physics[main]'
  thread oneapi-src#3: tid = 0x0648, 0x00007ff843c2cafe combase.dll`InternalDoATClassCreate + 39518
  thread oneapi-src#93: tid = 0x0688, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x100501d] uMovie::StreamingThread'
  thread #1: tid = 0x087c, 0x00007ff842e7a104 win32u.dll`NtUserMsgWaitForMultipleObjectsEx + 20
  thread oneapi-src#96: tid = 0x0890, 0x00007ff845830a14 ntdll.dll`NtWaitForAlertByThreadId + 20, name = 'PPU[0x1002020] HLE Video Decoder'
<...>
```

danhoeflinger pushed a commit that referenced this pull request


          [AArch64] Avoid jump tables in swiftasync clobber-live-reg test. NFC.

155d584

The upstream test relies on jump-tables, which are lowered in
dramatically different ways with later arm64e/ptrauth patches.

Concretely, it's failing for at least two reasons:
- ptrauth removes x16/x17 from tcGPR64 to prevent indirect tail-calls
  from using either register as the callee, conflicting with their usage
  as scratch for the tail-call LR auth checking sequence.  In the
  1/2_available_regs_left tests, this causes the MI scheduler to move
  the load up across some of the inlineasm register clobbers.

- ptrauth adds an x16/x17-using pseudo for jump-table dispatch, which
  looks somewhat different from the regular jump-table dispatch codegen
  by itself, but also prevents compression currently.

They seem like sensible changes.  But they mean the tests aren't really
testing what they're intented to, because there's always an implicit
x16/x17 clobber when using jump-tables.

This updates the test in a way that should work identically regardless
of ptrauth support, with one exception, #1 above, which merely reorders
the load/inlineasm w.r.t. eachother.
I verified the tests still fail the live-reg assertions when
applicable.

danhoeflinger pushed a commit that referenced this pull request


          Revert "[MLIR][XeGPU] Adding XeGPU 2d block operators (#84692)" (#85653)

8d14204

This reverts commit daebe5c.

This commit causes the following asan issue:

```
<snip>/llvm-project/build/bin/mlir-opt <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir | <snip>/llvm-project/build/bin/FileCheck <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
# executed command: <snip>/llvm-project/build/bin/mlir-opt <snip>/llvm-project/mlir/test/Dialect/XeGPU/XeGPUOps.mlir
# .---command stderr------------
# | =================================================================
# | ==2772558==ERROR: AddressSanitizer: stack-use-after-return on address 0x7fd2c2c42b90 at pc 0x55e406d54614 bp 0x7ffc810e4070 sp 0x7ffc810e4068
# | READ of size 8 at 0x7fd2c2c42b90 thread T0
# |     #0 0x55e406d54613 in operator()<long int const*> /usr/include/c++/13/bits/predefined_ops.h:318
# |     #1 0x55e406d54613 in __count_if<long int const*, __gnu_cxx::__ops::_Iter_pred<mlir::verifyListOfOperandsOrIntegers(Operation*, llvm::StringRef, unsigned int, llvm::ArrayRef<long int>, ValueRange)::<lambda(int64_t)> > > /usr/include/c++/13/bits/stl_algobase.h:2125
# |     #2 0x55e406d54613 in count_if<long int const*, mlir::verifyListOfOperandsOrIntegers(Operation*, 
...
```

danhoeflinger pushed a commit that referenced this pull request


          [Coverage][Expansion] handle nested macros in scratch space (#89869)

f9e9e59

The problematic program is as follows:

```shell
#define pre_a 0
#define PRE(x) pre_##x

void f(void) {
    PRE(a) && 0;
}

int main(void) { return 0; }
```

in which after token concatenation (`##`), there's another nested macro
`pre_a`.

Currently only the outer expansion region will be produced. ([compiler
explorer
link](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:___c,selection:(endColumn:29,endLineNumber:8,positionColumn:29,positionLineNumber:8,selectionStartColumn:29,selectionStartLineNumber:8,startColumn:29,startLineNumber:8),source:'%23define+pre_a+0%0A%23define+PRE(x)+pre_%23%23x%0A%0Avoid+f(void)+%7B%0A++++PRE(a)+%26%26+0%3B%0A%7D%0A%0Aint+main(void)+%7B+return+0%3B+%7D'),l:'5',n:'0',o:'C+source+%231',t:'0')),k:51.69491525423727,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:cclang_assertions_trunk,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'0',intel:'0',libraryCode:'1',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:___c,libs:!(),options:'-fprofile-instr-generate+-fcoverage-mapping+-fcoverage-mcdc+-Xclang+-dump-coverage-mapping+',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+x86-64+clang+(assertions+trunk)+(Editor+%231)',t:'0')),k:34.5741843594503,l:'4',m:28.903654485049834,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x86-64+clang+(trunk)',editorid:1,fontScale:14,fontUsePx:'0',j:2,wrap:'1'),l:'5',n:'0',o:'Output+of+x86-64+clang+(assertions+trunk)+(Compiler+%232)',t:'0')),header:(),l:'4',m:71.09634551495017,n:'0',o:'',s:0,t:'0')),k:48.30508474576271,l:'3',n:'0',o:'',t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4))

```text
f:
  File 0, 4:14 -> 6:2 = #0
  Decision,File 0, 5:5 -> 5:16 = M:0, C:2
  Expansion,File 0, 5:5 -> 5:8 = #0 (Expanded file = 1)
  File 0, 5:15 -> 5:16 = #1
  Branch,File 0, 5:15 -> 5:16 = 0, 0 [2,0,0] 
  File 1, 2:16 -> 2:23 = #0
  File 2, 1:15 -> 1:16 = #0
  File 2, 1:15 -> 1:16 = #0
  Branch,File 2, 1:15 -> 1:16 = 0, 0 [1,2,0] 
```

The inner expansion region isn't produced because:

1. In the range-based for loop quoted below, each sloc is processed and
possibly emit a corresponding expansion region.
2. For our sloc in question, its direct parent returned by
`getIncludeOrExpansionLoc()` is a `<scratch space>`, because that's how
`##` is processed.


https://github.com/llvm/llvm-project/blob/88b6186af3908c55b357858eb348b5143f21c289/clang/lib/CodeGen/CoverageMappingGen.cpp#L518-L520

3. This `<scratch space>` cannot be found in the FileID mapping so
`ParentFileID` will be assigned an `std::nullopt`


https://github.com/llvm/llvm-project/blob/88b6186af3908c55b357858eb348b5143f21c289/clang/lib/CodeGen/CoverageMappingGen.cpp#L521-L526

4. As a result this iteration of for loop finishes early and no
expansion region is added for the sloc.

This problem gets worse with MC/DC: as the example shows, there's a
branch from File 2 but File 2 itself is missing. This will trigger
assertion failures.

The fix is more or less a workaround and takes a similar approach as
#89573.

~~Depends on #89573.~~ This includes #89573. Kudos to @chapuni!
This and #89573 together fix #87000: I tested locally, both the reduced
program and my original use case (fwiw, Linux kernel) can run
successfully.

---------

Co-authored-by: NAKAMURA Takumi <geek4civic@gmail.com>

danhoeflinger closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet