[Clang][SYCL] Introduce clang-sycl-link-wrapper to link SYCL offloading device code by asudarsa · Pull Request #1 · asudarsa/llvm-project

asudarsa · 2024-10-07T14:45:29Z

This PR is one of the many PRs in the SYCL upstreaming effort focusing on device code linking during the SYCL offload compilation process. RFC: https://discourse.llvm.org/t/rfc-offloading-design-for-sycl-offload-kind-and-spir-targets/74088

In this PR, we introduce a new tool that will be used to perform device code linking for SYCL offload kind. It accepts SYCL device objects in LLVM IR bitcode format and will generate a fully linked device object that can then be wrapped and linked into the host object.

A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows:

clang --target=spirv64 --sycl-link input.bc

Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are:

Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow.
SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code generation cannot be done as part of LTO. This limitation will be lifted once SPIR-V backend is available as a viable LLVM backend.
Hence, we introduce this new tool to provide a clean wrapper to perform SYCL device linking.

List of items supported in this PR:

Addition of a new linking tool to link device for SYCL offloading. Providing a way to call this from clang driver. This supports a simple JIT compilation flow.
Currently, this new tool supports linking of multiple sources in LLVM IR device code format, linking the SYCL device library files if available and finally running llvm-spirv tool to generate SPIR-V IR device file.

List of upcoming PRs:

Post-link routine that can support device code splitting and also support running of several SYCL-specific finalization passes.
Integration of post-link routine into clang-sycl-link-wrapper
Invoking clang driver from inside clang-linker-wrapper for linking SYCL device codes.
AOT compilation support for SPIR-V targets
AOT compilation support for non SPIR-V targets

Thanks

sarnex

looks great, thanks! just some comments below.

sarnex · 2024-10-07T16:15:32Z

Suggested change

Purpose of this wrapper is to provide an interface to link SYCL device bitcode

The purpose of this wrapper is to provide an interface to link SYCL device bitcode

sarnex · 2024-10-07T16:15:53Z

Suggested change

Device code linking for SYCL offloading kind has a number of known quirks that

Device code linking for SYCL offloading has a number of known quirks that

sarnex · 2024-10-07T16:16:22Z

Suggested change

IR bitcode to gaurantee conformance to SYCL standards. This step is unique to

IR bitcode to guarantee conformance to SYCL standards. This step is unique to

sarnex · 2024-10-07T16:16:32Z

Suggested change

SYCL offloading compilation flow.

the SYCL offloading compilation flow.

sarnex · 2024-10-07T16:16:55Z

Suggested change

2. SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code

2. SPIR-V LLVM Translator tool is an external tool and hence SPIR-V IR code

sarnex · 2024-10-07T20:41:41Z

i forget, are there plans to have the driver own this list?

I think it's reasonable to hard-code the list here as this is a SYCL specific tool.

Thanks

well with the old offload model the driver also owns a copy, so once we have the new model enabled and remove support for the old model, if this is the single list then that's fine, i just want to avoid two copies of the same thing

sarnex · 2024-10-07T20:42:27Z

is nullopt return valid? should it be an error?

If user does not specify a valid device library location, then we simply do not link any device lib files. It need not be an error.

user here is the driver/clang-linker-wrapper right?

sarnex · 2024-10-07T20:44:42Z

I see the llvm::sys::path namespace has a filename function to extract the filename from the path, can we use that here?

sarnex · 2024-10-07T20:45:31Z

do we want to note in a comment or something that we will need to add processing of the linked module (sycl-post-link) later?

will add a PR comment about upcoming PRs

sarnex · 2024-10-07T20:47:16Z

if there's an error, i think we won't remove the temp files even if --save-temps is not passed, right? reportError is noreturn, so we probably need something that will always run

reportError is being used in other tools as well.

reportError is fine but checking for save_temps after a reportError won't work I think. If the same thing happens in other tools then it's fine for this PR

bader · 2024-10-07T23:27:12Z

A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows:

clang --target=spirv64 --sycl-link input.bc

Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are:

Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow.

I think it might be easier to communicate SYCL runtime requirements by adding SYCL operating system type or environment type to the target triple.

clang -target=spirv64-intel-sycl input.bc # OS type
clang -target=spirv64-intel--sycl input.bc # Environment type

E.g. CUDA uses CUDA OS type to select the right interface with the driver. See https://llvm.org/docs/NVPTXUsage.html#triples

One concern is that Codeplay is already using cuda OS type when we compiler SYCL for nvptx target. @Naghasan, what do you think about the idea of having SYCL specific OS type in the target triple?

bader · 2024-10-07T23:28:15Z

Suggested change

====================

Clang SYCL link Wrapper

====================

=======================

Clang SYCL link Wrapper

=======================

bader · 2024-10-07T23:33:03Z

According to my understanding, the purpose of the tool is to provide standard linker interface and be able to link any object file formats 'compile' step is able to produce for SYCL offload mode. So, I would expect this tool to handle linking of fat objects with device code in other IR like SPIR-V as well as native binary format.

Hi @bader

Thanks for the feedback. I agree with you. This tool is expected to handle linking of fat objects and SPIR-V objects. However, in its current state, it can support only bitcode files. We can extend this tool easily to support linking of other objects. I was planning to update the document as the tool evolves. I will add more details to the doc.

Thanks! Let's clearly state the final goal for the tool, current status of the implementation, and plans for the future work.

bader · 2024-10-07T23:35:29Z

Suggested change

be passed down to downstrea tools like 'llvm-link', 'llvm-spirv', etc.

be passed down to downstrea tools like 'clang', 'llvm-spirv', etc.

We should use clang instead of llvm-link to link llvm bitcode files.

I can add a TODO. I do have a tracker to replace llvm-link with clang. '-only-needed' linking support seems to be missing in clang.

I think llvm-link is not a part of clang distribution, so SYCL linking won't work. We need to resolve that ASAP.

bader · 2024-10-07T23:37:24Z

What about AOT targets?

I was planning to add it as the implementation evolves in near future. But I will add it now and provide a PR comment about upcoming changes.

Thanks

bader · 2024-10-07T23:40:31Z

Please, specify that --sycl-link is the clang driver option, not clang-sycl-link-wrapper option.

bader · 2024-10-07T23:46:46Z

Copy-paste from another source? I expect this comment reference spir-v instead of nvptx and amdgpu.

Yes. Sorry. I thought I had cleaned it up. We do not really need this comment. Will remove it.

bader · 2024-10-07T23:49:33Z

We should either change sycl to spirv in the option name of vice versa in the description. Otherwise, it looks a bit messy.

bader · 2024-10-07T23:50:19Z

Suggested change

// Special option to pass in llvm-spirv options

// Options to pass to llvm-spirv tool

bader · 2024-10-07T23:51:36Z

Suggested change

/// Ssave intermediary results.

/// Save intermediary results.

bader · 2024-10-07T23:52:52Z

Suggested change

SmallString<128> SPIRVDumpDir;

static SmallString<128> SPIRVDumpDir;

Naghasan · 2024-10-09T14:30:24Z

A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows:
clang --target=spirv64 --sycl-link input.bc
Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are:

Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow.

I think it might be easier to communicate SYCL runtime requirements by adding SYCL operating system type or environment type to the target triple.
clang -target=spirv64-intel-sycl input.bc # OS type
clang -target=spirv64-intel--sycl input.bc # Environment type
E.g. CUDA uses CUDA OS type to select the right interface with the driver. See https://llvm.org/docs/NVPTXUsage.html#triples

One concern is that Codeplay is already using cuda OS type when we compiler SYCL for nvptx target. @Naghasan, what do you think about the idea of having SYCL specific OS type in the target triple?

I think it would be best to avoid that TBH. I briefly looked at the patch, and I don't understand (yet) why we would need extra information. I'll make a review.

If we have to, environment would be best IMO so as to not compete with existing TC logic that we can reuse. But the point of the device compiler is to lower to a specific environment (OpenCL, Level 0, HIP, CUDA), so I don't see why the middle or backend should need to know that to function properly and so I would personally prefer an extra flag.

Naghasan · 2024-10-09T14:58:54Z

Right, I have a better understanding now than when I wrote #1 (comment)

You shouldn't need that flag... if the driver was there. If it was, the condition should be if (JA.isDeviceOffloading(Action::OFK_SYCL)) rather than this global modifier. IMHO, the driver change shouldn't be part of this patch:

there is a better way to do it but can't implement this because it lacks the required sycl offloading driver logic.

looking at the test, you don't need this to test clang-sycl-linker-wrapper

Thanks for the feedback @Naghasan
Here, the clang driver is expected to be called from inside clang-linker-wrapper to link device images and we pass --sycl-link (and no -fsycl) to invoke the 'link' stage.
Adding @mdtoguchi for more details here.

Thanks

The use of JA.isDeviceOffloading() is useful when we know we are offloading and determining what kind we are offloading to. From what I understand, the clang call from within the wrapper is not to be considered as an offloading enabling invocation, but rather a target specific call to produce a specific type of binary.

mdtoguchi · 2024-10-10T14:28:16Z

Suggested change

The wrapper will also support the Ahead-Of-Time compilation flow. AOT

The wrapper will also support the Ahead-Of-Time (AOT) compilation flow. AOT

mdtoguchi · 2024-10-10T14:28:43Z

Suggested change

generation is deferred until application runtime time.

generation is deferred until application runtime.

mdtoguchi · 2024-10-10T14:29:07Z

Suggested change

makes it difficult to use in a unified offloading setting. Two of the primary

make it difficult to use in a unified offloading setting. Two of the primary

'makes' goes well with 'Device code linking'. WDYT?

It reads interesting, as the association can also be with 'quirks'. It is the quirks that make it difficult - it's not the device code linking that makes it difficult.

mdtoguchi · 2024-10-10T14:31:05Z

Do we need 'sycl' here? The tool is a SYCL specific tool - seems redundant.

sarnex

no major flags from me assuming my comments are addressed in a seperate pr

…ng device code This PR is one of the many PRs in the SYCL upstreaming effort focusing on device code linking during the SYCL offload compilation process. RFC: https://discourse.llvm.org/t/rfc-offloading-design-for-sycl-offload-kind-and-spir-targets/74088 In this PR, we introduce a new tool that will be used to perform device code linking for SYCL offload kind. It accepts SYCL device objects in LLVM IR bitcode format and will generate a fully linked device object that can then be wrapped and linked into the host object. A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows: `clang --target=spirv64 --sycl-link input.bc` Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are: 1. Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow. 2. SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code generation cannot be done as part of LTO. This limitation will be lifted once SPIR-V backend is available as a viable LLVM backend. Hence, we introduce this new tool to provide a clean wrapper to perform SYCL device linking. Thanks Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

…of changes missed out from earlier commit. Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

… when only one input is present Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

… some comments Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

When compiling VLS SVE, the compiler often replaces VL-based offsets with immediate-based ones. This leads to a mismatch in the allowed addressing modes due to SVE loads/stores generally expecting immediate offsets relative to VL. For example, given: ```c svfloat64_t foo(const double *x) { svbool_t pg = svptrue_b64(); return svld1_f64(pg, x+svcntd()); } ``` When compiled with `-msve-vector-bits=128`, we currently generate: ```gas foo: ptrue p0.d mov x8, #2 ld1d { z0.d }, p0/z, [x0, x8, lsl #3] ret ``` Instead, we could be generating: ```gas foo: ldr z0, [x0, #1, mul vl] ret ``` Likewise for other types, stores, and other VLS lengths. This patch achieves the above by extending `SelectAddrModeIndexedSVE` to let constants through when `vscale` is known.

`TestReportData.py` is failing on the macOS CI with: ``` Traceback (most recent call last): File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1784, in test_method return attrvalue(self) File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 148, in wrapper return func(*args, **kwargs) File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/test/API/functionalities/asan/TestReportData.py", line 28, in test_libsanitizers_asan self.asan_tests(libsanitizers=True) File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/test/API/functionalities/asan/TestReportData.py", line 60, in asan_tests self.expect( File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2490, in expect self.fail(log_msg) AssertionError: Ran command: "thread list" Got output: Process 3474 stopped * thread #1: tid = 0x38b5e9, 0x00007ff80f563b52 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT Expecting sub string: "stopped" (was found) Expecting sub string: "stop reason = Use of deallocated memory" (was not found) Process should be stopped due to ASan report ``` There isn't much to go off of in the log, so adding more to help us debug this.

These are macOS tests only and are currently failing on the x86_64 CI and on arm64 on recent versions of macOS/Xcode. The tests are failing because we're stopping in: ``` Process 17458 stopped * thread #1: tid = 0xbda69a, 0x00000002735bd000 libsystem_malloc.dylib`purgeable_print_self.cold.1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x2735bd000) ``` instead of the libsanitizers library. This seems to be related to `-fsanitize-trivial-abi` support Skip these for now until we figure out the root cause.

…d A520 (llvm#132246) Inefficient SVE codegen occurs on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add ``` void foo(float a, float b, float dst, unsigned n) { for (unsigned i = 0; i < n; ++i) dst[i] = a[i] + b[i]; } ``` Vectorizes the inner loop into the following interleaved sequence of instructions. ``` add x12, x1, x10 ld1b { z0.b }, p0/z, [x1, x10] add x13, x2, x10 ld1b { z1.b }, p0/z, [x2, x10] ldr z2, [x12, #1, mul vl] ldr z3, [x13, #1, mul vl] dech x11 add x12, x0, x10 fadd z0.s, z1.s, z0.s fadd z1.s, z3.s, z2.s st1b { z0.b }, p0, [x0, x10] addvl x10, x10, #2 str z1, [x12, #1, mul vl] ``` By adjusting the target features to prefer fixed over scalable if the cost is equal we get the following vectorized loop. ``` ldp q0, q3, [x11, #-16] subs x13, x13, llvm#8 ldp q1, q2, [x10, #-16] add x10, x10, llvm#32 add x11, x11, llvm#32 fadd v0.4s, v1.4s, v0.4s fadd v1.4s, v2.4s, v3.4s stp q0, q1, [x12, #-16] add x12, x12, llvm#32 ``` Which is more efficient.

… A510/A520 (llvm#134606) Recommit. This work was done by llvm#132246 but failed buildbots due to the test introduced needing updates Inefficient SVE codegen occurs on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add ``` void foo(float a, float b, float dst, unsigned n) { for (unsigned i = 0; i < n; ++i) dst[i] = a[i] + b[i]; } ``` Vectorizes the inner loop into the following interleaved sequence of instructions. ``` add x12, x1, x10 ld1b { z0.b }, p0/z, [x1, x10] add x13, x2, x10 ld1b { z1.b }, p0/z, [x2, x10] ldr z2, [x12, #1, mul vl] ldr z3, [x13, #1, mul vl] dech x11 add x12, x0, x10 fadd z0.s, z1.s, z0.s fadd z1.s, z3.s, z2.s st1b { z0.b }, p0, [x0, x10] addvl x10, x10, #2 str z1, [x12, #1, mul vl] ``` By adjusting the target features to prefer fixed over scalable if the cost is equal we get the following vectorized loop. ``` ldp q0, q3, [x11, #-16] subs x13, x13, llvm#8 ldp q1, q2, [x10, #-16] add x10, x10, llvm#32 add x11, x11, llvm#32 fadd v0.4s, v1.4s, v0.4s fadd v1.4s, v2.4s, v3.4s stp q0, q1, [x12, #-16] add x12, x12, llvm#32 ``` Which is more efficient.

…s=128. (llvm#134068) When compiling with -msve-vector-bits=128 or vscale_range(1, 1) and when the offsets allow it, we can pair SVE LDR/STR instructions into Neon LDP/STP. For example, given: ```cpp #include <arm_sve.h> void foo(double const *ldp, double *stp) { svbool_t pg = svptrue_b64(); svfloat64_t ld1 = svld1_f64(pg, ldp); svfloat64_t ld2 = svld1_f64(pg, ldp+svcntd()); svst1_f64(pg, stp, ld1); svst1_f64(pg, stp+svcntd(), ld2); } ``` When compiled with `-msve-vector-bits=128`, we currently generate: ```gas foo: ldr z0, [x0] ldr z1, [x0, #1, mul vl] str z0, [x1] str z1, [x1, #1, mul vl] ret ``` With this patch, we instead generate: ```gas foo: ldp q0, q1, [x0] stp q0, q1, [x1] ret ``` This is an alternative, more targetted approach to llvm#127500.

…ctor-bits=128." (llvm#134997) Reverts llvm#134068 Caused a stage 2 build failure: https://lab.llvm.org/buildbot/#/builders/41/builds/6016 ``` FAILED: lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/include -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/include -mcpu=neoverse-512tvb -mllvm -scalable-vectorization=preferred -mllvm -treat-scalable-fixed-error-as-warning=false -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=c++17 -UNDEBUG -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -MF lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o.d -o lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -c /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support/Caching.cpp Opcode has unknown scale! UNREACHABLE executed at ../llvm/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp:4530! PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/include -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/include -mcpu=neoverse-512tvb -mllvm -scalable-vectorization=preferred -mllvm -treat-scalable-fixed-error-as-warning=false -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=c++17 -UNDEBUG -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -MF lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o.d -o lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -c /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support/Caching.cpp 1. <eof> parser at end of file 2. Code generation 3. Running pass 'Function Pass Manager' on module '/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support/Caching.cpp'. 4. Running pass 'AArch64 load / store optimization pass' on function '@"_ZNSt17_Function_handlerIFN4llvm8ExpectedISt8functionIFNS1_ISt10unique_ptrINS0_16CachedFileStreamESt14default_deleteIS4_EEEEjRKNS0_5TwineEEEEEjNS0_9StringRefESB_EZNS0_10localCacheESB_SB_SB_S2_IFvjSB_S3_INS0_12MemoryBufferES5_ISH_EEEEE3$_0E9_M_invokeERKSt9_Any_dataOjOSF_SB_"' #0 0x0000b6eae9b67bf0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x81c7bf0) #1 0x0000b6eae9b65aec llvm::sys::RunSignalHandlers() (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x81c5aec) #2 0x0000b6eae9acd5f4 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0 #3 0x0000f16c1aff28f8 (linux-vdso.so.1+0x8f8) #4 0x0000f16c1aacf1f0 __pthread_kill_implementation ./nptl/pthread_kill.c:44:76 #5 0x0000f16c1aa8a67c gsignal ./signal/../sysdeps/posix/raise.c:27:6 #6 0x0000f16c1aa77130 abort ./stdlib/abort.c:81:7 llvm#7 0x0000b6eae9ad6628 (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x8136628) llvm#8 0x0000b6eae72e95a8 (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x59495a8) llvm#9 0x0000b6eae74ca9a8 (anonymous namespace)::AArch64LoadStoreOpt::findMatchingInsn(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, (anonymous namespace)::LdStPairFlags&, unsigned int, bool) AArch64LoadStoreOptimizer.cpp:0:0 llvm#10 0x0000b6eae74c85a8 (anonymous namespace)::AArch64LoadStoreOpt::tryToPairLdStInst(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>&) AArch64LoadStoreOptimizer.cpp:0:0 llvm#11 0x0000b6eae74c624c (anonymous namespace)::AArch64LoadStoreOpt::optimizeBlock(llvm::MachineBasicBlock&, bool) AArch64LoadStoreOptimizer.cpp:0:0 llvm#12 0x0000b6eae74c429c (anonymous namespace)::AArch64LoadStoreOpt::runOnMachineFunction(llvm::MachineFunction&) AArch64LoadStoreOptimizer.cpp:0:0 ```

…vailable (llvm#135343) When a frame is inlined, LLDB will display its name in backtraces as follows: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.3 * frame #0: 0x0000000100000398 a.out`func() [inlined] baz(x=10) at inline.cpp:1:42 frame #1: 0x0000000100000398 a.out`func() [inlined] bar() at inline.cpp:2:37 frame #2: 0x0000000100000398 a.out`func() at inline.cpp:4:15 frame #3: 0x00000001000003c0 a.out`main at inline.cpp:7:5 frame #4: 0x000000026eb29ab8 dyld`start + 6812 ``` The longer the names get the more confusing this gets because the first function name that appears is the parent frame. My assumption (which may need some more surveying) is that for the majority of cases we only care about the actual frame name (not the parent). So this patch removes all the special logic that prints the parent frame. Another quirk of the current format is that the inlined frame name does not abide by the `${function.name-XXX}` format variables. We always just print the raw demangled name. With this patch, we would format the inlined frame name according to the `frame-format` setting (see the test-cases). If we really want to have the `parentFrame [inlined] inlinedFrame` format, we could expose it through a new `frame-format` variable (e..g., `${function.inlined-at-name}` and let the user decide where to place things.

Currently, given: ```cpp uint64_t incb(uint64_t x) { return x+svcntb(); } ``` LLVM generates: ```gas incb: addvl x0, x0, #1 ret ``` Which is equivalent to: ```gas incb: incb x0 ret ``` However, on microarchitectures like the Neoverse V2 and Neoverse V3, the second form (with INCB) can have significantly better latency and throughput (according to their SWOG). On the Neoverse V2, for example, ADDVL has a latency and throughput of 2, whereas some forms of INCB have a latency of 1 and a throughput of 4. The same applies to DECB. This patch adds patterns to prefer the cheaper INCB/DECB forms over ADDVL where applicable.

Fixes llvm#123300 What is seen ``` clang-repl> int x = 42; clang-repl> auto capture = [&]() { return x * 2; }; In file included from <<< inputs >>>:1: input_line_4:1:17: error: non-local lambda expression cannot have a capture-default 1 | auto capture = [&]() { return x * 2; }; | ^ zsh: segmentation fault clang-repl --Xcc="-v" (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8) * frame #0: 0x0000000107b4f8b8 libclang-cpp.19.1.dylib`clang::IncrementalParser::CleanUpPTU(clang::PartialTranslationUnit&) + 988 frame #1: 0x0000000107b4f1b4 libclang-cpp.19.1.dylib`clang::IncrementalParser::ParseOrWrapTopLevelDecl() + 416 frame #2: 0x0000000107b4fb94 libclang-cpp.19.1.dylib`clang::IncrementalParser::Parse(llvm::StringRef) + 612 frame #3: 0x0000000107b52fec libclang-cpp.19.1.dylib`clang::Interpreter::ParseAndExecute(llvm::StringRef, clang::Value*) + 180 frame #4: 0x0000000100003498 clang-repl`main + 3560 frame #5: 0x000000018d39a0e0 dyld`start + 2360 ``` Though the error is justified, we shouldn't be interested in exiting through a segfault in such cases. The issue is that empty named decls weren't being taken care of resulting into this assert https://github.com/llvm/llvm-project/blob/c1a229252617ed58f943bf3f4698bd8204ee0f04/clang/include/clang/AST/DeclarationName.h#L503 Can also be seen when the example is attempted through xeus-cpp-lite. ![image](https://github.com/user-attachments/assets/9b0e6ead-138e-4b06-9ad9-fcb9f8d5bf6e)

# Symptom We have seen SIGSEGV like this: ``` * thread #1, name = 'lldb-server', stop reason = SIGSEGV frame #0: 0x00007f39e529c993 libc.so.6`__pthread_kill_internal(signo=11, threadid=<unavailable>) at pthread_kill.c:46:37 ... * frame #5: 0x000056027c94fe48 lldb-server`lldb_private::process_linux::GetPtraceScope() + 72 frame #6: 0x000056027c92f94f lldb-server`lldb_private::process_linux::NativeProcessLinux::Attach(int) + 1087 ... ``` See [full stack trace](https://pastebin.com/X0d6QhYj). This happens on Linux where LLDB doesn't have access to `/proc/sys/kernel/yama/ptrace_scope`. A similar error (an unchecked `Error`) can be reproduced by running the newly added unit test without the fix. See the "Test" section below. # Root cause `GetPtraceScope()` ([code](https://github.com/llvm/llvm-project/blob/328f40f408c218f25695ea42c844e43bef38660b/lldb/source/Plugins/Process/Linux/Procfs.cpp#L77)) has the following `if` statement: ``` llvm::Expected<int> lldb_private::process_linux::GetPtraceScope() { ErrorOr<std::unique_ptr<MemoryBuffer>> ptrace_scope_file = getProcFile("sys/kernel/yama/ptrace_scope"); if (!*ptrace_scope_file) return errorCodeToError(ptrace_scope_file.getError()); ... } ``` The intention of the `if` statement is to check whether the `ptrace_scope_file` is an `Error` or not, and return the error if it is. However, the `operator*` of `ErrorOr` returns the value that is stored (which is a `std::unique_ptr<MemoryBuffer>`), so what the `if` condition actually do is to check if the unique pointer is non-null. Note that the method `ErrorOr::getStorage()` ([called by](https://github.com/llvm/llvm-project/blob/328f40f408c218f25695ea42c844e43bef38660b/llvm/include/llvm/Support/ErrorOr.h#L162-L164) `ErrorOr::operator *`) **does** assert on whether or not `HasError` has been set (see [ErrorOr.h](https://github.com/llvm/llvm-project/blob/328f40f408c218f25695ea42c844e43bef38660b/llvm/include/llvm/Support/ErrorOr.h#L235-L243)). However, it seems this wasn't executed, probably because the LLDB was a release build. # Fix The fix is simply remove the `*` in the said `if` statement.

…142952) This was removed in llvm#135343 in favour of making it a format variable, which we do here. This follows the precedent of the `[opt]` and `[artificial]` markers. Before: ``` thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17 frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 frame #3: 0x00000001000003b4 a.out`main at inline.cpp:10:3 frame #4: 0x0000000186345be4 dyld`start + 7040 ``` After (note the `[inlined]` markers): ``` thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 [inlined] frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17 frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 [inlined] frame #3: 0x00000001000003b4 a.out`main at inline.cpp:10:3 frame #4: 0x0000000186345be4 dyld`start + 7040 ``` rdar://152642178

asudarsa changed the title ~~[Clang][SYCL] Introduce clang-sycl-link-wrapper to link SYCL offloadi…~~ [Clang][SYCL] Introduce clang-sycl-link-wrapper to link SYCL offloading device code Oct 7, 2024

sarnex reviewed Oct 7, 2024

View reviewed changes

bader reviewed Oct 7, 2024

View reviewed changes

Naghasan reviewed Oct 9, 2024

View reviewed changes

asudarsa requested review from Naghasan, bader and sarnex October 10, 2024 14:20

mdtoguchi reviewed Oct 10, 2024

View reviewed changes

sarnex approved these changes Oct 10, 2024

View reviewed changes

bader approved these changes Oct 10, 2024

View reviewed changes

asudarsa marked this pull request as ready for review October 14, 2024 15:09

asudarsa force-pushed the add_clang-sycl-link-wrapper-part-1 branch from 856c205 to eff4a03 Compare October 14, 2024 19:00

asudarsa added 11 commits October 14, 2024 12:17

Update code comment

753edc3

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Renamed the tool to be called clang-sycl-linker

679436a

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Address review comments

307c74e

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Remove unused files

21a7b1c

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Fix documenetation error. Add TODO for --sycl-link usage. Add couple …

e38ded0

…of changes missed out from earlier commit. Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Fix minor bug

b78b2ef

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Add clang-sycl-linker to CLANG_TEST_DEPS list

e69ddfd

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Add more testing

bab77c0

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Remove test

23fb5c2

Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Some non-functional refactoring and extra check to not call llvm-link…

411bb3a

… when only one input is present Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

Update test to specify correct inputs to clang-sycl-linker and update…

d127523

… some comments Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>

	Purpose of this wrapper is to provide an interface to link SYCL device bitcode
	The purpose of this wrapper is to provide an interface to link SYCL device bitcode

	IR bitcode to gaurantee conformance to SYCL standards. This step is unique to
	IR bitcode to guarantee conformance to SYCL standards. This step is unique to

	SYCL offloading compilation flow.
	the SYCL offloading compilation flow.

	2. SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code
	2. SPIR-V LLVM Translator tool is an external tool and hence SPIR-V IR code

-====================
-Clang SYCL link Wrapper
-====================
+=======================
+Clang SYCL link Wrapper
+=======================

	be passed down to downstrea tools like 'llvm-link', 'llvm-spirv', etc.
	be passed down to downstrea tools like 'clang', 'llvm-spirv', etc.

	// Special option to pass in llvm-spirv options
	// Options to pass to llvm-spirv tool

	/// Ssave intermediary results.
	/// Save intermediary results.

	SmallString<128> SPIRVDumpDir;
	static SmallString<128> SPIRVDumpDir;

	The wrapper will also support the Ahead-Of-Time compilation flow. AOT
	The wrapper will also support the Ahead-Of-Time (AOT) compilation flow. AOT

	generation is deferred until application runtime time.
	generation is deferred until application runtime.

	makes it difficult to use in a unified offloading setting. Two of the primary
	make it difficult to use in a unified offloading setting. Two of the primary

Conversation

asudarsa commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarnex left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarnex Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bader commented Oct 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asudarsa Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asudarsa Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

asudarsa commented Oct 7, 2024 •

edited

Loading

sarnex Oct 10, 2024 •

edited

Loading

asudarsa Oct 10, 2024 •

edited

Loading

asudarsa Oct 10, 2024 •

edited

Loading

asudarsa Oct 10, 2024 •

edited

Loading