Discussion:
[PATCH] [OPENMP] Driver support for OpenMP offloading
(too old to reply)
Samuel Antao
2015-05-20 17:42:18 UTC
Permalink
Hi ABataev, hfinkel, rsmith, rjmccall, chandlerc,

With a full implementation of OpenMP 3.1. already available upstream, we aim at continuing that work and add support for OpenMP 4.0 as well. One important component introduced by OpenMP 4.0 is offloading which enables the execution of a given structured block to be transferred to a device other than the host.

An implementation for OpenMP offloading infrastructure in clang is proposed in http://goo.gl/L1rnKJ. This document is already a second iteration that includes contributions from several vendors and members of the LLVM community. It was published in http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084304.html for discussion by the community, and so far we didn’t have any major concern about the design.

Unlike other OpenMP components, offloading requires support from the compiler driver given that for the same source file, several (host and target) objects will be generated using potentially different toolchains. At the same time, the compiler needs to have a mechanism to relate variables in the host with the ones generated with target, so communication between toolchains is required. The way this relation is supported by the driver will also have implications in the code generation.

This patch proposes an implementation of the driver support for offloading. The following summarizes the main changes this patch introduces:

a) clang can be invoked with -fopenmp=libiom5 -omptargets=triple1,
,tripleN, where triplei are the target triples the user wants to be able to offload to.

b) driver detects whether the offloading triples are valid or not and if the corresponding toolchain is prepared to offload. This patch only enables offloading for Linux toolchains.

c) Each target compiler phase takes the host IR (result of the host compiler phase) as a second input. This will enable the host generation to specify the variables that should be emitted for the target in the form of metadata and this metadata could be read by the target frontend.

d) Given that the same host IR result info is used by the different toolchains, the driver keeps a cache of results in order to avoid the job that generates a given result to be emitted twice.

e) Offloading leverages the argument translation functionality in order to convert host arguments into target arguments. This is currently used to make sure a shared library is always produced by the target toolchain - a library that can be loaded by the OpenMP runtime library.

f) The target shared libraries are embedded into the host binary by using a linker script produced by the driver and passed to the host linker.

g) The driver passes to the frontend offloading a command that specify if the frontend is producing code for a target. This is required as the code generation for target and host have to be different.

h) A full path to the original source file is passed to the frontend so it can be used to produce unique IDs that are the same for the host and targets.

Thanks!
Samuel

http://reviews.llvm.org/D9888

Files:
include/clang/Basic/DiagnosticDriverKinds.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Driver/Action.h
include/clang/Driver/Compilation.h
include/clang/Driver/Driver.h
include/clang/Driver/Options.td
include/clang/Driver/ToolChain.h
include/clang/Driver/Types.def
lib/Driver/Action.cpp
lib/Driver/Compilation.cpp
lib/Driver/Driver.cpp
lib/Driver/InputInfo.h
lib/Driver/ToolChain.cpp
lib/Driver/ToolChains.cpp
lib/Driver/ToolChains.h
lib/Driver/Tools.cpp
test/OpenMP/target_driver.c

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
John McCall
2015-05-21 16:55:42 UTC
Permalink
Hmm. Using the host IR as an implicit line of communication is an interesting approach. Can you expand on what kind of information needs to flow from host to target there, or at least link to a place in the previous discussion?


http://reviews.llvm.org/D9888

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
Samuel Antao
2015-05-21 21:20:19 UTC
Permalink
Hi John

Thanks for looking into this patch!

Sure, let me expand on the host-target communication. Just a little bit of context before I do that:

During code generation, the target frontend has to decide whether a given declaration or target region has to be emitted or not. Without any information communicated from the host frontend, this decision become complicated for cases like:

- #pragma omp target regions in static functions or class members;
- static declarations delimitted by #pragma omp declare target regions that end up not being used;
- #pragma omp target in template functions

In order for the target frontend to correctly identify all the declarations that need to be emitted it would have to, somehow, emulate the actions done by the host frontend which would turn the code generation messy in places that do not even relate with OpenMP.

On top of that, in order to have an efficient mapping between host and target entries (global declarations/target regions)
table (this is discussed in the document, in section 5.1, where __tgt_offload_entry is introduced) the compiler would have to emit the corresponding entries in the host and target side in the same order. This is useful for devices whose toolchain maintain the order of the symbols given that the order of the entries in the host and target tables will be the same after linking. So knowing an index would be enough to do the mapping. In order for that to happen, the target frontend would have to know that order, which would be also hard to extract if no information is communicated form the host.

So, the information that needs to be propagated to make what I described above possible is basically i) declaration mangled names and ii) order they were emitted. This information could be communicated in the form of metadata that is emitted by the host frontend when the module is released and loaded by the target frontend when CGOpenMPRuntime is created. This information has however to be coded in slightly different ways for different kinds of declarations. Let me explain this with an example:

//######################################
#pragma omp declare target
struct MyClass{

...

MyClass &add(MyClass &op){...}

MyClass &add(int (&op)[N]){...}

bool eq(MyClass &op){...}

MyClass() {...}

~MyClass() {...}

};

MyClass C;
MyClass D;
#pragma omp end declare target

void foo(){

int AA[N];
MyClass H, T;
MyClass HC;

...

#pragma omp target
{
MyClass TC;
T.add(AA).add(HC);
}

if (H.eq(T)) {...}

#pragma omp target
{
T.add(AA);
}

}
//######################################

I was planning the metadata for this example to look more or less like this:

; Named metadata that encloses all the offloading information
!openmp.offloading.info = !{!1, !2, !3, !4, !5, !6, !7, !8, !9, !10}

; Global variables that require a map between host and target:
; Entry 0 -> ID for this type of metadata (0)
; Entry 1 -> Mangled name of the variable
; Entry 2 -> Order it was emitted
!1 = !{i32 0, !"C", i32 0}
!2 = !{i32 0, !"D", i32 2}

; Functions with target regions
; Entry 0 -> ID for this type of metadata (1)
; Entry 1 -> Mangled name of the function that was emitted for the host and encloses target regions
; Entry 2-n -> Order the target regions in the functions (in the same sequence the statements are found) are emitted
!3 = !{i32 1, !"_Z3foov", i32 4, i32 5}

; Global initializers
; Entry 0 -> ID for this type of metadata (2)
; Entry 1-n -> Order the initializers are emitted in descending order of priority (we will require a target region per set of initializers with the same priority)
!4 = !{i32 2, i32 6}

; Global Dtors
; Entry 0 -> ID for this type of metadata (3)
; Entry 1 -> Mangled name of the variable to be destructed
; Entry 2 -> Order the destructor was emitted (we will have a target region per variable being destroyed - this can probably be optimized)
!5 = !{i32 3, !"C", i32 1}
!6 = !{i32 3, !"D", i32 3}

; Other functions that should be emitted in the target but do not require to be mapped to the host
; Entry 0 -> ID for this type of metadata (4)
; Entry 1 -> Mangled name of the function that has to be emitted.
!7 = !{i32 4, !"_ZN7MyClass3addERA64_i"}
!8 = !{i32 4, !"_ZN7MyClass3addERS_"}
!9 = !{i32 4, !"_ZN7MyClassC2Ev"}
!10 = !{i32 4, !"_ZN7MyClassD2Ev"}

I realize this is the kind of information I should propose as a patch to the codegen part of offloading, but I think it makes sense to discuss it now as the driver has to enable it.

I also foresee the communication between target and host to be useful for other cases, like the propagation of alias information from host to target. I don’t have have however a proposal for that at this moment.

Hope I haven’t been either too brief or too exhaustive! Let me know if I can clarify anything else for you.

Thanks!
Samuel


http://reviews.llvm.org/D9888

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
Samuel Antao
2015-06-08 14:50:42 UTC
Permalink
Are there any other comments or questions about this patch?

Many thanks!
Samuel


http://reviews.llvm.org/D9888

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
Samuel Antao
2015-06-15 17:01:04 UTC
Permalink
I've just noticed Chad is owning the Compiler driver, so I believe he should also be added to the list of reviewer of this patch.

Thanks!
Samuel


http://reviews.llvm.org/D9888

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
Eric Christopher
2015-06-15 20:04:25 UTC
Permalink
Quite a big patch, I'd definitely like to take a look at this as well. It relates to how some of the cuda work is progressing too.

Thanks!

-eric


http://reviews.llvm.org/D9888

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
Samuel Antao
2015-06-15 22:09:44 UTC
Permalink
Post by Eric Christopher
Quite a big patch, I'd definitely like to take a look at this as well. It relates to how some of the cuda work is progressing too.
Thanks!
-eric
Thanks eric,

Please let me know any comments you may have.

I agree the patch is quite big... I had a hard time trying to find a better partition that would make sense - this requires small but related changes in several places - and mapped to something meaningful in terms of the regression tests. If you see a good way to partition the patch let me know and I would gladly do it.

Thanks again!
Samuel


http://reviews.llvm.org/D9888

EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
Jonas Hahnfeld via cfe-commits
2015-09-04 10:26:08 UTC
Permalink
Hahnfeld added a comment.

I think this has to be updated for the current trunk...


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2015-10-01 18:22:54 UTC
Permalink
sfantao updated this revision to Diff 36263.
sfantao added a comment.

This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.

Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.

Chris, Art, I understand you have worked out the latest CUDA changes so any feedback from you is greatly appreciated!

Here are few more details about this diff:




Add tool to bundle and unbundle corresponding host and device files into a single one.




One of the goals of OpenMP offloading is to enable users to offload with little effort, by annotating the code with a few pragmas. I'd also like to save users the trouble of changing their existent applications' build system. So having the compiler always return a single file instead of one for the host and each target even if the user is doing separate compilation is desirable.

This diff includes a tool named clang-offload-bundled (happy to change the name or even include it in the driver if someone thinks it is the best direction to go) that is used on all input files that are not source files to unbundle them, and on top level jobs that are not linking jobs to bundle the results obtained for host and each target.

The format of the bundled files is currently very simple: text formats are concatenated with comments that have a magic string and target identifying triple in between, and binary formats have a header that contains the triple and the offset and size of the code for host and each target.

This tool still has to be improved in the future to deal with archive files so that each individual file in the archive is properly dealt with. We see that archives are very commonly used in current application to combine separate compilation results. So I'm convinced users would enjoy this feature.




The building of the driver actions is unchanged.




I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.




Add offloading kind to `ToolChain`




Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.

This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.




Use Job results cache to easily use host results in device actions and vice-versa.




An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.

In OpenMP the device declarations have be communicated from the host frontend to the device frontend. So this is used to conveniently pass that information. Unlike CUDA, OpenMP doesn't have already outline functions with "device" attributes that the frontend can rely on to make the decision on what to be emitted or not.

The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).




Use compiler generated linker script to do the device/host code combining and correctly support separate compilation.




Currently the OpenMP support in the toolchains is only implemented for Generic GCC targets and a linker script is used to embed the resulting device images into the host binary ELF sections. Also, the linker script defines the symbols that are emitted during code generation so that the address of the images can be easily retrieved.




Minor refactoring of the existing code to enable reusing.




I've outlined some of the exiting code into static function so that it could be reused by the new offloading related hooks.

Any comments/remarks are very welcome!

Thanks!
Samuel


http://reviews.llvm.org/D9888

Files:
include/clang/Basic/DiagnosticDriverKinds.td
include/clang/Driver/Action.h
include/clang/Driver/CC1Options.td
include/clang/Driver/Driver.h
include/clang/Driver/Options.td
include/clang/Driver/ToolChain.h
include/clang/Driver/Types.h
lib/Driver/Action.cpp
lib/Driver/Compilation.cpp
lib/Driver/Driver.cpp
lib/Driver/ToolChain.cpp
lib/Driver/ToolChains.cpp
lib/Driver/ToolChains.h
lib/Driver/Tools.cpp
lib/Driver/Tools.h
lib/Driver/Types.cpp
test/OpenMP/target_driver.c
tools/CMakeLists.txt
tools/Makefile
tools/clang-offload-bundler/CMakeLists.txt
tools/clang-offload-bundler/ClangOffloadBundler.cpp
tools/clang-offload-bundler/Makefile
Jonas Hahnfeld via cfe-commits
2015-10-07 08:25:41 UTC
Permalink
Hahnfeld added a comment.

Currently trying to test, but

1. Offloading to the same target isn't supported (`x86_64-unknown-linux-gnu` as host and device) - this was working with `clang-omp`

The produced IR isn't showing any calls to the target library and on linkage it complains:

undefined reference to `.omp_offloading.img_start.x86_64-unknown-linux-gnu'
undefined reference to `.omp_offloading.img_end.x86_64-unknown-linux-gnu'
undefined reference to `.omp_offloading.entries_begin'
undefined reference to `.omp_offloading.entries_end'
undefined reference to `.omp_offloading.entries_begin'
undefined reference to `.omp_offloading.entries_end'

(btw: `clang-offload-bundler` saves the IR file to `$TMP` with `-S -emit-llvm`, this seems to be a bug - I had to use `--save-temps`)

2. I can't seem to figure out the target triple for NVIDIA GPUs. It should be `nvptx[64]-nvidia-cuda` which gives me

include/llvm/Option/Option.h:101: const llvm::opt::Option llvm::opt::Option::getAlias() const: Assertion `Info && "Must have a valid info!"' failed.

In `clang-omp` it was `nvptxsm_35-nvidia-cuda` but this is now invalid...


http://reviews.llvm.org/D9888
Artem Belevich via cfe-commits
2015-10-07 23:50:02 UTC
Permalink
This post might be inappropriate. Click to display it.
Samuel Antao via cfe-commits
2015-10-08 00:40:06 UTC
Permalink
sfantao updated this revision to Diff 36816.
sfantao added a comment.

Make the offloading ELF sections consistent with what is in http://reviews.llvm.org/D12614.

Fix bug in AtTopLevel flag, so that the bundling job is considered always top level job.

Fix several typos.


http://reviews.llvm.org/D9888

Files:
include/clang/Basic/DiagnosticDriverKinds.td
include/clang/Driver/Action.h
include/clang/Driver/CC1Options.td
include/clang/Driver/Driver.h
include/clang/Driver/Options.td
include/clang/Driver/ToolChain.h
include/clang/Driver/Types.h
lib/Driver/Action.cpp
lib/Driver/Compilation.cpp
lib/Driver/Driver.cpp
lib/Driver/ToolChain.cpp
lib/Driver/ToolChains.cpp
lib/Driver/ToolChains.h
lib/Driver/Tools.cpp
lib/Driver/Tools.h
lib/Driver/Types.cpp
test/OpenMP/target_driver.c
tools/CMakeLists.txt
tools/Makefile
tools/clang-offload-bundler/CMakeLists.txt
tools/clang-offload-bundler/ClangOffloadBundler.cpp
tools/clang-offload-bundler/Makefile
Samuel Antao via cfe-commits
2015-10-08 00:50:58 UTC
Permalink
sfantao added a comment.

Art, Jonas,

Thanks for the comments!
Post by Jonas Hahnfeld via cfe-commits
Currently trying to test, but
1. Offloading to the same target isn't supported (`x86_64-unknown-linux-gnu` as host and device) - this was working with `clang-omp` The produced IR isn't showing any calls to the target library and on linkage it complains: ``` undefined reference to `.omp_offloading.img_start.x86_64-unknown-linux-gnu' undefined reference to `.omp_offloading.img_end.x86_64-unknown-linux-gnu' undefined reference to `.omp_offloading.entries_begin' undefined reference to `.omp_offloading.entries_end' undefined reference to `.omp_offloading.entries_begin' undefined reference to `.omp_offloading.entries_end' ```
I assume you were trying this using the diff in http://reviews.llvm.org/D12614. There was an inconsistency in the names of the ELF sections and symbols defined by the linker script in these two patches. This is now fixed.

Note that if you are using the libomptarget library from clang-omp, you need to replace in the code `.openmptgt_host_entries` by `.omp_offloading.entries`. I changed the names so that all of them are consistent with what is already in place for other OpenMP directives.

I also changed the files generation so that different files are used even if target and host have the same triple.

Please, let me know if it still does not work for you.
Post by Jonas Hahnfeld via cfe-commits
(btw: `clang-offload-bundler` saves the IR file to `$TMP` with `-S -emit-llvm`, this seems to be a bug - I had to use `--save-temps`)
Yes, the bundling job was not being marked as top level. It is now fixed!
Post by Jonas Hahnfeld via cfe-commits
2. I can't seem to figure out the target triple for NVIDIA GPUs. It should be `nvptx[64]-nvidia-cuda` which gives me ``` include/llvm/Option/Option.h:101: const llvm::opt::Option llvm::opt::Option::getAlias() const: Assertion `Info && "Must have a valid info!"' failed. ``` In `clang-omp` it was `nvptxsm_35-nvidia-cuda` but this is now invalid...
I didn't implement the triples logic for the nvptx targets yet. I'll port that from clang-omp once we have the basic functionality working upstream.

I'll address Art's comments in a separate message.

Thanks again,
Samuel


================
Comment at: include/clang/Driver/Driver.h:208
@@ +207,3 @@
+ /// CreateUnbundledOffloadingResult - Create a command to unbundle the input
+ /// and use the resulting input info. If there re inputs already cached in
+ /// OffloadingHostResults for that action use them instead. If no offloading
----------------
Post by Jonas Hahnfeld via cfe-commits
re -> are
Fixed!

================
Comment at: include/clang/Driver/Driver.h:210
@@ +209,3 @@
+ /// OffloadingHostResults for that action use them instead. If no offloading
+ /// is being support just return the provided input info.
+ InputInfo CreateUnbundledOffloadingResult(
----------------
Post by Jonas Hahnfeld via cfe-commits
"If offloading is not supported" perhaps?
Fixed!

================
Comment at: lib/Driver/Driver.cpp:2090
@@ +2089,3 @@
+ dyn_cast<OffloadUnbundlingJobAction>(A)) {
+ // The input of the unbundling job has to a single input non-source file,
+ // so we do not consider it having multiple architectures. We just use the
----------------
Post by Jonas Hahnfeld via cfe-commits
"has to be"
Fixed!


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2015-10-08 01:56:57 UTC
Permalink
sfantao added a comment.
Post by Artem Belevich via cfe-commits
Post by Samuel Antao via cfe-commits
This diff refactors the original patch and is rebased on top of the latests offloading changes inserted for CUDA.
Here I don't touch the CUDA support. I tried, however, to have the implementation modular enough so that it could eventually be combined with the CUDA implementation. In my view OpenMP offloading is more general in the sense that it does not refer to a given tool chain, instead it uses existing toolchains to generate code for offloading devices. So, I believe that a tool chain (which I did not include in this patch) targeting NVPTX will be able to handle both CUDA and OpenMP offloading models.
What do you mean by "does not refer to a given toolchain"? Do you have the toolchain patch available?
I mean not having to create a toolchain for a specific offloading model. OpenMP offloading is meant for any target and possibility many different targets simultaneously, so having a toolchain for each combination would be overwhelming.

I don't have a patch for the toolchain out for review yet. I'm planing to port what we have in clang-omp for the NVPTX toolchain once I have the host functionality in place. In there (https://github.com/clang-omp/clang_trunk/tree/master/lib/Driver) the Driver is implemented in a different way, I guess the version I'm proposing here is much cleaner. However, the ToolChains shouldn't be that different. All the tweaking is moved to the `Tool` itself, and I imagine I can drive that using the `ToolChain` offloading kind I'm proposing here. In https://github.com/clang-omp/clang_trunk/blob/master/lib/Driver/Tools.cpp I basically pick some arguments to forward to the tool and do some tricks to include libdevice in compilation when required. Do you think something like that could also work for CUDA?
Post by Artem Belevich via cfe-commits
Creating a separate toolchain for CUDA was a crutch that was available to craft appropriate cc1 command line for device-side compilation using existing toolchain. It works, but it's rather rigid arrangement. Creating a NVPTX toolchain which can be parameterized to produce CUDA or OpenMP would be an improvement.
Ideally toolchain tweaking should probably be done outside of the toolchain itself so that it can be used with any combination of {CUDA or OpenMP target tweaks}x{toolchains capable of generating target code}.
I agree. I decided to move all the offloading tweaking to the tools, given that that is what clang tool already does: customizes the arguments based on the `ToolChain` that is passed to it.
Post by Artem Belevich via cfe-commits
Post by Samuel Antao via cfe-commits
b ) The building of the driver actions is unchanged.
I don't create device specific actions. Instead only the bundling/unbundling are inserted as first or last action if the file type requires that.
Could you elaborate on that? The way I read it, the driver sees linear chain of compilation steps plus bundling/unbundling at the beginning/end and that each action would result in multiple compiler invocations, presumably per target.
If that's the case, then it may present a bit of a challenge in case one part of compilation depends on results of another. That's the case for CUDA where results of device-side compilation must be present for host-side compilation so we can generate additional code to initialize it at runtime.
That's right. I try to tackle the challenge of passing host/device results to device/host jobs by using a cache of results as I had described in d). The goal here is to add the flexibility required to accommodate different offloading models. In OpenMP we use host compile results in device compile jobs, and device link results in host link jobs whereas in CUDA the assemble result is used in compile job. I believe that we can have that cache to include whatever information is required to suit all needs.
Post by Artem Belevich via cfe-commits
Post by Samuel Antao via cfe-commits
c) Add offloading kind to `ToolChain`
Offloading does not require a new toolchain to be created. Existent toolchains are used and the offloading kind is used to drive specific behavior in each toolchain so that valid device code is generated.
This is a major difference from what is currently done for CUDA. But I guess the CUDA implementation easily fits this design and the Nvidia GPU toolchain could be reused for both CUDA and OpenMP offloading.
Sounds good. I'd be happy to make necessary make CUDA support use it.
Great! Thanks.
Post by Artem Belevich via cfe-commits
Post by Samuel Antao via cfe-commits
d) Use Job results cache to easily use host results in device actions and vice-versa.
An array of the results for each job is kept so that the device job can use the result previously generated for the host and used it as input or vice-versa.
Nice. That's something that will be handy for CUDA and may help to avoid passing bits of info about other jobs explicitly throughout the driver.
Post by Samuel Antao via cfe-commits
The result cache can also be updated to keep the required information for the CUDA implementation to decide host/device binaries combining (injection is the term used in the code). I don't have a concrete proposal for that however, given that is not clear to me what are the plans for CUDA to support separate compilation, I understand that the CUDA binary is inserted directly in host IR (Art, can you shed some light on this?).
Currently CUDA depends on libcudart which assumes that GPU code and its initialization is done the way nvcc does it. Currently we do include PTX assembly (as in readable text) generated on device side into host-side IR *and* generate some host data structures and init code to register GPU binaries with libcudart. I haven't figured out a way to compile host/device sides of CUDA without a host-side compilation depending on device results.
Long-term we're considering implementing CUDA runtime support based on plain driver interface which would give us more control over where we keep GPU code and how we initialize it. Then we could simplify things and, for example, incorporate GPU code via linker script. Alas for the time being we're stuck with libcudart and sequential device and host compilation phases.
As for separate compilation -- compilation part is doable. It's using the results of such compilation that becomes tricky. CUDA's triple-bracket kernel launch syntax depends on libcudart and will not work, because we would not generate init code. You can still launch kernels manually using raw driver API, but it's quite a bit more convoluted.
Ok, I see. I am not aware of what exactly libcudart does, but I can elaborate on what the OpenMP offloading implementation we have in place does:

We have a descriptor that is registered with the runtime library (we generate a function for that called before any global initializers are executed ), this descriptor has (among other things) fields that are initialized with the symbols defined by the linker script (so that the runtime library can immediately get the CUDA module) and also the names of the kernels (in OpenMP with don't have user-defined names for these kernels, so we generate some mangling to make sure they are unique). While launching the kernel, the runtime gets a pointer from which he can easily retrieve the name, and the CUDA driver API is used to get the CUDA function to be launched. We have been successfully generating a CUDA module that works well with separate compilation using ptxas and nvlink.

Part of my work is also port the runtime library in clang-omp to the LLLVM OpenMP project. I see CUDA as a simplified version of what OpenMP does, given that the user controls the data mappings explicitly, so I am sure we can find some synergies in the runtime library too and you may be able to use something that we already have in there.

Thanks!
Samuel
Post by Artem Belevich via cfe-commits
--Artem
http://reviews.llvm.org/D9888
Jonas Hahnfeld via cfe-commits
2015-10-08 07:23:50 UTC
Permalink
Hahnfeld added a comment.
Post by Samuel Antao via cfe-commits
[...]
I assume you were trying this using the diff in http://reviews.llvm.org/D12614. There was an inconsistency in the names of the ELF sections and symbols defined by the linker script in these two patches. This is now fixed.
Note that if you are using the libomptarget library from clang-omp, you need to replace in the code `.openmptgt_host_entries` by `.omp_offloading.entries`. I changed the names so that all of them are consistent with what is already in place for other OpenMP directives.
I also changed the files generation so that different files are used even if target and host have the same triple.
Please, let me know if it still does not work for you.
Thanks for your help, a small test program now seems to work!
Post by Samuel Antao via cfe-commits
[...]
I didn't implement the triples logic for the nvptx targets yet. I'll port that from clang-omp once we have the basic functionality working upstream.
Ok, I'll wait then. Thanks for your work and finally upstreaming this!
Jonas


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2015-10-15 00:47:58 UTC
Permalink
sfantao added a comment.

Are there any more comments/suggestions about this patch?

Thanks!
Samuel


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2015-10-20 18:59:39 UTC
Permalink
sfantao updated this revision to Diff 37903.
sfantao added a comment.

Move clang-offload-bundler to to a separate review: http://reviews.llvm.org/D13909.

This patch depends on http://reviews.llvm.org/D13909.


http://reviews.llvm.org/D9888

Files:
include/clang/Basic/DiagnosticDriverKinds.td
include/clang/Driver/Action.h
include/clang/Driver/CC1Options.td
include/clang/Driver/Driver.h
include/clang/Driver/Options.td
include/clang/Driver/ToolChain.h
include/clang/Driver/Types.h
lib/Driver/Action.cpp
lib/Driver/Compilation.cpp
lib/Driver/Driver.cpp
lib/Driver/ToolChain.cpp
lib/Driver/ToolChains.cpp
lib/Driver/ToolChains.h
lib/Driver/Tools.cpp
lib/Driver/Tools.h
lib/Driver/Types.cpp
test/OpenMP/target_driver.c
Samuel Antao via cfe-commits
2015-11-06 22:35:45 UTC
Permalink
sfantao updated this revision to Diff 39594.
sfantao added a comment.

Rebase.


http://reviews.llvm.org/D9888

Files:
include/clang/Basic/DiagnosticDriverKinds.td
include/clang/Driver/Action.h
include/clang/Driver/CC1Options.td
include/clang/Driver/Driver.h
include/clang/Driver/Options.td
include/clang/Driver/ToolChain.h
include/clang/Driver/Types.h
lib/Driver/Action.cpp
lib/Driver/Compilation.cpp
lib/Driver/Driver.cpp
lib/Driver/ToolChain.cpp
lib/Driver/ToolChains.cpp
lib/Driver/ToolChains.h
lib/Driver/Tools.cpp
lib/Driver/Tools.h
lib/Driver/Types.cpp
test/OpenMP/target_driver.c
Samuel Antao via cfe-commits
2015-11-24 01:15:45 UTC
Permalink
sfantao updated this revision to Diff 41001.
sfantao added a comment.

Rebase.


http://reviews.llvm.org/D9888

Files:
include/clang/Basic/DiagnosticDriverKinds.td
include/clang/Driver/Action.h
include/clang/Driver/CC1Options.td
include/clang/Driver/Driver.h
include/clang/Driver/Options.td
include/clang/Driver/ToolChain.h
include/clang/Driver/Types.h
lib/Driver/Action.cpp
lib/Driver/Compilation.cpp
lib/Driver/Driver.cpp
lib/Driver/ToolChain.cpp
lib/Driver/ToolChains.cpp
lib/Driver/ToolChains.h
lib/Driver/Tools.cpp
lib/Driver/Tools.h
lib/Driver/Types.cpp
test/OpenMP/target_driver.c
Samuel Antao via cfe-commits
2015-12-09 00:03:44 UTC
Permalink
sfantao added a comment.

Any more comments on this patch?

Thanks,
Samuel


http://reviews.llvm.org/D9888
Jonas Hahnfeld via cfe-commits
2016-01-20 06:53:16 UTC
Permalink
Hahnfeld added a comment.

Will this somewhen receive a final review and get merged?


http://reviews.llvm.org/D9888
Jonas Hahnfeld via cfe-commits
2016-02-12 14:06:45 UTC
Permalink
Hahnfeld added a comment.

@rsmith could you possibly take a look at this one? It has been around for roughly 8 months now and hasn't received much feedback


http://reviews.llvm.org/D9888
Richard Smith via cfe-commits
2016-03-18 17:00:57 UTC
Permalink
rsmith added a comment.

@echristo, you asked for time to review this; if you still want to, please can you do so?
@tra, it looks like you're happy with this design (and with moving the CUDA offloading support in this direction), please let us know if not!


================
Comment at: include/clang/Driver/Options.td:1617-1618
@@ -1616,2 +1616,4 @@
HelpText<"Write output to <file>">, MetaVarName<"<file>">;
+def omptargets_EQ : CommaJoined<["-"], "omptargets=">, Flags<[DriverOption, CC1Option]>,
+ HelpText<"Specify comma-separated list of triples OpenMP offloading targets to be supported">;
def pagezero__size : JoinedOrSeparate<["-"], "pagezero_size">;
----------------
This is an unfortunate flag name; `-oblah` already means something. Is this name chosen for compatibility with some other system, or could we change it to, say, `-fopenmp-targets=`?

================
Comment at: lib/Driver/Tools.cpp:316
@@ +315,3 @@
+ // Add commands to embed target binaries. We ensure that each section and
+ // image s 16-byte aligned. This is not mandatory, but increases the
+ // likelihood of data to be aligned with a cache block in several main host
----------------
s -> is


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2016-03-18 17:32:35 UTC
Permalink
sfantao added a comment.

Hi Richard,

Thanks for your review. I partitioned some of the stuff I am proposing here in smaller patches:

http://reviews.llvm.org/D18170
http://reviews.llvm.org/D18171
http://reviews.llvm.org/D18172

These patches already try to incorporate the feedback I got in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html related with the generation of actions.

Thanks again,
Samuel


================
Comment at: include/clang/Driver/Options.td:1617-1618
@@ -1616,2 +1616,4 @@
HelpText<"Write output to <file>">, MetaVarName<"<file>">;
+def omptargets_EQ : CommaJoined<["-"], "omptargets=">, Flags<[DriverOption, CC1Option]>,
+ HelpText<"Specify comma-separated list of triples OpenMP offloading targets to be supported">;
def pagezero__size : JoinedOrSeparate<["-"], "pagezero_size">;
----------------
Post by Richard Smith via cfe-commits
This is an unfortunate flag name; `-oblah` already means something. Is this name chosen for compatibility with some other system, or could we change it to, say, `-fopenmp-targets=`?
You are right, we are now using -fomptargets in codegen exactly because of that. I can change it to `-fopenmp-targets=` we don't have any compatibility issues at this point.

================
Comment at: lib/Driver/Tools.cpp:316
@@ +315,3 @@
+ // Add commands to embed target binaries. We ensure that each section and
+ // image s 16-byte aligned. This is not mandatory, but increases the
+ // likelihood of data to be aligned with a cache block in several main host
----------------
Post by Richard Smith via cfe-commits
s -> is
I'll fix it.


http://reviews.llvm.org/D9888
Michael Kuron via cfe-commits
2016-03-22 14:01:35 UTC
Permalink
mkuron added a comment.

The three smaller patches into which you divided this one appear to be missing some things. For example, `AddOpenMPLinkerScript` in //lib/Driver/Tools.cpp// from this patch appears to still be necessary to get the desired functionality, but it is not present in any of the three.


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2016-03-22 14:40:13 UTC
Permalink
sfantao added a comment.

Hi Michael,
Post by Michael Kuron via cfe-commits
The three smaller patches into which you divided this one appear to be missing some things. For example, `AddOpenMPLinkerScript` in //lib/Driver/Tools.cpp// from this patch appears to still be necessary to get the desired functionality, but it is not present in any of the three.
Those three patches do not add any OpenMP specific code yet, so they do not cover the whole implementation I have here. I am doing things in a slightly different way in the new patches given the feedback I had in the mailing list and I am waiting to review to see if the approach I have in there is acceptable. If so, I'll continue with the OpenMP related patches afterwards.

Thanks,
Samuel


http://reviews.llvm.org/D9888
Eric Christopher via cfe-commits
2016-03-22 23:50:37 UTC
Permalink
echristo added a comment.

First I'd like to note that the code quality here is really high, most of my comments are higher level design decisions going with the driver and the implementation here rather than that.

One meta comment: offload appears to be something that could be used for CUDA and OpenMP (and OpenACC etc) as a term. I think we should either merge these concepts or pick a different name :)

Thanks for all of your work and patience here! The rest of the comments are inline.

-eric


================
Comment at: include/clang/Driver/Driver.h:210-213
@@ +209,6 @@
+ /// owns all the ToolChain objects stored in it, and will clean them up when
+ /// torn down. We use a different cache for offloading as it is possible to
+ /// have offloading toolchains with the same triple the host has, and the
+ /// implementation has to differentiate the two in order to adjust the
+ /// commands for offloading.
+ mutable llvm::StringMap<ToolChain *> OffloadToolChains;
----------------
Example?

================
Comment at: include/clang/Driver/Driver.h:216-217
@@ +215,4 @@
+
+ /// \brief Array of the toolchains of offloading targets in the order they
+ /// were requested by the user.
+ SmallVector<const ToolChain *, 4> OrderedOffloadingToolchains;
----------------
Any reason?

================
Comment at: include/clang/Driver/Driver.h:427-435
@@ -383,10 +426,11 @@
/// action \p A.
void BuildJobsForAction(Compilation &C,
const Action *A,
const ToolChain *TC,
const char *BoundArch,
bool AtTopLevel,
bool MultipleArchs,
const char *LinkingOutput,
- InputInfo &Result) const;
+ InputInfo &Result,
+ OffloadingHostResultsTy &OffloadingHostResults) const;

----------------
This function is starting to get a little silly. Perhaps we should look into refactoring such that this doesn't need to be "the one function that rules them all". Perhaps a different ownership model for the things that are arguments here?

================
Comment at: lib/Driver/Compilation.cpp:66-67
@@ +65,4 @@
+
+ // Check if there is any offloading specific translation to do.
+ DerivedArgList *OffloadArgs = TC->TranslateOffloadArgs(*Entry, BoundArch);
+ if (OffloadArgs) {
----------------
Hmm?

================
Comment at: lib/Driver/Driver.cpp:224-225
@@ +223,4 @@
+
+/// \brief Dump the job bindings for a given action.
+static void DumpJobBindings(ArrayRef<const ToolChain *> TCs, StringRef ToolName,
+ ArrayRef<InputInfo> Inputs,
----------------
This can probably be done separately? Can you split this out and make it generally useful?

================
Comment at: lib/Driver/Driver.cpp:2045-2051
@@ -1739,11 +2044,9 @@
// checking the backend tool, check if the tool for the CompileJob
- // has an integrated assembler.
- const ActionList *BackendInputs = &(*Inputs)[0]->getInputs();
- // Compile job may be wrapped in CudaHostAction, extract it if
- // that's the case and update CollapsedCHA if we combine phases.
- CudaHostAction *CHA = dyn_cast<CudaHostAction>(*BackendInputs->begin());
- JobAction *CompileJA =
- cast<CompileJobAction>(CHA ? *CHA->begin() : *BackendInputs->begin());
- assert(CompileJA && "Backend job is not preceeded by compile job.");
- const Tool *Compiler = TC->SelectTool(*CompileJA);
- if (!Compiler)
+ // has an integrated assembler. However, if OpenMP offloading is required
+ // the backend and compile jobs have to be kept separate and an integrated
+ // assembler of the backend job will be queried instead.
+ JobAction *CurJA = cast<BackendJobAction>(*Inputs->begin());
+ const ActionList *BackendInputs = &CurJA->getInputs();
+ CudaHostAction *CHA = nullptr;
+ if (!RequiresOpenMPOffloading(TC)) {
+ // Compile job may be wrapped in CudaHostAction, extract it if
----------------
Might be time to make some specialized versions of this function. This may take it from "ridiculously confusing" to "code no one should ever look at" :)

================
Comment at: lib/Driver/Tools.cpp:6032
@@ +6031,3 @@
+ // The (un)bundling command looks like this:
+ // clang-offload-bundler -type=bc
+ // -omptargets=host-triple,device-triple1,device-triple2
----------------
Should we get the offload bundler in first so that the interface is there and testable? (Honest question, no particular opinion here). Though the command lines there will affect how this code is written.

================
Comment at: test/OpenMP/target_driver.c:41-47
@@ +40,9 @@
+
+// CHK-PHASES-LIB-DAG: {{.*}}: linker, {[[L0:[0-9]+]], [[A0:[0-9]+]]}, image
+// CHK-PHASES-LIB-DAG: [[A0]]: assembler, {[[A1:[0-9]+]]}, object
+// CHK-PHASES-LIB-DAG: [[A1]]: backend, {[[A2:[0-9]+]]}, assembler
+// CHK-PHASES-LIB-DAG: [[A2]]: compiler, {[[A3:[0-9]+]]}, ir
+// CHK-PHASES-LIB-DAG: [[A3]]: preprocessor, {[[I:[0-9]+]]}, cpp-output
+// CHK-PHASES-LIB-DAG: [[I]]: input, {{.*}}, c
+// CHK-PHASES-LIB-DAG: [[L0]]: input, "m", object
+
----------------
Do we really think the phases should be a DAG check?

================
Comment at: test/OpenMP/target_driver.c:54
@@ +53,3 @@
+// RUN: echo 'bla' > %t.o
+// RUN: %clang -ccc-print-phases -lm -fopenmp=libomp -target powerpc64-ibm-linux-gnu -omptargets=x86_64-pc-linux-gnu,powerpc64-ibm-linux-gnu %s %t.o 2>&1 \
+// RUN: | FileCheck -check-prefix=CHK-PHASES-OBJ %s
----------------
How do you pass options to individual omptargets? e.g. -mvsx or -mavx2?


http://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2016-04-07 01:53:31 UTC
Permalink
sfantao marked 8 inline comments as done.
sfantao added a comment.

Hi Eric,

Thanks for the review!

As you are probably a aware, I started partitioning this patch following your initial concern related with the size of this patch and the feedback I got from http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html. I am keeping this patch as it shows the big picture of what I am trying to accomplish, so if you prefer to add other higher level suggesting here that's perfectly fine. Let me know if there is a more proper way to link patches.

So, I am incorporating your suggestions here in the partioned patches as specified in the inline comments. The partitioned patches are http://reviews.llvm.org/D18170, http://reviews.llvm.org/D18171 and http://reviews.llvm.org/D18172.
Post by Eric Christopher via cfe-commits
One meta comment: offload appears to be something that could be used for CUDA and OpenMP (and OpenACC etc) as a term. I think we should either merge these concepts or pick a different name :)
Yes, I agree. I am now using `offloading`. I only refer to the programming model name if the code relates to something specific of that programming model.

Thanks again,
Samuel


================
Comment at: include/clang/Driver/Driver.h:210-213
@@ +209,6 @@
+ /// owns all the ToolChain objects stored in it, and will clean them up when
+ /// torn down. We use a different cache for offloading as it is possible to
+ /// have offloading toolchains with the same triple the host has, and the
+ /// implementation has to differentiate the two in order to adjust the
+ /// commands for offloading.
+ mutable llvm::StringMap<ToolChain *> OffloadToolChains;
----------------
Post by Eric Christopher via cfe-commits
Example?
I got rid of this extra toolchain cache and I am organizing it in a multimap by offload kind as Art suggested in http://reviews.llvm.org/D18170. That avoids the multiple containers for the offloading toolchains (this one and the ordered one).

================
Comment at: include/clang/Driver/Driver.h:216-217
@@ +215,4 @@
+
+ /// \brief Array of the toolchains of offloading targets in the order they
+ /// were requested by the user.
+ SmallVector<const ToolChain *, 4> OrderedOffloadingToolchains;
----------------
Post by Eric Christopher via cfe-commits
Any reason?
Currently in OpenMP any directive that relates with offloading supports a `device()` clause that basically specifies which device to use for that region or data transfer. E.g.

```
void foo() {
...
}

void bar(int i) {
#pragma omp target device(i)
foo();
}
```
... here foo is going to be executed on the device `i`. The problem is that the device is an integer - it does not tell which device type it is - so it is up to the implementation to decide how `i` is interpreted. So, if we have a system with two GPUs and two DSP devices. We may bind 0-1 to the GPUs and 2-3 to the DSPs.

My goal with preserving the order of the toolchains was to allow codegen to leverage that information and make a better decision on how to bind devices to integers. Maybe, if the user requests the GPU toolchain first he may be interested in prioritizing its use, so the first IDs would map to GPUs. Making a long story short, this is only about preserving information so that codegen can use it.

In any case, this is going to change in the future as the OpenMP language committee is working on having a device identifier to use instead of an integer. So, if you prefer remove the `ordered` out of the name, I am not opposed to that.


================
Comment at: include/clang/Driver/Driver.h:427-435
@@ -383,10 +426,11 @@
/// action \p A.
void BuildJobsForAction(Compilation &C,
const Action *A,
const ToolChain *TC,
const char *BoundArch,
bool AtTopLevel,
bool MultipleArchs,
const char *LinkingOutput,
- InputInfo &Result) const;
+ InputInfo &Result,
+ OffloadingHostResultsTy &OffloadingHostResults) const;

----------------
Post by Eric Christopher via cfe-commits
This function is starting to get a little silly. Perhaps we should look into refactoring such that this doesn't need to be "the one function that rules them all". Perhaps a different ownership model for the things that are arguments here?
This has changed a little in recent CUDA work, in the version http://reviews.llvm.org/D18171 is based on, `Result` is returned instead of being passed by reference, and we have a `string/action-result map. I'll have to add to that string the offloading kind eventually, but in the partitioned patches I didn't touch that yet.

Do you suggest having that cache owned by the driver instead of passing it along?

================
Comment at: lib/Driver/Compilation.cpp:66-67
@@ +65,4 @@
+
+ // Check if there is any offloading specific translation to do.
+ DerivedArgList *OffloadArgs = TC->TranslateOffloadArgs(*Entry, BoundArch);
+ if (OffloadArgs) {
----------------
Post by Eric Christopher via cfe-commits
Hmm?
This relates in some extend to your other question: how do we pass device-specific options.

So, right now we are relying on the host options to derive device-specific options. This hook was meant to make the tuning of the host options so that things that do not make sense on the device are filtered. Also, the device resulting image is usually a shared library so it that can be easily loaded, this hook is also used to specify the options that result in a shared library, even if the host options don't ask for a host shared library.

Can you think of a better way to abstract this?

================
Comment at: lib/Driver/Driver.cpp:224-225
@@ +223,4 @@
+
+/// \brief Dump the job bindings for a given action.
+static void DumpJobBindings(ArrayRef<const ToolChain *> TCs, StringRef ToolName,
+ ArrayRef<InputInfo> Inputs,
----------------
Post by Eric Christopher via cfe-commits
This can probably be done separately? Can you split this out and make it generally useful?
Given the feedback I got in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html, I end up moving most the functionality that I have in jobs creation to the creation of actions. Having a action graph that shows the offloading specifics was desired feature. As a result, what gets more complex is the dump of the actions.

In http://reviews.llvm.org/D18171 I have an example on how that dump looks like. That patch also proposes a unified offloading action that should be reused by the different offloading programming models. Does this address your concern?

================
Comment at: lib/Driver/Driver.cpp:2045-2051
@@ -1739,11 +2044,9 @@
// checking the backend tool, check if the tool for the CompileJob
- // has an integrated assembler.
- const ActionList *BackendInputs = &(*Inputs)[0]->getInputs();
- // Compile job may be wrapped in CudaHostAction, extract it if
- // that's the case and update CollapsedCHA if we combine phases.
- CudaHostAction *CHA = dyn_cast<CudaHostAction>(*BackendInputs->begin());
- JobAction *CompileJA =
- cast<CompileJobAction>(CHA ? *CHA->begin() : *BackendInputs->begin());
- assert(CompileJA && "Backend job is not preceeded by compile job.");
- const Tool *Compiler = TC->SelectTool(*CompileJA);
- if (!Compiler)
+ // has an integrated assembler. However, if OpenMP offloading is required
+ // the backend and compile jobs have to be kept separate and an integrated
+ // assembler of the backend job will be queried instead.
+ JobAction *CurJA = cast<BackendJobAction>(*Inputs->begin());
+ const ActionList *BackendInputs = &CurJA->getInputs();
+ CudaHostAction *CHA = nullptr;
+ if (!RequiresOpenMPOffloading(TC)) {
+ // Compile job may be wrapped in CudaHostAction, extract it if
----------------
Post by Eric Christopher via cfe-commits
Might be time to make some specialized versions of this function. This may take it from "ridiculously confusing" to "code no one should ever look at" :)
I agree. This function is really messy... :S

In http://reviews.llvm.org/D18171 I am proposing `collapseOffloadingAction` that drives the collapsing of offload actions and abstracts some of the complexity in `selectToolForJob`. Do you think that goes in the right direction, or you think I should do something else?

================
Comment at: lib/Driver/Tools.cpp:6032
@@ +6031,3 @@
+ // The (un)bundling command looks like this:
+ // clang-offload-bundler -type=bc
+ // -omptargets=host-triple,device-triple1,device-triple2
----------------
Post by Eric Christopher via cfe-commits
Should we get the offload bundler in first so that the interface is there and testable? (Honest question, no particular opinion here). Though the command lines there will affect how this code is written.
Yes, sure, I proposed an implementation of the bundler, using a generic format in http://reviews.llvm.org/D13909. Let me know any comments you have about that specific component.

I still need to add testing specific to http://reviews.llvm.org/D13909, which I didn't yet because I didn't know where it was supposed to live - maybe in the Driver? Do you have an opinion about that?

Also, in http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html, the generic opinion was that the bundler should use the host object format to bundle whenever possible. So, I also have to add a default behavior for the binary bundler when the input is an object file. For the other input types, I don't think there were any strong opinions. Do you happen to have one?

In any case, I was planing to add the object file specific bundling in a separate patch, which seems to me a natural way to partition the bundler functionality. Does that sound like a good plan?

================
Comment at: test/OpenMP/target_driver.c:41-47
@@ +40,9 @@
+
+// CHK-PHASES-LIB-DAG: {{.*}}: linker, {[[L0:[0-9]+]], [[A0:[0-9]+]]}, image
+// CHK-PHASES-LIB-DAG: [[A0]]: assembler, {[[A1:[0-9]+]]}, object
+// CHK-PHASES-LIB-DAG: [[A1]]: backend, {[[A2:[0-9]+]]}, assembler
+// CHK-PHASES-LIB-DAG: [[A2]]: compiler, {[[A3:[0-9]+]]}, ir
+// CHK-PHASES-LIB-DAG: [[A3]]: preprocessor, {[[I:[0-9]+]]}, cpp-output
+// CHK-PHASES-LIB-DAG: [[I]]: input, {{.*}}, c
+// CHK-PHASES-LIB-DAG: [[L0]]: input, "m", object
+
----------------
Post by Eric Christopher via cfe-commits
Do we really think the phases should be a DAG check?
Using a DAG seemed to me a robust way to test that. I'd have to double check, but several map containers are used for the inputs and actions, so the order may depend on the implementation of the container. I was just trying to use a safe way to test.

Do you prefer to change this to the exact sequence I am getting?

================
Comment at: test/OpenMP/target_driver.c:54
@@ +53,3 @@
+// RUN: echo 'bla' > %t.o
+// RUN: %clang -ccc-print-phases -lm -fopenmp=libomp -target powerpc64-ibm-linux-gnu -omptargets=x86_64-pc-linux-gnu,powerpc64-ibm-linux-gnu %s %t.o 2>&1 \
+// RUN: | FileCheck -check-prefix=CHK-PHASES-OBJ %s
----------------
Post by Eric Christopher via cfe-commits
How do you pass options to individual omptargets? e.g. -mvsx or -mavx2?
Well, currently I don't. In http://lists.llvm.org/pipermail/cfe-dev/2016-February/047547.html I was proposing something to tackle that, but the opinion was that it was somewhat secondary and the driver design should be settled first.

What I as proposing was some sort of group option associated with the device triple. The idea was to avoid proliferation of device specific options and reuse what we already have, just organize it groups so that i could be forwarded to the right tool chain. The goal was to make things like this possible:
```
clang -mcpu=pwr8 -target-offload=nvptx64-nvidia-cuda -fopenmp -mcpu=sm_35 -target-offload=nvptx64-nvidia-cuda -fcuda -mcpu=sm_32 a.c
```
... where mcpu is used to specify the cpu/gpu for the different tool chains and programing models. This would also be useful to specify include and library paths that only make sense to the device.

Do you have any opinion about that?


http://reviews.llvm.org/D9888
Jonas Hahnfeld via cfe-commits
2016-10-28 06:16:24 UTC
Permalink
Hahnfeld added a comment.

I think these changes have been contributed to trunk in multiple commits so this can be closed?


https://reviews.llvm.org/D9888
Samuel Antao via cfe-commits
2016-10-28 10:53:42 UTC
Permalink
sfantao abandoned this revision.
sfantao marked 8 inline comments as done.
sfantao added a comment.

Hi Jonas,
Post by Jonas Hahnfeld via cfe-commits
I think these changes have been contributed to trunk in multiple commits so this can be closed?
You're right, this can be closed now.

Thanks!
Samuel


https://reviews.llvm.org/D9888

Continue reading on narkive:
Loading...