compiler optimization for openmp accelerator offloading
play

Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - PowerPoint PPT Presentation

EuroLLVM April 8, 2019 Brussels, Belgium Leadership Computing Facility Argonne National Laboratory Compiler Optimization For (OpenMP) Accelerator Offloading Johannes Doerfert and Hal Finkel https://www.alcf.anl.gov/ This research was


  1. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle.

  2. Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution.

  3. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  4. • a separate master team costs resources • synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  5. • synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  6. • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  7. 1 master and N-1 worker teams, worker teams M threads: 7/14 Problems: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } • currently impossible to optimization Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

  8. few + Logic +Logic + RT Calls Device RT Device Code Code Host Code IPO Module Cross Fat Binary Opt. 8/14 Host 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM

  9. few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code

  10. few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code

  11. +Logic few + Logic Device RT + RT Calls IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code Code Host Code

  12. +Logic Device RT few + Logic IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls Code Host Code

  13. +Logic Device RT few IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Code Gen. OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls + Logic Code Host Code

  14. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  15. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  16. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  17. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  18. +Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

  19. few 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+ RT Calls + Logic Code Host Code

  20. 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  21. Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  22. Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  23. Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  24. 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  25. 8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  26. 8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  27. 8/14 Device mization Opti- Device AND Host Opt. Opt. Host Fat Binary Cross Module Gen. Code IPO 2. Optimize Device and Host Code Together OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

  28. • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface

  29. • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:

  30. • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:

  31. • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices:

  32. ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated:

  33. • if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization

  34. • currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true

  35. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions.

  36. • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable

  37. • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions

  38. • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible

  39. known parallel regions in the module to work on 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all

  40. TransitiveCallSite AbstractCallSite Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite Passes (IPOs)

  41. Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)

  42. Passes (IPOs) 12/14 Functional changes required for Inter-procedural Constant Propagation: 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)

  43. 13/14 Abstract Call Sites — Performance Results

  44. 13/14 Abstract Call Sites — Performance Results

  45. 13/14 Abstract Call Sites — Performance Results

  46. 13/14 Abstract Call Sites — Performance Results

  47. 14/14 Conclusion

  48. 14/14 Conclusion

  49. 14/14 Conclusion

  50. 14/14 Conclusion

  51. 14/14 Conclusion

  52. I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18)

  53. II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , …

  54. III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var)

  55. IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts:

  56. V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers

  57. ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

  58. ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

  59. ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — In LLVM: Attribute Deduction (IPO!) read/write-only , restrict / noalias , … II: Variable Privatization — In LLVM: Argument Promotion (IPO!) shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

  60. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

  61. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out);

  62. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int *N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  63. OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  64. OpenMP Input: An Abstract Parallel IR #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by an annotated loop for /* parallel */ ( int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int i , int * N, float ** In, float ** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  65. OpenMP Input: Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

  66. OpenMP Input: − integration cost per IPO Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); + valid and executable IR for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } + no unintended interactions

  67. TransitiveCallSite AbstractCallSite Passes (IPOs) IPO in LLVM CallInst InvokeInst CallSite Passes (IPOs)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend