optimizing indirections or using abstractions without
play

Optimizing Indirections, or using abstractions without remorse - PowerPoint PPT Presentation

Optimizing Indirections, or using abstractions without remorse LLVMDev18 October 18, 2018 San Jose, California, USA Johannes Doerfert, Hal Finkel Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/


  1. Optimizing Indirections, or using abstractions without remorse LLVMDev’18 — October 18, 2018 — San Jose, California, USA Johannes Doerfert, Hal Finkel Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/

  2. Acknowledgment This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. 1/15

  3. Context & Motivation

  4. Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations ⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15

  5. Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations ⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15

  6. Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations ⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15

  7. Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15 ⇒ Introduce suitable abstractions and transformations to bridge the indirection

  8. Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15 ⇒ Introduce suitable abstractions and transformations to bridge the indirection

  9. Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15 Interested? Contact me and come to our BoF! ⇒ Introduce suitable abstractions and transformations to bridge the indirection

  10. Context — Compiler Optimization Original Program After Optimizations int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y); 3/15

  11. Context — Compiler Optimization Original Program After Optimizations int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y); for (i = 0; i < N; i++) { f(7, i); } g(7); 3/15

  12. Motivation — Compiler Optimization For Parallelism Original Program After Optimizations int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); 3/15

  13. Motivation — Compiler Optimization For Parallelism Original Program After Optimizations int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); 3/15

  14. Sequential Performance of Parallel Programs Why is this important? 4/15

  15. Sequential Performance of Parallel Programs 4/15

  16. Sequential Performance of Parallel Programs 4/15

  17. Sequential Performance of Parallel Programs 4/15

  18. Sequential Performance of Parallel Programs 4/15

  19. Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) 5/15 Out[i] = In[i] + In[i+N];

  20. Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); 5/15 Out[i] = In[i] + In[i+N];

  21. Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int *N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

  22. Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

  23. An Abstract Parallel IR OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by an annotated loop parfor ( int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int i , int * N, float ** In, float ** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

  24. Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

  25. Early Outlining + Transitive Calls body_fn(?, &N, &In, &Out); } (*Out)[i] = (*In)[i] + (*In)[i + (*N)] for ( int i = lb; i < ub; i++) int lb = omp_get_lb(tid), ub = omp_get_ub(tid); static void body_fn( int tid, int * N, float ** In, float ** Out) { // Parallel region outlined in the front-end (clang)! // Model transitive call: OpenMP Input: &N, &In, &Out); omp_rt_parallel_for(0, N, &body_fn, // Parallel region replaced by a runtime call. for ( int i = 0; i < N; i++) #pragma omp parallel for 5/15 Out[i] = In[i] + In[i+N];

  26. Early Outlining + Transitive Calls int lb = omp_get_lb(tid), ub = omp_get_ub(tid); LLVM-TS + SPEC + >1k function pointers arguments in + no unintended interactions + valid and executable IR 5/15 } (*Out)[i] = (*In)[i] + (*In)[i + (*N)] for ( int i = lb; i < ub; i++) static void body_fn( int tid, int * N, float ** In, float ** Out) { OpenMP Input: // Parallel region outlined in the front-end (clang)! body_fn(?, &N, &In, &Out); // Model transitive call: &N, &In, &Out); omp_rt_parallel_for(0, N, &body_fn, // Parallel region replaced by a runtime call. for ( int i = 0; i < N; i++) #pragma omp parallel for − integration cost per IPO Out[i] = In[i] + In[i+N];

  27. Call Abstraction in LLVM CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs) 6/15

  28. Call Abstraction in LLVM + Transitive Call Sites CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs) 6/15

  29. Call Abstraction in LLVM + Transitive Call Sites CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs) 6/15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend