Optimizing Indirections, or using abstractions without remorse - PowerPoint PPT Presentation

Optimizing Indirections, or using abstractions without remorse LLVMDev’18 — October 18, 2018 — San Jose, California, USA Johannes Doerfert, Hal Finkel Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/

Acknowledgment This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. 1/15

Context & Motivation

Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations ⇒ Introduce suitable abstractions and transformations to bridge the indirection Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15

Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15 ⇒ Introduce suitable abstractions and transformations to bridge the indirection

Context — Optimizations For Parallel Programs Optimizations for sequential aspects • Can reuse (improved) existing transformations Optimizations for parallel aspects • New explicit parallelism-aware transformations (see IWOMP’18 a ) ⇒ Introduce a unifying abstraction layer (see EuroLLVM’18 Talk b ) a Compiler Optimizations For OpenMP , J. Doerfert, H. Finkel, IWOMP 2018 b A Parallel IR in Real Life: Optimizing OpenMP , H. Finkel, J. Doerfert, X. Tian, G. Stelle, Euro-LLVM Meeting 2018 2/15 Interested? Contact me and come to our BoF! ⇒ Introduce suitable abstractions and transformations to bridge the indirection

Context — Compiler Optimization Original Program After Optimizations int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y); 3/15

Context — Compiler Optimization Original Program After Optimizations int y = 7; for (i = 0; i < N; i++) { f(y, i); } g(y); for (i = 0; i < N; i++) { f(7, i); } g(7); 3/15

Motivation — Compiler Optimization For Parallelism Original Program After Optimizations int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); 3/15

Motivation — Compiler Optimization For Parallelism Original Program After Optimizations int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); int y = 7; #pragma omp parallel for for (i = 0; i < N; i++) { f(y, i); } g(y); 3/15

Sequential Performance of Parallel Programs Why is this important? 4/15

Sequential Performance of Parallel Programs 4/15

Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) 5/15 Out[i] = In[i] + In[i+N];

Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); 5/15 Out[i] = In[i] + In[i+N];

Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int *N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

An Abstract Parallel IR OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by an annotated loop parfor ( int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int i , int * N, float ** In, float ** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

Early Outlining OpenMP Input: #pragma omp parallel for for ( int i = 0; i < N; i++) // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } 5/15 Out[i] = In[i] + In[i+N];

Early Outlining + Transitive Calls body_fn(?, &N, &In, &Out); } (*Out)[i] = (*In)[i] + (*In)[i + (*N)] for ( int i = lb; i < ub; i++) int lb = omp_get_lb(tid), ub = omp_get_ub(tid); static void body_fn( int tid, int * N, float ** In, float ** Out) { // Parallel region outlined in the front-end (clang)! // Model transitive call: OpenMP Input: &N, &In, &Out); omp_rt_parallel_for(0, N, &body_fn, // Parallel region replaced by a runtime call. for ( int i = 0; i < N; i++) #pragma omp parallel for 5/15 Out[i] = In[i] + In[i+N];

Early Outlining + Transitive Calls int lb = omp_get_lb(tid), ub = omp_get_ub(tid); LLVM-TS + SPEC + >1k function pointers arguments in + no unintended interactions + valid and executable IR 5/15 } (*Out)[i] = (*In)[i] + (*In)[i + (*N)] for ( int i = lb; i < ub; i++) static void body_fn( int tid, int * N, float ** In, float ** Out) { OpenMP Input: // Parallel region outlined in the front-end (clang)! body_fn(?, &N, &In, &Out); // Model transitive call: &N, &In, &Out); omp_rt_parallel_for(0, N, &body_fn, // Parallel region replaced by a runtime call. for ( int i = 0; i < N; i++) #pragma omp parallel for − integration cost per IPO Out[i] = In[i] + In[i+N];

Call Abstraction in LLVM CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs) 6/15

Call Abstraction in LLVM + Transitive Call Sites CallInst InvokeInst CallSite Passes (IPOs) TransitiveCallSite AbstractCallSite Passes (IPOs) 6/15

Optimizing Indirections, or using abstractions without remorse - PowerPoint PPT Presentation

Optimizing Indirections, or using abstractions without remorse LLVMDev18 October 18, 2018 San Jose, California, USA Johannes Doerfert, Hal Finkel Leadership Computing Facility Argonne National Laboratory https://www.alcf.anl.gov/

Zephyr: Efficient Incremental Reprogramming of Sensor Nodes using Function Call Indirections and

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Planning and Optimization D2. Abstractions: Additive Abstractions Gabriele R oger and Thomas

Automatically Deriving Abstraction Heuristics PDB Abstractions Explicit-State Abstractions

Unified L2 Abstractions for L3-Driven Fast Handover draft-irtf-mobopts-l2-abstractions-01 F.

APIs/abstractions Previously iously Abstractions for infrastructure to ease operations (Ops)

L9: Frontend Abstractions Web Engineering 188.951 2VU SS20 Jrgen Cito L9: Frontend

FRONTEND AT SCALE Designing abstractions for big teams @joshduck What front-end abstractions

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Planning and Optimization D3. Abstractions: Additive Abstractions Malte Helmert and Thomas Keller

ABSTRACTIONS OF THE DATA PLANE DIMACS Working Group on Abstractions for Network Services,

Resources, Services, and Interfaces Services: Hardware Abstractions CPU/Memory abstractions

Resources, Services, and Interfaces Services: Hardware Abstractions CPU/Memory abstractions

Planning and Optimization D3. Abstractions: Additive Abstractions Malte Helmert and Gabriele R

Without sustaining injury Without sustaining injury Without sustaining injury Without sustaining

Event-Triggered Interactive Gradient Descent for Real-Time Multiobjective Optimization Pio Ong and

Optimization in the Big Data Regime Sham M. Kakade Machine Learning for Big Data

Improving CEMA using Correlation Optimization Pieter Robyns Peter Quax Wim Lamotte

Dummy Fill Optimization for Enhanced Manufacturability Yaoguang Wei and Sachin S. Sapatnekar

*Recommended $8.2M one time this year $16.4 M over 5 years to base $25 M one time to offset

Which t-Norm Case When This . . . Is Most Appropriate for Our Answers First Result: Product . .

Self Storage Website SEO Audit A Brief Overview What is SEO? Search Engine

Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1