A Heteroge erogeneous neous Paralle llel l Fr Framework mework - PowerPoint PPT Presentation

A Heteroge erogeneous neous Paralle llel l Fr Framework mework for or Do Domain in- Specif ific ic Languag guages es Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL

Pr Prog ogrammab rammability ility Ch Chas asm Applications Pthreads Sun OpenMP T2 Scientific Engineering CUDA Virtual Nvidia OpenCL Worlds Fermi Parallel Programming Language Personal Robotics Verilog Altera VHDL FPGA Data Informatics MPI PGAS Cray Jaguar

The he Ideal deal Pa Paral rallel lel Prog ogram rammi ming ng La Language nguage Performance Productivity Generality

Su Successf essful ul La Languages nguages Performance Productivity Generality

Do Doma main in Sp Specific ecific La Lang nguages uages Performance (Heterogeneous Parallelism) Domain Specific Languages Productivity Generality

Ben enefit efits s of of Us Using ing DS DSLs Ls for or Pa Parallelism rallelism Productivity • Shield most programmers from the difficulty of parallel programming • Focus on developing algorithms and applications and not on low level implementation details Performance • Match high level domain abstraction to generic parallel execution patterns • Restrict expressiveness to more easily and fully extract available parallelism • Use domain knowledge for static/dynamic optimizations Portability and forward scalability • DSL & Runtime can be evolved to take advantage of latest hardware features • Applications remain unchanged • Allows innovative HW without worrying about application portability

DS DSLs Ls: Com : Compil piler er vs. vs. Li Libr brary ary  A Domain-Specific Approach to Heterogeneous Parallelism , Chafi et al.  A framework for parallel DSL libraries  Used data-parallel patterns and deferred execution (transparent futures) to execute tasks in parallel  Why write a compiler?  Static optimizations (both generic and domain-specific)  All DSL abstractions can be removed from the generated code  Generate code for hardware not supported by the host language  Full-program analysis

Co Commo mmon n DS DSL L Fr Fram amework ework  Building a new DSL Design the language (syntax, operations, abstractions, etc.)  Implement compiler (parsing, type checking, optimizations, etc.)  Discover parallelism (understand parallel patterns)  Emit parallel code for different hardware (optimize for low-level  architectural details) Handle synchronization, multiple address spaces, etc.   Need a DSL infrastructure Embed DSLs in a common host language  Provide building blocks for common DSL compiler & runtime  functionality

De Deli lite Ove vervi view ew Domain Data Machine Physics Analytics Learning Specific ( OptiQL ) ( Liszt ) ( OptiML ) Languages Domain Embedding Language ( Scala ) Staged Execution Delite Compiler Parallel Patterns Delite: DSL Infrastructure Static Optimizations Heterogeneous Code Generation Delite Runtime Walk-time Optimizations Locality-aware Scheduling Heterogeneous SMP GPU Hardware

DS DSL L Int ntermedia ermediate te Rep epresentat resentation ion (I (IR) Application Domain User DSL Interface s = sum(M) C2 = sort(C1) User M1 = M2 + M3 V1 = exp(V2) Domain Ops DSL Domain Matrix Vector Matrix Collection Analysis & Opt. Author Plus Exp Sum Quicksort

Bui uildi lding ng an n IR IR  OptiML: A DSL for machine learning  Built using Delite  Supports linear algebra (Matrix/Vector) operations //a, b, c, d : Matrix A B C D val x = a * b + c * d ⇒ def infix_+(a: Matrix, b: Matrix) = Matrix Matrix Times Times new MatrixPlus(a,b) def infix_*(a: Matrix, b: Matrix) = Matrix new MatrixTimes(a,b) Plus  DSL methods build IR as program runs

DS DSL Op L Optim imiz izations ations  DSL developer defines how DSL operations create IR nodes  Specialize implementation of operation for each occurrence by pattern matching on the IR  This technique can be used to control merely what to add to IR or to perform IR rewrites  Use this to apply linear algebra simplification rules A B A C B C ⇒ * * A + + * AB + AC A(B+C)

Opt ptiM iML Li Linear near Alg lgebra bra Rewrites rites  A straightforward translation of the Gaussian Discriminant Analysis (GDA) algorithm from the mathematical description produces the following code: val sigma = sum (0,m) { i => val a = if (!x.labels(i)) x(i) - mu0 else x(i) - mu1 a.t ** a }  A much more efficient implementation recognizes that 𝑜 𝑜 𝑦 𝑗 ∗ 𝑧 𝑗 → 𝑌 : , 𝑗 ∗ 𝑍 𝑗, : = 𝑌 ∗ 𝑍 𝑗=0 𝑗=0  Transformed code was 20.4x faster with 1 thread and 48.3x faster with 8 threads.

De Delit lite DS DSL L Fr Fram amework ework  Building a new DSL  Design the language (syntax, operations, abstractions, etc.)  Implement compiler  Domain-specific analysis and optimization  Lexing, parsing, type-checking, generic optimizations  Discover parallelism (understand parallel patterns)  Emit parallel code for different hardware (optimize for low-level architectural details)  Handle synchronization, multiple address spaces, etc.

De Deli lite Ops ps  Encode known parallel execution patterns  M ap, filter, reduce, …  Bulk-synchronous foreach  Divide & conquer  Delite provides implementations of these patterns for multiple hardware targets  e.g., multi-core, GPU  DSL author maps each domain operation to the appropriate pattern  Delite handles parallel optimization, code generation, and execution for all DSLs

Mul ultiview tiview De Delite lite IR IR Application Domain User DSL Interface s = sum(M) C2 = sort(C1) User M1 = M2 + M3 V1 = exp(V2) Domain Ops DSL Domain Matrix Vector Matrix Collection Analysis & Opt. Author Plus Exp Sum Quicksort Delite Ops Delite Parallelism Analysis & Opt. Divide & ZipWith Map Reduce Conquer Code Generation

De Deli lite Op Fus p Fusio ion  Operates on all loop-based ops  Reduces op overhead and improves locality  Elimination of temporary data structures  Merging loop bodies may enable further optimizations  Fuse both dependent and side-by-side operations  Fused ops can have multiple inputs & outputs  Algorithm: fuse two loops if  size(loop1) == size(loop2)  No mutual dependencies (which aren’t removed by fusing)

Do Downs nsampling ampling in in Opt ptiM iML C++ OptiML Fusing OptiML No Fusing 0.3 3.5 Normalized Execution Time 3 2.5 0.6 2 0.9 1.5 0.9 1.0 1.0 1 1.8 1.9 3.3 3.4 5.6 5.8 0.5 0 1 2 4 8 Processors

Mul ultiview tiview De Delite lite IR IR Application Domain User DSL Interface s = sum(M) C2 = sort(C1) User M1 = M2 + M3 V1 = exp(V2) Domain Ops DSL Domain Matrix Vector Matrix Collection Analysis & Opt. Author Plus Exp Sum Quicksort Delite Ops Delite Parallelism Analysis & Opt. Divide & ZipWith Map Reduce Conquer Code Generation Delite Generic Op Generic Analysis Op & Opt.

Generi neric c IR IR  Optimizations  Common subexpression elimination (CSE)  Dead code elimination (DCE)  Constant folding  Code motion (e.g., loop hoisting)  Side effects and alias tracking  All performed at the granularity of DSL operations  e.g., MatrixMultiply

Delit De lite DS DSL L Co Compi mpiler ler Inf nfrastruc rastructure ture Liszt OptiML program program Delite Parallelism Scala Embedding Framework Framework Intermediate Representation (IR) ⇒ ⇒ Base IR Delite IR DS IR Generic Parallelism Analysis, Domain Analysis & Opt. Opt. & Mapping Analysis & Opt. Code Generation Delite Kernels DSL Data Execution (Scala, C, Structures Graph Cuda)

He Heterogeneous erogeneous Co Code de Generat neration ion  Delite can have multiple registered target code generators (Scala, Cuda , …)  Calls all generators for each Op to create kernels  Only 1 generator has to succeed  Generates an execution graph that enumerates all Delite Ops in the program  Encodes parallelism within the application  Contains all the information the Delite Runtime requires to execute the program  Op dependencies, supported targets, etc.

De Deli lite Runti untime me Local System Delite Kernels DSL Data Execution (Scala, C, Structures SMP GPU Graph Cuda) Application Inputs Machine Inputs Walk-Time Code Generator Scheduler JIT Kernel Fusion, Specialization, Synchronization Partial schedules, Fused, specialized kernels Execution-Time Schedule Dispatch, Memory Management, Lazy Data Transfers

Sc Sche hedule dule & Ke & Kernel nel Co Compi pilation lation  Compile execution graph to executables for each resource after scheduling  Defer all synchronization to this point and optimize  Kernels specialized based on number of processors allocated for it  e.g., specialize height of tree reduction  Greatly reduces overhead compared to dynamic deferred execution model  Can have finer-grained Ops with less overhead

Benefits nefits of of Runtime untime Co Code degen gen  GDA with 64 element input Compiled Interpreted 2.5 0.49 Normalized Execution Time 0.53 0.62 2 1.5 0.99 1.00 1 1.62 2.30 3.21 0.5 0 1 2 4 8 Processors

A Heteroge erogeneous neous Paralle llel l Fr Framework mework - PowerPoint PPT Presentation

A Heteroge erogeneous neous Paralle llel l Fr Framework mework for or Do Domain in- Specif ific ic Languag guages es Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University Tiark Rompf,

Instan(t)a-neous Monitoring Instan(t)a-neous Monitoring Have You Ever Had The Feeling You Wanted

J ava/ DSM A Plat f orm f or Het erogeneous Comput ing W. Yu, A. Cox Depar t ment of Comput

Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of

Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of

www.cs.rochester.edu/u/scott/ Dagstuhl, January 2015 MLS ! 1 ! Whe Where re in the in the C

Lecture 11- ECE 240a neous Emission from a (See Notes on Spontaneous Emission) Dipole

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Hete terog ogene neous C ous Conc oncur urrenc ncy Michael L. Scott (on leave at Google

UNU-WIDER Conference - Think Development Helsinki 13-15 September 2018 Paralle lel Ses ession

In Intro roduc ductio ion t n to P Paralle arallel P l Pro rogram gramming ing for

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial Cloaking Framework A Spatial

A framework for linking land use and A framework for linking land use and A framework for linking

Classifying Homogeneous Structures Cherlin Introduction The finite case Gregory Cherlin

LEGAL FRAMEWORK FOR STRENGTHENING LEGAL FRAMEWORK FOR STRENGTHENING LEGAL FRAMEWORK FOR

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Applicant Name: [Last, first name #1; Last,

Insessional writing classes: Motivating students to write Robert Marks Senior Language Tutor

Absecon School District Road to Reopen: Board of Education Presentation July 28, 2020 7:00pm

Implementing RtI as a Partnership Kalamazoo Public Schools Urban SE Leadership Collaborative

Project Plan Asynchronize All The (Localization) Things! The Capstone Experience Team Mozilla

PRESENTATION OF FO (LOCAL) STUDENT DISCIPLINE Stephanie S. Elizalde Chief of School Leadership

Financing Municipal Renewable Energy Projects: Negotiating Power Purchase Agreements for

POLYGLOT WITH GRAALVM O S C O N 2 0 1 9 / M I C H A E L H U N G E R / @ M E S I R I I