A Heteroge erogeneous neous Paralle llel l Fr Framework mework - - PowerPoint PPT Presentation

a heteroge erogeneous neous paralle llel l fr framework
SMART_READER_LITE
LIVE PREVIEW

A Heteroge erogeneous neous Paralle llel l Fr Framework mework - - PowerPoint PPT Presentation

A Heteroge erogeneous neous Paralle llel l Fr Framework mework for or Do Domain in- Specif ific ic Languag guages es Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University Tiark Rompf,


slide-1
SLIDE 1

A Heteroge erogeneous neous Paralle llel l Fr Framework mework for

  • r Do

Domain in- Specif ific ic Languag guages es

Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL

slide-2
SLIDE 2

Pr Prog

  • grammab

rammability ility Ch Chas asm

Cray Jaguar Sun T2 Nvidia Fermi Altera FPGA

MPI PGAS Pthreads OpenMP CUDA OpenCL Verilog VHDL

Virtual Worlds Personal Robotics Data Informatics Scientific Engineering

Applications

Parallel Programming Language

slide-3
SLIDE 3

Performance Productivity Generality

The he Ideal deal Pa Paral rallel lel Prog

  • gram

rammi ming ng La Language nguage

slide-4
SLIDE 4

Su Successf essful ul La Languages nguages

Performance Productivity Generality

slide-5
SLIDE 5

Do Doma main in Sp Specific ecific La Lang nguages uages

Domain Specific Languages Performance (Heterogeneous Parallelism) Productivity Generality

slide-6
SLIDE 6

Ben enefit efits s of

  • f Us

Using ing DS DSLs Ls for

  • r

Pa Parallelism rallelism

Productivity

  • Shield most programmers from the difficulty of parallel

programming

  • Focus on developing algorithms and applications and not on low

level implementation details

Performance

  • Match high level domain abstraction to generic parallel execution

patterns

  • Restrict expressiveness to more easily and fully extract available

parallelism

  • Use domain knowledge for static/dynamic optimizations

Portability and forward scalability

  • DSL & Runtime can be evolved to take advantage of latest

hardware features

  • Applications remain unchanged
  • Allows innovative HW without worrying about application portability
slide-7
SLIDE 7

DS DSLs Ls: Com : Compil piler er vs.

  • vs. Li

Libr brary ary

 A Domain-Specific Approach to Heterogeneous

Parallelism, Chafi et al.

 A framework for parallel DSL libraries  Used data-parallel patterns and deferred execution

(transparent futures) to execute tasks in parallel

 Why write a compiler?

 Static optimizations (both generic and domain-specific)  All DSL abstractions can be removed from the generated

code

 Generate code for hardware not supported by the host

language

 Full-program analysis

slide-8
SLIDE 8

Co Commo mmon n DS DSL L Fr Fram amework ework

 Building a new DSL

Design the language (syntax, operations, abstractions, etc.)

Implement compiler (parsing, type checking, optimizations, etc.)

Discover parallelism (understand parallel patterns)

Emit parallel code for different hardware (optimize for low-level architectural details)

Handle synchronization, multiple address spaces, etc.

 Need a DSL infrastructure

Embed DSLs in a common host language

Provide building blocks for common DSL compiler & runtime functionality

slide-9
SLIDE 9

De Deli lite Ove vervi view ew

Domain Embedding Language (Scala) Delite Runtime

Staged Execution

Heterogeneous Hardware Delite: DSL Infrastructure

Walk-time Optimizations

Delite Compiler

Static Optimizations Heterogeneous Code Generation Locality-aware Scheduling

Physics (Liszt) Machine Learning (OptiML)

Domain Specific Languages

SMP GPU Parallel Patterns

Data Analytics (OptiQL)

slide-10
SLIDE 10

DS DSL L Int ntermedia ermediate te Rep epresentat resentation ion (I (IR)

Matrix Plus Vector Exp Matrix Sum s = sum(M) V1 = exp(V2) M1 = M2 + M3 Domain Analysis & Opt. Domain User Interface

DSL User Application Domain Ops DSL Author

Collection Quicksort C2 = sort(C1)

slide-11
SLIDE 11

Bui uildi lding ng an n IR IR

 OptiML: A DSL for machine learning

 Built using Delite  Supports linear algebra (Matrix/Vector) operations

 DSL methods build IR as program runs

//a, b, c, d : Matrix val x = a * b + c * d def infix_+(a: Matrix, b: Matrix) = new MatrixPlus(a,b) def infix_*(a: Matrix, b: Matrix) = new MatrixTimes(a,b)

Matrix Plus Matrix Times Matrix Times A B C D

slide-12
SLIDE 12

DS DSL Op L Optim imiz izations ations

 DSL developer defines how DSL operations create IR nodes  Specialize implementation of operation for each occurrence

by pattern matching on the IR

 This technique can be used to control merely what to add

to IR or to perform IR rewrites

 Use this to apply linear algebra simplification rules

A B A C * * + A C + B *

AB + AC A(B+C)

slide-13
SLIDE 13

Opt ptiM iML Li Linear near Alg lgebra bra Rewrites rites

 A straightforward translation of the Gaussian Discriminant

Analysis (GDA) algorithm from the mathematical description produces the following code:

 A much more efficient implementation recognizes that  Transformed code was 20.4x faster with 1 thread and

48.3x faster with 8 threads.

𝑦𝑗

𝑜 𝑗=0

∗ 𝑧𝑗 → 𝑌 : , 𝑗 ∗ 𝑍 𝑗, : = 𝑌 ∗ 𝑍

𝑜 𝑗=0

val sigma = sum(0,m) { i => val a = if (!x.labels(i)) x(i) - mu0 else x(i) - mu1 a.t ** a }

slide-14
SLIDE 14

De Delit lite DS DSL L Fr Fram amework ework

 Building a new DSL

 Design the language (syntax, operations,

abstractions, etc.)

 Implement compiler

 Domain-specific analysis and optimization  Lexing, parsing, type-checking, generic optimizations

 Discover parallelism (understand parallel patterns)  Emit parallel code for different hardware (optimize

for low-level architectural details)

 Handle synchronization, multiple address spaces, etc.

slide-15
SLIDE 15

De Deli lite Ops ps

 Encode known parallel execution patterns

 Map, filter, reduce, …  Bulk-synchronous foreach  Divide & conquer

 Delite provides implementations of these

patterns for multiple hardware targets

 e.g., multi-core, GPU

 DSL author maps each domain operation to the

appropriate pattern

 Delite handles parallel optimization, code generation, and

execution for all DSLs

slide-16
SLIDE 16

Mul ultiview tiview De Delite lite IR IR

Matrix Plus Vector Exp Matrix Sum Reduce Map ZipWith s = sum(M) V1 = exp(V2) M1 = M2 + M3 Domain Analysis & Opt. Domain User Interface Parallelism Analysis & Opt. Code Generation

DSL User Application Domain Ops Delite Ops DSL Author Delite

Collection Quicksort Divide & Conquer C2 = sort(C1)

slide-17
SLIDE 17

De Deli lite Op Fus p Fusio ion

 Operates on all loop-based ops  Reduces op overhead and improves locality

 Elimination of temporary data structures  Merging loop bodies may enable further optimizations

 Fuse both dependent and side-by-side operations

 Fused ops can have multiple inputs & outputs

 Algorithm: fuse two loops if

 size(loop1) == size(loop2)  No mutual dependencies (which aren’t removed by fusing)

slide-18
SLIDE 18

Do Downs nsampling ampling in in Opt ptiM iML

0.9 1.8 3.3 5.6 1.0 1.9 3.4 5.8 0.3 0.6 0.9 1.0 0.5 1 1.5 2 2.5 3 3.5 1 2 4 8 Normalized Execution Time Processors C++ OptiML Fusing OptiML No Fusing

slide-19
SLIDE 19

Mul ultiview tiview De Delite lite IR IR

Matrix Plus Vector Exp Matrix Sum Reduce Map ZipWith Op s = sum(M) V1 = exp(V2) M1 = M2 + M3 Domain Analysis & Opt. Domain User Interface Parallelism Analysis & Opt. Code Generation

DSL User

Generic Analysis & Opt.

Application Domain Ops Delite Ops Generic Op DSL Author Delite Delite

Collection Quicksort Divide & Conquer C2 = sort(C1)

slide-20
SLIDE 20

 Optimizations

 Common subexpression elimination (CSE)  Dead code elimination (DCE)  Constant folding  Code motion (e.g., loop hoisting)

 Side effects and alias tracking  All performed at the granularity of DSL

  • perations

 e.g., MatrixMultiply

Generi neric c IR IR

slide-21
SLIDE 21

Intermediate Representation (IR)

De Delit lite DS DSL L Co Compi mpiler ler Inf nfrastruc rastructure ture

Scala Embedding Framework

Delite Execution Graph

Delite Parallelism Framework

Base IR Generic Analysis & Opt.

Code Generation

Kernels (Scala, C, Cuda)

Liszt program OptiML program

DS IR Domain Analysis & Opt. Delite IR Parallelism Analysis,

  • Opt. & Mapping

⇒ ⇒

DSL Data Structures

slide-22
SLIDE 22

He Heterogeneous erogeneous Co Code de Generat neration ion

 Delite can have multiple registered target code

generators (Scala, Cuda, …)

 Calls all generators for each Op to create kernels

 Only 1 generator has to succeed

 Generates an execution graph that enumerates

all Delite Ops in the program

 Encodes parallelism within the application  Contains all the information the Delite Runtime requires to

execute the program

 Op dependencies, supported targets, etc.

slide-23
SLIDE 23

De Deli lite Runti untime me

Delite Execution Graph Kernels (Scala, C, Cuda) DSL Data Structures Local System GPU Partial schedules, Fused, specialized kernels SMP Machine Inputs Application Inputs Scheduler Code Generator JIT Kernel Fusion, Specialization, Synchronization

Walk-Time

Schedule Dispatch, Memory Management, Lazy Data Transfers

Execution-Time

slide-24
SLIDE 24

Sc Sche hedule dule & Ke & Kernel nel Co Compi pilation lation

 Compile execution graph to executables for each

resource after scheduling

 Defer all synchronization to this point and optimize

 Kernels specialized based on number of

processors allocated for it

 e.g., specialize height of tree reduction

 Greatly reduces overhead compared to dynamic

deferred execution model

 Can have finer-grained Ops with less overhead

slide-25
SLIDE 25

Benefits nefits of

  • f Runtime

untime Co Code degen gen

 GDA with 64 element input 0.5 1 1.5 2 2.5 1 2 4 8 Normalized Execution Time Processors Compiled Interpreted

1.00 1.62 2.30 3.21 0.99 0.53 0.62 0.49

slide-26
SLIDE 26

GPU PU Ma Mana nagement gement

 Cuda host thread launches kernels and automatically

performs data transfers as required by schedule

Compiler provides helper functions to

  • Copy data structures between address spaces
  • pre-allocate outputs and temporaries
  • select the number of threads & thread blocks

 Provides device memory management for kernels

Perform liveness analysis to determine when op inputs and

  • utputs are dead on the GPU

Runtime frees dead data when it experiences memory pressure

slide-27
SLIDE 27

Cu Cuda da Co Code de Generat neration ion

With a library approach we can only launch pre-written kernels

Code generation enables kernels containing user-defined functions and optimization opportunities

e.g., fuse operations into one kernel and keep intermediate results in registers

1.0 2.3 5.5 0.2 0.4 0.6 0.8 1 1.2 RBM NB GDA

Normalized Execution Time

Library-Based Delite

slide-28
SLIDE 28

Pe Performance

  • rmance Resu

sults lts

 Machine

 Two quad-core Nehalem 2.67 GHz processors  NVidia Tesla C2050 GPU

 Application Versions

 OptiML + Delite  MATLAB

 version 1: multi-core (parallelization using

“parfor” construct and BLAS)

 version 2: GPU

 C++

 used Armadillo linear algebra library for a

sequential baseline

 Algorithmically identical to OptiML version

slide-29
SLIDE 29

Op OptiML iML vs. . MA MATLAB LAB vs. . Armadill madillo

  • (C++)

++)

OptiML Parallelized MATLAB C++

1.0 1.6 1.8 1.9 41.3 0.5 0.9 1.4 1.6 2.6 0.6

0.00 0.50 1.00 1.50 2.00 2.50 1 CPU2 CPU4 CPU8 CPU CPU + GPU Normalized Execution Time

GDA

1.0 1.9 3.6 5.8 1.1 0.1 0.2 0.2 0.3 1.2

0.00 2.00 4.00 6.00 8.00 10.00 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

0.01

100.00 110.00

Naive Bayes

..

1.0 1.7 2.7 3.5 11.0 1.0 1.9 3.2 4.7 8.9 0.6

0.00 0.50 1.00 1.50 2.00 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

RBM

1.0 2.1 4.1 7.1 2.3 0.3 0.4 0.4 0.4 0.3 1.2

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

K-means

1.0 1.9 3.1 4.2 1.1 0.9 1.2 1.4 1.4 0.8

0.00 0.50 1.00 1.50 2.00 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

0.1

7.00 15.00

SVM

..

1.0 1.4 2.0 2.3 1.7 0.5 0.9 1.3 1.1 0.4 0.5

0.00 0.50 1.00 1.50 2.00 2.50 3.00 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

Linear Regression

slide-30
SLIDE 30

Co Conc nclusion lusions

 DSLs can provide both productivity and performance

  • n heterogeneous hardware

 Need to simplify the process of developing DSLs for

parallelism

 Delite provides a framework for creating heterogeneous

parallel DSLs

 Performs generic, parallel, and domain-specific

  • ptimizations in a single system

 Visit us at ppl.stanford.edu

 Link to GitHub project  Related publications & projects