Buil ildi ding ng-Bl Block ocks s for Pe Performanc ormance - - PowerPoint PPT Presentation

buil ildi ding ng bl block ocks s for pe performanc
SMART_READER_LITE
LIVE PREVIEW

Buil ildi ding ng-Bl Block ocks s for Pe Performanc ormance - - PowerPoint PPT Presentation

Buil ildi ding ng-Bl Block ocks s for Pe Performanc ormance e Orie iented nted DS DSLs Ls Tiark Rompf, Martin Odersky EPFL Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Hassan Chafi, Kunle Olukotun Stanford University DS DSL L


slide-1
SLIDE 1

Buil ildi ding ng-Bl Block

  • cks

s for Pe Performanc

  • rmance

e Orie iented nted DS DSLs Ls

Tiark Rompf, Martin Odersky EPFL Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Hassan Chafi, Kunle Olukotun Stanford University

slide-2
SLIDE 2

DS DSL L Ben enefits efits

Make programmers more productive

 Raise the level of abstraction  Easier to reason about programs  Maintenance, verification, etc

slide-3
SLIDE 3

Pe Perfo rformanc rmance e Ori riented ented DS DSLs Ls

Make compiler more productive, too!

 Generate better code  Optimize using domain knowledge  Target heterogeneous + parallel hardware

slide-4
SLIDE 4

DS DSLs Ls un under er De Develo elopment pment

 Liszt (mesh based PDE solvers)

DeVito et al.: Liszt: A Domain-Specific Language for Building Portable Mesh-based PDE solvers. Supercomputing (SC) 2011

 OptiML (machine learning)

Sujeeth et al.: OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. International Conference for Machine Learning (ICML) 2011

 OptiQL (data query)  all embedded in Scala  heterogeneous compilation (multi core CPU/GPU)  good absolute performance and speedups

slide-5
SLIDE 5

Co Commo mmon n DS DSL L Inf nfrastr rastructure ucture

 Don’t start from scratch for each new DSL

 It’s just too hard …

 Delite Framework + Runtime

See also Brown et al.: A Heterogeneous Parallel Framework for Domain-Specific Languages. PACT’11

 This Talk/Paper: Building blocks

that work together in new or interesting ways

slide-6
SLIDE 6

Fo Focus cus on

  • n 2 th

2 thin ings: gs:

 #1: DeliteOps

 high-level view of common execution

patterns (i.e. loops)

 parallelism and heterogeneous targets

 #2: Staging

 DSL programs are program generators  move (costly) abstraction to generating stage

 Case study: SPADE app in OptiML

slide-7
SLIDE 7

#1 #1: : DeliteOps liteOps

slide-8
SLIDE 8

He Hete terogen rogeneou eous s Pa Para ralle llel l Pr Prog

  • gramming

ramming

Cray Jaguar Sun T2 Nvidia Fermi Altera FPGA

MPI Pthreads OpenMP CUDA OpenCL Verilog VHDL

Today: Performance = heterogeneous + parallel

slide-9
SLIDE 9

He Hete terogen rogeneou eous s Pa Para ralle llel l Pr Prog

  • gramming

ramming

Cray Jaguar Sun T2 Nvidia Fermi Altera FPGA

MPI Pthreads OpenMP CUDA OpenCL Verilog VHDL

Your favourite Java, Haskell, Scala, C++ compiler will not generate code for these platforms.

Compilers have not kept pace!

slide-10
SLIDE 10

Pr Prog

  • grammab

rammability ility Ch Chas asm

Too many different programming models

Cray Jaguar Sun T2 Nvidia Fermi Altera FPGA

MPI Pthreads OpenMP CUDA OpenCL Verilog VHDL

Virtual Worlds Personal Robotics Data informatics Scientific Engineering

Applications

slide-11
SLIDE 11

De Delit liteO eOps ps

 Capture common parallel execution

patterns

 map, filter, reduce, … join, bfs, …

 Map them efficiently to a variety of

target platforms

 Multi core CPU, GPU

 Express your DSL as DeliteOps

 => Parallelism for free!

slide-12
SLIDE 12

Intermediate Representation (IR)

De Delit lite DS DSL L Co Compi mpiler ler

 

Scala Embedding Framework

Delite Execution Graph

Delite Parallelism Framework

Base IR Generic Analysis & Opt.

Code Generation

Kernels (Scala, C, Cuda, MPI Verilog, …)

Liszt program OptiML program

DS IR Domain Analysis & Opt. Delite IR Parallelism Analysis,

  • Opt. & Mapping

⇒ ⇒

Data Structures (arrays, trees, graphs, …)

slide-13
SLIDE 13

De Delit lite Op Fu Fusion ion

 Operates on all loop-based ops  Reduces op overhead and improves locality

 Elimination of temporary data structures  Merging loop bodies may enable further optimizations

 Fuse both dependent and side-by-side operations

 Fused ops can have multiple inputs + outputs

 Algorithm: fuse two loops if

 size(loop1) == size(loop2)  No mutual dependencies (which aren’t removed by fusing)

slide-14
SLIDE 14

De Delit lite Op Fu Fusion ion

def square(x: Rep[Double]) = x*x def mean(xs: Rep[Array[Double]]) = xs.sum / xs.length def variance(xs: Rep[Array[Double]]) = xs.map(square) / xs.length - square(mean(xs)) val array1 = Array.fill(n) { i => 1 } val array2 = Array.fill(n) { i => 2*i } val array3 = Array.fill(n) { i => array1(i) + array2(i) } val m = mean(array3) val v = variance(array3) println(m) println(v)

// begin reduce x47,x51,x11 var x47 = 0 var x51 = 0 var x11 = 0 while (x11 < x0) { val x44 = 2.0*x11 val x45 = 1.0+x44 val x50 = x45*x45 x47 += x45 x51 += x50 x11 += 1 } // end reduce val x48 = x47/x0 val x49 = println(x48) val x52 = x51/x0 val x53 = x48*x48 val x54 = x52-x53 val x55 = println(x54)

3+1+(1+1) = 6 traversals, 4 arrays 1 traversal, 0 arrays

slide-15
SLIDE 15

#2 #2: : Sta Staging ging

How do we go from DSL source to DeliteOps?

slide-16
SLIDE 16

2 Ch 2 Chal allen lenges: ges:

 #1: generate intermediate

representation (IR) from DSL code embedded in Scala

 #2: do it in such a way that the IR is

free from unnecessary abstraction

 Avoid abstraction penalty!

slide-17
SLIDE 17

Ex Exampl ample

val v = Vector.rand(100) println("today’s lucky number is: ") println(v.sum) abstract class Vector[T] def vector_rand(n: Rep[Int]): Rep[Vector[Double]] def infix_sum[T:Numeric](v: Rep[Vector[T]]): Rep[T] DSL program DSL interface case class VectorRand(n: Exp[Int]) extends Def[Vector[Double] case class VectorSum[T:Numeric](in: Exp[Vector[T]]) extends DeliteOpReduce[Exp[T]] { def func = (a,b) => a + b } def vector_rand(n: Exp[Int]) = new VectorRand(n) def infix_sum[T:Numeric](v: Exp[Vector[T]]) = new VectorSum(v) type Rep[T] type Rep[T] = Exp[T] class Exp[T] class Def[T] DSL imlpl.

slide-18
SLIDE 18

 “Finally Tagless” / Polymorphic

embedding

Carette, Kiselyov, Shan: Finally Tagless, Partially Evaluated: Tagless Staged Interpreters for Simpler Typed Languages. APLAS’07/J. Funct.

  • Prog. 2009.

Hofer, Ostermann, Rendel, Moors: Polymorphic Embeddings of DSLs. GPCE’08.

 Lightweight Modular Staging (LMS)

 Rompf, Odersky: Lightweight Modular Staging: A Pragmatic

Approach to Runtime Code Generation and Compiled DSLs. GPCE’10.

slide-19
SLIDE 19

 Can use the full host language to

compose DSL program fragments!

 Move (costly) abstraction to the

generating stage

slide-20
SLIDE 20

Ex Exampl ample

 Use higher order functions in DSL

programs

 While keeping the DSL first order!

slide-21
SLIDE 21

Hi Higher her-Order Order fun unctions ctions

val xs: Rep[Vector[Int]] = … println(xs.count(x => x > 7)) def infix_foreach[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Unit]) = { var i: Rep[Int] = 0 while (i < v.length) { f(v(i)) i += 1 } } def infix_count[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Boolean]) = { var c: Rep[Int] = 0 v foreach { x => if (f(x)) c += 1 } c } val v: Array[Int] = ... var c = 0 var i = 0 while (i < v.length) { val x = v(i) if (x > 7) c += 1 i += 1 } println(c)

slide-22
SLIDE 22

Co Cont ntinu inuations ations

val u,v,w: Rep[Vector[Int]] = ... nondet { val a = amb(u) val b = amb(v) val c = amb(w) require(a*a + b*b == c*c) println("found:") println(a,b,c) } def amb[T](xs: Rep[Vector[T]]): Rep[T] @cps[Rep[Unit]] = shift { k => xs foreach k } def require(x: Rep[Boolean]): Rep[Unit] @cps[Rep[Unit]] = shift { k => if (x) k() else () } while (…) { while (…) { while (…) { if (…) { println("found:") println(a,b,c) } } } }

slide-23
SLIDE 23

Res esult ult

 Function values and continuations

translated away by staging

 Control flow strictly first order  Much simpler analysis for other

  • ptimizations
slide-24
SLIDE 24

Reg egular ular Co Compi mpiler ler op

  • pti

timi mizations zations

 Common subexpression and dead

code elimination

 Global code motion  Symbolic execution / pattern rewrites Coarse-grained: optimizations can happen on vectors, matrices or whole loops

slide-25
SLIDE 25

In n th the e Pa Paper: per:

 Removing data structure abstraction  Partial evaluation/symbolic execution

  • f staged IR

 Effect abstractions  Extending the framework/modularity

slide-26
SLIDE 26

Ca Case se Stu Study: dy: Op OptiML tiML

A DSL For Machine Learning

slide-27
SLIDE 27

OptiML tiML: : A DSL SL Fo For r Mac achine hine Le Lear arnin ning

 Provides a familiar (MATLAB-like) language and

API for writing ML applications

 Ex. val c = a * b (a, b are Matrix[Double])

 Implicitly parallel data structures

 General data types: Vector[T], Matrix[T], Graph[V,E]

 Independent from the underlying implementation

 Specialized data types: Stream, TrainingSet, TestSet,

IndexVector, Image, Video ..

 Encode semantic information & structured, synchronized

communication  Implicitly parallel control structures

 sum{…}, (0::end) {…}, gradient { … }, untilconverged { … }  Allow anonymous functions with restricted semantics to be

passed as arguments of the control structures

slide-28
SLIDE 28

Pu Putt tting ing it it al all l to together: ether: SP SPADE ADE

kernelWidth

Downsample: L1 distances between all 106 events in 13D space… reduce to 50,000 events

val distances = Stream[Double](data.numRows, data.numRows){ (i,j) => dist(data(i),data(j)) } for (row <- distances.rows) { if(densities(row.index) == 0) { val neighbors = row find { _ < apprxWidth } densities(neighbors) = row count { _ < kernelWidth } } }

slide-29
SLIDE 29

val distances = Stream[Double](data.numRows, data.numRows){ (i,j) => dist(data(i),data(j)) } for (row <- distances.rows) { row.init // expensive! part of the stream foreach operation if(densities(row.index) == 0) { val neighbors = row find { _ < apprxWidth } densities(neighbors) = row count { _ < kernelWidth } } }

SP SPADE ADE tr transformations ansformations

row is 235,000 elements in one typical dataset – fusing is a big win!

slide-30
SLIDE 30

SP SPADE ADE gene enerat rated ed co code

// FOR EACH ELEMENT IN ROW while (x155 < x61) { val x168 = x155 * x64 var x180 = 0 // INITIALIZE STREAM VALUE (dist(i,j)) while (x180 < x64) { val x248 = x164 + x180 // … } // VECTOR FIND if (x245) x201.insert(x201.length, x155) // VECTOR COUNT if (x246) { val x207 = x208 + 1 x208 = x207 } x155 += 1 }

From a ~5 line algorithm description in OptiML …to an efficient, fused, imperative version that closely resembles a hand-optimized C++ baseline!

slide-31
SLIDE 31

Imp mpact act of

  • f Op Fu

Fusion ion

0.9 1.8 3.3 5.6 1.0 1.9 3.4 5.8 0.3 0.6 0.9 1.0 0.5 1 1.5 2 2.5 3 3.5 1 2 4 8 Normalized Execution Time Processors C++ OptiML Fusing OptiML No Fusing

slide-32
SLIDE 32

Ex Experi eriments ments on

  • n la

larger rger ap apps ps

1.0 1.7 3.1 4.9 0.7

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1 CPU 2 CPU 4 CPU 8 CPU Normalized Execution Time

TM OptiML C++

1.0 1.9 3.4 5.8 0.9 1.8 3.3 5.6 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1 CPU 2 CPU 4 CPU 8 CPU

SPADE

1.0 1.7 2.5 3.3 1.2 1.5 3.5 5.4 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1 CPU 2 CPU 4 CPU 8 CPU

LBP

slide-33
SLIDE 33

Ex Experi eriments ments on

  • n ML

L ker ernels nels

1.0 1.6 1.8 1.9 41.3 0.5 0.9 1.4 1.6 2.6 13.2

0.0 0.5 1.0 1.5 2.0 2.5 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

Normalized Execution Time

GDA

1.0 2.1 4.1 7.1 2.3 0.3 0.4 0.4 0.4 0.3 0.3

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

K-means

1.0 1.7 2.7 3.5 11.0 1.0 1.9 3.2 4.7 8.9 16.1

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

RBM

1.0 1.9 3.8 5.8 1.1 0.1 0.2 0.2 0.3 0.1

0.0 2.0 4.0 6.0 8.0 10.0 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

0.01

100.0 110.0

Naive Bayes

..

1.0 1.4 2.0 2.3 1.6 0.5 0.9 1.3 1.1 0.4 0.3

0.0 1.0 2.0 3.0 4.0 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

Linear Regression

1.0 1.9 3.1 4.2 1.1 0.9 1.2 1.4 1.4

0.0 0.5 1.0 1.5 2.0 1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

0.1

7.0 15.0

SVM

..

0.2

OptiML Parallelized MATLAB MATLAB + Jacket

slide-34
SLIDE 34

Su Summa mmary ry

 Performance oriented DSLs are a promising

parallel programming platform

 Capable of achieving portability, productivity, and

high performance

 Delite can simplify the task of implementing

DSLs

 OptiML outperforms MATLAB and C++ on a

set of well known machine learning applications, with expressive code

slide-35
SLIDE 35

Quest uestion ions? s?

slide-36
SLIDE 36

Performance Productivity Generality

Pr Programming

  • gramming La

Language nguage Desi sign gn Space ace

slide-37
SLIDE 37

Performance Productivity Generality

Pr Programming

  • gramming La

Language nguage Desi sign gn Space ace

slide-38
SLIDE 38

Ge Gene neral al Pur urpose pose La Lang nguages uages

Performance Productivity Generality Performance

  • riented

DSLs

slide-39
SLIDE 39

We need to develop all these DSLs Current DSL methods are unsatisfactory

DS DSLs Ls Pr Pres esent ent Ne New Pr Prob

  • blem

lem

slide-40
SLIDE 40

Cur urren rent t DSL L Develo velopment pment Approaches proaches

 Stand-alone DSLs

Can include extensive optimizations

Enormous effort to develop to a sufficient degree of maturity

 Actual Compiler/Optimizations  Tooling (IDE, Debuggers,…)

Interoperation between multiple DSLs is very difficult

 Purely embedded DSLs ⇒ “just a library”

Easy to develop (can reuse full host language)

Easier to learn DSL

Can Combine multiple DSLs in one program

Can Share DSL infrastructure among several DSLs

Hard to optimize using domain knowledge

Target same architecture as host language

Need to do better

slide-41
SLIDE 41

 DSLs: trade off generality for

productivity and performance

 DSL embedding:

 Combine benefits of pure embedding with

analyzability of external dsls