Language Language Vi Virtua rtualization lization fo for r - - PowerPoint PPT Presentation

language language vi virtua rtualization lization fo for
SMART_READER_LITE
LIVE PREVIEW

Language Language Vi Virtua rtualization lization fo for r - - PowerPoint PPT Presentation

Language Language Vi Virtua rtualization lization fo for r Heterogeneo Heterogeneous us Par Parallel allel Computing Computing Hassan Chafi, Arvind Sujeeth, Zach DeVito, Pat Hanrahan, Kunle Olukotun Stanford University Adriaan


slide-1
SLIDE 1

Language Language Vi Virtua rtualization lization fo for r Heterogeneo Heterogeneous us Par Parallel allel Computing Computing

Hassan Chafi, Arvind Sujeeth, Zach DeVito, Pat Hanrahan, Kunle Olukotun Stanford University Adriaan Moors, Tiark Rompf, Martin Odersky EPFL

slide-2
SLIDE 2

Er Era a of

  • f Pow

Power er Li Limited Computing mited Computing

 Mobile

 Battery operated  Passively cooled

 Data center

 Energy costs  Infrastructure costs

slide-3
SLIDE 3

Computing Computing Sy Syst stem em Po Power wer

second Ops Energy Power

Op 

slide-4
SLIDE 4

He Heterogeneous terogeneous Ha Hardware rdware

Heterogeneous HW for energy efficiency

Multi-core, ILP, threads, data-parallel engines, custom engines 

H.264 encode study

1 10 100 1000

4 cores + ILP + SIMD + custom inst ASIC

Performance Energy Savings

Source: Understanding Sources of Inefficiency in General-Purpose Chips (ISCA’10)

Future performance gains will mainly come from heterogeneous hardware with different specialized resources

slide-5
SLIDE 5

DE DE Shaw Shaw Re Research: search: Anton Anton

  • D. E. Shaw et al. SC 2009, Best Paper and Gordon Bell Prize

100 times more power efficient Molecular dynamics computer

slide-6
SLIDE 6

Ap Apple ple A4 A4 in in the i{ the i{Pad|Phone Pad|Phone}

Contains CPU and GPU and …

slide-7
SLIDE 7

He Heterogeneous terogeneous Pa Parallel rallel Com Computing puting

Uniprocessor

Sequential programming

C

CMP (Multicore)

Threads and locks

C + (Pthreads, OpenMP)

GPU

Data parallel programming

C + (Pthreads, OpenMP) + (CUDA, OpenCL)

Cluster

Message passing

C + (Pthreads, OpenMP) + (CUDA, OpenCL) + MPI

Intel Pentium 4

Too many different programming models

Sun T2 Cray Jaguar Nvidia Fermi

slide-8
SLIDE 8

It’s all About Energy

(Ul Ultimately: timately: Mo Money) ney)

 Human effort just like electrical power  Aim: reduce development effort, increase

performance

 Increase performance now means:

 reduce energy per op  increase # of targets

 Need to reduce effort per target!

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

A Sol A Solution ution For

  • r Pe

Pervasi vasive ve Pa Paralle rallelism lism

 Domain Specific Languages (DSLs)

Programming language with restricted expressiveness for a particular domain

slide-12
SLIDE 12

Performance Productivity Completeness

The The Hol Holy y Gr Grail ail of

  • f Pe

Performance rformance Or Oriented iented La Languages nguages

slide-13
SLIDE 13

The The Hol Holy y Gr Grail ail of

  • f Pe

Performance rformance Or Oriented iented La Languages nguages

Performance Productivity Completeness Target DSLs

slide-14
SLIDE 14

Be Bene nefits fits of Us

  • f Usin

ing g DS DSLs Ls for Par for Paralle allelis lism

Productivity

  • Shield average programmers from the difficulty of parallel

programming

  • Focus on developing algorithms and applications and not on low

level implementation details

Performance

  • Match generic parallel execution patterns to high level domain

abstraction

  • Restrict expressiveness to more easily and fully extract available

parallelism

  • Use domain knowledge for static/dynamic optimizations

Portability and forward scalability

  • DSL & Runtime can be evolved to take advantage of latest

hardware features

  • Applications remain unchanged
  • Allows HW vendors to innovate without worrying about application

portability

slide-15
SLIDE 15

We need to develop all these DSLs Current DSL methods are unsatisfactory

New New Pro Problem blem

slide-16
SLIDE 16

Cu Curre rrent nt DS DSL De L Deve velo lopm pment ent Ap Approa proache ches

 Stand-alone DSLs

Can include extensive optimizations

Enormous effort to develop to a sufficient degree of maturity

 Actual Compiler/Optimizations  Tooling (IDE, Debuggers,…)

Interoperation between multiple DSLs is very difficult

 Purely embedded DSLs ⇒ “just a library”

Easy to develop (can reuse full host language)

Easier to learn DSL

Can Combine multiple DSLs in one program

Can Share DSL infrastructure among several DSLs

Hard to optimize using domain knowledge

Target same architecture as host language

Need to do better

slide-17
SLIDE 17

Need to Need to Do Do Better Better

 Goal: Develop embedded DSLs that

perform as well as stand-alone ones

 Intuition: General-purpose languages

should be designed with DSL embedding in mind

 Can we make this intuition more

tangible?

slide-18
SLIDE 18

Vi Virtualization rtualization Analogy Analogy

Want to have a range of differently configured machines

  • Not practical to run as many physical machines
  • Hardware Virtualization: run the logical machines
  • n virtualizable physical hardware

Want to have a range of different languages

  • Not practical to implement as many compilers
  • Language Virtualization: embed the logical

languages into a virtualizable host language

slide-19
SLIDE 19

La Lang nguage uage Vi Virt rtua ualiz lizat atio ion Req n Requi uirem rement ents

Expressiveness

  • Encompasses syntax, semantics and general ease of use for domain

experts

Performance

  • Embedded language must me amenable to extensive static and

dynamic analysis, optimization and code generation

Safety

  • Preserve type safety of embedded language
  • No loosened guarantees about program behavior

Modest Effort

  • Virtualization is only useful if it reduces effort to embed high

performance DSL

slide-20
SLIDE 20

Ac Achi hiev evin ing g Vi Virt rtua ualiz lizat atio ion: n: Ex Expre pressi ssiven veness ess

 OOP allowed higher level of abstractions

 Add your own types and define operations on them  But how about custom type interaction with language features

 Overload all relevant embedding language constructs

maps to

 DSL developer can control how loops over domain

collection should be represented and executed by implementing withFilter and foreach for their DSL type

for (x <- elems if x % 2 == 0) p(x) elems.withFilter(x => x % 2 == 0).foreach(x => p(x))

slide-21
SLIDE 21

Ac Achi hiev evin ing g Vi Virt rtua ualiz lizat atio ion: n: Ex Expre pressi ssiven veness ess

 For full virtualization, need to apply similar

techniques to all other relevant constructs of the embedding language (for example) maps to

 DSL developer can control the meaning of

conditionals by providing overloaded variants specialized to DSL types

if (cond) something else somethingElse __ifThenElse(cond, something, somethingElse)

slide-22
SLIDE 22

Out Outli line ne

 Introduction

 Using DSLs for parallel programming

 Language Virtualization

 Enhancing the power of DSL embedding languages

 Polymorphic Embedding and Modular Staging

 Enhancing the power of embedded DSLs

 Example DSLs

 OptiML – targets machine learning applications  Liszt – targets scientific computing simulations

 Conclusion

slide-23
SLIDE 23

Embedded DSL gets it all for free, but can’t change any of it

Li Ligh ghtw twei eigh ght Mod t Modul ular Sta ar Stagin ging g Ap Appro proach ach

Lexer Parser Type checker Analysis Optimization Code gen DSLs adopt front-end from highly expressive embedding language but can customize IR and participate in backend phases Stand-alone DSL implements everything

Typical Compiler

Modular Staging provides a hybrid approach

GPCE’10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

slide-24
SLIDE 24

Linear Linear Al Algebra gebra Ex Example ample

trait TestMatrix { def example(a: Matrix, b: Matrix, c: Matrix, d: Matrix) = { val x = a*b + a*c val y = a*c + a*d println(x+y) } }

a*b + a*c + a*c + a*d = a * ( b + c + c + d)

slide-25
SLIDE 25

Ab Abstract stract Ma Matrix trix Us Usage age

trait TestMatrix { def example(a: Rep[Matrix], b: Rep[Matrix], c: Rep[Matrix] , d: Rep[Matrix]) = { val x = a*b + a*c val y = a*c + a*d println(x+y) } }

 Rep[Matrix]: abstract type constructor ⇒ range of possible

implementations of Matrix

 Operations on Rep[Matrix] defined in MatrixArith trait

this: MatrixArith =>

slide-26
SLIDE 26

Li Liftin fting g Ma Matrix trix to A to Abstract bstract Re Representation presentation

 DSL interface building blocks structured as traits

Expressions of type Rep[T] represent expressions of type T

Can plug in different representation

Need to be able to convert (lift) Matrix to abstract representation

Need to define an interface for our DSL type

Now can plugin different implementations and representations for the DSL

trait MatrixArith { type Rep[T] implicit def liftMatrixToRep(x: Matrix): Rep[Matrix] def infix_+(x:Rep[Matrix], y: Rep[Matrix]): Rep[Matrix] def infix_*(x:Rep[Matrix] , y: Rep[Matrix]): Rep[Matrix] }

slide-27
SLIDE 27

Now Can Now Can Bui Build ld an IR an IR

 Start with common IR structure to be shared among DSLs  Generic optimizations (e.g. common subexpression and

dead code elimination) handled once and for all

trait Expressions { // constants/symbols (atomic) abstract class Exp[T] case class Const[T](x: T) extends Exp[T] case class Sym[T](n: Int) extends Exp[T] // operations (composite, defined in subtraits) abstract class Op[T] // additional members for managing encountered definitions def findOrCreateDefinition[T](op: Op[T]): Sym[T] implicit def toExp[T](d: Op[T]): Exp[T] = findOrCreateDefinition(d) }

slide-28
SLIDE 28

Cu Customize stomize IR IR with Domain with Domain Info Info

Choose Exp as representation for the DSL types

Define Lifting function to create expressions

Extend generic IR with domain-specific node types

DSL methods build IR as program runs

trait MatrixArithRepExp extends MatrixArith with Expressions { type Rep[T] = Exp[T] implicit def liftMatrixToRep(x: Matrix) = Const(x) case class Plus(x: Exp[Matrix],y: Exp[Matrix]) extends Op[Matrix] case class Times(x: Exp[Matrix],y: Exp[Matrix]) extends Op[Matrix] def infix_+(x: Exp[Matrix],y: Exp[Matrix]) = Plus(x, y) def infix_*(x: Exp[Matrix],y: Exp[Matrix]) = Times(x, y) }

slide-29
SLIDE 29

DS DSL L Opt Optimization imization

Use domain-specific knowledge to make optimizations in a modular fashion

Override IR node creation

Construct Optimized IR nodes if possible

Construct default otherwise

Rewrite rules are simple, yet powerful optimization mechanism

Access to the full domain specific IR allows for application of much more complex optimizations

trait MatrixArithRepExpOpt extends MatrixArithRepExp {

  • verride def infix_+(x: Exp[Matrix], y: Exp[Matrix]) = (x, y) match {

case (Times(a, b), Times(c, d)) if (a == c) => infix_*(a, infix_+(b,d)) case _ => super.plus(x, y) }}

slide-30
SLIDE 30

Out Outli line ne

 Introduction

 Using DSLs for parallel programming

 Language Virtualization

 Enhancing the power of DSL embedding languages

 Polymorphic Embedding and Modular Staging

 Enhancing the power of embedded DSLs

 Example DSLs

 OptiML – targets machine learning applications  Liszt – targets scientific computing simulations

 Conclusion

slide-31
SLIDE 31

Opt OptiML iML: : A DSL A DSL fo for Machine r Machine Le Learning arning

 Learning patterns from data

 Regression  Classification (e.g. SVMs)  Clustering (e.g. K-Means)  Density estimation (e.g. Expectation Maximization)  Inference (e.g. Loopy Belief Propagation)  Adaptive (e.g. Reinforcement Learning)

slide-32
SLIDE 32

Why Why Ma Machine chine Learning Learning

 A good domain for studying parallelism

 Many applications and datasets are time-

bound in practice

 A combination of regular and irregular

parallelism at varying granularities

 At the core of many emerging applications

(speech recognition, robotic control, data mining etc.)

slide-33
SLIDE 33

Opt OptiML iML Language Language Features Features

 Implicitly parallel data structures

General linear algebra data types : Vector[T], Matrix[T]

 Independent from the underlying implementation

Special data types : TrainingSet, TestSet, IndexVector, Image, Video ..

 Encode semantic information

 Implicitly parallel control structures

Sum{…}, (0::end) {…}, gradient { … }, untilconverged { … }

Encode restricted semantics within passed in code block

 Domain specific optimizations

Trade off a small amount accuracy for a large amount of performance

 Relaxed dependencies  Best effort computing

slide-34
SLIDE 34

// x : TrainingSet[Double] // mu0, mu1 : Vector[Double] val sigma = sum(0,x.numSamples) { if (x.labels(_) == false) (x(_)-mu0).trans.outer(x(_)-mu0) else (x(_)-mu1).trans.outer(x(_)-mu1) }

Opt OptiML iML Code Exa Code Example mple

 Gaussian Discriminant Analysis

ML-specific data types Parallel Control structures Restricted index semantics

slide-35
SLIDE 35

Per Performanc formance e Stu Study dy (C (CPU) PU)

1.0 1.8 3.6 6.3 1.1 1.2 1.2 1.2

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1 CPU 2 CPU 4 CPU 8 CPU

K-means

1.0 3.1 4.4 5.5 0.7 1.6 2.1 2.3

0.00 0.50 1.00 1.50 1 CPU 2 CPU 4 CPU 8 CPU

Normalized Execution Time

SVM

1.0 1.9 3.4 5.2 0.1 0.1 0.1 0.1 0.00 2.00 4.00 6.00 8.00 1 CPU 2 CPU 4 CPU 8 CPU

LBP

1.0 1.9 3.1 3.0 1.0 1.9 3.4 4.7 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1 CPU 2 CPU 4 CPU 8 CPU

RBM

1.0 1.7 1.8 1.9 0.5 1.0 1.4 1.6

0.00 0.50 1.00 1.50 2.00 1 CPU 2 CPU 4 CPU 8 CPU

Normalized Execution Time

GDA

1.0 2.0 3.4 4.6 0.6 0.8 1.0 1.1

0.00 0.50 1.00 1.50 2.00 1 CPU 2 CPU 4 CPU 8 CPU

Naive Bayes OptiML on DELITE Explicitly Parallelized MATLAB

slide-36
SLIDE 36

Per Performanc formance e Stu Study dy (GP (GPU) U)

0.50 1.00 2.00 4.00 8.00 16.00 32.00 GDA RBM SVM KM NB LBP Normalized Speedup

slide-37
SLIDE 37

Do Doma main in Spe Specifi cific c Op Opti timi miza zati tion

  • ns

0.2 0.4 0.6 0.8 1 1.2 Normalized Execution Time

K-means Best-effort (1.2% error) Best-effort (4.2% error) Best-effort (7.4% error) SVM Relaxed SVM (+ 1% error)

1.0x 1.8x 4.9x 12.7x 1.0x 1.8x

Best Effort Computation

Relaxed Dependencies

slide-38
SLIDE 38

Out Outli line ne

 Introduction

 Using DSLs for parallel programming

 Language Virtualization

 Enhancing the power of DSL embedding languages

 Polymorphic Embedding and Modular Staging

 Enhancing the power of embedded DSLs

 Example DSLs

 OptiML – targets machine learning applications  Liszt – targets scientific computing simulations

 Conclusion

slide-39
SLIDE 39

Fuel injection Transition Thermal Turbulence Turbulence Combustion

Li Liszt: szt: A DS A DSL L fo for r PD PDEs Es

 Mesh-based  Numeric Simulation  Huge domains

 millions of cells

 Example: Unstructured

Reynolds-averaged Navier Stokes (RANS) solver

slide-40
SLIDE 40

Li Liszt Langua szt Language ge Features Features

 Built-in mesh interface for arbitrary

polyhedra

 Vertex, Edge, Face, Cell

 Collections of mesh elements

 Element Sets: faces(c:Cell), edgesCCW(f:Face)

 Mesh-based data storage

 Fields: val vert_position = position(v)

 Parallelizable iteration

 forall statements: for( f <- faces(cell) ) { … }

slide-41
SLIDE 41

Li Liszt Code szt Code Ex Example ample

Simple Set Comprehension Functions, Function Calls Mesh Topology Operators Field Data Storage

for(edge <- edges(mesh)) { val flux = flux_calc(edge) val v0 = head(edge) val v1 = tail(edge) Flux(v0) += flux Flux(v1) -= flux }

Code contains possible write conflicts! We use architecture specific strategies guided by domain knowledge

  • MPI: Ghost cell-based message passing
  • GPU: Coloring-based use of shared memory
slide-42
SLIDE 42

MP MPI Performance I Performance

 Using 8 cores per node, scaling up to 96

cores (12 nodes, 8 cores per node, all communication using MPI)

20 40 60 80 100 120 20 40 60 80 100 120

Speedup over Scalar Number of MPI Nodes

MPI Speedup 750k Mesh

Linear Scaling Liszt Scaling Joe Scaling

1 10 100 1000 20 40 60 80 100 120

Runtim Log Scale (seconds) Number of MPI Nodes

MPI Wall-Clock Runtime

Liszt Runtime Joe Runtime

slide-43
SLIDE 43

GPU GPU Performance Performance

 Scaling mesh size from 50k (unit-sized) cells to 750k

(16x) on a Tesla C2050. Comparison is against single threaded runtime on host CPU (Core 2 Quad 2.66Ghz)

Single-Precision: 31.5x, Double-precision: 28x

5 10 15 20 25 30 35 5 10 15 20 Speedup over Scalar Problem Size

GPU Speedup over Single-Core

Speedup Double Speedup Float

slide-44
SLIDE 44

Conclus Conclusions ions

 DSLs can be an answer to the heterogeneous

parallel programming problem

 Need embedding languages to be more

virtualizable

 First steps in virtualizing Scala  Lightweight modular staging allows for more

powerful embedded DSLs

 Early embedded DSL results are promising  No unicorns were harmed during production