My three main points 1.Parallel programming and functional - - PowerPoint PPT Presentation

my three main points
SMART_READER_LITE
LIVE PREVIEW

My three main points 1.Parallel programming and functional - - PowerPoint PPT Presentation

D ATA P ARALLELISM IN H ASKELL Manuel M. T. Chakravarty University of New South Wales I NCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Simon Peyton Jones Thursday, 11 June 2009 My three main points 1.Parallel programming


slide-1
SLIDE 1

DATA PARALLELISM IN HASKELL

Manuel M. T. Chakravarty

University of New South Wales

INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Simon Peyton Jones

Thursday, 11 June 2009

slide-2
SLIDE 2

1.Parallel programming and functional programming are intimately connected 2.Data parallelism is cheaper than control parallelism 3.Two approaches to data parallelism in Haskell

My three main points

Thursday, 11 June 2009

slide-3
SLIDE 3

Parallel Functional

What is hard about parallel programming? Why is it easier in a functional language?

Thursday, 11 June 2009

slide-4
SLIDE 4

What is Hard About Parallelism?

Thursday, 11 June 2009

slide-5
SLIDE 5

What is Hard About Parallelism?

Indeterminate execution order! Other difficulties are arguably a consequence (race conditions, mutual exclusion, and so on)

Thursday, 11 June 2009

slide-6
SLIDE 6

Why Use a Functional Language?

Thursday, 11 June 2009

slide-7
SLIDE 7

Why Use a Functional Language?

De-emphasises attention to execution order

  • Purity and persistance
  • Focus on data dependencies

Encourages the use of collective operations

  • Wholemeal programming is better for you!

Thursday, 11 June 2009

slide-8
SLIDE 8

Why Use a Functional Language?

De-emphasises attention to execution order

  • Purity and persistance
  • Focus on data dependencies

Encourages the use of collective operations

  • Wholemeal programming is better for parallelism!

Thursday, 11 June 2009

slide-9
SLIDE 9

Haskell?

Thursday, 11 June 2009

slide-10
SLIDE 10

Laziness prevented bad habits Haskell programmers are not spoiled by the luxury

  • f predictable execution order — a luxury that we

can no longer afford in the presence of parallelism. Haskell programming culture and implementations avoid relying on a specific execution order

Haskell?

Thursday, 11 June 2009

slide-11
SLIDE 11

Laziness prevented bad habits Haskell programmers are not spoiled by the luxury

  • f predictable execution order — a luxury that we

can no longer afford in the presence of parallelism. Haskell programming culture and implementations avoid relying on a specific execution order

Haskell?

Haskell is ready for parallelism!

Thursday, 11 June 2009

slide-12
SLIDE 12

Why should we care about data parallelism?

Thursday, 11 June 2009

slide-13
SLIDE 13

Data parallelism is successful in the large

On servers farms: CGI rendering, MapReduce, ... Fortran and OpenMP for high-performance computing

Thursday, 11 June 2009

slide-14
SLIDE 14

Data parallelism is successful in the large

On servers farms: CGI rendering, MapReduce, ... Fortran and OpenMP for high-performance computing

Data parallelism becomes increasingly important in the small!

Thursday, 11 June 2009

slide-15
SLIDE 15

Two competing extremes in current processor design

OUR DATA PARALLEL FUTURE

[Image courtesy of NVIDIA]

Quadcore Xeon CPU Tesla T10 GPU

Thursday, 11 June 2009

slide-16
SLIDE 16

Two competing extremes in current processor design

OUR DATA PARALLEL FUTURE

[Image courtesy of NVIDIA]

Why?

Quadcore Xeon CPU Tesla T10 GPU

Thursday, 11 June 2009

slide-17
SLIDE 17

Reduce power consumption!

✴GPU achieves 20x better performance/Watt (judging by peak performance) ✴Speedups between 20x to 150x have been observed in real applications

Thursday, 11 June 2009

slide-18
SLIDE 18

We need data parallelism

GPU-like architectures require data parallelism 4 core CPU versus 240 core GPU are the current extreme Intel Larrabee (in 2010): 32 cores x 16 vector units Increasing core counts in CPUs and GPUs

Thursday, 11 June 2009

slide-19
SLIDE 19

We need data parallelism

GPU-like architectures require data parallelism 4 core CPU versus 240 core GPU are the current extreme Intel Larrabee (in 2010): 32 cores x 16 vector units Increasing core counts in CPUs and GPUs

Data parallelism is good news for functional programming!

Thursday, 11 June 2009

slide-20
SLIDE 20

Data parallelism and functional programming

CUDA Kernel Invocation

seq_kernel<<N, M>>(arg1, ..., argn);

Thursday, 11 June 2009

slide-21
SLIDE 21

Data parallelism and functional programming

CUDA Kernel Invocation

seq_kernel<<N, M>>(arg1, ..., argn);

FORTRAN 95

FORALL (i=1:n) A(i,i) = pure_function(b,i) END FORALL

Thursday, 11 June 2009

slide-22
SLIDE 22

Data parallelism and functional programming

Parallel map is essential; reductions are common Parallel code must be pure

CUDA Kernel Invocation

seq_kernel<<N, M>>(arg1, ..., argn);

FORTRAN 95

FORALL (i=1:n) A(i,i) = pure_function(b,i) END FORALL

Thursday, 11 June 2009

slide-23
SLIDE 23

TWO APPROACHES TO DATA PARALLEL PROGRAMMING IN HASKELL

Thursday, 11 June 2009

slide-24
SLIDE 24

Two forms of data parallelism

flat, regular nested, irregular

Thursday, 11 June 2009

slide-25
SLIDE 25

Two forms of data parallelism

flat, regular nested, irregular

Thursday, 11 June 2009

slide-26
SLIDE 26

Two forms of data parallelism

flat, regular nested, irregular limited expressiveness covers sparse structures and even divide&conquer

Thursday, 11 June 2009

slide-27
SLIDE 27

Two forms of data parallelism

flat, regular nested, irregular limited expressiveness covers sparse structures and even divide&conquer close to the hardware model needs to be turned into flat parallelism for execution

Thursday, 11 June 2009

slide-28
SLIDE 28

Two forms of data parallelism

flat, regular nested, irregular limited expressiveness covers sparse structures and even divide&conquer close to the hardware model needs to be turned into flat parallelism for execution well understood compilation techniques highly experimental program transformations

Thursday, 11 June 2009

slide-29
SLIDE 29

Flat data parallelism in Haskell

Embedded language of array computations (two- level language) Datatype of multi-dimensional arrays [Gabi's talk] Array elements limited to tuples of scalars (Int, Float, Bool, etc) Collective array operations: map, fold, scan, zip, permute, etc.

Thursday, 11 June 2009

slide-30
SLIDE 30

Scalar Alpha X Plus Y (SAXPY)

type Vector = Array DIM1 Float saxpy :: GPU.Exp Float -> Vector -> Vector

  • > Vector

saxpy alpha xs ys = GPU.run $ do xs' <- use xs ys' <- use ys GPU.zipWith (\x y -> alpha*x + y) xs' ys'

Thursday, 11 June 2009

slide-31
SLIDE 31

Scalar Alpha X Plus Y (SAXPY)

type Vector = Array DIM1 Float saxpy :: GPU.Exp Float -> Vector -> Vector

  • > Vector

saxpy alpha xs ys = GPU.run $ do xs' <- use xs ys' <- use ys GPU.zipWith (\x y -> alpha*x + y) xs' ys'

GPU.Exp e — expression evaluated on the GPU Monadic code to make sharing explicit GPU.run — compile & execute embedded code

Thursday, 11 June 2009

slide-32
SLIDE 32

First-order, except for a fixed set of higher-order collective operations No recursion No nesting — code is not compositional No arrays of structured data

Limitations of the embedded language

Thursday, 11 June 2009

slide-33
SLIDE 33

Prototype implementation targeting GPUs

Runtime code generation (computation only) 1 10 100 1000 10000 100000 10 30 50 70 90 110 130 150 170 190 SAXPY Time (milliseconds) Number of elements (million)

Plain Haskell, CPU only (AMD Sempron) Plain Haskell, CPU only (Intel Xeon) Haskell with GPU.gen (GeForce 8800GTS) Haskell with GPU.gen (Tesla S1070 x1)

Thursday, 11 June 2009

slide-34
SLIDE 34

Prototype implementation targeting GPUs

Runtime code generation (computation only) 0.1 1 10 100 1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sparse Matrix Vector Multiplication Time (milliseconds) Number of non-zero elements (million)

Plain Haskell, CPU only (AMD Sempron) Plain Haskell, CPU only (Intel Xeon) Haskell with GPU.gen (GeForce 8800GTS) Haskell with GPU.gen (Tesla S1070 x1)

Thursday, 11 June 2009

slide-35
SLIDE 35

Prototype implementation targeting GPUs

Runtime code generation (computation only) 1 10 100 1000 10000 100000 1000000 10 30 50 70 90 110 130 150 170 190 Black Scholes Call Options Time (milliseconds) Number of options (million)

Plain Haskell, CPU only (AMD Sempron) Plain Haskell, CPU only (Intel Xeon) Haskell with GPU.gen (GeForce 8800GTS) Haskell with GPU.gen (Tesla S1070 x1) C for CUDA (Tesla S1070 x1)

Thursday, 11 June 2009

slide-36
SLIDE 36

Nested data parallelism in Haskell

Language extension (fully integrated) Data type of nested parallel arrays [:e:] — here, e can be any type Parallel evaluation semantics Array comprehensions & collective operations (mapP, scanP, etc.) Forthcoming: multidimensional arrays [Gabi's talk]

Thursday, 11 June 2009

slide-37
SLIDE 37

Parallel Quicksort

qsort :: Ord a => [:a:] -> [:a:] qsort [::] = [::] qsort xs = let p = xs!:0 smaller = [:x | x <- xs, x < p:] equal = [:x | x <- xs, x == p:] bigger = [:x | x <- xs, x > p:] qs = [:qsort xs‘ | xs‘ <- [:smaller, bigger:]:] in qs!:0 +:+ equal +:+ qs!:1

Thursday, 11 June 2009

slide-38
SLIDE 38

Parallel Quicksort

qsort :: Ord a => [:a:] -> [:a:] qsort [::] = [::] qsort xs = let p = xs!:0 smaller = [:x | x <- xs, x < p:] equal = [:x | x <- xs, x == p:] bigger = [:x | x <- xs, x > p:] qs = [:qsort xs‘ | xs‘ <- [:smaller, bigger:]:] in qs!:0 +:+ equal +:+ qs!:1

[: e | x <- xs:] — array comprehension (!:), (+:+) — array indexing and append collective array operations are parallel

Thursday, 11 June 2009

slide-39
SLIDE 39

qsort

Thursday, 11 June 2009

slide-40
SLIDE 40

qsort qsort qsort

Thursday, 11 June 2009

slide-41
SLIDE 41

qsort qsort qsort qsort qsort qsort qsort

Thursday, 11 June 2009

slide-42
SLIDE 42

qsort qsort qsort qsort qsort qsort qsort qs qsort qsort q qs q q qs

Thursday, 11 June 2009

slide-43
SLIDE 43

qsort qsort qsort qsort qsort qsort qsort qs qsort qsort q qs q q qs q q qs qs q q q

Thursday, 11 June 2009

slide-44
SLIDE 44

qsort qsort qsort qsort qsort qsort qsort qs qsort qsort q qs q q qs

Exploiting both inner and intra function parallelism!

q q q q qs qs q q q

Thursday, 11 June 2009

slide-45
SLIDE 45

Properties of the language extension

First class Arrays of structured data (e.g., arrays of trees)

  • data RTree a = RTree a [:RTree a:]

Higher-order (e.g., parallel array of functions) Arbitrarily nested parallelism — compositional Much harder to implement!

Thursday, 11 June 2009

slide-46
SLIDE 46

Implementation

Extension of the Glasgow Haskell Compiler (GHC)

Thursday, 11 June 2009

slide-47
SLIDE 47

Implementation

Extension of the Glasgow Haskell Compiler (GHC) Stage 1: The Vectoriser Transforms all nested into flat parallelism

f :: a -> b

Thursday, 11 June 2009

slide-48
SLIDE 48

Implementation

Extension of the Glasgow Haskell Compiler (GHC) Stage 1: The Vectoriser Transforms all nested into flat parallelism

f :: a -> b f^ :: [:a:] -> [:b:]

Thursday, 11 June 2009

slide-49
SLIDE 49

Implementation

Extension of the Glasgow Haskell Compiler (GHC) Stage 1: The Vectoriser Transforms all nested into flat parallelism

f :: a -> b f^ :: [:a:] -> [:b:]

Stage 2: Library package DPH High-performance flat array library Communication and array fusion Radical re-ordering of computations

Thursday, 11 June 2009

slide-50
SLIDE 50

Current Implementation targeting multicore CPUs

GHC performs vectorisation transformation on Core IL

Thursday, 11 June 2009

slide-51
SLIDE 51

Current Implementation targeting multicore CPUs

GHC performs vectorisation transformation on Core IL

2x Quad-Core Xeon = 8 cores (8 thread contexts) 1x UltraSPARC T2 = 8 cores (64 thread contexts)

Thursday, 11 June 2009

slide-52
SLIDE 52

Summary

Data parallelism is getting increasingly important Two approaches to data parallelism in Haskell: 1.Embedded array language for flat parallelism 2.Language extension of parallel arrays supporting nested parallelism Nested parallelism is much harder to implement, but also much more expressive Multiple backends (multicore CPUs, GPUs, ...)

Thursday, 11 June 2009