An embedded language for An embedded language for data-parallel - - PowerPoint PPT Presentation

an embedded language for an embedded language for data
SMART_READER_LITE
LIVE PREVIEW

An embedded language for An embedded language for data-parallel - - PowerPoint PPT Presentation

An embedded language for An embedded language for data-parallel programming data-parallel programming Master of Science Thesis in Computer Science Master of Science Thesis in Computer Science By Joel Svensson By Joel Svensson Department of


slide-1
SLIDE 1

An embedded language for An embedded language for data-parallel programming data-parallel programming

Master of Science Thesis in Computer Science Master of Science Thesis in Computer Science By Joel Svensson By Joel Svensson Department of Computer Science and Engineering Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY CHALMERS UNIVERSITY OF TECHNOLOGY GÖTEBORGS UNIVERSITY GÖTEBORGS UNIVERSITY Göteborg, Sweden Göteborg, Sweden

slide-2
SLIDE 2

Obsidian: an embedded language for Obsidian: an embedded language for data-parallel programming data-parallel programming

Data-parallel programming Data-parallel programming General-Purpose computations on the GPU General-Purpose computations on the GPU (GPGPU) (GPGPU) Lava Lava

NVIDIA 8800 GPU

slide-3
SLIDE 3

Project Outline Project Outline

An embedded language for data-parallel An embedded language for data-parallel programming programming Lava programming style using combinators Lava programming style using combinators Generate C code for NVIDIA GPU Generate C code for NVIDIA GPU

slide-4
SLIDE 4

Data-parallel programming Data-parallel programming

Single sequential program Single sequential program Executed by a number of processing Executed by a number of processing elements elements Operating on different data Operating on different data

for j := 1 to log(n) do for all k in parallel do if ((k+1) mod 2^j) = 0 then x[k] := x[k-2^(j-1)] + x[k] fi

  • d
  • d
slide-5
SLIDE 5

GPGPU GPGPU

GPUs are relatively cheap GPUs are relatively cheap

 High performance (Hundreds of GFLOPS)

High performance (Hundreds of GFLOPS)

Applications: Applications: Physics simulation Physics simulation Bioinformatics Bioinformatics Sorting Sorting

www.gpgpu.org

slide-6
SLIDE 6

GPU vs CPU GFLOPS GPU vs CPU GFLOPS Chart Chart

slide-7
SLIDE 7

NVIDIA 8800 GPUs NVIDIA 8800 GPUs

A set of SIMD multiprocessors A set of SIMD multiprocessors 8 SIMD processing elements per 8 SIMD processing elements per Multiprocessor Multiprocessor Up to 16 multiprocessors in one GPU Up to 16 multiprocessors in one GPU Giving 128 processing elements total Giving 128 processing elements total

www.nvidia.com

slide-8
SLIDE 8

NVIDIA 8800 GPUs NVIDIA 8800 GPUs

slide-9
SLIDE 9

NVDIA Compute Unified Device NVDIA Compute Unified Device Architecture Architecture

C compiler and libraries for the GPU C compiler and libraries for the GPU GPU as a highly parallel co-processor GPU as a highly parallel co-processor for use with NVIDIA's 8800 series GPUs for use with NVIDIA's 8800 series GPUs

www.nvidia.com/cuda

slide-10
SLIDE 10

CUDA Programming model CUDA Programming model

High number of threads High number of threads

 Divided into Blocks

Divided into Blocks

Thread block Thread block

 512 Threads

512 Threads

 Divided into Warps

Divided into Warps

 Executed on one multiprocessor

Executed on one multiprocessor

slide-11
SLIDE 11

CUDA Synchronisation CUDA Synchronisation

CUDA supplies a synchronisation primitive, CUDA supplies a synchronisation primitive, __syncthreads() __syncthreads()

 Barrier synchronisation

Barrier synchronisation

 Across all the threads of a block

Across all the threads of a block

Coordinate communication Coordinate communication

slide-12
SLIDE 12

Obsidian Obsidian

Embedded in Haskell Embedded in Haskell Presents a high level Presents a high level programmers interface programmers interface Parallel computations Parallel computations described using described using combinators combinators CUDA C code is CUDA C code is generated generated

slide-13
SLIDE 13

Obsidian Obsidian

Describes computations on arrays: Describes computations on arrays:

 Length homogeneous

Length homogeneous

Sorting algorithms Sorting algorithms

 Integer values

Integer values

Limitations: Limitations:

 Currently limited to iterative sorting algorithms

Currently limited to iterative sorting algorithms

slide-14
SLIDE 14

Obsidian Programming Obsidian Programming

Basics Basics

 Sequential composition of programs:

Sequential composition of programs: ->-

  • >-

 Parallel composition of programs:

Parallel composition of programs: parl parl

 Index operations:

Index operations:

rev rev riffle riffle unriffle unriffle

 Array operations:

Array operations:

halve halve conc conc

 Apply or Map:

Apply or Map: fun fun

slide-15
SLIDE 15

Obsidian Programming Obsidian Programming

Array Operations Array Operations

 halve

halve

 conc

conc

 oeSplit

  • eSplit

 shuffle

shuffle

slide-16
SLIDE 16

Obsidian Programming Obsidian Programming

Index Operations Index Operations

 rev

rev

 riffle

riffle

 unriffle

unriffle

riffle = halve ->- shuffle

slide-17
SLIDE 17

unriffle unriffle

unriffle = oeSplit ->- conc

slide-18
SLIDE 18

Obsidian Programming Obsidian Programming

Apply or Map: Apply or Map: fun fun Sequential composition of programs: Sequential composition of programs: ->-

  • >-

Parallel composition of programs: Parallel composition of programs: parl parl

slide-19
SLIDE 19

Obsidian Programming: an Obsidian Programming: an example example

rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int)) rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int)) rev_incr = rev ->- fun (+1) ->- sync rev_incr = rev ->- fun (+1) ->- sync *Obsidian> execute rev_incr [1,2,3] *Obsidian> execute rev_incr [1,2,3] [4,3,2] [4,3,2]

slide-20
SLIDE 20

Obsidian Synchronisation Obsidian Synchronisation

Synchronisation primitive: Synchronisation primitive: sync sync

 All

All array elements are updated after a array elements are updated after a sync sync

 Only applicable at top-level

Only applicable at top-level

Inherits behavior from CUDA's Inherits behavior from CUDA's __syncthreads() __syncthreads()

slide-21
SLIDE 21

Generating C Code Generating C Code

Generate CUDA C Code for NVIDIA GPU Generate CUDA C Code for NVIDIA GPU

 Executed as one block of threads

Executed as one block of threads

Pros Pros

 Communication and synchronisation possible

Communication and synchronisation possible

Cons Cons

 Upper limit of 512 threads per block

Upper limit of 512 threads per block

 Does not use entire GPU

Does not use entire GPU

slide-22
SLIDE 22

Generating C Code Generating C Code

Each thread is in charge of calculating one Each thread is in charge of calculating one array element array element

 Limits array size to 512 elements

Limits array size to 512 elements

 Leads to some redundancy

Leads to some redundancy

Swap operation performed by two threads in Swap operation performed by two threads in cooperation cooperation

slide-23
SLIDE 23

Generating C Code Generating C Code

__global__ static void reverse(int *values, int n) { extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = shared[((n - 1) - tid)]; __syncthreads(); shared[tid] = tmp; __syncthreads(); values[tid] = shared[tid]; }

reverse = rev ->- sync

slide-24
SLIDE 24

Generating C Code Generating C Code

__global__ static void example( int *values, int n int *values, int n) { extern __shared__ int shared[]; extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads(); values[tid] = shared[tid]; values[tid] = shared[tid]; }

slide-25
SLIDE 25

Generating C Code Generating C Code

__global__ static void example(int *values, int n) { extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]); tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads(); values[tid] = shared[tid]; }

1 2 3

slide-26
SLIDE 26

Implementing a sorter Implementing a sorter

A two-sorter sorts a pair of values: A two-sorter sorts a pair of values:

cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a) cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)

Sort each pair of elements in an array: Sort each pair of elements in an array:

sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync) sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)

*Obsidian> execute sort2 [2,3,5,1,6,7] *Obsidian> execute sort2 [2,3,5,1,6,7] [2,3,1,5,6,7] [2,3,1,5,6,7] *Obsidian> execute sort2 [2,1,2,1,2,1] *Obsidian> execute sort2 [2,1,2,1,2,1] [1,2,1,2,1,2] [1,2,1,2,1,2]

slide-27
SLIDE 27

Implementing a sorter Implementing a sorter

A more efficient pairwise sort: A more efficient pairwise sort:

sortEvens = evens (cmpSwap (<*)) ->- sync sortEvens = evens (cmpSwap (<*)) ->- sync

*Obsidian> execute sortEvens [2,3,5,1,6,7] *Obsidian> execute sortEvens [2,3,5,1,6,7] [2,3,1,5,6,7] [2,3,1,5,6,7] *Obsidian> execute sortEvens [2,1,2,1,2,1] *Obsidian> execute sortEvens [2,1,2,1,2,1] [1,2,1,2,1,2] [1,2,1,2,1,2]

slide-28
SLIDE 28

Implementing a sorter Implementing a sorter

evens

slide-29
SLIDE 29

Implementing a sorter Implementing a sorter

A close relative of A close relative of evens evens is is odds

  • dds:

:

sortOdds = odds (cmpSwap (<*)) ->- sync sortOdds = odds (cmpSwap (<*)) ->- sync *Obsidian> execute sortOdds [5,3,2,1,4,6] *Obsidian> execute sortOdds [5,3,2,1,4,6] [5,2,3,1,4,6] [5,2,3,1,4,6] *Obsidian> execute sortOdds [1,2,1,2,1,2] *Obsidian> execute sortOdds [1,2,1,2,1,2] [1,1,2,1,2,2] [1,1,2,1,2,2]

slide-30
SLIDE 30

Implementing a sorter Implementing a sorter

  • dds
slide-31
SLIDE 31

Odd Even Transposition Odd Even Transposition Sort Sort

Sorter implemented using Sorter implemented using odds

  • dds and

and evens evens: :

sortOETCore = sortEvens ->- sortOdds sortOETCore = sortEvens ->- sortOdds sortOET arr = sortOET arr =

let n = len arr let n = len arr in (repE (idiv (n+1) 2) sortOETCore) arr in (repE (idiv (n+1) 2) sortOETCore) arr

slide-32
SLIDE 32

Odd Even Transposition Odd Even Transposition Sort Sort

slide-33
SLIDE 33

VSort VSort

Another iterative sorter Another iterative sorter log log2

2(n) depth

(n) depth Built around a Built around a shuffle exchange network: shuffle exchange network:

shex f n = rep n (riffle ->- evens f ->- sync) shex f n = rep n (riffle ->- evens f ->- sync)

slide-34
SLIDE 34

VSort VSort

Merger implemented using shex:

bmergeIt n = shex (cmpSwap (<*)) n

*Obsidian> execute (shex (cmpSwap (<*)) 3) [2,4,6,8,7,5,3,1] [1,2,3,4,5,6,7,8]

slide-35
SLIDE 35

VSort VSort

Sorter implemented using bmergeIt:

vmergeIt n = tblLook tautab ->- sync –>- bmergeIt n VsortIt n = rep n (vmergeIt n)

slide-36
SLIDE 36

Comparison of sorters Comparison of sorters

Six different sorters Six different sorters

 Bitonic sort on CPU

Bitonic sort on CPU

 Odd Even Transposition sort

Odd Even Transposition sort

 Three versions of VSort

Three versions of VSort

 CUDA Bitonic sort on GPU

CUDA Bitonic sort on GPU

Data and Hardware Data and Hardware

 288 Mb of random data

288 Mb of random data

 CPU: 2.4GHz Intel Core 2

CPU: 2.4GHz Intel Core 2

 GPU: 1.2GHz NVIDIA 8800 GTS (shader

GPU: 1.2GHz NVIDIA 8800 GTS (shader clock) clock)

slide-37
SLIDE 37

Comparison of sorters Comparison of sorters

slide-38
SLIDE 38

Related work Related work

Pan Pan

 Embedded in Haskell

Embedded in Haskell

 Image synthesis

Image synthesis

 Generates C code

Generates C code

Vertigo Vertigo

 Embedded in Haskell

Embedded in Haskell

 Describes

Describes Shaders Shaders

 Generates GPU programs

Generates GPU programs

slide-39
SLIDE 39

Related work Related work

PyGPU PyGPU

 Embedded in Python

Embedded in Python

 Uses Pythons introspective abilities

Uses Pythons introspective abilities

 Graphics applications

Graphics applications

slide-40
SLIDE 40

Related work Related work

NESL NESL

 Functional language

Functional language

 Nested data-parallelism

Nested data-parallelism

 Compiles into VCode

Compiles into VCode

Data Parallel Haskell Data Parallel Haskell

 Nested data-parallelism in Haskell

Nested data-parallelism in Haskell

slide-41
SLIDE 41

Future work Future work

Solve the recursion dilemma Solve the recursion dilemma

 Enable the description of recursive sorters

Enable the description of recursive sorters

Bitonic Sort Bitonic Sort

Make use of entire GPU Make use of entire GPU Optimise the generated code Optimise the generated code More generality More generality

 Not just sorters

Not just sorters

Other target platforms Other target platforms

slide-42
SLIDE 42

Future work Future work

More generality More generality

 Arr a -> Arr b (not just Arr Int -> Arr Int)

Arr a -> Arr b (not just Arr Int -> Arr Int)

 Matrices

Matrices

 Pairs of arrays to arrays

Pairs of arrays to arrays

 Arrays of pairs to arrays

Arrays of pairs to arrays

 Throw away length homogeneity demand

Throw away length homogeneity demand