Simple Optimizations for Applicative Array Programs for Graphics - - PowerPoint PPT Presentation

simple optimizations for applicative array programs for
SMART_READER_LITE
LIVE PREVIEW

Simple Optimizations for Applicative Array Programs for Graphics - - PowerPoint PPT Presentation

Simple Optimizations for Applicative Array Programs for Graphics Processors Bradford Larsen Tufts University blarsen@cs.tufts.edu This work was supported in part by the NASA Space Grant Graduate Fellowship and NSF grants IIS-0082577 and


slide-1
SLIDE 1

Simple Optimizations for Applicative Array Programs for Graphics Processors

Bradford Larsen Tufts University blarsen@cs.tufts.edu

This work was supported in part by the NASA Space Grant Graduate Fellowship and NSF grants IIS-0082577 and OCI-0749125

1 Monday, February 14, 2011

slide-2
SLIDE 2

GPUs are powerful, but difficult to program

1 TFLOP/s on modern GPUs; several times greater than CPUs Lots of code for simple operations:

float sum = 0; for (int i = 0; i < n; i += 1) sum += arr[i];

in C takes ~150 lines of CUDA GPU code is data-parallel: you must decompose the problem’s data

2 Monday, February 14, 2011

slide-3
SLIDE 3

Applicative array programming allows easy GPU use

a b c d f(c) f(a) f(b) f(d) a b c d i ⊕ a ⊕ b ⊕ c ⊕ d

vmap f xs vreduce ⊕ i xs

Element-wise transformations Element-wise accumulation a b c d

vslice (1, 2) xs

b c Subvector Extraction f(c, c’) f(a, a’) f(b, b’) f(d, d’)

vzipWith f xs ys

a b c d a’ b’ c’ d’

3 Monday, February 14, 2011

slide-4
SLIDE 4

The Barracuda language supports these primitives on the GPU

Applicative: no side effects Compositional: primitives can be freely nested Deeply embedded within Haskell Functions on vectors, matrices, and scalars are the unit of compilation

4 Monday, February 14, 2011

slide-5
SLIDE 5

Barracuda code resembles Haskell code on lists

rmse :: [Float] -> [Float] -> Float rmse x y = sqrt (sumDiff / fromIntegral (length x)) where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Barracuda Haskell lists

5 Monday, February 14, 2011

slide-6
SLIDE 6

rmse :: [Float] -> [Float] -> Float rmse x y = sqrt (sumDiff / fromIntegral (length x)) where sumDiff = sum (map (^2) (zipWith (-) x y))

Barracuda code resembles Haskell code on lists

rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Barracuda Haskell lists

Barracuda code works on GPU vectors, not lists

6 Monday, February 14, 2011

slide-7
SLIDE 7

rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) rmse :: [Float] -> [Float] -> Float rmse x y = sqrt (sumDiff / fromIntegral (length x)) where sumDiff = sum (map (^2) (zipWith (-) x y))

Barracuda code resembles Haskell code on lists

Barracuda Haskell lists

Barracuda functions are named differently

7 Monday, February 14, 2011

slide-8
SLIDE 8

Barracuda functions construct abstract syntax trees

rmse:: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y))

I.e., Barracuda is deeply embedded within Haskell

Prim1 FSqrt Prim2 FDiv VReduce (+) FConst 0 VMap (^2) VZipWith (-) x y Prim1 I2F VLength x

8 Monday, February 14, 2011

slide-9
SLIDE 9

Barracuda ASTs are compiled into optimized CUDA code

Barracuda Functions CUDA kernels C++ wrapper functions Barracuda runtime code User’s C++ code GPGPU Application

nvcc Barracuda compiler The user writes these

Barracuda ASTs

9 Monday, February 14, 2011

slide-10
SLIDE 10

Efficient GPU code exploits the memory hierarchy

100’s 1000’s

3GB Device Memory Main Memory 48 KB Shared Memory 32 GPU Cores

1

14 chips

NVIDIA Tesla C2050

10 Monday, February 14, 2011

slide-11
SLIDE 11

Nested array expressions are potentially troublesome

rmse:: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y))

Naive compilation uses temporaries, multiple passes

  • ver data

Prim1 FSqrt Prim2 FDiv VReduce (+) FConst 0 VMap (^2) VZipWith (-) x y Prim1 I2F VLength x

11 Monday, February 14, 2011

slide-12
SLIDE 12

CUDA computes on elements, not arrays

CUDA code is data-parallel: kernels describe what happens at one location. Array indexing laws allow for fusion: (vmap f xs)!i = f (xs!i) (vzipWith f xs ys)!i = f (xs!i) (ys!i) (vslice (b, e) xs)!i = xs!(e - b + i)

12 Monday, February 14, 2011

slide-13
SLIDE 13

Barracuda always applies the array indexing laws

Array fusion comes naturally during codegen, e.g.:

(vmap f (vmap g xs))!i → f (g (xs!i)) vmap f (vzipWith g xs ys)!i → f (g (xs!i) (ys!i)) (vslice (b, e) (vmap f xs))!i → f (xs!(e - b + i)) (vmap f (vslice (b, e) xs))!i → f (xs!(e - b + i)) (vslice (b, e) (vslice (b’, e’) xs))!i → xs!(e - b + e’ - b’ + i)

13 Monday, February 14, 2011

slide-14
SLIDE 14

Efficient GPU code exploits the memory hierarchy

100’s 1000’s

3GB Device Memory Main Memory 48 KB Shared Memory 32 GPU Cores

1

14 chips

NVIDIA Tesla C2050

14 Monday, February 14, 2011

slide-15
SLIDE 15

Stencil operations involve redundant reads

A data-parallel CUDA kernel is run by many threads on the 14 GPU chips.

a b c d e f g h 2 1 3 6 4 5 7 Vector locations Block of threads

Stencil operations involve array elements in a neighborhood, resulting in several threads reading the same elements.

15 Monday, February 14, 2011

slide-16
SLIDE 16

Barracuda automatically uses shared memory when useful

When multiple array subexpressions

  • verlap, there is read redundancy, e.g.:

a b c d e f g h xs ys a b c d e f g zs b c d e f g h as = vzipWith (-) zs ys ys = vslice (0, 6) xs zs = vslice (1, 7) xs Elements b–g are read twice in the computation of as

16 Monday, February 14, 2011

slide-17
SLIDE 17

Use of shared memory is only useful when

array elements are read at least two times; it is known at compile-time that elements are read multiple times; and there are enough elements to amortize the added indexing costs.

17 Monday, February 14, 2011

slide-18
SLIDE 18

Shared memory optimization examples

as = vzipWith (-) zs ys ys = slice (0, 1022) xs zs = slice (1, 1023) xs There are enough elements as = vzipWith (-) zs ys ys = slice (0, 511) xs zs = slice (512, 1023) xs No elements are read multiple times as = vzipWith (-) zs ys ys = slice (0, 511) xs No elements are read multiple times as = vzipWith (-) zs ys ys = slice (0, 1022) xs zs = slice (1, vlength xs) xs Slices use only constant and vector length expressions; there are enough elements

18 Monday, February 14, 2011

slide-19
SLIDE 19

A mix of existing and new benchmarks was used

BLAS operations, Black-Scholes seen in Lee et al. (2009) and Mainland and Morrisett (2010) Weighted moving average, RMSE, forward difference used to show impact of optimizations Test system: 512MB NVIDIA GeForce 8800GT, CUDA 3.2

19 Monday, February 14, 2011

slide-20
SLIDE 20

0.75 0.85 0.95 1.05 2^8 2^12 2^16 2^20 2^24 Runtime relative to hand-coded solutions Number of array elements

Barracuda performance is good

Black Scholes call options SDOT SAXPY Slower Faster

20 Monday, February 14, 2011

slide-21
SLIDE 21

1E+02 1E+03 1E+04 2^8 2^12 2^16 2^20 2^24 RMSE average kernel runtime Average runtime (µs) Number of array elements

Array fusion is essential for good performance

With fusion Manually unfused

Fused version 1.7–2.9x faster Fused version 1.1x faster

21 Monday, February 14, 2011

slide-22
SLIDE 22

2^8 2^12 2^16 2^20 2^24 Speedup from shared memory optimization Number of array elements

Use of shared memory greatly improves performance

Weighted moving average Forward difference Jacobi iteration stencil Speedup 1 2 4 8

22 Monday, February 14, 2011

slide-23
SLIDE 23

Speedups are enabled by careful use of declarative programming

Barracuda gets speedups through better use of GPU memory. Computation is moved into fast memory through array fusion and shared memory optimization. These optimizations are easy to implement because the source language is applicative and has few primitives.

23 Monday, February 14, 2011