Simple Optimizations for Applicative Array Programs for Graphics - PowerPoint PPT Presentation

Simple Optimizations for Applicative Array Programs for Graphics Processors Bradford Larsen Tufts University blarsen@cs.tufts.edu This work was supported in part by the NASA Space Grant Graduate Fellowship and NSF grants IIS-0082577 and OCI-0749125 Monday, February 14, 2011 1

GPUs are powerful, but difficult to program 1 TFLOP/s on modern GPUs; several times greater than CPUs Lots of code for simple operations: float sum = 0; for (int i = 0; i < n; i += 1) sum += arr[i]; in C takes ~150 lines of CUDA GPU code is data-parallel: you must decompose the problem’s data Monday, February 14, 2011 2

Applicative array programming allows easy GPU use vmap f xs a b c d f(a) f(b) f(c) f(d) vzipWith f xs ys Element-wise transformations a b c d f(a, a’) f(b, b’) f(c, c’) f(d, d’) a’ b’ c’ d’ vreduce ⊕ i xs Element-wise a b c d i ⊕ a ⊕ b ⊕ c ⊕ d accumulation vslice (1, 2) xs Subvector a b c d b c Extraction Monday, February 14, 2011 3

The Barracuda language supports these primitives on the GPU Applicative: no side effects Compositional: primitives can be freely nested Deeply embedded within Haskell Functions on vectors, matrices, and scalars are the unit of compilation Monday, February 14, 2011 4

Barracuda code resembles Haskell code on lists rmse :: [Float] -> [Float] -> Float Haskell rmse x y = sqrt (sumDiff / fromIntegral (length x)) lists where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) Barracuda where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Monday, February 14, 2011 5

Barracuda code resembles Haskell code on lists rmse :: [Float] -> [Float] -> Float Haskell rmse x y = sqrt (sumDiff / fromIntegral (length x)) lists where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) Barracuda where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Barracuda code works on GPU vectors, not lists Monday, February 14, 2011 6

Barracuda code resembles Haskell code on lists rmse :: [Float] -> [Float] -> Float Haskell rmse x y = sqrt (sumDiff / fromIntegral (length x)) lists where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) Barracuda where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Barracuda functions are named differently Monday, February 14, 2011 7

Barracuda functions construct abstract syntax trees rmse:: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Prim1 FSqrt Prim2 FDiv VReduce (+) Prim1 I2F I.e., Barracuda is FConst 0 VMap (^2) VLength x deeply embedded within Haskell VZipWith (-) x y Monday, February 14, 2011 8

Barracuda ASTs are compiled into optimized CUDA code The user User’s C++ code writes these Barracuda runtime nvcc code Barracuda Functions GPGPU Application Barracuda CUDA kernels compiler Barracuda ASTs C++ wrapper functions Monday, February 14, 2011 9

Efficient GPU code exploits the memory hierarchy NVIDIA Tesla C2050 1 100’s 48 KB 32 GPU 3GB Device Shared Cores Memory Memory 1000’s 14 chips Main Memory Monday, February 14, 2011 10

Nested array expressions are potentially troublesome rmse:: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Prim1 FSqrt Naive compilation Prim2 FDiv uses temporaries, multiple passes VReduce (+) Prim1 I2F over data FConst 0 VMap (^2) VLength x VZipWith (-) x y Monday, February 14, 2011 11

CUDA computes on elements, not arrays CUDA code is data-parallel: kernels describe what happens at one location. Array indexing laws allow for fusion: (vmap f xs)!i = f (xs!i) (vzipWith f xs ys)!i = f (xs!i) (ys!i) (vslice (b, e) xs)!i = xs!(e - b + i) Monday, February 14, 2011 12

Barracuda always applies the array indexing laws Array fusion comes naturally during codegen, e.g.: → (vmap f (vmap g xs))!i f (g (xs!i)) → vmap f (vzipWith g xs ys)!i f (g (xs!i) (ys!i)) → (vslice (b, e) (vmap f xs))!i f (xs!(e - b + i)) → (vmap f (vslice (b, e) xs))!i f (xs!(e - b + i)) → (vslice (b, e) (vslice (b’, e’) xs))!i xs!(e - b + e’ - b’ + i) Monday, February 14, 2011 13

Efficient GPU code exploits the memory hierarchy NVIDIA Tesla C2050 1 100’s 48 KB 32 GPU 3GB Device Shared Cores Memory Memory 1000’s 14 chips Main Memory Monday, February 14, 2011 14

Stencil operations involve redundant reads A data-parallel CUDA kernel is run by many threads on the 14 GPU chips. a b c d e f g h Vector locations 0 1 2 3 4 5 6 7 Block of threads Stencil operations involve array elements in a neighborhood, resulting in several threads reading the same elements. Monday, February 14, 2011 15

Barracuda automatically uses shared memory when useful When multiple array subexpressions overlap, there is read redundancy, e.g.: as = vzipWith (-) zs ys ys = vslice (0, 6) xs zs = vslice (1, 7) xs a b c d e f g h xs ys a b c d e f g b c d e f g h zs Elements b–g are read twice in the computation of as Monday, February 14, 2011 16

Use of shared memory is only useful when array elements are read at least two times; it is known at compile-time that elements are read multiple times; and there are enough elements to amortize the added indexing costs. Monday, February 14, 2011 17

Shared memory optimization examples as = vzipWith (-) zs ys ys = slice (0, 1022) xs There are enough elements zs = slice (1, 1023) xs as = vzipWith (-) zs ys No elements are ys = slice (0, 511) xs read multiple times zs = slice (512, 1023) xs No elements are as = vzipWith (-) zs ys ys = slice (0, 511) xs read multiple times Slices use only constant and as = vzipWith (-) zs ys vector length expressions; ys = slice (0, 1022) xs zs = slice (1, vlength xs) xs there are enough elements Monday, February 14, 2011 18

A mix of existing and new benchmarks was used BLAS operations, Black-Scholes seen in Lee et al. (2009) and Mainland and Morrisett (2010) Weighted moving average, RMSE, forward difference used to show impact of optimizations Test system: 512MB NVIDIA GeForce 8800GT, CUDA 3.2 Monday, February 14, 2011 19

Barracuda performance is good Runtime relative to hand-coded solutions SDOT 1.05 Slower Black Scholes call options 0.95 0.85 Faster SAXPY 0.75 2^8 2^12 2^16 2^20 2^24 Number of array elements Monday, February 14, 2011 20

Array fusion is essential for good performance RMSE average kernel runtime Manually 1E+04 unfused Fused version Fused version 1.7–2.9x faster 1.1x faster Average runtime (µs) With fusion 1E+03 1E+02 2^8 2^12 2^16 2^20 2^24 Number of array elements Monday, February 14, 2011 21

Use of shared memory greatly improves performance Speedup from shared memory optimization Weighted moving average 8 Forward Speedup difference 4 Jacobi iteration 2 stencil 1 2^8 2^12 2^16 2^20 2^24 Number of array elements Monday, February 14, 2011 22

Speedups are enabled by careful use of declarative programming Barracuda gets speedups through better use of GPU memory. Computation is moved into fast memory through array fusion and shared memory optimization. These optimizations are easy to implement because the source language is applicative and has few primitives. Monday, February 14, 2011 23

Simple Optimizations for Applicative Array Programs for Graphics - PowerPoint PPT Presentation

Simple Optimizations for Applicative Array Programs for Graphics Processors Bradford Larsen Tufts University blarsen@cs.tufts.edu This work was supported in part by the NASA Space Grant Graduate Fellowship and NSF grants IIS-0082577 and

Functor and Applicative Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

Equational Reasoning with Applicative Functors Andreas Lochbihler Joshua Schneider Institute of

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Verifying Optimizations using SMT Solvers Nuno Lopes technology Why verify optimizations? from

Implementing Data Layout Optimizations Implementing Data Layout Optimizations in the LLVM

Khem Raj Embedded Linux Conference 2014, San Jose, CA } What is GCC } General Optimizations

Hindley-Milner elaboration in applicative style Fran cois Pottier This pearl presents This

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Very Large Array Project The Expanded Observing with the Jansky VLA Gustaaf van Moorsel Array

Array Code Generation 1. Array code generation 2. Surprises in memory access 3. Lessons learned

Genetic Improvement of GPU Software W. B. Langdon Computer Science, University College London GI

Reformations that Matter (and Some that Dont) Christopher Ocker Fragment of a larger

Debian dependency resolution in polynomial time Niels Thykier Debian Developer Release Manager

LORD RD JESUS US CHRI RIST ST 1 A Message to the Saints on the First Day of Unleavened Bread

{ avg. latency) Actuator Cylinder 7.4/8.2 ms avg. seek Track Arm Platter Head Buffer

BLOCK MANAGEMENT IN SOLID-STATE DEVICES Abhishek Rajimwale (University of Wisconsin-Madison)

Sandboxing & Virtualization: Modern Tools for Combating Malware Anup K. Ghosh, PhD Founder

Learning MySQL 5.7 Jervin Real Percona Live 2017 1 / 21 Agenda 1. Background 2. New Features

Simple Optimizations for Applicative Array Programs for Graphics - PowerPoint PPT Presentation

Simple Optimizations for Applicative Array Programs for Graphics Processors Bradford Larsen Tufts University blarsen@cs.tufts.edu This work was supported in part by the NASA Space Grant Graduate Fellowship and NSF grants IIS-0082577 and

Functor and Applicative Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Concepts Introduced in Chapter 9 introduction to compiler optimizations basic blocks and

Equational Reasoning with Applicative Functors Andreas Lochbihler Joshua Schneider Institute of

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

2 3 Motivations 4 Motivations 5 Motivations 6 Motivations 7 8 System Implementation and

Verifying Optimizations using SMT Solvers Nuno Lopes technology Why verify optimizations? from

Implementing Data Layout Optimizations Implementing Data Layout Optimizations in the LLVM

Khem Raj Embedded Linux Conference 2014, San Jose, CA } What is GCC } General Optimizations

Hindley-Milner elaboration in applicative style Fran cois Pottier This pearl presents This

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Very Large Array Project The Expanded Observing with the Jansky VLA Gustaaf van Moorsel Array

Array Code Generation 1. Array code generation 2. Surprises in memory access 3. Lessons learned

Genetic Improvement of GPU Software W. B. Langdon Computer Science, University College London GI

Reformations that Matter (and Some that Dont) Christopher Ocker Fragment of a larger

Debian dependency resolution in polynomial time Niels Thykier Debian Developer Release Manager

LORD RD JESUS US CHRI RIST ST 1 A Message to the Saints on the First Day of Unleavened Bread

{ avg. latency) Actuator Cylinder 7.4/8.2 ms avg. seek Track Arm Platter Head Buffer

BLOCK MANAGEMENT IN SOLID-STATE DEVICES Abhishek Rajimwale (University of Wisconsin-Madison)

Sandboxing &amp; Virtualization: Modern Tools for Combating Malware Anup K. Ghosh, PhD Founder

Learning MySQL 5.7 Jervin Real Percona Live 2017 1 / 21 Agenda 1. Background 2. New Features

Sandboxing & Virtualization: Modern Tools for Combating Malware Anup K. Ghosh, PhD Founder