Recursive Filter Code for GPUs Sepideh Maleki and Martin Burtscher - - PowerPoint PPT Presentation

recursive filter code for gpus
SMART_READER_LITE
LIVE PREVIEW

Recursive Filter Code for GPUs Sepideh Maleki and Martin Burtscher - - PowerPoint PPT Presentation

Automatic Generation of 1D Recursive Filter Code for GPUs Sepideh Maleki and Martin Burtscher Based on Fibonacci Sequences Fibonacci numbers: 0, 1 , 1, 2, 3, 5, 8, 13, 21, Sum of previous two values ( F i = F i 1 + F i 2 )


slide-1
SLIDE 1

Automatic Generation of 1D Recursive Filter Code for GPUs

Sepideh Maleki and Martin Burtscher

slide-2
SLIDE 2

Based on Fibonacci Sequences

▪ Fibonacci numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, …

▪ Sum of previous two values (Fi = Fi−1 + Fi−2)

▪ Tribonacci numbers: 0, 0, 1, 1, 2, 4, 7, 13, 24, …

▪ Sum of prior three values (Fi = Fi−1 + Fi−2 + Fi−3)

▪ (2, -3, 1)-Fibonacci numbers: 0, 0, 1, 2, 1, -3, -7, -4, …

▪ Weighted sum of prior values (Fi = 2Fi−1 - 3Fi−2 + 1Fi−2)

▪ (w1,…,wk)-Fibonacci numbers: 0, …, 0, 1, w1, w1

2+w2, …

▪ Weighted sum of prior k values with wj ∈ ℝ (Fi = w1Fi−1

+ w2Fi−2 + … + wkFi−k), called k-nacci numbers

Automatic Generation of 1D Recursive Filter Code for GPUs 2

http://www.storyofmathematics.com/medieval_fibonacci.html
slide-3
SLIDE 3

Linear Recurrences

▪ Transform input sequence into output sequence

x0, …, xn-1 → y0, …, yn-1

▪ Our focus is on order-k homogeneous linear

recurrences with constant coefficients

yi = a0xi+a-1xi-1+…+a-pxi-p + b-1yi-1+b-2yi-2+…+b-kyi-k

Automatic Generation of 1D Recursive Filter Code for GPUs 3

slide-4
SLIDE 4

Importance of Linear Recurrences

▪ Linear recurrences appear in many domains

▪ Mathematics ▪ Data compression ▪ Biology

▪ Parallel programming

▪ Prefix sums

▪ Telecommunication

▪ Digital filters

Automatic Generation of 1D Recursive Filter Code for GPUs 4

▪ Random-number gen. ▪ Finance and economics ▪ Complexity analysis

gamedsforum.ca

slide-5
SLIDE 5

Prefix Sums

▪ Prefix sums are fundamental building blocks

▪ Help parallelize many seemingly serial algorithms

▪ Given a sequence of values (integer or real) ▪ Compute the sequence whose values are the sum

  • f all previous values from the original sequence

yi = xi + yi-1

Automatic Generation of 1D Recursive Filter Code for GPUs 5

1

  • 9

5 3 2

  • 1

8

  • 6

7

  • 2

3 3 5 4 12 6

slide-6
SLIDE 6

▪ IIR filters are fundamental DSP algorithms

▪ Used in telecommunication and audio DSP codes

▪ Digital equivalent to

analog RC circuits

▪ Illustration

▪ High-pass filter

Digital (Recursive) Filters

Automatic Generation of 1D Recursive Filter Code for GPUs 6

The Scientist and Engineer’s Guide to Digital Signal Processing by Steven W. Smith

yi = 0.93xi - 0.93xi-1 + 0.86yi-1

slide-7
SLIDE 7

▪ Recurrence equation (xj = 0, yj = 0, ∀j < 0)

yi = a0xi+a-1xi-1+…+a-pxi-p + b-1yi-1+b-2yi-2+…+b-kyi-k

▪ Computation of element yi

Parallelization Difficulty

Automatic Generation of 1D Recursive Filter Code for GPUs 7

yi yi+1 … yi-k … yi-2 yi-1 … xi xi+1 … … xi-p … xi-1 …

a0 a-1 a-p b-1 b-2 b-k

∑ The bj are the recursion (feed-back) coefficients The aj are the non-recursion (feed-forward) coefficients k denotes the order

  • f the recurrence

Input sequence: given, read-only Output sequence: written and read

Data dependency!

slide-8
SLIDE 8

▪ Recurrence equation

yi = a0xi+a-1xi-1+…+a-pxi-p + b-1yi-1+b-2yi-2+…+b-kyi-k (a0, a-1, …, a-p : b-1, b-2, …, b-k)

▪ Signature

▪ Lists only non-recursion and recursion coefficients in

parentheses and separated by a colon

Simplified Notation

Automatic Generation of 1D Recursive Filter Code for GPUs 8

slide-9
SLIDE 9

Signature Examples

▪ Standard prefix sum

▪ Prefix sum over scalar values ▪ (1 : 1)

▪ Low-pass digital filters

▪ Retain low frequencies but dampen high frequencies ▪ 1-stage (0.2 : 0.8), 2-stage (0.04 : 1.6, -0.64), etc.

▪ High-pass digital filters

▪ Retain high frequencies ▪ 1-stage (0.9, -0.9 : 0.8) ▪ 2-stage (0.81, -1.62, 0.81 : 1.6, -0.64)

Automatic Generation of 1D Recursive Filter Code for GPUs 9

by Steven W. Smith

1

  • 9

5 3 2

  • 1

8

  • 6

7

  • 2

3 3 5 4 12 6

slide-10
SLIDE 10

▪ Original recurrence

yi = a0xi+a-1xi-1+…+a-pxi-p + b-1yi-1+b-2yi-2+…+b-kyi-k

▪ Equivalent map and simpler recurrence

▪ Map operation ti = a0xi+a-1xi-1+…+a-pxi-p

(a0, …, a-p : 0)

▪ Recurrence

yi = ti + b-1yi-1+b-2yi-2+…+b-kyi-k

(1 : b-1, …, b-k)

▪ Benefit: easier to parallelize

▪ Recurrence always has (1 : ...) format; map is trivial

Separation into Map + Recurrence

Automatic Generation of 1D Recursive Filter Code for GPUs 10

slide-11
SLIDE 11

▪ High-level idea

▪ Break input into chunks of size 1 (trivial) ▪ Iteratively combine adjacent chunks into larger chunks

▪ Two phases

1.

Merging

2.

Pipelining*

Our PLR Approach

Automatic Generation of 1D Recursive Filter Code for GPUs 11

1

  • 9

5 3 2

  • 1

8

  • 6
  • 1

5

  • 8

4 1

  • 9

5 3 2

  • 1

8

  • 6
  • 1

5

  • 8

4

  • 5
  • 9 -4

3 5

  • 1

7

  • 6

3 5

  • 3

4

  • 5 -14 -9

3 5 4 12

  • 6

3 8 4 7

  • 2

3 6 6 11 3 7

slide-12
SLIDE 12

PLR Merging (1 : d)

▪ Merging two adjacent chunks

▪ v0, v1, …, vm-1 | vm, vm+1, …, v2m-1

▪ Correcting element vm

▪ Per (1 : d), need to add d times prior element vm-1 ▪ The correction term is d∙vm-1

▪ Correcting element vm+1

▪ Need to add d times the corrected prior element

▪ Already added d times vm in an earlier iteration ▪ Only need to add d times the prior correction term

▪ The correction term is d∙d∙vm-1

Automatic Generation of 1D Recursive Filter Code for GPUs 12

3 2

  • 1

8 3 5

  • 1

7 3 5 4 12

slide-13
SLIDE 13

All factors are 1 for d = 1; prefix sum is trivial base case

PLR Merging (1 : d) cont.

▪ Correcting elements vm+2, vm+3, etc.

▪ The correction terms are d3∙vm-1, d4∙vm-1, etc. ▪ Correction factor times carry vm-1 from prior chunk

▪ Key observation

▪ Carry value depends on input sequence ▪ Correction factors only depend on recurrence

▪ Can be precomputed as they are the same for all inputs

▪ Just the correction factors

▪ d, d2, d3, …, dm

Automatic Generation of 1D Recursive Filter Code for GPUs 13

→ 1 | d, d2, d3, …, dm

Start with 1, apply recurrence (0 : d)

slide-14
SLIDE 14

PLR Merging (1 : d, e)

▪ Merging two adjacent chunks

▪ v0, v1, …, vm-2, vm-1 | vm, vm+1, …, v2m-1

▪ Correcting element vm

▪ Per (1 : d, e), need to add d times vm-1 plus e times vm-2 ▪ The correction term is d∙vm-1 + e∙vm-2

▪ Correcting element vm+1

▪ Need to add d times (d∙vm-1+e∙vm-2) plus e times vm-1 ▪ The correction term is d∙(d∙vm-1+e∙vm-2) + e∙vm-1, which

is (d2+e)∙vm-1 + (d∙e)∙vm-2 after rearranging the terms

Automatic Generation of 1D Recursive Filter Code for GPUs 14

slide-15
SLIDE 15

PLR Merging (1 : d, e) cont.

▪ Correcting elements vm+2, vm+3, etc.

▪ The correction terms are (d3+2de)∙vm-1 + (d2e+e2)∙vm-2,

(d4+3d2e+e2)∙vm-1 + (d3e+2de2)∙vm-2, etc.

▪ There are two carries vm-1 and vm-2 from prior chunk

▪ Because the recurrence (1 : d, e) has order 2

▪ Just the correction factors for vm-1

▪ d, d2+e, d3+2de, d4+3d2e+e2, …

▪ Just the correction factors for vm-2

▪ e, de, d2e+e2, d3e+2de2, …

Automatic Generation of 1D Recursive Filter Code for GPUs 15

slide-16
SLIDE 16

▪ Correction factors for vm-1

▪ d, d2+e, d3+2de, d4+3d2e+e2, …

▪ Correction factors for vm-2

▪ e, de, d2e+e2, d3e+2de2, …

▪ Both sequences can be generated by (0 : d, e)

▪ 0, 1 | d, d2+e, d3+2de, d4+3d2e+e2, … ▪ 1, 0 | e, de, d2e+e2, d3e+2de2, …

PLR Merging (1 : d, e) cont.

Automatic Generation of 1D Recursive Filter Code for GPUs 16

The “1” indicates the location of the carry in the prior chunk

slide-17
SLIDE 17

PLR Merging (1 : b-1, b-2, …, b-k)

▪ Correction-factor computation

▪ Recurrence has order k, so k lists of factors needed ▪ Start with k-1 zeros and a one: 0, …, 0, 1, 0, …, 0

▪ “1” is in location of corresponding carry

▪ Compute factors using (0 : b-1, b-2, …, b-k)

Correction factors are k-nacci sequences (generalized Fibonacci sequences)

Automatic Generation of 1D Recursive Filter Code for GPUs 17

slide-18
SLIDE 18

PLR: Proof of Concept Tool

▪ PLR code generator

▪ Compiles signature into CUDA code for GPUs ▪ Performs domain-specific code optimizations

▪ Generated code

▪ Performs map operation (a0, a-1, …, a-p : 0) ▪ Computes recurrence (1 : b-1, b-2, …, b-k)

▪ First five merge steps are done at warp level ▪ Remaining merge steps are done at thread-block level ▪ Pipelining is performed at grid level*

▪ Uses m ≤ 9 ∙ 1024 for floats and m ≤ 11 ∙ 1024 for ints

Automatic Generation of 1D Recursive Filter Code for GPUs 18

slide-19
SLIDE 19

Experimental Methodology

▪ GPU

▪ GeForce GTX Titan X (1.1 GHz cores, 3.5 GHz memory) ▪ 3072 cores, 24 SMs, up to 49,152 active threads ▪ 2 MB L2 cache, 12 GB of global memory (336 GB/s)

▪ Compiler and flags

▪ nvcc 7.5 with “-O3 -arch=sm_52”

▪ Comparison codes

▪ Prefix sums: CUB (Nvidia), SAM (us), Scan (CMU) ▪ Digital filters: Alg3 (IMPA), Rec (Halide/MIT), Scan ▪ All downloaded except Scan (uses CUB’s scan)

Automatic Generation of 1D Recursive Filter Code for GPUs 19

slide-20
SLIDE 20

Different Approaches

▪ CUB: prefix scan on templated objects/operators

▪ Warp/block/grid prefix scans with carry pipelining

▪ SAM: scalar/higher-order/tuple-based prefix sums

▪ Auto-tuned specialized prefix sums (warp/block/grid)

▪ Scan: arbitrary 1D linear recurrences

▪ Elem = vector, recur = matrix, operator = vec/mat mult

▪ Alg3: 2D recursive filters (for image processing)

▪ Framework to facilitate overlap of multi-dim filters

▪ Rec: 2D recursive filters (for image processing)

▪ Specify filters and heuristics, Halide compiles/optimizes

Automatic Generation of 1D Recursive Filter Code for GPUs 20

We use scaled-down versions with 1D filters on 2D inputs

slide-21
SLIDE 21

5 10 15 20 25 30 35 40

throughput [billion floats per second] sequence length [floats]

memcpy Alg3 Rec Scan PLR

1-Stage Low-Pass Filter Throughput

Automatic Generation of 1D Recursive Filter Code for GPUs 21

PLR outperforms other tools on large inputs Alg3 computes forward and backward recurrence PLR reaches memory-copy throughput; cannot be exceeded Rec runs 2D input (many short carry chains)

slide-22
SLIDE 22

2-Stage Low-Pass Filter Throughput

Automatic Generation of 1D Recursive Filter Code for GPUs 22

5 10 15 20 25 30 35 40

throughput [billion floats per second] sequence length [floats]

memcpy Alg3 Rec Scan PLR

L2 cache capacity exceeded Alg3 and Rec read data twice

slide-23
SLIDE 23

High-Pass Filter Throughput

Automatic Generation of 1D Recursive Filter Code for GPUs 23

5 10 15 20 25 30 35 40

throughput [billion floats per second] sequence length [floats]

memcpy Scan1 PLR1 PLR2 PLR3

Alg3 and Rec do not support multiple a coefficients Scan yields 50% of throughput b/c it accesses 2x as much data

slide-24
SLIDE 24

5 10 15 20 25 30 35 40

throughput [billion ints per second] sequence length [ints]

memcpy CUB SAM Scan PLR

Prefix-Sum Throughput

Automatic Generation of 1D Recursive Filter Code for GPUs 24

CUB, SAM, and PLR perform roughly equally

slide-25
SLIDE 25

2-Tuple Prefix-Sum Throughput

Automatic Generation of 1D Recursive Filter Code for GPUs 25

5 10 15 20 25 30 35 40

throughput [billion ints per second] sequence length [ints]

memcpy CUB SAM Scan PLR

PLR outperforms CUB and SAM on large inputs Scan accesses six times as much data

slide-26
SLIDE 26

2nd-Order Prefix-Sum Throughput

Automatic Generation of 1D Recursive Filter Code for GPUs 26

5 10 15 20 25 30 35 40

throughput [billion ints per second] sequence length [ints]

memcpy CUB SAM Scan PLR

SAM is much faster on medium and large inputs PLR performs on par with CUB

slide-27
SLIDE 27

Domain-Specific Code Optimizations

Automatic Generation of 1D Recursive Filter Code for GPUs 27

5 10 15 20 25 30 35 40

throughput [billion words per second]

  • ptimizations on
  • ptimizations off

1,073,741,824 values floating-point signatures integer signatures

Float: suppressing 0 factors is important Int: specializing 0 and 1 factors is important No specialization implemented

slide-28
SLIDE 28

Try PLR for Yourself

Automatic Generation of 1D Recursive Filter Code for GPUs 28

http://cs.txstate.edu/~burtscher/research/PLR/

slide-29
SLIDE 29

Summary

▪ Linear recurrences are widely used computations

▪ Difficult to parallelize due to data dependencies

▪ PLR: new general parallelization approach

▪ Based on iterative merging (and pipelining) ▪ Input-independent correction factors (k-nacci)

▪ Precomputed and optimized for a given recurrence

▪ Work, space, and communication efficient

▪ Highest GPU performance in many cases

▪ Automatically compiled, parallelized, and optimized

Automatic Generation of 1D Recursive Filter Code for GPUs 29

slide-30
SLIDE 30

Thank you!

▪ Acknowledgments

▪ Diego Nehab, André Maximo, Gaurav Chaurasia,

Sahar Azimi, NSF, Nvidia, paper reviewers

▪ Contact information

▪ burtscher@txstate.edu

▪ PLR web page

▪ http://cs.txstate.edu/~burtscher/research/PLR/ ▪ Includes link to paper

Automatic Generation of 1D Recursive Filter Code for GPUs 30