of Infinite Input Extensions Diego Nehab Andr Maximo IMPA GE - - PowerPoint PPT Presentation

โ–ถ
of infinite input extensions
SMART_READER_LITE
LIVE PREVIEW

of Infinite Input Extensions Diego Nehab Andr Maximo IMPA GE - - PowerPoint PPT Presentation

Parallel Recursive Filtering of Infinite Input Extensions Diego Nehab Andr Maximo IMPA GE Global Research GTC 2017 Linear time-invariant filters filter Linear Invariant to scale filter scale scale filter Linear Invariant to


slide-1
SLIDE 1

Parallel Recursive Filtering

  • f Infinite Input Extensions

Diego Nehab Andrรฉ Maximo IMPA GE Global Research

GTC 2017

slide-2
SLIDE 2

Linear time-invariant filters

filter

slide-3
SLIDE 3

Linear

  • Invariant to scale

filter scale filter scale

slide-4
SLIDE 4

Linear

  • Invariant to addition

add add filter filter filter

slide-5
SLIDE 5

Time invariant

  • Invariant to shift

filter t t shift filter t t t shift t

slide-6
SLIDE 6

Convolution

  • The convolution of sequences ๐œ and ๐ข is a sequence ๐›

๐› = ๐œ โˆ— ๐ข ๐‘๐‘— = เท

๐‘˜=โˆ’โˆž โˆž

๐‘

๐‘˜โ„Ž๐‘—โˆ’๐‘˜

๐œ, ๐ข, ๐›: โ„ค โ†’ โ„ ๐œ ๐ข ๐ข ๐ข ๐› = ๐œ โˆ— ๐ข ๐ข

slide-7
SLIDE 7

Linear shift-invariant filters are convolutions

filterโ€™s impulse response unit impulse filter convolve filter

slide-8
SLIDE 8

โˆ—

Examples

โˆ— โˆ—

slide-9
SLIDE 9

Outline of talk

  • Introduction
  • Recursive filters are very useful
  • Initialization at the boundaries is an important problem
  • Exact recursive filtering of infinite input extensions
  • Closed-form formulas available for the first time
  • Enable simple and effective algorithms
  • Parallelization
  • Fastest recursive filtering algorithms to date
  • First to filter infinite extensions exactly
slide-10
SLIDE 10

General model for convolutions

  • Linear difference equations

๐๐ด = ๐‚๐ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘  โ‹ฑ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘  โ‹ฑ โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ โ‹ฑ ๐‘๐‘  โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฑ ๐‘๐‘  โ‹ฎ ๐‘1 ๐‘0 โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฎ ๐‘จ๐‘—โˆ’2 ๐‘จ๐‘—โˆ’1 ๐‘จ๐‘— ๐‘จ๐‘—+1 ๐‘จ๐‘—+2 โ‹ฎ = โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘’ โ‹ฑ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘’ โ‹ฑ โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ โ‹ฑ ๐‘๐‘’ โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฑ ๐‘๐‘’ โ‹ฎ ๐‘1 ๐‘0 โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฎ ๐‘ฅ๐‘—โˆ’2 ๐‘ฅ๐‘—โˆ’1 ๐‘ฅ๐‘— ๐‘ฅ๐‘—+1 ๐‘ฅ๐‘—+2 โ‹ฎ constant diagonals ๐‘๐‘ ๐‘จ๐‘—โˆ’๐‘  + โ‹ฏ + ๐‘1๐‘จ๐‘—โˆ’1+๐‘0๐‘จ๐‘—+๐‘โˆ’1๐‘จ๐‘—+1+ โ‹ฏ + ๐‘โˆ’๐‘ ๐‘จ๐‘—+๐‘  = ๐‘๐‘’๐‘ฅ๐‘—โˆ’๐‘’ + โ‹ฏ + ๐‘1๐‘ฅ๐‘—โˆ’1+๐‘0๐‘ฅ๐‘—+๐‘โˆ’1๐‘ฅ๐‘—+1+ โ‹ฏ + ๐‘โˆ’๐‘’๐‘ฅ๐‘—+๐‘’ filter ๐ด ๐ฑ ๐ด ๐ด

slide-11
SLIDE 11

finite impulse response support (FIR) โ‹ฎ ๐‘ฆ๐‘—โˆ’2 ๐‘ฆ๐‘—โˆ’1 ๐‘ฆ๐‘— ๐‘ฆ๐‘—+1 ๐‘ฆ๐‘—+2 โ‹ฎ = โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘ก โ‹ฑ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘ก โ‹ฑ โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ โ‹ฑ ๐‘๐‘ก โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฑ ๐‘๐‘ก โ‹ฎ ๐‘1 ๐‘0 โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฎ ๐‘ฅ๐‘—โˆ’2 ๐‘ฅ๐‘—โˆ’1 ๐‘ฅ๐‘— ๐‘ฅ๐‘—+1 ๐‘ฅ๐‘—+2 โ‹ฎ

  • Decompose into direct and recursive parts

๐ฒ ๐ด recursive ๐‘ฆ๐‘— = ๐‘๐‘ก๐‘ฅ๐‘—โˆ’๐‘ก + โ‹ฏ + ๐‘1๐‘ฅ๐‘—โˆ’1+๐‘0๐‘ฅ๐‘—+๐‘โˆ’1๐‘ฅ๐‘—+1+ โ‹ฏ + ๐‘โˆ’๐‘ก๐‘ฅ๐‘—+๐‘ก direct part is what we think of as convolution it is like a matrix multiplication ๐ฒ = ๐‚๐ฑ ๐‘ƒ(๐‘ก ๐‘œ)

General model for convolutions

๐‘๐‘ ๐‘จ๐‘—โˆ’๐‘  + โ‹ฏ + ๐‘1๐‘จ๐‘—โˆ’1+๐‘0๐‘จ๐‘—+๐‘โˆ’1๐‘จ๐‘—+1+ โ‹ฏ + ๐‘โˆ’๐‘ ๐‘จ๐‘—+๐‘  = ๐‘ฆ๐‘— recursive part is the inverse of a convolution it is like a linear system ๐๐ด = ๐ฒ infinite impulse response support (IIR) โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘  โ‹ฑ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ ๐‘โˆ’๐‘  โ‹ฑ โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฎ โ‹ฑ ๐‘๐‘  โ‹ฎ ๐‘1 ๐‘0 ๐‘โˆ’1 โ‹ฑ ๐‘๐‘  โ‹ฎ ๐‘1 ๐‘0 โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฎ ๐‘จ๐‘—โˆ’2 ๐‘จ๐‘—โˆ’1 ๐‘จ๐‘— ๐‘จ๐‘—+1 ๐‘จ๐‘—+2 โ‹ฎ = โ‹ฎ ๐‘ฆ๐‘—โˆ’2 ๐‘ฆ๐‘—โˆ’1 ๐‘ฆ๐‘— ๐‘ฆ๐‘—+1 ๐‘ฆ๐‘—+2 โ‹ฎ direct ๐ฑ filter ๐ด ๐ฑ ๐ด ๐ด direct ๐ฑ ๐ด recursive ๐๐ด = ๐‚๐ฑ ๐ฒ = ๐‚๐ฑ ๐๐ด = ๐ฒ

slide-12
SLIDE 12

๐…๐ด = โˆš๐‘0 โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ 1 ๐‘“1 โ‹ฎ ๐‘“๐‘  1 ๐‘“1 โ‹ฎ ๐‘“๐‘  1 ๐‘“1 โ‹ฎ โ‹ฑ 1 ๐‘“1 โ‹ฑ 1 โ‹ฑ โ‹ฑ ๐ด = ๐ณ anticausal ๐ด

General model for convolutions

  • Decompose recursive part into causal and anticausal passes

direct ๐ฑ ๐ฒ causal ๐ณ anticausal ๐ด ๐‘ง๐‘— = 1 โˆš๐‘0 ๐‘ฆ๐‘— โˆ’ ๐‘’1๐‘ง๐‘—โˆ’1 โˆ’ โ‹ฏ โˆ’ ๐‘’๐‘ ๐‘ง๐‘—โˆ’๐‘  causal part is forward-substitution ๐„๐ณ = ๐ฒ ๐‘ƒ(๐‘  ๐‘œ) ๐‘จ๐‘— = 1 โˆš๐‘0 ๐‘ง๐‘— โˆ’ ๐‘“1๐‘จ๐‘—+1 โˆ’ โ‹ฏ โˆ’ ๐‘“๐‘ ๐‘จ๐‘—+๐‘  anticausal is back-substitution ๐…๐ด = ๐ณ ๐‘ƒ(๐‘  ๐‘œ) ๐ฒ causal ๐„๐ณ = โˆš๐‘0 โ‹ฑ โ‹ฑ 1 โ‹ฑ ๐‘’1 1 โ‹ฑ โ‹ฎ ๐‘’1 1 ๐‘’๐‘  โ‹ฎ ๐‘’1 1 ๐‘’๐‘  โ‹ฎ ๐‘’1 1 โ‹ฑ โ‹ฑ โ‹ฑ โ‹ฑ ๐ณ = ๐ฒ ๐ด recursive ๐ฒ ๐๐ด = ๐ฒ ๐„๐ณ = ๐ฒ ๐…๐ด = ๐ณ ๐ฒ = ๐‚๐ฑ

slide-13
SLIDE 13

recursive filter

Downsampling with cardinal cubic B-splines

prefiltered post-processed output input Catmull-Rom t cardinal cubic B-spline t

infinite support

cubic B-spline t prefiltered [Nehab & Hoppe 2014]

slide-14
SLIDE 14

input direct convolution ๐‘ก = 2 ๐œ

2 ๐œ ๐‘œ operations

Fast image blur

Blur with FIR filter (given ๐œ)

causal pass ๐‘  = 3 anticausal pass ๐‘  = 3

6๐‘œ operations Blur with IIR filter (any ๐œ) Blur with FFT ๐‘œ log ๐‘œ operations

[van Vliet et al. 1998]

slide-15
SLIDE 15

? ?

What do near input boundaries?

slide-16
SLIDE 16

Infinite input extensions

repeat periodically filter reflect periodically filter constant padding filter

slide-17
SLIDE 17

periodic repetition clamp to border

Tileable textures

  • Textures designed to be tiled in a certain way
  • Filtering must respect the periodicity
  • riginal texture

filtered, tiled, and shifted

slide-18
SLIDE 18

Dealing with boundaries in practice

  • In the frequency domain
  • DFT/DCT imply infinite extensions
  • Computations are exact even for IIR filters
  • In the time domain
  • Direct convolution can decide out of bounds input arbitrarily
  • Recursive filters must define out of bounds outputs!
slide-19
SLIDE 19

even more wasted computation filtered even more wasted memory padded filtered more wasted computation padded more wasted memory wasted memory padded wasted computation filtered

Approximation by input padding

input approximation

  • utput

amount of padding depends

  • n impulse response โ€supportโ€
slide-20
SLIDE 20

Outline of talk

  • Introduction
  • Recursive filters are very useful
  • Initialization at the boundaries is an important problem
  • Exact recursive filtering of infinite input extensions
  • Closed-form formulas available for the first time
  • Enable simple and effective algorithms
  • Parallelization
  • Fastest recursive filtering algorithms to date
  • First to filter infinite extensions exactly
slide-21
SLIDE 21

finite output

Exact recursive filtering of finite input

finite input ๐ฒ2 ๐ฒ4 ๐ฒ5 ๐ฒ6 ๐ฒ1 ๐ฒ3 input extension ๐ฒโ€“3 ๐ฒโ€“1 ๐ฒโ€“0 ๐ฒโ€“2 ๐ฒโ€“4 ๐ฒโ€“5

โ€ฆ

๐ฒ8 ๐ฒ10 ๐ฒ7 ๐ฒ9 ๐ฒ11 ๐ฒ12 โ€ฆ input extension ๐ด8 ๐ด10 ๐ด9 ๐ด11 ๐ด12 โ€ฆ ๐ด5 ๐ด3 ๐ด2 ๐ด1 ๐ด6 ๐ด4 ๐ด7 ๐ณ2 ๐ณ4 ๐ณ5 ๐ณ6 ๐ณ1 ๐ณ3 ๐ณ8 ๐ณ10 ๐ณ7 ๐ณ9 ๐ณ11 ๐ณ12 โ€ฆ ๐ณ0

โ€ฆ

๐ณโ€“2 ๐ณโ€“1 ๐ณโ€“3 ๐ณโ€“4 ๐ณโ€“5

slide-22
SLIDE 22

finite output

Exact recursive filtering of finite input

  • Instead, first obtain the initial feedbacks
  • To initialize causal pass
  • To initialize subsequent anticausal pass
  • How to obtain these feedbacks in closed form?
  • Depends on the choice of infinite input extension

initial causal feedback initial anticausal feedback finite input input extension ๐ฒโ€“3 ๐ฒโ€“1 ๐ฒโ€“0 ๐ฒโ€“2 ๐ฒโ€“4 ๐ฒโ€“5

โ€ฆ

๐ฒ8 ๐ฒ10 ๐ฒ7 ๐ฒ9 ๐ฒ11 ๐ฒ12 โ€ฆ input extension ๐ด5 ๐ด3 ๐ด2 ๐ด1 ๐ด6 ๐ด4 ๐ด7 ๐ฒ2 ๐ฒ4 ๐ฒ5 ๐ฒ6 ๐ฒ1 ๐ฒ3 ๐ณ2 ๐ณ4 ๐ณ5 ๐ณ6 ๐ณ1 ๐ณ3 ๐ณ0

slide-23
SLIDE 23

Constant padding

๐ณ8 ๐ณ10 ๐ณ7 ๐ณ9 ๐ณ11 ๐ณ12 โ€ฆ series finite input ๐ฒ2 ๐ฒ4 ๐ฒ5 ๐ฒ6 ๐ฒ1 ๐ฒ3 input extension ๐ด5 ๐ด3 ๐ด2 ๐ด1 ๐ด6 ๐ด4 anticausal ๐ณ2 ๐ณ4 ๐ณ5 ๐ณ6 ๐ณ1 ๐ณ3 causal ๐ฒ0 ๐ฒ0 ๐ฒ0 ๐ฒ0 ๐ฒ0 ๐ฒ0

โ€ฆ

๐ณโ€“2 ๐ณโ€“1 ๐ณโ€“3 ๐ณโ€“4 ๐ณโ€“5

โ€ฆ

infinite series ๐ณ0 ๐ณ0 = ๐‘ต1 ๐ฒ0 initial causal feedback ๐ด8 ๐ด10 ๐ด9 ๐ด11 ๐ด12 โ€ฆ infinite series ๐ด7 = ๐‘ต2 ๐ด7 + ๐‘ต3 ๐ณ6 ๐ฒ7 initial anticausal feedback ๐ฒ7 ๐ฒ7 ๐ฒ7 ๐ฒ7 ๐ฒ7 ๐ฒ7 โ€ฆ input extension ๐‘ต1 = ๐‘ป๐บเดฅ ๐‘ฉ๐บ ๐‘ป๐บ = ๐‘ฑ โˆ’ ๐‘ฉ๐บ

๐‘  โˆ’1

๐‘ต2 = ๐‘ป๐‘†๐บ๐‘ฉ๐บ

๐‘ 

๐‘ป๐‘†๐บ โˆ’ ๐‘ฉ๐‘†

๐‘  ๐‘ป๐‘†๐บ๐‘ฉ๐บ ๐‘  = เดฅ

๐‘ฉ๐‘† ๐‘ต3 = ๐‘ป๐‘†เดฅ ๐‘ฉ๐‘† โˆ’ ๐‘ป๐‘†๐บ๐‘ฉ๐บ

๐‘  ๐‘ต1

๐‘ป๐‘† = ๐‘ฑ โˆ’ ๐‘ฉ๐‘†

๐‘  โˆ’1

finite output precomputed ๐‘  ร— ๐‘  matrix precomputed ๐‘  ร— ๐‘  matrices

slide-24
SLIDE 24

แˆถ ๐ณ2 แˆถ ๐ณ4 แˆถ ๐ณ5 แˆถ ๐ณ6 แˆถ ๐ณ1 แˆถ ๐ณ3 1st causal 2nd causal ๐ณ2 ๐ณ4 ๐ณ5 ๐ณ6 ๐ณ1 ๐ณ3 2nd anticausal ๐ด5 ๐ด3 ๐ด2 ๐ด1 ๐ด6 ๐ด4 แˆถ ๐ด5 แˆถ ๐ด3 แˆถ ๐ด2 แˆถ ๐ด1 แˆถ ๐ด6 แˆถ ๐ด4 1st anticausal

Periodic repetition

finite input ๐ฒ2 ๐ฒ4 ๐ฒ5 ๐ฒ6 ๐ฒ1 ๐ฒ3 = ๐‘ต5 ๐ด1 แˆถ ๐ด1 ๐ด1 initial anticausal feedback ๐ณ6 = ๐‘ต4 แˆถ ๐ณ6 ๐ณ6 initial causal feedback ๐‘ต4 = ๐‘ฑ โˆ’ ๐‘ฉ๐บ

๐‘œ โˆ’1

๐‘ต5 = ๐‘ฑ โˆ’ ๐‘ฉ๐‘†

๐‘œ โˆ’1

finite output input extension ๐ฒ3 ๐ฒ5 ๐ฒ6 ๐ฒ4 ๐ฒ2 ๐ฒ1

โ€ฆ

๐ฒ2 ๐ฒ4 ๐ฒ1 ๐ฒ3 ๐ฒ5 ๐ฒ6 โ€ฆ input extension

slide-25
SLIDE 25

แˆท ๐ด5 แˆท ๐ด3 แˆท ๐ด2 แˆท ๐ด1 แˆท ๐ด6 แˆท ๐ด4 1st anticausal แˆถ ๐ณ2 แˆถ ๐ณ4 แˆถ ๐ณ5 แˆถ ๐ณ6 แˆถ ๐ณ1 แˆถ ๐ณ3 1st causal ๐ณ2 ๐ณ4 ๐ณ5 ๐ณ6 ๐ณ1 ๐ณ3 2nd causal

Periodic reflection

finite input ๐ฒ2 ๐ฒ4 ๐ฒ5 ๐ฒ6 ๐ฒ1 ๐ฒ3 แˆถ ๐ณ6 ๐ณ12 = ๐‘ต8 + ๐‘ต9 แˆท ๐ด1 แˆถ ๐ณ6 ๐ณ12 = ๐‘ต6 + ๐‘ต7 ๐‘ต7 = ๐‘ฑ โˆ’ ๐‘ฉ๐บ

2โ„Ž โˆ’1๐‘ณ เดฅ

๐‘ฉ๐‘†

โˆ’1 ๐‘ฑ โˆ’ ๐‘ฉ๐บ ๐‘ ๐‘ฉ๐‘† ๐‘ 

๐‘ต6 = ๐‘ฑ โˆ’ ๐‘ฉ๐บ

2โ„Ž โˆ’1 ๐‘ฉ๐บ โ„Ž โˆ’ ๐‘ณ๐‘ฉ๐‘† โ„Ž เดฅ

๐‘ฉ๐บ

โˆ’1๐‘ฉ๐บ ๐‘  เดฅ

๐‘ฉ๐‘† ๐‘ต8 = ๐‘ณ โˆ’ ๐‘ฉ๐‘†

๐‘  โˆ’๐Ÿเดฅ

๐‘ฉ๐‘† ๐‘ต9 = ๐‘ณ โˆ’ ๐‘ฉ๐‘†

๐‘  โˆ’๐Ÿเดฅ

๐‘ฉ๐‘†๐‘ฉ๐บ

โ„Ž

๐ณ12 initial causal feedback initial anticausal feedback finite output ๐ด5 ๐ด3 ๐ด2 ๐ด1 ๐ด6 ๐ด4 2nd anticausal input extension

โ€ฆ โ€ฆ

input extension

slide-26
SLIDE 26

Is this correct? Is this fast?

  • All required ๐‘  ร— ๐‘  matrices ๐‘ต1to ๐‘ต9 exist and can be precomputed
  • Proofs in the paper assume only filter stability
  • Exact filtering for all infinite extensions takes ๐‘ƒ(๐‘  ๐‘œ)
  • May require twice as much computation
  • Real advantage comes with parallelization
  • No additional cost
slide-27
SLIDE 27

Outline of talk

  • Introduction
  • Recursive filters are very useful
  • Initialization at the boundaries is an important problem
  • Exact recursive filtering of infinite input extensions
  • Closed-form formulas available for the first time
  • Enable simple and effective algorithms
  • Parallelization
  • Fastest recursive filtering algorithms to date
  • First to filter infinite extensions exactly
slide-28
SLIDE 28

GPU

  • โ€ข โ€ข

Threads Threads Shared Memor y Shared Memor y

Global Memory

Modern GPU

  • Many multiprocessors each supporting many hardware threads
  • 10k threads is typical
  • On-chip shared memory within each multiprocessor
  • 48k to be shared by threads local to multiprocessor
  • Global memory with high throughput but high latency
  • High throughput but high latency
  • Challenge is hide latency and keeping all cores busy

Multiprocessor Multiprocessor

slide-29
SLIDE 29

๐ฒ13 ๐ฒ15 ๐ฒ16 ๐ฒ12 ๐ฒ14 ๐ฒ1 ๐ฒ3 ๐ฒ4 ๐ฒ5 ๐ฒ2

Independent rows then columns

input ๐ฒ7 ๐ฒ9 ๐ฒ10 ๐ฒ11 ๐ฒ6 ๐ฒ8

  • utput

๐ณ1 ๐ณ3 ๐ณ4 ๐ณ5 ๐ณ2 ๐ณ13 ๐ณ15 ๐ณ16 ๐ณ12 ๐ณ14 ๐ณ7 ๐ณ9 ๐ณ10 ๐ณ11 ๐ณ6 ๐ณ8 causal ๐ด1 ๐ด2 ๐ด3 ๐ด4 ๐ด13 ๐ด15 ๐ด14 ๐ด16 ๐ด8 ๐ด7 ๐ด6 ๐ด5 ๐ด10 ๐ด11 ๐ด9 ๐ด12 anticausal [Ruijters and Thevenaz 2010]

slide-30
SLIDE 30

Summary of parallel algorithms

๐‘  is the filter order ๐‘ž is the number of processors โ„Ž is the image height ๐‘ฅ is the image width

Algorithm Step complexity Threads Bandwidth independent rows then columns 4๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž, ๐‘ฅ 8โ„Ž๐‘ฅ

slide-31
SLIDE 31

7 1 3 5 Throughput (GiP/s) 1st order Input size (pixels) 642 1282 2562 5122 10242 20482 40962

[Nehab, Maximo, Lima & Hoppe 2011]

independent rows then columns

Performance on Fermi (GF100)

slide-32
SLIDE 32

correct feedbacks ๐ด9 ๐ด1 ๐ด5 ๐ด13 แˆถ ๐ด9 แˆถ ๐ด1 แˆถ ๐ด5 แˆถ ๐ด13 ๐ด10 ๐ด11 ๐ด9 ๐ด12 2nd anticausal ๐ณ13 ๐ณ15 ๐ณ16 ๐ณ14 2nd causal แˆถ ๐ณ13 แˆถ ๐ณ15 แˆถ ๐ณ16 แˆถ ๐ณ14 1st causal แˆถ ๐ณ4 แˆถ ๐ณ8 แˆถ ๐ณ12 แˆถ ๐ณ16 แˆถ ๐ณ10 แˆถ ๐ณ12 แˆถ ๐ณ9 แˆถ ๐ณ11 1st causal ๐ณ12 ๐ณ9 ๐ณ10 ๐ณ11 2nd causal แˆถ ๐ณ5 แˆถ ๐ณ7 แˆถ ๐ณ6 แˆถ ๐ณ8 1st causal ๐ณ4 ๐ณ16 ๐ณ12 ๐ณ8 correct feedbacks ๐ณ5 ๐ณ7 ๐ณ6 ๐ณ8 2nd causal 1st causal แˆถ ๐ณ1 แˆถ ๐ณ3 แˆถ ๐ณ4 แˆถ ๐ณ2 ๐ณ3 ๐ณ1 ๐ณ4 ๐ณ2 2nd causal ๐ฒ13 ๐ฒ15 ๐ฒ16 ๐ฒ14 ๐ฒ1 ๐ฒ3 ๐ฒ4 ๐ฒ2 ๐ฒ12 ๐ฒ9 ๐ฒ10 ๐ฒ11 ๐ฒ5 ๐ฒ7 ๐ฒ6 ๐ฒ8 แˆถ ๐ด1 แˆถ ๐ด2 แˆถ ๐ด3 แˆถ ๐ด4 1st anticausal แˆถ ๐ด13 แˆถ ๐ด15 แˆถ ๐ด14 แˆถ ๐ด16 1st anticausal แˆถ ๐ด8 แˆถ ๐ด7 แˆถ ๐ด6 แˆถ ๐ด5 1st anticausal แˆถ ๐ด10 แˆถ ๐ด11 แˆถ ๐ด9 แˆถ ๐ด12 1st anticausal ๐ด1 ๐ด2 ๐ด3 ๐ด4 2nd anticausal ๐ด13 ๐ด15 ๐ด14 ๐ด16 2nd anticausal ๐ด8 ๐ด7 ๐ด6 ๐ด5 2nd anticausal แˆถ ๐ด๐‘๐‘— ๐ด๐‘๐‘— ๐ด๐‘(๐‘—+1) = ๐‘ฉ๐‘†

๐‘

+

Split rows and columns into blocks

  • utput

input ๐ฒ13 ๐ฒ15 ๐ฒ16 ๐ฒ14 ๐ฒ1 ๐ฒ3 ๐ฒ4 ๐ฒ2 ๐ฒ12 ๐ฒ9 ๐ฒ10 ๐ฒ11 ๐ฒ5 ๐ฒ7 ๐ฒ6 ๐ฒ8 แˆถ ๐ณ๐‘๐‘— ๐ณ๐‘๐‘— ๐ณ๐‘(๐‘—โˆ’1) = ๐‘ฉ๐บ

๐‘

+ [Sung & Mitra 1986; Nehab, Maximo, Lima & Hoppe 2011]

slide-33
SLIDE 33

Summary of parallel algorithms

๐‘  is the filter order ๐‘ž is the number of processors โ„Ž is the image height ๐‘ฅ is the image width

Algorithm Step complexity Threads Bandwidth independent rows then columns 4๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž, ๐‘ฅ 8โ„Ž๐‘ฅ + split rows and columns โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 9โ„Ž๐‘ฅ

slide-34
SLIDE 34

7 1 3 5 Throughput (GiP/s) 1st order Input size (pixels) 642 1282 2562 5122 10242 20482 40962

[Nehab, Maximo, Lima & Hoppe 2011]

independent rows then columns + split rows and columns

Performance on Fermi (GF100)

slide-35
SLIDE 35

correct feedbacks ๐ด9 ๐ด1 ๐ด5 ๐ด13 แˆท ๐ด10 แˆท ๐ด11 แˆท ๐ด9 แˆท ๐ด12 1st anticausal ๐ด10 ๐ด11 ๐ด9 ๐ด12 2nd anticausal แˆท ๐ด9 แˆท ๐ด1 แˆท ๐ด5 แˆท ๐ด13 ๐ด13 ๐ด15 ๐ด14 ๐ด16 2nd anticausal แˆท ๐ด13 แˆท ๐ด15 แˆท ๐ด14 แˆท ๐ด16 1st anticausal ๐ณ15 ๐ณ13 ๐ณ16 ๐ณ14 2nd causal ๐ณ12 ๐ณ9 ๐ณ10 ๐ณ11 2nd causal ๐ณ5 ๐ณ7 ๐ณ6 ๐ณ8 2nd causal ๐ณ3 ๐ณ1 ๐ณ4 ๐ณ2 2nd causal แˆถ ๐ณ12 แˆถ ๐ณ9 แˆถ ๐ณ10 แˆถ ๐ณ11 1st causal 1st causal แˆถ ๐ณ1 แˆถ ๐ณ3 แˆถ ๐ณ4 แˆถ ๐ณ2 แˆถ ๐ณ13 แˆถ ๐ณ15 แˆถ ๐ณ16 แˆถ ๐ณ14 1st causal แˆถ ๐ณ5 แˆถ ๐ณ7 แˆถ ๐ณ6 แˆถ ๐ณ8 1st causal ๐ณ4 ๐ณ16 ๐ณ12 ๐ณ8 correct feedbacks แˆท ๐ด8 แˆท ๐ด7 แˆท ๐ด6 แˆท ๐ด5 1st anticausal แˆท ๐ด1 แˆท ๐ด2 แˆท ๐ด3 แˆท ๐ด4 1st anticausal ๐ด8 ๐ด7 ๐ด6 ๐ด5 2nd anticausal ๐ด1 ๐ด2 ๐ด3 ๐ด4 2nd anticausal แˆถ ๐ณ4 แˆถ ๐ณ8 แˆถ ๐ณ12 แˆถ ๐ณ16

Overlap causal-anticausal processing

  • utput

input แˆถ ๐ณ๐‘๐‘— ๐ณ๐‘๐‘— ๐ณ๐‘(๐‘—โˆ’1) = ๐‘ฉ๐บ

๐‘

+ [Nehab, Maximo, Lima & Hoppe 2011] แˆท ๐ด๐‘๐‘— ๐ด๐‘๐‘— ๐ด๐‘(๐‘—+1) = ๐‘ฉ๐‘†

๐‘

+ ๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐‘„ ๐ณ๐‘(๐‘—โˆ’1) + ๐ฒ13 ๐ฒ15 ๐ฒ16 ๐ฒ14 ๐ฒ1 ๐ฒ3 ๐ฒ4 ๐ฒ2 ๐ฒ12 ๐ฒ9 ๐ฒ10 ๐ฒ11 ๐ฒ5 ๐ฒ7 ๐ฒ6 ๐ฒ8

slide-36
SLIDE 36

Summary of parallel algorithms

๐‘  is the filter order ๐‘ž is the number of processors โ„Ž is the image height ๐‘ฅ is the image width

Algorithm Step complexity Threads Bandwidth independent rows then columns 4๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž, ๐‘ฅ 8โ„Ž๐‘ฅ + split rows and columns โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 9โ„Ž๐‘ฅ + overlap causal with anticausal โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 5โ„Ž๐‘ฅ

slide-37
SLIDE 37

7 1 3 5 Throughput (GiP/s) 1st order Input size (pixels) 642 1282 2562 5122 10242 20482 40962

[Nehab, Maximo, Lima & Hoppe 2011]

independent rows then columns + split rows and columns + overlap causal and anticausal

Performance on Fermi (GF100)

slide-38
SLIDE 38

Overlap causal, anticausal, rows, columns

[Nehab, Maximo, Lima & Hoppe 2011]

slide-39
SLIDE 39

Stage 1

slide-40
SLIDE 40

เดบ ๐ฏ๐‘›,๐‘œ ๐ฏ๐‘›,๐‘œ ๐ฏ๐‘›,๐‘œโˆ’1

= ๐‘ฉ๐บ

๐‘ ๐‘ข +

๐‘ฉ๐‘†๐น + ๐‘ฉ๐‘†๐ถ๐‘ฉ๐บ๐ถ ๐‘ˆ ๐‘ฉ๐บ๐ถ ๐‘ข +

๐ด๐‘›+1,๐‘œ ๐ณ๐‘›โˆ’1,๐‘œ

๐‘ฉ๐‘†

๐‘ ๐‘ข +

๐ด๐‘›+1,๐‘œ ๐ฐ๐‘›,๐‘œ ๐ฐ๐‘›,๐‘œ+1

= + ๐‘ฉ๐‘†๐ถ๐‘ฉ๐บ๐‘„ ๐‘ฉ๐‘†๐น

๐ฏ๐‘›,๐‘œโˆ’1 ๐ณ๐‘›โˆ’1,๐‘œ

๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐ถ ๐‘ข +

แˆทแˆท ๐ฐ๐‘›,๐‘œ

๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐‘„ ๐‘ข +

Stage 2

แˆถ ๐ณ๐‘›,๐‘œ ๐ณ๐‘›,๐‘œ ๐ณ๐‘›โˆ’1,๐‘œ

= ๐‘ฉ๐บ

๐‘

+

แˆท ๐ด๐‘›,๐‘œ ๐ด๐‘›,๐‘œ ๐ด๐‘›+1,๐‘œ

= ๐‘ฉ๐‘†

๐‘

+ ๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐‘„ ๐ณ๐‘›โˆ’1,๐‘œ +

slide-41
SLIDE 41

Stage 3

slide-42
SLIDE 42

Stage 3

  • utput
slide-43
SLIDE 43

Summary of parallel algorithms

๐‘  is the filter order ๐‘ž is the number of processors โ„Ž is the image height ๐‘ฅ is the image width

Algorithm Step complexity Threads Bandwidth independent rows then columns 4๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž, ๐‘ฅ 8โ„Ž๐‘ฅ + split rows and columns โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 9โ„Ž๐‘ฅ + overlap causal with anticausal โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 5โ„Ž๐‘ฅ + overlap rows with columns โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 3โ„Ž๐‘ฅ

slide-44
SLIDE 44

7 1 3 5 Throughput (GiP/s) 1st order Input size (pixels) 642 1282 2562 5122 10242 20482 40962

[Nehab, Maximo, Lima & Hoppe 2011]

+ overlap rows and columns independent rows then columns + split rows and columns + overlap causal and anticausal

Performance on Fermi (GF100)

slide-45
SLIDE 45

+ overlap rows and columns

[Nehab, Maximo, Lima & Hoppe 2011]

Input size (pixels) 642 1282 2562 5122 10242 20482 40962 2nd order 5 1 2 3 Throughput (GiP/s) 4

  • verlap causal and anticausal

Performance on Fermi (GF100)

slide-46
SLIDE 46

เดบ ๐ฏ๐‘›,๐‘œ ๐ฏ๐‘›,๐‘œ ๐ฏ๐‘›,๐‘œโˆ’1

= ๐‘ฉ๐บ

๐‘ ๐‘ข +

๐‘ฉ๐‘†๐น + ๐‘ฉ๐‘†๐ถ๐‘ฉ๐บ๐ถ ๐‘ˆ ๐‘ฉ๐บ๐ถ ๐‘ข +

๐ด๐‘›+1,๐‘œ ๐ณ๐‘›โˆ’1,๐‘œ ๐ณ๐‘›โˆ’1,๐‘œ

๐‘ฉ๐‘†

๐‘ ๐‘ข +

๐ด๐‘›+1,๐‘œ ๐ฐ๐‘›,๐‘œ ๐ฐ๐‘›,๐‘œ+1

= + ๐‘ฉ๐‘†๐ถ๐‘ฉ๐บ๐‘„ ๐‘ฉ๐‘†๐น

๐ฏ๐‘›,๐‘œโˆ’1

๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐ถ ๐‘ข +

แˆทแˆท ๐ฐ๐‘›,๐‘œ

๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐‘„ ๐‘ข + ๐‘ฉ๐‘†

๐‘ ๐‘ข +

๐ฐ๐‘›,๐‘œ ๐ฐ๐‘›,๐‘œ+1

=

๐ฏ๐‘›,๐‘œโˆ’1

๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐‘„ ๐‘ข + เดค

๐ฐ๐‘›,๐‘œ ๐ฏ๐‘›,๐‘œ ๐ฏ๐‘›,๐‘œโˆ’1

= ๐‘ฉ๐บ

๐‘ ๐‘ข + เดฅ

๐ฏ๐‘›,๐‘œ

New trick for better performance

แˆถ ๐ณ๐‘›,๐‘œ ๐ณ๐‘›,๐‘œ ๐ณ๐‘›โˆ’1,๐‘œ

= ๐‘ฉ๐บ

๐‘

+

แˆท ๐ด๐‘›,๐‘œ ๐ด๐‘›,๐‘œ ๐ด๐‘›+1,๐‘œ

= ๐‘ฉ๐‘†

๐‘

+ ๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐‘„ ๐ณ๐‘›โˆ’1,๐‘œ + sequentially find per-block column feedbacks sequentially find per-block row feedbacks

เดฅ ๐ฏ๐‘›,๐‘œ เดบ ๐ฏ๐‘›,๐‘œ

= ๐‘ฉ๐‘†๐น + ๐‘ฉ๐‘†๐ถ๐‘ฉ๐บ๐ถ ๐‘ˆ ๐‘ฉ๐บ๐ถ ๐‘ข +

๐ด๐‘›+1,๐‘œ ๐ณ๐‘›โˆ’1,๐‘œ เดค ๐ฐ๐‘›,๐‘œ ๐ด๐‘›+1,๐‘œ

= + ๐‘ฉ๐‘†๐ถ๐‘ฉ๐บ๐‘„ ๐‘ฉ๐‘†๐น

๐ณ๐‘›โˆ’1,๐‘œ

๐ผ ๐‘ฉ๐‘†๐ถ ๐‘ฉ๐บ๐ถ ๐‘ข +

แˆทแˆท ๐ฐ๐‘›,๐‘œ

new fully parallel intermediate stage (exactly the same as column processing)

slide-47
SLIDE 47

Summary of parallel algorithms

Algorithm Step complexity Threads Bandwidth independent rows then columns 4๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž, ๐‘ฅ 8โ„Ž๐‘ฅ + split rows and columns โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 9โ„Ž๐‘ฅ + overlap causal with anticausal โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 5โ„Ž๐‘ฅ + overlap rows with columns โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 3โ„Ž๐‘ฅ + new trick โ‰ˆ 8๐‘  โ„Ž๐‘ฅ/๐‘ž โ„Ž๐‘ฅ/๐‘ โ‰ˆ 3โ„Ž๐‘ฅ

๐‘  is the filter order ๐‘ž is the number of processors โ„Ž is the image height ๐‘ฅ is the image width

slide-48
SLIDE 48

13 1 3 5 7 9 11 642 81922 1282 2562 5122 10242 20482 40962 Input size (pixels) Throughput (GiP/s)

  • verlap causal and anticausal

+ overlap rows and columns + new trick

Performance on Kepler (GK110)

3rd order 4th order

slide-49
SLIDE 49

642 81922 1282 2562 5122 10242 20482 40962 17 1 3 15 13 11 9 7 5 Input size (pixels) Throughput (GiP/s)

Performance on Pascal (GP102)

  • verlap causal and anticausal

+ overlap rows and columns + new trick 5th order

slide-50
SLIDE 50

แˆถ ๐ณ2 แˆถ ๐ณ4 แˆถ ๐ณ5 แˆถ ๐ณ6 แˆถ ๐ณ1 แˆถ ๐ณ3 1st causal แˆท ๐ด5 แˆท ๐ด3 แˆท ๐ด2 แˆท ๐ด1 แˆท ๐ด6 แˆท ๐ด4 1st anticausal ๐ด5 ๐ด3 ๐ด2 ๐ด1 ๐ด6 ๐ด4 2nd anticausal ๐ณ2 ๐ณ4 ๐ณ5 ๐ณ6 ๐ณ1 ๐ณ3 2nd causal

Back to infinite extensions

entire finite input ๐ฒ2 ๐ฒ4 ๐ฒ5 ๐ฒ6 ๐ฒ1 ๐ฒ3 แˆถ ๐ณ6 ๐ณ12 = ๐‘ต8 + ๐‘ต9 แˆท ๐ด1 แˆถ ๐ณ6 ๐ณ12 = ๐‘ต6 + ๐‘ต7 ๐‘ต7 = ๐‘ฑ โˆ’ ๐‘ฉ๐บ

2โ„Ž โˆ’1๐‘ณ เดฅ

๐‘ฉ๐‘†

โˆ’1 ๐‘ฑ โˆ’ ๐‘ฉ๐บ ๐‘ ๐‘ฉ๐‘† ๐‘ 

๐‘ต6 = ๐‘ฑ โˆ’ ๐‘ฉ๐บ

2โ„Ž โˆ’1 ๐‘ฉ๐บ โ„Ž โˆ’ ๐‘ณ๐‘ฉ๐‘† โ„Ž เดฅ

๐‘ฉ๐บ

โˆ’1๐‘ฉ๐บ ๐‘  เดฅ

๐‘ฉ๐‘† ๐‘ต8 = ๐‘ณ โˆ’ ๐‘ฉ๐‘†

๐‘  โˆ’๐Ÿเดฅ

๐‘ฉ๐‘† ๐‘ต9 = ๐‘ณ โˆ’ ๐‘ฉ๐‘†

๐‘  โˆ’๐Ÿเดฅ

๐‘ฉ๐‘†๐‘ฉ๐บ

โ„Ž

๐ณ12 initial causal feedback initial anticausal feedback entire finite output input extension

โ€ฆ โ€ฆ

input extension แˆถ ๐ณ6 แˆท ๐ด1

slide-51
SLIDE 51

Back to infinite extensions

  • Side effect of 2nd stage of parallel algorithm
  • Just what we need to compute exact initial feedbacks!
slide-52
SLIDE 52

13 1 3 5 7 9 11 642 81922 1282 2562 5122 10242 20482 40962 Input size (pixels) Throughput (GiP/s) 1st order (cubic B-spline interpolation) [Chaurasia et al. 2015] [Nehab et al. 2011] padding exact reflect exact repeat exact clamp to border

Performance on Kepler (GK110)

slide-53
SLIDE 53

conv cuFFT [Chaurasia et al. 2015] [Nehab et al. 2011] 642 81922 1282 2562 5122 10242 20482 40962 7 1 3 5 Input size (pixels) Throughput (GiP/s) 3rd order (2D Gaussian blur with ๐œ = ๐‘œ/6)

Performance on Kepler (GK110)

padding exact reflect exact repeat exact clamp to border

slide-54
SLIDE 54

Summary

  • Introduction
  • Recursive filters are very useful
  • Initialization at the boundaries is an important problem
  • Exact recursive filtering of infinite input extensions
  • Closed-form formulas available for the first time
  • Enable simple and effective algorithms
  • Parallelization
  • Fastest recursive filtering algorithms to date
  • First to filter infinite extensions exactly
slide-55
SLIDE 55

Thank you!

  • Please download and use our code

https://github.com/andmax/gpufilter

  • Questions?