Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German - - PowerPoint PPT Presentation

generic polyphase filterbanks with cuda
SMART_READER_LITE
LIVE PREVIEW

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German - - PowerPoint PPT Presentation

Generic Polyphase Filterbanks with CUDA Jan Krmer DLR German Aerospace Center Communication and Navigation Satellite Networks Weling 04.02.2017 r rr www.dlr.de Slide 1 of


slide-1
SLIDE 1

❉▲❘

❑♥♦✇❧❡❞❣❡ ❢♦r ❚♦♠♦rr♦✇

Generic Polyphase Filterbanks with CUDA

Jan Krämer

DLR German Aerospace Center Communication and Navigation Satellite Networks Weßling

04.02.2017

slide-2
SLIDE 2

www.dlr.de · Slide 1 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-3
SLIDE 3

www.dlr.de · Slide 1 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-4
SLIDE 4

www.dlr.de · Slide 2 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Once upon a time in a space project

Multicarrier scheme with 15/30/45 carrier

❉▲❘

slide-5
SLIDE 5

www.dlr.de · Slide 2 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Once upon a time in a space project

Multicarrier scheme with 15/30/45 carrier So let’s just use a PFB, right?

❉▲❘

slide-6
SLIDE 6

www.dlr.de · Slide 3 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Early Trouble

45 carrier means 45x the bandwidth Only 12-15 % guardband available At least 3x oversampling needed Up to 1500 tap filters needed

❉▲❘

slide-7
SLIDE 7

www.dlr.de · Slide 3 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Early Trouble

45 carrier means 45x the bandwidth Only 12-15 % guardband available At least 3x oversampling needed Up to 1500 tap filters needed

❉▲❘

slide-8
SLIDE 8

www.dlr.de · Slide 4 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Early Trouble

CPU reference implementation 1000 taps 35dB rejection Originally 9x oversampling 2 Msamples/second achieved ⇒ 4 Msamples/second needed

❉▲❘

slide-9
SLIDE 9

www.dlr.de · Slide 4 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-10
SLIDE 10

www.dlr.de · Slide 5 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

What is CUDA

NVidias framework for GPGPU Used mainly to accelerate scientific computing Uses the massive amount of available compute cores inside a GPU

❉▲❘

slide-11
SLIDE 11

www.dlr.de · Slide 6 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

GPU Interior

GPU consists of several Streaming Multiprocessors (SM) Each SM consists of numerous compute or CUDA cores Single-Instruction Multiple-Threads (SIMT) structure Several kinds of memory

Global Memory (GDDR5 RAM) (slow) On-Chip (shared) Memory per SM (faster) Registers (blazingly fast)

❉▲❘

slide-12
SLIDE 12

www.dlr.de · Slide 7 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

CUDA Interior

Builds a (up to) 3 dimensional Grid The Gid contains the (up to) 3 dimensional Thread Blocks containing the threads Groups of 32 threads inside a Thread Block are grouped together ⇒ Warp

❉▲❘

slide-13
SLIDE 13

www.dlr.de · Slide 8 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

CUDA Interior

❉▲❘

slide-14
SLIDE 14

www.dlr.de · Slide 9 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Thread Execution

Each Block has a unique ID inside the Grid ⇒ Each thread has a unique global ID Thread Scheduler assigns each Thread Block to one SM and executed concurrently All threads in a Warp are executed concurrently inside the SM

❉▲❘

slide-15
SLIDE 15

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Performance Bottlenecks

Uncoalesced loads from global memory

⇒ Several cache-lines to be loaded

❉▲❘

slide-16
SLIDE 16

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Performance Bottlenecks

Uncoalesced loads from global memory

⇒ Several cache-lines to be loaded

Bank conflicts when accessing shared memory

❉▲❘

slide-17
SLIDE 17

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Performance Bottlenecks

Uncoalesced loads from global memory

⇒ Several cache-lines to be loaded

Bank conflicts when accessing shared memory Branching ⇒ Which instruction should be executed?

❉▲❘

slide-18
SLIDE 18

www.dlr.de · Slide 10 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-19
SLIDE 19

www.dlr.de · Slide 11 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Why PFBs and Channelizers/Synthesizers?

Used to reduce computational complexity for resampling filters Used to separate small bandwidth channels Used to generate multicarrier ’broadband’ signals

❉▲❘

slide-20
SLIDE 20

www.dlr.de · Slide 12 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Extracting a channel with 1

N of the total bandwidth

Mix Signal to Baseband Apply anti-alias filter Downsample the signal

❉▲❘

slide-21
SLIDE 21

www.dlr.de · Slide 13 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Extracting a channel with 1

N of the total bandwidth

Mix Signal to Baseband Apply anti-alias filter Downsample the signal

N-phase PFB splits one-dimensional filter in its N different phase shares

❉▲❘

slide-22
SLIDE 22

www.dlr.de · Slide 14 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Taps of the regular prototype filter

❉▲❘

slide-23
SLIDE 23

www.dlr.de · Slide 15 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Taps of the regular prototype filter Split into 4 polyphase partitions

❉▲❘

slide-24
SLIDE 24

www.dlr.de · Slide 16 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Taps of the regular prototype filter Split into 4 polyphase partitions Newly structured dataflow

❉▲❘

slide-25
SLIDE 25

www.dlr.de · Slide 17 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Taps of the regular prototype filter Split into 4 polyphase partitions Newly structured dataflow FFT separates all the channels

❉▲❘

slide-26
SLIDE 26

www.dlr.de · Slide 18 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Structure of a PFB Channelizer

Oversampling can be achieved by manipulating the input commutator and FFT input To synthesize several incoming channels just the reorder the

  • perations

❉▲❘

slide-27
SLIDE 27

www.dlr.de · Slide 18 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-28
SLIDE 28

www.dlr.de · Slide 18 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

❉▲❘

slide-29
SLIDE 29

www.dlr.de · Slide 19 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Identifying necessary operations

Channelizer consists of 4 operations

Shuffle the input stream Polyphase filtering FFT Shuffle the output stream

❉▲❘

slide-30
SLIDE 30

www.dlr.de · Slide 20 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Input Shuffling

Input Commutator implemented as matrix traversal Number of threads needs to accomodate to the filter history

⇒ Grid dimension takes care of this

Input buffer reads are coalesced ⇒ Block x-dimension same size as polyphase partition Intermediate buffer writes are therfore not coalesced

❉▲❘

slide-31
SLIDE 31

www.dlr.de · Slide 21 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Filter Operations

Block X dimension computes several input samples Block Y dimension computes oversampled output samples Grid X dimension represents polyphase partitions Grid Y dimension provide additional concurrency (due to block thread limits)

❉▲❘

slide-32
SLIDE 32

www.dlr.de · Slide 22 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Filter Operations

Each threadblock transfers memory from global memory to shared memory Each sample is accessed several times ⇒ shared memory

  • ffers faster memory transfers

Register and shared memory spills are avoided

❉▲❘

slide-33
SLIDE 33

www.dlr.de · Slide 23 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

FFT and Output Shuffling

FFT is the CuFFT of CUDA Output shuffling implemented as double loop done on Host CPU (for now)

❉▲❘

slide-34
SLIDE 34

www.dlr.de · Slide 23 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-35
SLIDE 35

www.dlr.de · Slide 24 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

32 Channel PFB

32 Channels No Oversampling 437 taps prototype filter

❉▲❘

slide-36
SLIDE 36

www.dlr.de · Slide 24 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

32 Channel PFB

❉▲❘

slide-37
SLIDE 37

www.dlr.de · Slide 25 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

45 Channel PFB

45 Channels 3x Oversampling 1501 taps prototype filter

❉▲❘

slide-38
SLIDE 38

www.dlr.de · Slide 25 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

45 Channel PFB

❉▲❘

slide-39
SLIDE 39

www.dlr.de · Slide 25 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Outline

  • 1. Motivation
  • 2. Short introduction to CUDA
  • 3. PFBs and the Channelizer
  • 4. Translation to CUDA
  • 5. Results
  • 6. Release plans and future changes

❉▲❘

slide-40
SLIDE 40

www.dlr.de · Slide 26 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Release Plan

Release Date: TBD

Still some bureaucratic hurdles Still dependent on project code

License: LGPL3 Platform Github (Group KN-SAN) Follow https://github.com/spectrejan for release news

❉▲❘

slide-41
SLIDE 41

www.dlr.de · Slide 27 of 27 > Generic Polyphase Filterbanks with CUDA > Jan Krämer > 04.02.2017

Contact: j.kraemer@dlr.de @JanKrmer

https://github.com/spectrejan

❉▲❘