Hardware Acceleration of Pulsar Search on FPGAs using OpenCL Oliver - - PowerPoint PPT Presentation

hardware acceleration of pulsar search on fpgas using
SMART_READER_LITE
LIVE PREVIEW

Hardware Acceleration of Pulsar Search on FPGAs using OpenCL Oliver - - PowerPoint PPT Presentation

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation Whats Next Hardware Acceleration of Pulsar Search on FPGAs using OpenCL Oliver Sinnen Haomiao Wang & Prabu Thiagaraj (Manchester Uni)


slide-1
SLIDE 1

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Hardware Acceleration of Pulsar Search on FPGAs using OpenCL

Oliver Sinnen Haomiao Wang & Prabu Thiagaraj (Manchester Uni)

Parallel and Reconfigurable Computing

Department of Electrical and Computer Engineering University of Auckland

Computing for SKA, 2017

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-2
SLIDE 2

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Strong-field Test of Gravity using Pulsars

Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory) Image credit: NASA . Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-3
SLIDE 3

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Outline

1

Overview and Task

2

FT Convolution Decomposition

3

High-level Techniques and Implementation

4

Evaluation

5

What’s Next

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-4
SLIDE 4

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Outline

1

Overview and Task

2

FT Convolution Decomposition

3

High-level Techniques and Implementation

4

Evaluation

5

What’s Next

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-5
SLIDE 5

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Pulsar and Pulsar Search

Observed radiation is a pulse Binary pulsar (Doppler effect) Acceleration search: 1) Time-domain 2) Frequency-domain .

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-6
SLIDE 6

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Pulsar and Pulsar Search

Frequency-domain Using matched filtering technique in Fourier domain to recover the signal into single bin. Ar0 ⋍

[r0]+m/2

k=[r0]−m/2

AkA∗

r0−k,

where frequency r0 is unknown. Summation is computed at a range of frequencies r.

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-7
SLIDE 7

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Block Overview of Pulsar Search Engine

Data Receptor (RCPT) Beamformed Data (BFD) Dedispersion Buffer Creator (DDBC) Filterbank Data Chuncks (FDC) RFI Mtigation (RFIM) Dedispersion Buffers (DB) Dedispersion Transform (DDTR) Flagged DB (FDB) Periodicity Search Buffer Creator (PSBC) Dedispersed Data Buffer (DDB) Complex Fourier Transform (CXFT) Periodicity Search Buffer (PSB) Birdie Zapping (BRDZ) Dereddening Spectrum (DRED) Inverse Complex Fourier Transform (iCXFT) Single Pulse Detector (SPCT) Dedispersed Data Buffer (DDB) Single Pulse Sifter (SPSIFT) Single Pulse Optimiser (SPOPT) Candidate Data Output Streamer (CDOS) To SDP Fourier Domain Acceleration Search (FDAS) Time Domain Resampler Transform (TDRT) Fourier Transform and Power Spectrum (PWFT) Harmonic Summing (HRMS) Time Domain Candidate Optimisation (TDAO) Candidate Sifting (SIFT) Full Filterbank Buffer Creator (FFBC) Candidate Folding and Optimsation (FLDO) Fourier Domain Candidate Optimisation (FDAO) Filterbank Data for Selected SP Candidates Filterbank Data for Candidate Folding From SDP Common Single Pulse Time‐domain Acc Freq‐domain Acc

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-8
SLIDE 8

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Fourier-domain Acceleration Search (FDAS)

FDAS module is applied to search for (binary) pulsars with constant frequency derivatives in frequency-domain

Beam2 Beami

. . . . . . . . . . . .

Over 2,000 beams are formed at 4,096 channels/beam Beami signals are de-dispersed for 6,000 DMs FIR_1 FIR_k

. . . . . .

FIR_85 Post- processing

PSS Engine_i

FT Convolution Module BeamN DM1 DM1 DM2

...

DMj DM6000

...

85 FIR filters, maximum length is 421-tap Pre- Processing .RFIM .DDTR .PSBC .CXFT .BRDZ .DRED · Single Pulse Search Modules · Time Domain Acceleration or Harmonic- sum Module

FDAS Module

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-9
SLIDE 9

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Specification of Task

Parameter Destriiption Value B # of beams 1000 ∼ 2000 DM # of de-dispersion measure (DM) trails 6000 Tobs Observation period 540s tlimit Time of executing one sample group 88ms N # of complex samples per group 222 M # of templates/filter 85 K # of average template/filter length > 200 .

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-10
SLIDE 10

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Outline

1

Overview and Task

2

FT Convolution Decomposition

3

High-level Techniques and Implementation

4

Evaluation

5

What’s Next

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-11
SLIDE 11

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

FT Convolution

Complex floating-point operations Multiple long FIR filters Large input size Strict time limit Number of acceleration devices (CapEx) Energy consumption (OpEx)

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-12
SLIDE 12

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Basic Element

Time-domain FIR Filter (TDFIR) ym[i] =

K−1

k=0

xm[i −k]hm[k], for i = 0,1, ...N −1 Frequency-domain FIR Filter (FDFIR) F{f ∗h} = F{f }·F{h}

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-13
SLIDE 13

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Hardware Limitation

Naïve Time Domain

DSP

block Single precision floating-point (SPF) multiplications (A+iB)×(C +iD) = (A×C −B ×D)+i(A×D +B ×C)

Naïve Frequency Domain

Off-chip (global) memory

Off-chip memory bandwidth

RAM block

On-chip (local) memory size 4-Million elements = 32MBytes .

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-14
SLIDE 14

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Decomposition Algorithms

Overlap-add Algorithm Split the coefficient array –> OLA-TD Overlap-save Algorithm Split the input array –> OLS-FD

Coefficients C_1 C_2 C_N

. . .

Input data Zero Length = Ncoef /N -1 Output data_i Output data_1 Output data Output data_N

. . .

+

Length = Ncoef /N Split Convolve with subset coefficient group i Output data_2 PD_3 PD_2 Input Data Zero Length =Ncoef -1 ID_1 ID_2 ID_3 ID_N

...

ID_i PD_i Convolution with FIR filter

Discard the Ncoef-1 elements

Output Data PD_1

...

PD_N Split the input into N small groups (a) OLA (b) OLS

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-15
SLIDE 15

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Outline

1

Overview and Task

2

FT Convolution Decomposition

3

High-level Techniques and Implementation

4

Evaluation

5

What’s Next

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-16
SLIDE 16

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

High-level Techniques

Maxeler MaxCompiler using Java to develop FPGA (HPCC2016) Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards), GPUs, and CPUs (FPT2016, best paper candidate)

DDR Controller & PHY Global Memory Interconnect Local Memory Interconnect FPGA_0 PCIe Block RAM Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline 2GB DDR3 x 2 Memory (Global Memory) PCIe Core 1 Core 4 Host Memory (DDR3 and SSD)

... . . .

DDR Controller & PHY Global Memory Interconnect Local Memory Interconnect FPGA_i PCIe Block RAM Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline

...

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-17
SLIDE 17

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Kernel Structures–OLA

  • Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)

FDAS on FPGA using OpenCL

slide-18
SLIDE 18

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Kernel Structures–OLS

Global Memory (Bank1) Global Memory (Bank2) Channels FFT Data Fetch Kernel (NDRange) FFT and Multiplication Kernel (Single) Channels FFT Bit-Reverse Kernel (NDRange) Channels IFFT Data Fetch Kernel (NDRange) IFFT Kernel (Single) IFFT Bit-Reverse Kernel (NDRange) Channels Channels Output Input Processed coefficients Output Global Memory 1st launch: Bank1 2nd launch: Bank2 Channels Data Fetch and Multiplication Kernel (NDRange) FFT/IFFT Kernel (Single) 1st launch FFT 2nd launch IFFT Channels Bit-Reverse Kernel (NDRange) Global Memory 1st launch: Bank2 2nd launch: Bank1 1st launch 2nd launch Processed coefficients 2nd launch Input 1st launch Switch

AOLS TOLS .

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-19
SLIDE 19

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Outline

1

Overview and Task

2

FT Convolution Decomposition

3

High-level Techniques and Implementation

4

Evaluation

5

What’s Next

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-20
SLIDE 20

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Platform

Table: Details of FPGA and GPU Platforms

Device (Board) Terasic DE5-Net Sapphire Nitro R7 370 Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370 Technology 28nm 28nm Compute resource 622,000 LEs 1024 Stream Processors 256 DSP blocks On-chip memory size 50Mb — Global memory size 2 x 2GB DDR3 3GB GDDR5 Global memory frequency 800MHz 5,600MHz Memory interface width 2 x 64-bit 256-bit Max clock frequency — 985MHz OpenCL 1.0 1.2 Max power consumption — 150W

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-21
SLIDE 21

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Latency–TDFIR vs FDFIR

4 TDFIR Kernels Naïve

TD-Naïve-64S TD-Naïve-64N

OLA

OLA-64S OLA-64N

5 FDFIR Kernels Naïve

FD-Naïve

OLS

AOLS

AOLS-1024 AOLS-2048 AOLS-4096

TOLS

TOLS-1024

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-22
SLIDE 22

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Latency–TDFIR vs FDFIR

Latencies of a single FPGA (Intel Stratix V A7) in processing same input array using 9 different OpenCL kernels:

  • 64

128 256 421 50 100 150 200 250 FIR Filter Length Kernel Execution Latency (ms)

  • TD−Naïve−64S

TD−Naïve−64N OLA−64S OLA−64N FD−Naïve AOLS−1024 AOLS−2048 AOLS−4096 TOLS−1024 Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-23
SLIDE 23

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Multiple FIR Filters

Even fastest kernel cannot meet time limit

=> Implement multiple FIR filters in parallel

Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing ! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points–> 4 points

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-24
SLIDE 24

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Multiple FIR Filters

Even fastest kernel cannot meet time limit

=> Implement multiple FIR filters in parallel

Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing ! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points–> 4 points

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-25
SLIDE 25

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Multiple FIR Filters

Even fastest kernel cannot meet time limit

=> Implement multiple FIR filters in parallel

Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing ! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points–> 4 points

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-26
SLIDE 26

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Multiple FIR Filters

1) Multiple FIR filters 2) Power of complex values Becomes optimisation problem!

TOLS-1024 8points 2 x AOLS-1024 8points 2 x AOLS-1024-P 8points AOLS-1024-P 8points AOLS-1024 8points AOLS-1024-P 4points AOLS-2048-P 4points 3 x AOLS-2048-P 4points 3 x AOLS-1024-P 4points FFT Engine Element-wise multiplications Power Unused DSP blocks

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-27
SLIDE 27

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Multiple FIR Filters and FPGAs

200 400 600 3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−P

OpenCL Kernels latency of 84 FIR Filters (ms) Device

1 FPGA 2 FPGAs 3 FPGAs 100 200 300 3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−P

OpenCL Kernels Kernel Peformance (GFLOPS) Device

1 FPGA 2 FPGAs 3 FPGAs 1 2 3 4 3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−P

OpenCL Kernels Power Efficiency (GFLOPS/watt) Device

1 FPGA 2 FPGAs 3 FPGAs 5 10 15 20 3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−P

OpenCL Kernels Energy Dissipation (Joule) Device

1 FPGA 2 FPGAs 3 FPGAs

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-28
SLIDE 28

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Latency–FPGA vs GPU

Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs in processing 2 and 4 Million points:

  • 9

18 27 36 45 54 63 72 81 20 40 60 80 100 120 140 Total FIR Filters Kernel Execution Latency (ms)

  • 3xAOLS−2048−P (4−Million)

3xAOLS−2048−P (2−Million) GPU−FD (4−Million) GPU−FD (2−Million)

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-29
SLIDE 29

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Energy–FPGA vs GPU

Energy dissipation of single FPGA and GPU in executing the same task with different kernels:

GPU−FD 3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−P AOLS−2048 TOLS−1024 5 10 15 20 Energy Dissipation (Joule)

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-30
SLIDE 30

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Outline

1

Overview and Task

2

FT Convolution Decomposition

3

High-level Techniques and Implementation

4

Evaluation

5

What’s Next

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-31
SLIDE 31

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Harmonic-summing

Input: Filter-output-plane (FOP, 85×222 SPF points, ~1.33GBytes) Processing flow:

8 harmonic planes are generated based on FOP and the stretched planes One threshold for each row of each harmonic plane (overall: 85×8 = 680) Hundreds of candidates are recorded

Output: Candidates

each candidate contains the indexes of filter, harmonic, bin, and amplitude (up to 64-bit)

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-32
SLIDE 32

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Harmonic-summing

Problems: Input data size is too large (~1.33GBytes)

On-chip memory size too small for all planes

The cost of computation task is very cheap (SPF adds) and easy to parallelise

Off-chip memory bandwidth is issue

Challenge: Optimise data use (and reuse), computation not an issue

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL

slide-33
SLIDE 33

Overview and Task FT Convolution Decomposition High-level Techniques and Implementation Evaluation What’s Next

Conclusion

FPGA-based implementation and optimisation of FT convolution (FIR filter), based on OLA and OLS algorithms High-level approaches to such tasks works well

Covered large design space Easy porting and sharing with partners

With multiple FPGAs, FPGA implementation has advantage

  • ver GPU in both performance (GFLOPS) and Energy

efficiency

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni) FDAS on FPGA using OpenCL