Case Study in 3D FFT Ahmed Sanaullah Martin Herbordt Vipin - - PowerPoint PPT Presentation

case study in 3d fft
SMART_READER_LITE
LIVE PREVIEW

Case Study in 3D FFT Ahmed Sanaullah Martin Herbordt Vipin - - PowerPoint PPT Presentation

OpenCL for FPGAs/HPC Case Study in 3D FFT Ahmed Sanaullah Martin Herbordt Vipin Sachdeva Boston University Silicon Therapeutics OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/15/2017 What gives FPGAs high performance? Deep pipelines


slide-1
SLIDE 1

OpenCL for FPGAs/HPC Case Study in 3D FFT

Ahmed Sanaullah Martin Herbordt Boston University Vipin Sachdeva Silicon Therapeutics

slide-2
SLIDE 2

Boston University Slideshow Title Goes Here

What gives FPGAs high performance?

► Deep pipelines ► Block RAMs ► Flexible on-chip communication/networks ► High utilization

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/15/2017

slide-3
SLIDE 3

Boston University Slideshow Title Goes Here

What gives FPGAs high performance?

► Deep pipelines ► Block RAMs ► Flexible on-chip communication/networks ► High utilization

To sum it up…

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

Application Specific Architecture

slide-4
SLIDE 4

Boston University Slideshow Title Goes Here

What gives FPGAs high performance?

► Deep pipelines ► Block RAMs ► Flexible on-chip communication/networks ► High utilization

To sum it up… But creating these designs in HDL is very complex How do we solve the programmability problem?

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

Application Specific Architecture

slide-5
SLIDE 5

Boston University Slideshow Title Goes Here

IP Cores

► 3rd party solutions ► Highly optimized ► Ease of use ► Reduces implementation timeframes

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

slide-6
SLIDE 6

Boston University Slideshow Title Goes Here

IP Cores

► 3rd party solutions ► Highly optimized ► Ease of use ► Reduces implementation timeframes

But …

► Limited customizability ► Implementation specifics hidden to protect intellectual property

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

slide-7
SLIDE 7

Boston University Slideshow Title Goes Here

IP Cores

► 3rd party solutions ► Highly optimized ► Ease of use ► Reduces implementation timeframes

But …

► Limited customizability ► Implementation specifics hidden to protect intellectual property

Which means…

OpenCL for FPGAs/HPC: Case Study in 3D FFT

7

11/16/2017

Application Specific Architecture

Pseudo

slide-8
SLIDE 8

Boston University Slideshow Title Goes Here

How about OpenCL?

► Develop application in C99 and compile to hardware ► Primitives and pragmas ► further customize hardware translations ► e.g. loop unroll, compute unit replication, single/multiple work item

Doesn’t OpenCL generate a complete .aocx file?

► Do not have to complete compilation ► Can obtain generated HDL from kernel_system folder ► Isolate and integrate required modules into existing design

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

slide-9
SLIDE 9

Boston University Slideshow Title Goes Here

3D FFT

Case Study

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

1D FFT (Y dimension) (height) 2D FFT (X dimension) (width) 3D FFT (Z dimension) (depth)

slide-10
SLIDE 10

Boston University Slideshow Title Goes Here

IP Core Radix-4/2

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

OpenCL Radix-2

FFT IP Core 1 FFT IP Core 2 FFT IP Core 3 FFT IP Core 4 FFT IP Core N 1D Vector 1D Vector 1D Vector 1D Vector 1D Vector Stage 1 Stage 2 Stage log(N) 1D Vector 1D Vector Individual Complex Values

3D FFT Compute Units

slide-11
SLIDE 11

Boston University Slideshow Title Goes Here OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

FPGA: Altera Arria 10-X115

► 427K ALMs ► 1518 DSP blocks ► 53Mb BRAMs

FFT Size: 643 Throughput Constraint: 64

► Mix of ALMs and DSPs used for FFT IP cores ► Insufficient DSP resources ► DSPs preferred over ALMs

slide-12
SLIDE 12

Boston University Slideshow Title Goes Here

Resource and Performance Comparison

  • OpenCL FFT has:

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

► ≈ 10x fewer ALMs usage ► ≈ 25x less on-chip memory usage ► ≈ 2x higher frequency ► OpenCL FFT can meet the required throughput using DSPs only

slide-13
SLIDE 13

Boston University Slideshow Title Goes Here OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

Conclusion

► OpenCL based designs can perform better than IP core based one ► For 643 FFT ► FFT IP cores are constrained to a specific computational flow ► May not be optimal for all FFT sizes ► OpenCL enables more application specific designs ► with less effort than HDL programming

slide-14
SLIDE 14

Boston University Slideshow Title Goes Here

Memory Architecture

► Ping-pong Primary Memory buffers ► Primary Memory Bank: O(N2) complexity (single read, single write) ► Secondary Memory Bank: O(N) complexity (single read, parallel write) ► Transpose ► Outputs of Compute Unit write to the same Secondary Memory Bank ► Secondary Memory Banks write to Primary Memory Banks ► New writes to Secondary Memory Bank every N cycles

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

slide-15
SLIDE 15

Boston University Slideshow Title Goes Here

Can this design source and sink data stall-free?

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

IP Core Loc Offset Buffer # FFTx X Y Z FFTy Y Z X FFTz Z X Y

𝐽𝑜𝑒𝑓𝑦_3𝐸 = 𝑪𝒗𝒈𝒈𝒇𝒔# × 𝑂2 + 𝑷𝒈𝒈𝒕𝒇𝒖 × 𝑂 + 𝑴𝒑𝒅

► Buffer# varies for a given cycle ► Loc changes every cycle ► Offset changes every N cycles ► Buffer# → Offset for next FFT dimension

slide-16
SLIDE 16

Boston University Slideshow Title Goes Here

Can this design source and sink data stall-free?

OpenCL for FPGAs/HPC: Case Study in 3D FFT 11/16/2017

IP Core Loc Offset Buffer # FFTx X Y Z FFTy Y Z X FFTz Z X Y

𝐽𝑜𝑒𝑓𝑦_3𝐸 = 𝑪𝒗𝒈𝒈𝒇𝒔# × 𝑂2 + 𝑷𝒈𝒈𝒕𝒇𝒖 × 𝑂 + 𝑴𝒑𝒅

OpenCL Radix-2 Loc Offset Buffer # FFTx Y Z X FFTy Z X Y FFTz X Y Z

► Buffer# varies for a given cycle ► Loc changes every cycle ► Offset changes every N cycles ► Buffer# → Offset for next FFT dimension ► Only difference is in initial data locations

► Hence, no stalls