Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van - PowerPoint PPT Presentation

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14

Radio Astronomy: Herculus A (a.k.a. 3C 348) “… optically invisible jets, one-and-a-half million light-years wide, dwarf the visible galaxy from which they emerge.” Image courtesy of NRAO/AUI Kees van Berkel page 1

VLA radio telescope, New Mexico 27 independent antennae (dishes), each with a diameter of 25m Kees van Berkel page 2

NGC6946: where is the NH 3 ? and how cold is it? Optical + X-ray combined Radio: 24 GHz ( λ =12.5 mm) 20 million light years from earth 1.76 GB of “radio data” (image about 50 arcsecs wide) (a few fJ in total, a few B photons) Kees van Berkel page 3

NGC6946: where is the NH 3 ? and how cold is it? „image cube”: (256 × 256 pixels) × 640 channels 19° Kelvin 640 channels of 500 kHz each NH 3 cloud Kees van Berkel page 4

Exascale Computing for Radio Astronomy: GPU or FPGA? Computing: • for the Square Kilometer Array (y2022) • what kind is needed? • 2D-FFT, (de-)convolution, filters, dedispersion, and a lot more • “exa-scale”: 10 18 FLOP/sec, • how much? i.e. 10 × fastest computer existing • 10 4.5 nodes × 10 4.5 ALUs × f c =10 9 Hz? • in what form? • GPU or FPGA? • accelerator / node? • use rooflines as a tool, • how to find out? for modeling and for prediction Kees van Berkel page 5

Interferometry 2-element interferometer Output of the correlator: • E ν (r 1 ) is the electric field at position r 1 , • ν the observation frequency, and • * denotes complex conjugation baseline correlator Kees van Berkel page 6

Van Cittert–Zernike theorem [1934-38] sky intensity speed of light correlator output quasi solid angle base line vector, monochromatic separating the 2 antennae Adding geometry (assuming “narrow field”): 2D Fourier transform! where (l, m) are sky-image coordinates and (u, v) are coordinates of the base-line vector [Tay99, Cla90, Tho01] Kees van Berkel page 7

Van Cittert–Zernike theorem [1934-38] In principle : I-DFT (u, v) coverage (l,m) image (A, φ ) at (u,v) pixel intensity DFT v m u l Kees van Berkel page 8

Sampling Lucy in u-v domain with a disc u, v × = IDFT DFT = x, y * Kees van Berkel page 9

DFT convolution theorem visibility sampling observed complex function visibility (hermitian) “de-convolution” V ν ( u , v ) S ( u , v ) VS ν ( u , v ) × = DFT ! ! ! I-DFT I D I ν ( l , m ) B ν ( l , m ) ν ( l , m ) ∗ = convolution image dirty beam real dirty image map point spread function dirty map Kees van Berkel page 10

DFT convolution : Lucy with 2 hours VLA time u, v × = IDFT DFT = x, y * Kees van Berkel page 11

DFT convolution : Lucy with 12 hours VLA time u, v × = IDFT DFT = x, y * Kees van Berkel page 12

DFT convolution: synthetic sky with 2 hours VLA time u, v × = IDFT DFT CLEAN I-DFT “de-convolution” = x, y * Kees van Berkel page 13

De-convolution (“imaging”) based on CLEAN **G ~ () FFT V(u,v,w) V(u,v,w=0) sky model **clean beam update − 3 × 10 point Image V(u,v,w) + + iterations source I(l,m) ** γ× PSF extract 100 × (dirty beam) residual image − V(u,v,w) V(u,v,w=0) G T () × I-FFT 3D 2D visibilities [complex] image [real] [Hög74] (W-projection/snapshot implicit) Kees van Berkel page 14

SKA1-mid [South Africa]: science in 2020 Towards a Square Kilometer Array photograph artist impression SKA Organisation [Dew13] /Swinburne Astronomy Productions Kees van Berkel page 15

Imaging: compute load for SKA1-mid quantity unit 10 log note # base lines 5+ 2 2 × (#dishes) 2 = (2 × 200) 2 dump rate s -1 1+ (integration time = 0.08s) -1 observation time s 3 # channels 5 “image cube” for spectral analysis # visibilities / observation 14.5 = input to imaging ( ≈ 10 16 Byte) # op /visibility /iteration 4.5 convolution, matrix multiply, (I)FFT # major iterations 1.5 (3 × calibration) × (10 × major) # op /observation 20.5 # op /sec Hz 17.5 ≈ 1 exaflop/ sec • #operations/visibility/iteration depends on W -projection method • calibration loop (3 × ) around imaging loop [Jon14, Ver15, Wijn14] Kees van Berkel page 16

EXAflops/sec in 2015? SKA1-mid 20 MW Piz Daint (CH) * Tianhe-2, #1 2016 • net SKA1-mid computation load “ 2020 ” versus [Gre14] • gross (peak) compute performance “ 2015 ” Kees van Berkel page 17

Piz Daint: a Cray XC30 system Comprising many kilometers of (optical) cable, ... and 5272 nodes Kees van Berkel page 18

Super computers 101 A modern supercomputer = N (10 3 -10 5 ) identical nodes connected by a network (ignoring storage, peripherals, service nodes, …) network (system-level interconnect) Synchronous DRAM / node LOC LOC SD LOC RAM Mem = on-chip memory, e.g. L2 MC = Memory Controler NIC = Network Interface Mem MC NIC NoC = Network on Chip NoC TOC = Throughput Optimized Core LOC LOC LOC LOC LOC LOC SoC LOC = Latency Optimized Core TOC LOC SoC = System on Chip N “TOC, LOC” is Nvidia speak Kees van Berkel page 19

Piz Daint: 7.8 Pflops/s @ 1.8MW TOC = Tesla K20X GPU Cray CX30, N= 5272 LOC = 8 × Intel Xeon @2.6 GHz Aries network, Dragonfly router IC Piz Daint node system LOC LOC SD LOC RAM # nodes 1 5272 Xeon 2.6GHz 0.17 L2 MC NIC Tesla K20X 1.31 TFLOP/s 1.48 7787 NoC TB 0.06 337 LOC LOC LOC LOC LOC LOC SoC TOC LOC N kW 0.33 1754 Kees van Berkel page 20

Nvidia Research [Ore14]: 1 ExaFlops/s @ 23MW in 2020 TOCs = 8,192 multiply-add @ 1GHz N= 76800 (7nm CMOS) [Ore14] double-precision Aries-like network Nvidia 2020 node system LOC LOC SD LOC RAM # nodes 1 76800 L2 MC NIC TOC 16.4 TFLOP/s PF/s 16.4 1258 NoC TB 0.51 39322 LOC LOC LOC LOC LOC LOC SoC TOC LOC N kW 0.30 23000 Kees van Berkel page 21

Exascale computing for radio astronomy Exascale computing: 10 18 flops Radio astronomy: 10 17.5 flops with gridding ( W -projection) and 2D-FFT as heavy kernels. LOC LOC SD LOC Let’s map 2D-FFT on a node. RAM Option 1: FPGA* Mem MC NIC NoC Option 2: GPU LOC LOC LOC LOC LOC LOC SoC TOC LOC * in same package (not same SoC) N Kees van Berkel page 22

State-of-the-art GPU and FPGAs Nvidia Intel/Altera Xilinx GP100 Stra4x 10 VU13P cmos nm 16 14 16 clock frequency MHz 1328 800 *800 scalar/dsp processors 3584 11520 11,904 peak throughput GFLOP/s 9519 9216 7619 data type [32b] float float fixed DRAM interface HBM2 # HBM2 # HBM2 DRAM bandwidth GB/s 256 256 256 power consump4on W 300 126 GfFLOP/W 32 73 * assumption, no data found # HBM2 (High Bandwidth Memory) interface to 3D stacked DRAM is an option. Kees van Berkel page 23

DFT in matrix-vector form Let x and X be complex vectors of length N. T = F N ⋅ x 0 , x 1 , x N − 1 X T = F N ⋅ x T T ( ) ( ) or X 0 , X 1 , X N − 1 ω = e 2 π i / N Where F N is the twiddle factor matrix, Y = F M ⋅ X ⋅ F N In 2 dimensions: Where Y and X are matrices of size M × N. F M ⋅ X : apply M -point 1D-DFT to each column of matrix X . Kees van Berkel page 24

DFT: Cooley-Tukey factorization theorem Let N=P × M . Then F N can be factorized as [Loa92] F N = P N , P ( I P ⊗ F M ) D N ( F P ⊗ I M ) Where I M and I p are identity matrices, ⊗ denotes the tensor product, D N the diagonal twiddle matrix. Example: P=2, M=N/P =4: • M radix-2 DFTs • D N • P radix- N/2 DFTs • P N,P omitted Cooley Tukey factorization is the basis of FFT Kees van Berkel page 25

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van - PowerPoint PPT Presentation

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14 Radio Astronomy: Herculus A (a.k.a. 3C 348) optically invisible jets, one-and-a-half million light-years wide, dwarf the

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical

RADIO RADIO Ca Cate tegor gory y Br Breakdo eakdown wn Radio Radio Cr Crea eativity

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Software into GPU code, Multicore Software and FPGA Hardware Satnam Singh Microsoft Research,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

International Lightpath Experiences Radio Astronomy Courtesy of NRAO Radio vs. Optical astronomy

Packet Radio Lee Maddox, N4HOK What is Packet Radio? Packet radio is the connection of a computer

GNU Radio An introduction By Maryam Taghizadeh Dehkordi 9/9/2007 GNU Radio Outline

Dark Matter Radio (DM Radio) Kent Irwin for the DM Radio Collaboration DM Radio Pathfinder

Radio Radio It involve antennas It involve antennas It apparently involves electricity

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the

Where is the Research on Evolutionary Multi-objective Optimization Heading To? Carlos A. Coello

Iteration complexity analysis of dual first order methods: application to embedded and

Prescribed Energy Solutions of some class of Semilinear Elliptic Equations Piero Montecchiari

CWoLa Hunting: Extending the Bump Hunt with Machine Learning Based on: Phys. Rev. Lett. 121,

Dark Matter Theory Current status of WIMP DM Natsumi Nagata Univ. of Tokyo Mar. 8th, 2019

There is always room for one more, and for many more Aurora Ramrez Dept. Computer Science and

Drupal Frontend Performance & Scalability DrupalCamp Ohio 2012 Christefano Reyes

2-12 years? A scoping review. Dr Daniel Peyton PhD Student, Paediatric Registrar Murdoch

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van - PowerPoint PPT Presentation

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14 Radio Astronomy: Herculus A (a.k.a. 3C 348) optically invisible jets, one-and-a-half million light-years wide, dwarf the

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical

RADIO RADIO Ca Cate tegor gory y Br Breakdo eakdown wn Radio Radio Cr Crea eativity

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Software into GPU code, Multicore Software and FPGA Hardware Satnam Singh Microsoft Research,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

International Lightpath Experiences Radio Astronomy Courtesy of NRAO Radio vs. Optical astronomy

Packet Radio Lee Maddox, N4HOK What is Packet Radio? Packet radio is the connection of a computer

GNU Radio An introduction By Maryam Taghizadeh Dehkordi 9/9/2007 GNU Radio Outline

Dark Matter Radio (DM Radio) Kent Irwin for the DM Radio Collaboration DM Radio Pathfinder

Radio Radio It involve antennas It involve antennas It apparently involves electricity

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the

Where is the Research on Evolutionary Multi-objective Optimization Heading To? Carlos A. Coello

Iteration complexity analysis of dual first order methods: application to embedded and

Prescribed Energy Solutions of some class of Semilinear Elliptic Equations Piero Montecchiari

CWoLa Hunting: Extending the Bump Hunt with Machine Learning Based on: Phys. Rev. Lett. 121,

Dark Matter Theory Current status of WIMP DM Natsumi Nagata Univ. of Tokyo Mar. 8th, 2019

There is always room for one more, and for many more Aurora Ramrez Dept. Computer Science and

Drupal Frontend Performance &amp; Scalability DrupalCamp Ohio 2012 Christefano Reyes

2-12 years? A scoping review. Dr Daniel Peyton PhD Student, Paediatric Registrar Murdoch

Drupal Frontend Performance & Scalability DrupalCamp Ohio 2012 Christefano Reyes