exascale computing for radio astronomy gpu or fpga
play

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van - PowerPoint PPT Presentation

Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14 Radio Astronomy: Herculus A (a.k.a. 3C 348) optically invisible jets, one-and-a-half million light-years wide, dwarf the


  1. Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14

  2. Radio Astronomy: Herculus A (a.k.a. 3C 348) “… optically invisible jets, one-and-a-half million light-years wide, dwarf the visible galaxy from which they emerge.” Image courtesy of NRAO/AUI Kees van Berkel page 1

  3. VLA radio telescope, New Mexico 27 independent antennae (dishes), each with a diameter of 25m Kees van Berkel page 2

  4. NGC6946: where is the NH 3 ? and how cold is it? Optical + X-ray combined Radio: 24 GHz ( λ =12.5 mm) 20 million light years from earth 1.76 GB of “radio data” (image about 50 arcsecs wide) (a few fJ in total, a few B photons) Kees van Berkel page 3

  5. NGC6946: where is the NH 3 ? and how cold is it? „image cube”: (256 × 256 pixels) × 640 channels 19° Kelvin 640 channels of 500 kHz each NH 3 cloud Kees van Berkel page 4

  6. Exascale Computing for Radio Astronomy: GPU or FPGA? Computing: • for the Square Kilometer Array (y2022) • what kind is needed? • 2D-FFT, (de-)convolution, filters, dedispersion, and a lot more • “exa-scale”: 10 18 FLOP/sec, • how much? i.e. 10 × fastest computer existing • 10 4.5 nodes × 10 4.5 ALUs × f c =10 9 Hz? • in what form? • GPU or FPGA? • accelerator / node? • use rooflines as a tool, • how to find out? for modeling and for prediction Kees van Berkel page 5

  7. Interferometry 2-element interferometer Output of the correlator: • E ν (r 1 ) is the electric field at position r 1 , • ν the observation frequency, and • * denotes complex conjugation baseline correlator Kees van Berkel page 6

  8. Van Cittert–Zernike theorem [1934-38] sky intensity speed of light correlator output quasi solid angle base line vector, monochromatic separating the 2 antennae Adding geometry (assuming “narrow field”): 2D Fourier transform! where (l, m) are sky-image coordinates and (u, v) are coordinates of the base-line vector [Tay99, Cla90, Tho01] Kees van Berkel page 7

  9. Van Cittert–Zernike theorem [1934-38] In principle : I-DFT (u, v) coverage (l,m) image (A, φ ) at (u,v) pixel intensity DFT v m u l Kees van Berkel page 8

  10. Sampling Lucy in u-v domain with a disc u, v × = IDFT DFT = x, y * Kees van Berkel page 9

  11. DFT convolution theorem visibility sampling observed complex function visibility (hermitian) “de-convolution” V ν ( u , v ) S ( u , v ) VS ν ( u , v ) × = DFT ! ! ! I-DFT I D I ν ( l , m ) B ν ( l , m ) ν ( l , m ) ∗ = convolution image dirty beam real dirty image map point spread function dirty map Kees van Berkel page 10

  12. DFT convolution : Lucy with 2 hours VLA time u, v × = IDFT DFT = x, y * Kees van Berkel page 11

  13. DFT convolution : Lucy with 12 hours VLA time u, v × = IDFT DFT = x, y * Kees van Berkel page 12

  14. DFT convolution: synthetic sky with 2 hours VLA time u, v × = IDFT DFT CLEAN I-DFT “de-convolution” = x, y * Kees van Berkel page 13

  15. De-convolution (“imaging”) based on CLEAN **G ~ () FFT V(u,v,w) V(u,v,w=0) sky model **clean beam update − 3 × 10 point Image V(u,v,w) + + iterations source I(l,m) ** γ× PSF extract 100 × (dirty beam) residual image − V(u,v,w) V(u,v,w=0) G T () × I-FFT 3D 2D visibilities [complex] image [real] [Hög74] (W-projection/snapshot implicit) Kees van Berkel page 14

  16. SKA1-mid [South Africa]: science in 2020 Towards a Square Kilometer Array photograph artist impression SKA Organisation [Dew13] /Swinburne Astronomy Productions Kees van Berkel page 15

  17. Imaging: compute load for SKA1-mid quantity unit 10 log note # base lines 5+ 2 2 × (#dishes) 2 = (2 × 200) 2 dump rate s -1 1+ (integration time = 0.08s) -1 observation time s 3 # channels 5 “image cube” for spectral analysis # visibilities / observation 14.5 = input to imaging ( ≈ 10 16 Byte) # op /visibility /iteration 4.5 convolution, matrix multiply, (I)FFT # major iterations 1.5 (3 × calibration) × (10 × major) # op /observation 20.5 # op /sec Hz 17.5 ≈ 1 exaflop/ sec • #operations/visibility/iteration depends on W -projection method • calibration loop (3 × ) around imaging loop [Jon14, Ver15, Wijn14] Kees van Berkel page 16

  18. EXAflops/sec in 2015? SKA1-mid 20 MW Piz Daint (CH) * Tianhe-2, #1 2016 • net SKA1-mid computation load “ 2020 ” versus [Gre14] • gross (peak) compute performance “ 2015 ” Kees van Berkel page 17

  19. Piz Daint: a Cray XC30 system Comprising many kilometers of (optical) cable, ... and 5272 nodes Kees van Berkel page 18

  20. Super computers 101 A modern supercomputer = N (10 3 -10 5 ) identical nodes connected by a network (ignoring storage, peripherals, service nodes, …) network (system-level interconnect) Synchronous DRAM / node LOC LOC SD LOC RAM Mem = on-chip memory, e.g. L2 MC = Memory Controler NIC = Network Interface Mem MC NIC NoC = Network on Chip NoC TOC = Throughput Optimized Core LOC LOC LOC LOC LOC LOC SoC LOC = Latency Optimized Core TOC LOC SoC = System on Chip N “TOC, LOC” is Nvidia speak Kees van Berkel page 19

  21. Piz Daint: 7.8 Pflops/s @ 1.8MW TOC = Tesla K20X GPU Cray CX30, N= 5272 LOC = 8 × Intel Xeon @2.6 GHz Aries network, Dragonfly router IC Piz Daint node system LOC LOC SD LOC RAM # nodes 1 5272 Xeon 2.6GHz 0.17 L2 MC NIC Tesla K20X 1.31 TFLOP/s 1.48 7787 NoC TB 0.06 337 LOC LOC LOC LOC LOC LOC SoC TOC LOC N kW 0.33 1754 Kees van Berkel page 20

  22. Nvidia Research [Ore14]: 1 ExaFlops/s @ 23MW in 2020 TOCs = 8,192 multiply-add @ 1GHz N= 76800 (7nm CMOS) [Ore14] double-precision Aries-like network Nvidia 2020 node system LOC LOC SD LOC RAM # nodes 1 76800 L2 MC NIC TOC 16.4 TFLOP/s PF/s 16.4 1258 NoC TB 0.51 39322 LOC LOC LOC LOC LOC LOC SoC TOC LOC N kW 0.30 23000 Kees van Berkel page 21

  23. Exascale computing for radio astronomy Exascale computing: 10 18 flops Radio astronomy: 10 17.5 flops with gridding ( W -projection) and 2D-FFT as heavy kernels. LOC LOC SD LOC Let’s map 2D-FFT on a node. RAM Option 1: FPGA* Mem MC NIC NoC Option 2: GPU LOC LOC LOC LOC LOC LOC SoC TOC LOC * in same package (not same SoC) N Kees van Berkel page 22

  24. State-of-the-art GPU and FPGAs Nvidia Intel/Altera Xilinx GP100 Stra4x 10 VU13P cmos nm 16 14 16 clock frequency MHz 1328 800 *800 scalar/dsp processors 3584 11520 11,904 peak throughput GFLOP/s 9519 9216 7619 data type [32b] float float fixed DRAM interface HBM2 # HBM2 # HBM2 DRAM bandwidth GB/s 256 256 256 power consump4on W 300 126 GfFLOP/W 32 73 * assumption, no data found # HBM2 (High Bandwidth Memory) interface to 3D stacked DRAM is an option. Kees van Berkel page 23

  25. DFT in matrix-vector form Let x and X be complex vectors of length N. T = F N ⋅ x 0 , x 1 , x N − 1 X T = F N ⋅ x T T ( ) ( ) or X 0 , X 1 , X N − 1 ω = e 2 π i / N Where F N is the twiddle factor matrix, Y = F M ⋅ X ⋅ F N In 2 dimensions: Where Y and X are matrices of size M × N. F M ⋅ X : apply M -point 1D-DFT to each column of matrix X . Kees van Berkel page 24

  26. DFT: Cooley-Tukey factorization theorem Let N=P × M . Then F N can be factorized as [Loa92] F N = P N , P ( I P ⊗ F M ) D N ( F P ⊗ I M ) Where I M and I p are identity matrices, ⊗ denotes the tensor product, D N the diagonal twiddle matrix. Example: P=2, M=N/P =4: • M radix-2 DFTs • D N • P radix- N/2 DFTs • P N,P omitted Cooley Tukey factorization is the basis of FFT Kees van Berkel page 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend