s9306
play

S9306 Extreme Signal-Processing Performance Using Tensor Cores - PowerPoint PPT Presentation

Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first


  1. Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first part, we explain how we use Tensor Cores to obtain extreme signal-processing performance. In the second part of this talk, we explain how we solve the largest computational challenge in the imaging pipeline of modern radio telescopes. Netherlands Institute for Radio Astronomy

  2. Tensor Cores: Signal Processing at Unprecedented Speeds John Romein ASTRON (Netherlands Institute for Radio Astronomy) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 1

  3. outline ● tensor cores ● complex numbers and matrix multiplications ● signal-processing algorithms – correlations analyze performance – beam forming – ... GTC'19 / Tensor Core Signal Processing March 18-21, 2019 2

  4. tensor cores ● mixed-precision matrix multiplication hardware – Volta, Turing ● V100: peak 112 (!) TFLOPS ● designed for deep learning GTC'19 / Tensor Core Signal Processing March 18-21, 2019 3

  5. how to use tensor cores ● libraries (cuBLAS, cutlass, ...) ✘ – insufficient complex numbers support ● WMMA ✔ – operates directly on 16x16 matrices (+ few more formats) – use in CUDA program GTC'19 / Tensor Core Signal Processing March 18-21, 2019 4

  6. WMMA example ● warp performs 16x16 matrix multiplication load_matrix_sync(a_frag, &a[…][…], K); load_matrix_sync(b_frag, &b[…][…], N); fill_fragment(c_frag, 0); mma_sync(d_frag, a_frag, b_frag, c_frag); // d=a*b+c store_matrix_sync(&d[…][…], d_frag, …); GTC'19 / Tensor Core Signal Processing March 18-21, 2019 5

  7. signal processing: complex numbers ● describes phase & amplitude of signal ● real and imaginary part ( a r , a i ) ● complex multiply-add = 4 real multiply-adds c += ab c r += a r b r c r += –a i b i –sign → no tensor core support c i += a r b i c i += a i b r GTC'19 / Tensor Core Signal Processing March 18-21, 2019 6

  8. two complex array/matrix formats 1) split matrix 2) interleaved r 0,0 r 0,1 r 0,2 r 0,3 i 0,0 i 0,1 i 0,2 i 0,3 r 0,0 i 0,0 r 0,1 i 0,1 r 0,2 i 0,2 r 0,3 i 0,3 r 1,0 r 1,1 r 1,2 r 1,3 r 1,0 i 1,1 i 1,2 i 1,3 r 1,0 r 1,0 r 1,1 i 1,1 r 1,2 i 1,2 r 1,3 i 1,3 r 2,0 r 2,1 r 2,2 r 2,3 i 2,0 i 2,1 i 2,2 i 2,3 r 2,0 i 2,0 r 2,1 i 2,1 r 2,2 i 2,2 r 2,3 i 2,3 r 3,0 r 3,1 r 3,2 r 3,3 i 3,0 i 3,1 i 3,2 i 3,3 r 3,0 i 3,0 r 3,1 i 3,1 r 3,2 i 3,2 r 3,3 i 3,3 float real[4][4], imag[4][4]; std::complex<float> matrix[4][4]; GTC'19 / Tensor Core Signal Processing March 18-21, 2019 7

  9. 1) complex split matrices [ C ]=[ A ][ B ] → [ C r ]=[ A r ][ B r ]+[− A i ][ B i ] [ C i ]=[ A r ][ B i ]+[− A i ][ B r ] ● maps well to tensor cores – negate A i values GTC'19 / Tensor Core Signal Processing March 18-21, 2019 8

  10. 2) interleaved complex matrices r 0 i 0 r i r 0 i 0 r 1 i 1 r 2 i 2 ⋯ ⋯ r 7 i 7 -i 0 r 0 r 1 i 1 -i 1 r 1 r 2 i 2 = X -i 2 r 2 ⋮ ⋮ ⋮ ⋮ r 7 i 7 -i 7 r 7 ● reorder right matrix for tensor core use – duplicate/permute/negate entries GTC'19 / Tensor Core Signal Processing March 18-21, 2019 9

  11. complex formats: split array vs. interleaved ● implemented both ● generally no big performance difference GTC'19 / Tensor Core Signal Processing March 18-21, 2019 10

  12. tensor cores for signal processing ● suitable if – input ≤ 16 bit ✔ – algorithm translates to matrix-matrix multiplication ✔ + ✘ GTC'19 / Tensor Core Signal Processing March 18-21, 2019 11

  13. algorithm 1: correlations GTC'19 / Tensor Core Signal Processing March 18-21, 2019 12

  14. correlations ● combines telescope data – each pair: multiply & accumulate – ½r(r+1) pairs integration time − 1 correlation recv 1 ,recv 2 = ∑ sample recv 1 ,time × sample recv 2 ,time time = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 13

  15. correlator computations ● C ← A * A H ● C = C H → compute & store triangle receivers → receivers → C: A: receivers → time → GTC'19 / Tensor Core Signal Processing March 18-21, 2019 14

  16. work decomposition ● computeSquares() – thread block: 64x64 receivers – warp: 32x16 receivers ● computeTriangles() receivers → – redundant computations above diagonal = 64x64 receivers receivers → GTC'19 / Tensor Core Signal Processing March 18-21, 2019 15

  17. correlator implementation ● cache input: L2 → shared mem → registers – fix –sign on the fly ● wmma::store_matrix_sync() cannot write to triangle – copy via shared mem, or – write accumulation registers directly (hack!) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 16

  18. correlator performance 80 70 60 50 TFLOPS 40 30 overall 20 correlateSquares() 10 correlateTriangles() 0 64 128 192 256 320 384 448 512 576 # receivers (measured on Tesla V100) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 17

  19. correlator roofline analysis compute bound 100 correlateSquares() d n u o b h t d TFLOPS i correlateTriangles() w d n a 10 b y r o m e m 1 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 18

  20. correlator energy efficiency 300 250 200 GFLOP/J 150 100 overall computeSquares() 50 computeTriangles() 0 64 128 192 256 320 384 448 512 576 # receivers (measured on Titan RTX, not V100) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 19

  21. innovation beyond Moore's law 80 Tensor Cores 70 FP32 Titan RTX (Turing) Tesla V100 (Volta) 60 Titan X (Pascal) Titan X (Maxwell) Tesla K40 (Kepler) 50 TFLOPS 40 30 20 10 0 2013 2014 2015 2016 2017 2018 2019 2020 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 20

  22. algorithm 2: beam forming GTC'19 / Tensor Core Signal Processing March 18-21, 2019 21

  23. beam forming ● increase sensitivity in particular direction ● (weighted) addition of signals credit: Jason Hessels nr recv − 1 bfdata time,beam = ∑ samples time,recv weights recv,beam recv = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 22

  24. beam former implementation ● multiple beams: complex matrix-matrix multiplication receivers → time → time → x = beams → receivers → beam weights → nr recv − 1 bfdata time,beam = ∑ samples time,recv weights recv,beam recv = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 23

  25. beam former performance and roofline analysis 80 70 compute bound memory bandwidth bound 100 60 50 TFLOPS TFLOPS 512 receivers 40 30 64 receivers 512 beams 20 256 beams 10 128 beams 0 10 64 128 192 256 320 384 448 512 16 32 64 128 256 512 1024 # receivers FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 24

  26. other algorithms GTC'19 / Tensor Core Signal Processing March 18-21, 2019 25

  27. other signal-processing algorithms ● nonuniform Fourier transforms ✔ – map well to complex matrix-matrix multiplication – ≤ 80 TFLOPS 100 ● FIR filter ✘ nuFt – matrix multiplication → many zeros no need for tensor cores – typically memory bandwidth bound 10 ● FFT ✘ TFLOPS FIR filter – not a matrix multiplication FFT – memory bandwidth bound 1 1 4 6 4 6 4 6 1 6 5 2 9 2 0 0 1 4 FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 26

  28. current / future work ● try further optimizations – correlator: near diagonal – beam forming: cublasLtMatmul() (CUDA 10.1) ● support any number of receivers/beams ● 8 bit, 4 bit GTC'19 / Tensor Core Signal Processing March 18-21, 2019 27

  29. conclusions ● tensor cores for signal processing: – correlating – multi-beam forming matrix-matrix multiplication – nonuniform Fourier transforms ● unprecedented performance (≤ ~75 TFLOPS, ≤ 6x) This work is funded by the European Union under grant no. H2020-FETHPC-754304 (DEEP-EST). GTC'19 / Tensor Core Signal Processing March 18-21, 2019 28

  30. Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA Astronomical Imaging on GPUs Bram Veenboer Netherlands Institute for Radio Astronomy

  31. Outline • Introduction: • Interferometry • The Image-Domain Gridding algorithm • Performance analysis • Analysis and optimization: • Gridder kernel • Imaging application • Performance and energy-efficiency comparison • Results in context of Square Kilometre Array • Summary 2

  32. Introduction to radio astronomy Image credits: NRAO • Observe the sky at radio wavelengths − → map of radio sources • Dish-based telescopes: VLA, ALMA, MeerKAT, SKA-1 Mid • Size of the telescope is proportional to the wavelength • Use array of antennas for low frequencies: (LOFAR, MWA, SKA-1 Low) 3

  33. Radio telescope: astronomical interferometer • Interferometer: array of seperate telescopes • Interferometry: combine the signals from seperate radio telescopes • Resolution similar to one very large dish baseline 4

  34. Interferometry theory m 60 N Pole 40 image plane 20 l v [km] 0 Θ − 20 v w − 40 u − 60 uv plane − 60 − 40 − 20 0 20 40 60 u [km] • Sampling of the ‘uv-plane’: ‘visibilities’ • Earth-rotation synthesis 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend