S9306 Extreme Signal-Processing Performance Using Tensor Cores - PowerPoint PPT Presentation

Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first part, we explain how we use Tensor Cores to obtain extreme signal-processing performance. In the second part of this talk, we explain how we solve the largest computational challenge in the imaging pipeline of modern radio telescopes. Netherlands Institute for Radio Astronomy

Tensor Cores: Signal Processing at Unprecedented Speeds John Romein ASTRON (Netherlands Institute for Radio Astronomy) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 1

outline ● tensor cores ● complex numbers and matrix multiplications ● signal-processing algorithms – correlations analyze performance – beam forming – ... GTC'19 / Tensor Core Signal Processing March 18-21, 2019 2

tensor cores ● mixed-precision matrix multiplication hardware – Volta, Turing ● V100: peak 112 (!) TFLOPS ● designed for deep learning GTC'19 / Tensor Core Signal Processing March 18-21, 2019 3

how to use tensor cores ● libraries (cuBLAS, cutlass, ...) ✘ – insufficient complex numbers support ● WMMA ✔ – operates directly on 16x16 matrices (+ few more formats) – use in CUDA program GTC'19 / Tensor Core Signal Processing March 18-21, 2019 4

WMMA example ● warp performs 16x16 matrix multiplication load_matrix_sync(a_frag, &a[…][…], K); load_matrix_sync(b_frag, &b[…][…], N); fill_fragment(c_frag, 0); mma_sync(d_frag, a_frag, b_frag, c_frag); // d=a*b+c store_matrix_sync(&d[…][…], d_frag, …); GTC'19 / Tensor Core Signal Processing March 18-21, 2019 5

signal processing: complex numbers ● describes phase & amplitude of signal ● real and imaginary part ( a r , a i ) ● complex multiply-add = 4 real multiply-adds c += ab c r += a r b r c r += –a i b i –sign → no tensor core support c i += a r b i c i += a i b r GTC'19 / Tensor Core Signal Processing March 18-21, 2019 6

two complex array/matrix formats 1) split matrix 2) interleaved r 0,0 r 0,1 r 0,2 r 0,3 i 0,0 i 0,1 i 0,2 i 0,3 r 0,0 i 0,0 r 0,1 i 0,1 r 0,2 i 0,2 r 0,3 i 0,3 r 1,0 r 1,1 r 1,2 r 1,3 r 1,0 i 1,1 i 1,2 i 1,3 r 1,0 r 1,0 r 1,1 i 1,1 r 1,2 i 1,2 r 1,3 i 1,3 r 2,0 r 2,1 r 2,2 r 2,3 i 2,0 i 2,1 i 2,2 i 2,3 r 2,0 i 2,0 r 2,1 i 2,1 r 2,2 i 2,2 r 2,3 i 2,3 r 3,0 r 3,1 r 3,2 r 3,3 i 3,0 i 3,1 i 3,2 i 3,3 r 3,0 i 3,0 r 3,1 i 3,1 r 3,2 i 3,2 r 3,3 i 3,3 float real[4][4], imag[4][4]; std::complex<float> matrix[4][4]; GTC'19 / Tensor Core Signal Processing March 18-21, 2019 7

1) complex split matrices [ C ]=[ A ][ B ] → [ C r ]=[ A r ][ B r ]+[− A i ][ B i ] [ C i ]=[ A r ][ B i ]+[− A i ][ B r ] ● maps well to tensor cores – negate A i values GTC'19 / Tensor Core Signal Processing March 18-21, 2019 8

2) interleaved complex matrices r 0 i 0 r i r 0 i 0 r 1 i 1 r 2 i 2 ⋯ ⋯ r 7 i 7 -i 0 r 0 r 1 i 1 -i 1 r 1 r 2 i 2 = X -i 2 r 2 ⋮ ⋮ ⋮ ⋮ r 7 i 7 -i 7 r 7 ● reorder right matrix for tensor core use – duplicate/permute/negate entries GTC'19 / Tensor Core Signal Processing March 18-21, 2019 9

complex formats: split array vs. interleaved ● implemented both ● generally no big performance difference GTC'19 / Tensor Core Signal Processing March 18-21, 2019 10

tensor cores for signal processing ● suitable if – input ≤ 16 bit ✔ – algorithm translates to matrix-matrix multiplication ✔ + ✘ GTC'19 / Tensor Core Signal Processing March 18-21, 2019 11

algorithm 1: correlations GTC'19 / Tensor Core Signal Processing March 18-21, 2019 12

correlations ● combines telescope data – each pair: multiply & accumulate – ½r(r+1) pairs integration time − 1 correlation recv 1 ,recv 2 = ∑ sample recv 1 ,time × sample recv 2 ,time time = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 13

correlator computations ● C ← A * A H ● C = C H → compute & store triangle receivers → receivers → C: A: receivers → time → GTC'19 / Tensor Core Signal Processing March 18-21, 2019 14

work decomposition ● computeSquares() – thread block: 64x64 receivers – warp: 32x16 receivers ● computeTriangles() receivers → – redundant computations above diagonal = 64x64 receivers receivers → GTC'19 / Tensor Core Signal Processing March 18-21, 2019 15

correlator implementation ● cache input: L2 → shared mem → registers – fix –sign on the fly ● wmma::store_matrix_sync() cannot write to triangle – copy via shared mem, or – write accumulation registers directly (hack!) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 16

correlator performance 80 70 60 50 TFLOPS 40 30 overall 20 correlateSquares() 10 correlateTriangles() 0 64 128 192 256 320 384 448 512 576 # receivers (measured on Tesla V100) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 17

correlator roofline analysis compute bound 100 correlateSquares() d n u o b h t d TFLOPS i correlateTriangles() w d n a 10 b y r o m e m 1 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 18

correlator energy efficiency 300 250 200 GFLOP/J 150 100 overall computeSquares() 50 computeTriangles() 0 64 128 192 256 320 384 448 512 576 # receivers (measured on Titan RTX, not V100) GTC'19 / Tensor Core Signal Processing March 18-21, 2019 19

innovation beyond Moore's law 80 Tensor Cores 70 FP32 Titan RTX (Turing) Tesla V100 (Volta) 60 Titan X (Pascal) Titan X (Maxwell) Tesla K40 (Kepler) 50 TFLOPS 40 30 20 10 0 2013 2014 2015 2016 2017 2018 2019 2020 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 20

algorithm 2: beam forming GTC'19 / Tensor Core Signal Processing March 18-21, 2019 21

beam forming ● increase sensitivity in particular direction ● (weighted) addition of signals credit: Jason Hessels nr recv − 1 bfdata time,beam = ∑ samples time,recv weights recv,beam recv = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 22

beam former implementation ● multiple beams: complex matrix-matrix multiplication receivers → time → time → x = beams → receivers → beam weights → nr recv − 1 bfdata time,beam = ∑ samples time,recv weights recv,beam recv = 0 GTC'19 / Tensor Core Signal Processing March 18-21, 2019 23

beam former performance and roofline analysis 80 70 compute bound memory bandwidth bound 100 60 50 TFLOPS TFLOPS 512 receivers 40 30 64 receivers 512 beams 20 256 beams 10 128 beams 0 10 64 128 192 256 320 384 448 512 16 32 64 128 256 512 1024 # receivers FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 24

other algorithms GTC'19 / Tensor Core Signal Processing March 18-21, 2019 25

other signal-processing algorithms ● nonuniform Fourier transforms ✔ – map well to complex matrix-matrix multiplication – ≤ 80 TFLOPS 100 ● FIR filter ✘ nuFt – matrix multiplication → many zeros no need for tensor cores – typically memory bandwidth bound 10 ● FFT ✘ TFLOPS FIR filter – not a matrix multiplication FFT – memory bandwidth bound 1 1 4 6 4 6 4 6 1 6 5 2 9 2 0 0 1 4 FLOPS/byte GTC'19 / Tensor Core Signal Processing March 18-21, 2019 26

current / future work ● try further optimizations – correlator: near diagonal – beam forming: cublasLtMatmul() (CUDA 10.1) ● support any number of receivers/beams ● 8 bit, 4 bit GTC'19 / Tensor Core Signal Processing March 18-21, 2019 27

conclusions ● tensor cores for signal processing: – correlating – multi-beam forming matrix-matrix multiplication – nonuniform Fourier transforms ● unprecedented performance (≤ ~75 TFLOPS, ≤ 6x) This work is funded by the European Union under grant no. H2020-FETHPC-754304 (DEEP-EST). GTC'19 / Tensor Core Signal Processing March 18-21, 2019 28

Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA Astronomical Imaging on GPUs Bram Veenboer Netherlands Institute for Radio Astronomy

Outline • Introduction: • Interferometry • The Image-Domain Gridding algorithm • Performance analysis • Analysis and optimization: • Gridder kernel • Imaging application • Performance and energy-efficiency comparison • Results in context of Square Kilometre Array • Summary 2

Introduction to radio astronomy Image credits: NRAO • Observe the sky at radio wavelengths − → map of radio sources • Dish-based telescopes: VLA, ALMA, MeerKAT, SKA-1 Mid • Size of the telescope is proportional to the wavelength • Use array of antennas for low frequencies: (LOFAR, MWA, SKA-1 Low) 3

Radio telescope: astronomical interferometer • Interferometer: array of seperate telescopes • Interferometry: combine the signals from seperate radio telescopes • Resolution similar to one very large dish baseline 4

Interferometry theory m 60 N Pole 40 image plane 20 l v [km] 0 Θ − 20 v w − 40 u − 60 uv plane − 60 − 40 − 20 0 20 40 60 u [km] • Sampling of the ‘uv-plane’: ‘visibilities’ • Earth-rotation synthesis 5

S9306 Extreme Signal-Processing Performance Using Tensor Cores - PowerPoint PPT Presentation

Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first

Lecture 4 - Cosmological parameter dependence of the temperature power spectrum (continued) -

EYFS Maths Presentation Thank you for coming! Aims To outline the key areas of the Maths

CompA: Complex Analyzer Xiping Liu(xl2639) Jianshuo Qiu(jq2253) Tianwu Wang(tw2576) Yingshuang

Further Mathematics Important questions What is A level Further Mathematics? What Maths

NFHS Basketball Legal Uniforms Take Part. Get Set For Life. Basketball Uniforms Legal

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo ao Paulo Papa

Replotting the Nyquist Plot: A New Visualization Proposal Predrag Pejovi Introduction

Faltings Heights of CM Elliptic Curves Tyler Genao Florida Atlantic University In collaboration

Data Class XI ( As per CBSE Board) Handling New Syllabus 2019-20 Visit : python.mykvs.in for

Corporate Presentation March 8, 2019 PLATINUM GROUP METALS | WATERBERG PGM PROJECT DISCLOSURE

CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri

H1 2020 revenue and operational data July 28, 2020 Disclaimer This presentation contains

Presto at Wayfair Vinay Narayana https:// www.linkedin.com/in/vinaynarayana/ @nvinay26 1.

and opportunities for growth 16 May 2013 Agenda Introduction Clive Bannister | Group Chief

PROFITABLE LOW COST CO-PRODUCER SITE VISIT FEBRUARY 2017 1 DISCLAIMER These Presentation

INTERIM RESULTS FY2020 27 February 2020 Forward-looking and cautionary statement Certain

Thabo Dube MHSC CEO PRESENTATION CONTENTS MHSC History & Journey to Zero Harm

Tracy Levee February 2, 2016 1 $1,108,072 - largest component of the General Support Section

2016-17 Budget Overview Presentation to the Bethlehem Central School District Board of Education

Bank of America Merrill Lynch 2017 Transportation Conference Presentation May 2017 1 Disclosures

Presentation to analysts and investors attending Precious Metals Summit Attached is a presentation

Wheatland land-Chi Chili li Central al Sc Schools ls Board d of Educati cation on Meeting

INFINITY ECM PLATFORM Unique multifunctional Cloud SW platform that every World Company needs.

Hardy space infinite elements for exterior Maxwell problems L. Nannen, T. Hohage, A. Schdle, J.

S9306 Extreme Signal-Processing Performance Using Tensor Cores - PowerPoint PPT Presentation

Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first

Lecture 4 - Cosmological parameter dependence of the temperature power spectrum (continued) -

EYFS Maths Presentation Thank you for coming! Aims To outline the key areas of the Maths

CompA: Complex Analyzer Xiping Liu(xl2639) Jianshuo Qiu(jq2253) Tianwu Wang(tw2576) Yingshuang

Further Mathematics Important questions What is A level Further Mathematics? What Maths

NFHS Basketball Legal Uniforms Take Part. Get Set For Life. Basketball Uniforms Legal

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo ao Paulo Papa

Replotting the Nyquist Plot: A New Visualization Proposal Predrag Pejovi Introduction

Faltings Heights of CM Elliptic Curves Tyler Genao Florida Atlantic University In collaboration

Data Class XI ( As per CBSE Board) Handling New Syllabus 2019-20 Visit : python.mykvs.in for

Corporate Presentation March 8, 2019 PLATINUM GROUP METALS | WATERBERG PGM PROJECT DISCLOSURE

CULE*: GPU ACCELERATED RL * CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri

H1 2020 revenue and operational data July 28, 2020 Disclaimer This presentation contains

Presto at Wayfair Vinay Narayana https:// www.linkedin.com/in/vinaynarayana/ @nvinay26 1.

and opportunities for growth 16 May 2013 Agenda Introduction Clive Bannister | Group Chief

PROFITABLE LOW COST CO-PRODUCER SITE VISIT FEBRUARY 2017 1 DISCLAIMER These Presentation

INTERIM RESULTS FY2020 27 February 2020 Forward-looking and cautionary statement Certain

Thabo Dube MHSC CEO PRESENTATION CONTENTS MHSC History &amp; Journey to Zero Harm

Tracy Levee February 2, 2016 1 $1,108,072 - largest component of the General Support Section

2016-17 Budget Overview Presentation to the Bethlehem Central School District Board of Education

Bank of America Merrill Lynch 2017 Transportation Conference Presentation May 2017 1 Disclosures

Presentation to analysts and investors attending Precious Metals Summit Attached is a presentation

Wheatland land-Chi Chili li Central al Sc Schools ls Board d of Educati cation on Meeting

INFINITY ECM PLATFORM Unique multifunctional Cloud SW platform that every World Company needs.

Hardy space infinite elements for exterior Maxwell problems L. Nannen, T. Hohage, A. Schdle, J.

CULE: GPU ACCELERATED RL CUDA Learning Environment Reinforcement Learning Steven Dalton, Iuri

Thabo Dube MHSC CEO PRESENTATION CONTENTS MHSC History & Journey to Zero Harm