Powering Real-time Radio Astronomy Signal Processing with latest - PowerPoint PPT Presentation

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Vinay Deshpande Bharat Kumar Harshavardhan Reddy Suda NVIDIA, India NVIDIA, India NCRA, India

What signals we are processing? ▪ Digitized baseband signals from 30 dual polarized antennas of GMRT GMRT ▪ The Giant Meter-wave Radio Telescope (GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies ▪ Located 80 km north of Pune, 160 km east of Mumbai ▪ Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths

GMRT ▪ Supports two modes of operation : - Interferometry (correlator) - Array mode (beamformer) ▪ Frequency bands : - 130 to 260 MHz - 250 to 500 MHz - 550 to 900 MHz - 1050 to 1600 MHz ▪ Maximum instantaneous bandwidth : 400 MHz (Legacy GMRT = 32 MHz) ▪ Effective collecting area (2-3% of SKA) -30,000 sq m at lower frequencies -20,000 sq m at higher frequencies

The Giant Meter-wave Radio Telescope A Google eye view

GMRT receiver chain Signal processing in digital back-end Image courtesy : Ajith Kumar, NCRA

Computation requirements Antenna Signals(M=64) Sampler Maximum Bandwidth 400 MHz 16k point spectral channels – Fourier Transform O(NlogN) 3 TFlops Phase 0.1 TFlops Correction MAC 6.6 TFlops M(M+1)/2 Total ~ 10 TFlops

Design : Time slicing model

Design : Time slicing model A 4-node example Ant 1, Ant 2 --- Ant 16 : Digitized data of baseband signals of Antennas

Implementation ▪ 16 Dell T630 machines as Compute Nodes ▪ 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization ▪ 32 Tesla K40c GPU cards for processing ▪ 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes ▪ Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives ▪ Developed in collaboration with Swinburne University, Australia

Implementation Image courtesy : Irappa Halagalli, NCRA

Sample result Image of Coma cluster Upgraded GMRT 300 – 500 MHz : Legacy GMRT 325 MHz : 350 μ Jy 28 μ Jy Significantly lower noise RMS and better image quality with upgraded GMRT Dharam Vir Lal and Ishwar Chandra, NCRA

Computation Performance : K40 FFT MAC Channels (Gflops) (Gflops) 2048 620 626 4096 626 620 8192 512 574 16384 498 537 No. of antennas : 32 (dual pol) CUDA 7.5

Motivation for next generation GPUs ▪ Adding more compute intensive applications - Multi-beamforming - Processing on each beam (beam steering) - Gated correlator - FIR filtering with many taps for narrow-band mode implementation ▪ Working GMRT system and code provides an excellent testing ground for the features of next generation GPUs ▪ Performance measured and compared on GP100 and V100

Computation performance – K40 vs GP100 Cuda 7.5, ECC off Performance follows CUFFT benchmarks for K40 and P100 Reference for K40 benchmark : CUDA 6.5 performance report, September 2014 Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016

Computation performance : K40 vs GP100 Cuda 7.5, ECC off No. of antennas : 32 (dual pol)

Computation performance : K40 vs GP100 Cuda 7.5, ECC off Peak Performance : Peak Global Memory Bandwidth : K40 – 4.3 TFlops K40 – 288 GB / sec GP100 – 9.3 TFlops GP100 – 732 GB / sec

Computation performance as % of Real-time Bandwidth : 200 MHz No. of antennas : 32 (dual pol) Spectral Channels : 16384

Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster)

Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster) No. of antennas : 32 (dual pol)

Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster) Peak Performance : Peak Global Memory Bandwidth : GP100 – 9.3 TFlops GP100 – 732 GB / sec V100 – 14 TFlops V100 – 900 GB / sec

Reasons behind relatively low performance of MAC ▪ Non-contiguous Global Memory access at block level MAC input data format ▪ Low Arithmetic Intensity

GPU kernel improvements ▪ FFT : Single Precision to Half Precision floating point ▪ MAC : Simplified Index Arithmetic Improved the L2 hit ratio : less then 5% to nearly 86% Vectorized loads – Increased ILP (float4) Exposing more parallelism by increasing the occupancy Single Precision to Half Precision floating point – No performance gain

MAC : Performance gain with optimizations on V100 V100 on Cuda 9.1 (using PSG cluster) No. of antennas : 32 (dual pol)

FFT : Performance gain with half precision on V100 V100 on Cuda 9.1 (using PSG cluster)

FFT : Error analysis with half precision in power spectrum Spectral Channels : 2048 Batch size : 128

FFT : Error analysis with half precision in phase spectrum Spectral Channels : 2048 Batch size : 128

Going forward ▪ Improving MAC using Tensor cores – potential 2x improvement ▪ Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code ▪ Optimized FIR filtering routines in CUDA for narrow-band mode implementation ▪ Implementing multi-beamforming, beam steering and gated correlator

Acknowledgements ▪ Prof. Yashwant Gupta, Centre Director, NCRA ▪ Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA ▪ Sanjay Kudale, GMRT, NCRA ▪ Shelton Gnanaraj, GMRT, NCRA ▪ Andrew Jameson, Swinburne University, Australia ▪ Benjamin Barsdel, Swinburne University, Australia (now at Nvidia) ▪ CASPER Group, Berkeley ▪ Digital Back-end Group, GMRT, NCRA ▪ Computer Group, GMRT, NCRA ▪ Control Room, GMRT

Thank You

Powering Real-time Radio Astronomy Signal Processing with latest - PowerPoint PPT Presentation

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Vinay Deshpande Bharat Kumar Harshavardhan Reddy Suda NVIDIA, India NVIDIA, India NCRA, India What signals we are processing? Digitized baseband signals

1 2 Powering a Knowledge- -based Economy based Economy Powering a Knowledge 3 Powering a

RADIO RADIO Ca Cate tegor gory y Br Breakdo eakdown wn Radio Radio Cr Crea eativity

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

The Collaboration for Astronomy Signal Processing and Electronics Research in 2017 Jack Hickish

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Introduction to the Cold Powering Project for HL-LHC A. Ballarino Leader of WP6a (Cold Powering)

International Lightpath Experiences Radio Astronomy Courtesy of NRAO Radio vs. Optical astronomy

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Waveform Generation Fundamental part of signal processing is the signal. Within the

Processing Real-Time LOFAR Processing Real-Time LOFAR Telescope Data on a Blue Gene/P Telescope

Packet Radio Lee Maddox, N4HOK What is Packet Radio? Packet radio is the connection of a computer

GNU Radio An introduction By Maryam Taghizadeh Dehkordi 9/9/2007 GNU Radio Outline

Dark Matter Radio (DM Radio) Kent Irwin for the DM Radio Collaboration DM Radio Pathfinder

Radio Radio It involve antennas It involve antennas It apparently involves electricity

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Q3 2017 Conference Call Veeco Instruments Inc. > November 2, 2017 > 1 | Q3 2017

July 2015 Royal Society, London http://breakthroughinitiatives.org http://seti.berkeley.edu : Is

Chorus Full Year Result, FY14 For 12 months ending 30 June 2014 25 August 2014, Wellington /

Antennas for MIMO systems Brian Collins Antenova Ltd Something familiar Receiver 1 TX array

ORBIT 10 Years Later Ivan Seskar, Associate Director WINLAB Rutgers, The State University of New

Maximize the Value of CDMA Networks and Maximize the Value of CDMA Networks and Smoothly Evolve to

Correlating GSM and 802.11 Hardware Identifiers LCDR Jeremy Martin, LT Danny Rhame, Dr. Robert

Team X B Bhooma Reddy (200601224) Koundinya (200601201) Bluetooth wireless technology is a

Powering Real-time Radio Astronomy Signal Processing with latest - PowerPoint PPT Presentation

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Vinay Deshpande Bharat Kumar Harshavardhan Reddy Suda NVIDIA, India NVIDIA, India NCRA, India What signals we are processing? Digitized baseband signals

1 2 Powering a Knowledge- -based Economy based Economy Powering a Knowledge 3 Powering a

RADIO RADIO Ca Cate tegor gory y Br Breakdo eakdown wn Radio Radio Cr Crea eativity

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

The Collaboration for Astronomy Signal Processing and Electronics Research in 2017 Jack Hickish

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Introduction to the Cold Powering Project for HL-LHC A. Ballarino Leader of WP6a (Cold Powering)

International Lightpath Experiences Radio Astronomy Courtesy of NRAO Radio vs. Optical astronomy

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Waveform Generation Fundamental part of signal processing is the signal. Within the

Processing Real-Time LOFAR Processing Real-Time LOFAR Telescope Data on a Blue Gene/P Telescope

Packet Radio Lee Maddox, N4HOK What is Packet Radio? Packet radio is the connection of a computer

GNU Radio An introduction By Maryam Taghizadeh Dehkordi 9/9/2007 GNU Radio Outline

Dark Matter Radio (DM Radio) Kent Irwin for the DM Radio Collaboration DM Radio Pathfinder

Radio Radio It involve antennas It involve antennas It apparently involves electricity

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Q3 2017 Conference Call Veeco Instruments Inc. &gt; November 2, 2017 &gt; 1 | Q3 2017

July 2015 Royal Society, London http://breakthroughinitiatives.org http://seti.berkeley.edu : Is

Chorus Full Year Result, FY14 For 12 months ending 30 June 2014 25 August 2014, Wellington /

Antennas for MIMO systems Brian Collins Antenova Ltd Something familiar Receiver 1 TX array

ORBIT 10 Years Later Ivan Seskar, Associate Director WINLAB Rutgers, The State University of New

Maximize the Value of CDMA Networks and Maximize the Value of CDMA Networks and Smoothly Evolve to

Correlating GSM and 802.11 Hardware Identifiers LCDR Jeremy Martin, LT Danny Rhame, Dr. Robert

Team X B Bhooma Reddy (200601224) Koundinya (200601201) Bluetooth wireless technology is a

Q3 2017 Conference Call Veeco Instruments Inc. > November 2, 2017 > 1 | Q3 2017