Powering Real-time Radio Astronomy Signal Processing with latest - - PowerPoint PPT Presentation

powering real time radio astronomy signal processing with
SMART_READER_LITE
LIVE PREVIEW

Powering Real-time Radio Astronomy Signal Processing with latest - - PowerPoint PPT Presentation

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Vinay Deshpande Bharat Kumar Harshavardhan Reddy Suda NVIDIA, India NVIDIA, India NCRA, India What signals we are processing? Digitized baseband signals


slide-1
SLIDE 1

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures

Harshavardhan Reddy Suda NCRA, India Vinay Deshpande NVIDIA, India Bharat Kumar NVIDIA, India

slide-2
SLIDE 2

What signals we are processing? GMRT

▪ The Giant Meter-wave Radio Telescope (GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies ▪ Located 80 km north of Pune, 160 km east of Mumbai ▪ Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths

▪ Digitized baseband signals from 30 dual polarized antennas of GMRT

slide-3
SLIDE 3

GMRT

▪ Supports two modes of operation :

  • Interferometry (correlator)
  • Array mode (beamformer)

▪ Frequency bands :

  • 130 to 260 MHz
  • 250 to 500 MHz
  • 550 to 900 MHz
  • 1050 to 1600 MHz

▪ Maximum instantaneous bandwidth : 400 MHz (Legacy GMRT = 32 MHz) ▪ Effective collecting area (2-3% of SKA)

  • 30,000 sq m at lower frequencies
  • 20,000 sq m at higher frequencies
slide-4
SLIDE 4

The Giant Meter-wave Radio Telescope A Google eye view

slide-5
SLIDE 5

GMRT receiver chain Signal processing in digital back-end

Image courtesy : Ajith Kumar, NCRA

slide-6
SLIDE 6

Computation requirements

Sampler Fourier Transform O(NlogN) Phase Correction MAC M(M+1)/2

Antenna Signals(M=64)

Maximum Bandwidth 400 MHz

16k point spectral channels – 3 TFlops 0.1 TFlops 6.6 TFlops Total ~ 10 TFlops

slide-7
SLIDE 7

Design : Time slicing model

slide-8
SLIDE 8

Design : Time slicing model

A 4-node example

Ant 1, Ant 2 --- Ant 16 : Digitized data of baseband signals of Antennas

slide-9
SLIDE 9

Implementation

▪ 16 Dell T630 machines as Compute Nodes ▪ 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization ▪ 32 Tesla K40c GPU cards for processing ▪ 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes ▪ Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives ▪ Developed in collaboration with Swinburne University, Australia

slide-10
SLIDE 10

Implementation

Image courtesy : Irappa Halagalli, NCRA

slide-11
SLIDE 11

Sample result

Legacy GMRT 325 MHz : 350 μJy Upgraded GMRT 300 – 500 MHz : 28 μJy Significantly lower noise RMS and better image quality with upgraded GMRT

Dharam Vir Lal and Ishwar Chandra, NCRA

Image of Coma cluster

slide-12
SLIDE 12

Computation Performance : K40

Channels FFT (Gflops) MAC (Gflops) 2048 620 626 4096 626 620 8192 512 574 16384 498 537

  • No. of antennas : 32 (dual pol)

CUDA 7.5

slide-13
SLIDE 13

Motivation for next generation GPUs

▪ Adding more compute intensive applications

  • Multi-beamforming
  • Processing on each beam (beam steering)
  • Gated correlator
  • FIR filtering with many taps for narrow-band mode implementation

▪ Working GMRT system and code provides an excellent testing ground for the features of next generation GPUs ▪ Performance measured and compared on GP100 and V100

slide-14
SLIDE 14

Computation performance – K40 vs GP100

Cuda 7.5, ECC off

Performance follows CUFFT benchmarks for K40 and P100 Reference for K40 benchmark : CUDA 6.5 performance report, September 2014 Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016

slide-15
SLIDE 15

Computation performance : K40 vs GP100

Cuda 7.5, ECC off

  • No. of antennas : 32 (dual pol)
slide-16
SLIDE 16

Computation performance : K40 vs GP100

Cuda 7.5, ECC off

Peak Global Memory Bandwidth : K40 – 288 GB / sec GP100 – 732 GB / sec Peak Performance : K40 – 4.3 TFlops GP100 – 9.3 TFlops

slide-17
SLIDE 17

Computation performance as % of Real-time

Bandwidth : 200 MHz

  • No. of antennas : 32 (dual pol)

Spectral Channels : 16384

slide-18
SLIDE 18

Computation performance : GP100 vs V100

GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster)

slide-19
SLIDE 19

Computation performance : GP100 vs V100

GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster)

  • No. of antennas : 32 (dual pol)
slide-20
SLIDE 20

Computation performance : GP100 vs V100

GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster)

Peak Global Memory Bandwidth : GP100 – 732 GB / sec V100 – 900 GB / sec Peak Performance : GP100 – 9.3 TFlops V100 – 14 TFlops

slide-21
SLIDE 21

Reasons behind relatively low performance of MAC

▪ Non-contiguous Global Memory access at block level

MAC input data format

▪ Low Arithmetic Intensity

slide-22
SLIDE 22

GPU kernel improvements

▪ MAC : Simplified Index Arithmetic Improved the L2 hit ratio : less then 5% to nearly 86% Vectorized loads – Increased ILP (float4) Exposing more parallelism by increasing the occupancy Single Precision to Half Precision floating point – No performance gain ▪ FFT : Single Precision to Half Precision floating point

slide-23
SLIDE 23

MAC : Performance gain with optimizations on V100

  • No. of antennas : 32 (dual pol)

V100 on Cuda 9.1 (using PSG cluster)

slide-24
SLIDE 24

FFT : Performance gain with half precision on V100

V100 on Cuda 9.1 (using PSG cluster)

slide-25
SLIDE 25

FFT : Error analysis with half precision in power spectrum

Spectral Channels : 2048 Batch size : 128

slide-26
SLIDE 26

FFT : Error analysis with half precision in phase spectrum

Spectral Channels : 2048 Batch size : 128

slide-27
SLIDE 27

Going forward

▪ Improving MAC using Tensor cores – potential 2x improvement ▪ Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code ▪ Optimized FIR filtering routines in CUDA for narrow-band mode implementation ▪ Implementing multi-beamforming, beam steering and gated correlator

slide-28
SLIDE 28

Acknowledgements

  • Prof. Yashwant Gupta, Centre Director, NCRA

▪ Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA ▪ Sanjay Kudale, GMRT, NCRA ▪ Shelton Gnanaraj, GMRT, NCRA ▪ Andrew Jameson, Swinburne University, Australia ▪ Benjamin Barsdel, Swinburne University, Australia (now at Nvidia) ▪ CASPER Group, Berkeley ▪ Digital Back-end Group, GMRT, NCRA ▪ Computer Group, GMRT, NCRA ▪ Control Room, GMRT

slide-29
SLIDE 29

Thank You