BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE - PowerPoint PPT Presentation

S5302 - OPTIMIZATION OF GPU- BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE HARSHAVARDHAN REDDY DEVELOPER TECHNOLOGY ENGINNER, NCRA NVIDIA

INTRODUCTION NCRA – National Center for Radio Astrophysics Pune, India. http://ncra.tifr.res.in/ncra GMRT – Giant Meterwave Radio Telescope Situated at Kodad near Pune, India. http://gmrt.ncra.tifr.res.in/ Consists of 30 dish antennas 45 m diameter each, spread over 25 Km Used by radio-astronomers world-wide

uGMRT EFFORT The GMRT backend has been upgraded recently The “ uGMRT ” Key change: Bandwidth 32 -> 200/400 MHz Prototype system with 16 antennas – 8 compute nodes up and running GPUs upgrade from Fermi to Kepler Optimizing software backend For better science, less power and reduction in cost On going work involving NVIDIA and NCRA teams Contribution towards SKA

GMRT BACKEND Each antenna has two polarizations If the antenna is operating at 200 MHz bandwidth Sampling needs to be frequency 400 MHz Produces 400 million samples/sec 800 million samples per antenna per sec Total 800 * 32 = 25.6 G samples/sec (2 additional signal sources for debug and test) Signal processing backend needs to process all these samples in real-time Two polarizations Antenna - 1 400 M samples/sec A2D + more 8-bit samples A2D + more 400 M samples/sec 8-bit samples Bandwidth 200 MHz Sampling 400 MHz

BACKEND: COMPUTE INFRASTRUCTURE Samples from two antennas is fed to a single Antenna - 1 compute node Compute The number could change for other telescopes node 1 Can be decided by I/O requirements Antenna - 2 16 compute nodes Connected over high-speed network Each compute node has Antenna - 3 One CPU Compute One or two GPUs node 2 Antenna - 4 … …

GPU CORRELATOR Operations involved Data format conversion (Unpacking) Discrete Fourier Transform (DFT) Phase Rotation Multiply-And-Accumulate (MAC)

1. UNPACKING For converting each sample 8-bit read (integer) and 32-bit write (floating point) Dominated by I/O Unpacking is immediately followed by DFT 32-bit data per sample needs to be read again This read after write trip can be saved cuFFT callbacks introduced in CUDA 6.5 cuFFT callbacks can be used to combine unpacking with FFT operation Result - overhead of unpacking is reduced by 25%

2. DISCRETE FOURIER TRANSFORM DFT is implemented using cuFFT library APIs cuFFT Mode selection R2C C2C – Requires additional 2x2 Butterfly kernel Several possible combinations of input and output callback Unpacking, Phase Rotation, 2x2 butterfly No callbacks Unpacking callback Phase Rotation 2x2 Butterfly callback R2C Tested Tested, second best Tested NA C2C Tested Tested, best NA Tested

3. PHASE ROTATION Essentially multiplication by a constant Constant depends on antenna, frequency channel and time slice The kernel computes each constant on-the-fly Lots of math operations Redundancies in computation identified and removed Improvement in performance 10% Switching from CUDA 6.0 to 6.5 boosted performance by 50%

4. MAC The most costly operation Cost grows proportional to (antenna) 2 Choices for MAC routines GMRT – original routine xGPU – Mike Clark’s highly optimized MAC library xGPU performs better is almost all cases More so for higher number of antennas Side effect – Input/output reordering is required (antenna, time, frequency) -> (time, frequency, antenna) Shared memory based implementation achieves bandwidth of 128 GB/s on K20

PERFORMANCE OF MAC xGPU performs xGPU vs GMRT ~35% better than 2500 GMRT 2000 TIME IN MS 1500 1000 500 0 1K 2K 4K 8K 16K 32K GMRT MAC xGPU MAC

MAC KERNELS ON K40 Performance of GMRT MAC Performance of xGPU MAC K20 vs K40 K20 vs K40 2500 2500 2000 2000 TIME IN MS TIME IN MS 1500 1500 1000 1000 500 500 0 0 1k 2k 4k 8k 16k 1k 2k 4k 8k 16k 32k K20 K40 K20 K40 ~18% improvements 25-27% improvements

OVERALL RESULTS

OVERALL IMPROVEMENTS Overall improvement for 16K channels on single K20 4500 4000 25% 3500 faster 3000 TIME IN MS 2500 2000 1500 1000 500 0 Unpacking cuFFT Phase Rotation MAC Total Baseline Optimized Real-Time

OVERALL IMPROVEMENTS Optimized Correlator Performance 4500 4000 3500 3000 TIME IN MS 2500 2000 1500 1000 500 0 1K 2K 4K 8K 16K 32K Baseline Optimized 20-25% better performance

RFI REJECTION

RFI REJECTION RFI – Radio Frequency Interference RFI needs to be removed in real-time GMRT backend has time-domain RFI filtering implemented Desirable to have RFI filtering in both domains RFI filter RFI filter Correlator (time-domain) (frequency-domain)

RFI REJECTION CODE GMRT implements Median Absolute Deviation (MAD) based filtering MAD is a robust estimator Stream of input data is divided in fixed width windows For each window First MAD is computed Then threshold filter is applied All the windows can be processed concurrently GMRT has two implementations of the algorithm Optimized for small window – (< 1K) Optimized for large window – (> 4k)

IMPROVEMENTS IN RFI FILTERING Implicit histogram computation Second histogram is computed from first instead of re-fetching samples Integers instead of floating point numbers 𝑁𝐵𝐸 2 𝑁𝐵𝐸 = 𝑁𝐵𝐸 1 + 2 Helps in removing calls to ceil, floor etc. Reduced branching 8 if-else blocks reduced to 4 Reduction in launch latency overhead Launching smaller number of bigger kernels Side effect of combining kernels – temporary storage avoided Single version for all window sizes

RFI FILTERING RESULTS RFI Rejection performance at small window 30 3-20x faster 25 20 TIME IN MS 15 10 5 0 0.5K 1K 2K 4K WINDOW SIZE Baseline small window Optimized

RFI FILTERING RESULTS RFI Rejection performance at large window 16 14 2-10x faster 12 TIME IN MS 10 8 6 4 2 0 4K 8K 16K 32K AXIS TITLE Baseline large window Optimized

REFERENCES S3225 - Powering Real-time Radio Astronomy Signal Processing with GPUs GTC - 2013, Harshavardhan Reddy, Pradeep Gupta S4538 - Real-Time RFI Rejection Techniques for the GMRT Using GPUs GTC 2014, Rohini Joshi NCRA-NVIDIA collaboration work report phase 1 and phase 2

ACKNOWLEDGEMENT Team NCRA Dr. Yashwant Gupta Harshavardhan Reddy Rohini Joshi Niruj

THANK YOU

BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE - PowerPoint PPT Presentation

S5302 - OPTIMIZATION OF GPU- BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE HARSHAVARDHAN REDDY DEVELOPER TECHNOLOGY ENGINNER, NCRA NVIDIA INTRODUCTION NCRA National Center for Radio Astrophysics Pune, India.

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Waveform Generation Fundamental part of signal processing is the signal. Within the

Advanced Digital Signal Processing Part 5: Multi-Rate Digital Signal Processing Gerhard Schmidt

VLSI Digital Signal Processing Systems Keshab K. Parhi VLSI Digital Signal Processing Systems

Signal Processing in MATLAB Signal Processing in MATLAB February 2, 1998 Tom Krauss PhD Student

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

Machine Learning for Signal Processing Lecture 1: Signal Representations Class 1. 27 August

Signal Processing in the Pure Programming Signal Processing in Pure Language Albert Grf Dept.

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Sampling a Signal an analog signal together with some samples of the signal. The samples

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Signal Types Recall even digital signals are just voltages Analog signal Continuous

Acoustics II: time reversal signal restoration clicks audio signal processing tape speed

K SITE PROPERTY Maximizing Affordable Housing Opportunities on City Property Civic Development,

WHAT ARE SOURCES SOUGHT NOTICES AND REQUESTS FOR INFORMATION? 5/6/2020 1 SOURCES SOUGHT NOTICE

InterCity and AirTrain The Rail Futures blueprint for faster and better regional

Market Renewal Discussion with the Ontario Waterpower Association February 26, 2018 Disclaimer

Medicare Shared Savings Program: Application Submission Review For Initial Applicants June 13,

I-35 NORTHEAST EXPANSION (NEX) PROJECT Industry Workshop April 18, 2019 I-35 NEX Project

OUT OF AUTOCLAVE MATERIAL SEMI-PREG / TECHNICAL DEVELOPMENT OF RESIN TRANSFER MOLDING Y.

Communication Framework: How to get along with your stakeholders 1 Agenda Part I