Image-Domain Gridding on Accelerators Bram Veenboer Monday 26 th - - PowerPoint PPT Presentation

image domain gridding on accelerators
SMART_READER_LITE
LIVE PREVIEW

Image-Domain Gridding on Accelerators Bram Veenboer Monday 26 th - - PowerPoint PPT Presentation

Netherlands Institute for Radio Astronomy Image-Domain Gridding on Accelerators Bram Veenboer Monday 26 th March, 2018, GPU Technology Conference 2018, San Jose, USA ASTRON is part of the Netherlands Organisation for Scientific Research (NWO)


slide-1
SLIDE 1

Netherlands Institute for Radio Astronomy

Image-Domain Gridding on Accelerators

Bram Veenboer

Monday 26th March, 2018, GPU Technology Conference 2018, San Jose, USA ASTRON is part of the Netherlands Organisation for Scientific Research (NWO)

slide-2
SLIDE 2

Introduction to radio astronomy

Observe the sky at radio wavelengths:

Image credits: NRAO

Size of the telescope is proportional to the wavelength e.g. Hubble Space telescope: 1 um, 2.4 m Same resolution for 1 mm requires 2 km dish!

1

slide-3
SLIDE 3

Radio telescope: astronomical interferometer

Array of seperate telescopes: interferometer Create an image: interferometry Combine the signals for pairs of telescopes Resolution similar to one very large dish

2

slide-4
SLIDE 4

Interferometry theory

Sparse sampling of the ‘uv-plane‘: Every baseline samples the uv-plane: ‘visibility’ and ‘uvw-coordinate’ Orientation of baseline also determines

  • rientation in the uv-plane

A sample V (u, v) is the 2D FT of the brightness on the sky B(l, m) Apply Non-uniform Fourier transform to get sky image from uv data

3

slide-5
SLIDE 5

Sampling using 2 antennas

Every sample (in the uv-domain) shows up as a waveform in the image

Image credits: NRAO 4

slide-6
SLIDE 6

Sampling using 4 antennas

Every baseline (pair of two antennas) adds information to the image

5

slide-7
SLIDE 7

Sampling using 16 antennas (compact)

Using 16 antennas, the (artificial) source in the center of the image becomes visibile

6

slide-8
SLIDE 8

Sampling using 16 antennas (extended)

Longer baselines (larger antenna spacings) increase resolution of the image

7

slide-9
SLIDE 9

Sampling using 32 antennas, for 8 hours

Sampling for an extended period of time increases signal to noise

8

slide-10
SLIDE 10

Creating a sky-image

Imager: visibilities

gridder

− − − − → regular grid 2D FFT − − − − − → sky-image

baseline (pair of receivers) receiver

× I

correlator imager visibilities calibration sky-model sky-image imaging ionosphere i n c

  • m

i n g r a d i

  • w

a v e s

Correlator: combine signal into ‘visibility’ (with associated ‘uvw-coordinate’)

9

slide-11
SLIDE 11

Gridding using AW-projection (and W-stacking)

W-term: correct for curvature of the earth A-term: correct for direction-dependent effects

10

slide-12
SLIDE 12

W-projection gridding and Image-Domain gridding

W-projection gridding

using convolution kernels

Image-Domain gridding

using subgrids grid: visibility: convolution: updated pixel:

channels time

visibilities Fourier grid gridder kernel visibilities image subgrids Fourier subgrids Fourier grid gridder kernel FFT adder For more details: Image-Domain Gridding on Graphics Processors, Bram Veenboer, Matthias Petschow and John. W Romein, IPDPS 2017 11

slide-13
SLIDE 13

Square Kilometre Array

SKA1 Low, Australia SKA1 Mid, Africa

Data rates up to ≈ 10.000.000.000 visibilities/second

12

slide-14
SLIDE 14

Results: runtime/throughput

Runtime: time spend in one imaging cycle: gridding, fft and degridding Throughput: number of visibilities processed per second

10 20 30 40 50 60 Haswell KNL Pascal Vega Runtime [seconds] gridder subgrid-ifft adder grid-fft splitter subgrid-fft degridder

20 40 60 80 100 120 140 160 180 200 Haswell KNL Pascal Vega Throughput [MVisibilities/s] gridding degridding

Most time spent in gridder/degridder GPUs perform > order of magnitude better than CPU and Xeon Phi Very similar throughput for gridding and degridding

13

slide-15
SLIDE 15

Roofline analysis: overview

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Haswell KNL Pascal Vega

gridder degridder gridder degridder

Operational intensity [Op/Byte] Performance [TOp/s] Haswell KNL Pascal Vega

14

slide-16
SLIDE 16

Throughput limitation: host-device transfers

PCIe: ≈ 12 GB/s vs. NVLINK: ≈ 68 GB/s Fast interconnect needed to keep GPU computing.

15

slide-17
SLIDE 17

Roofline analysis: overview

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Haswell KNL Pascal Vega

gridder degridder gridder degridder

Operational intensity [Op/Byte] Performance [TOp/s] Haswell KNL Pascal Vega

16

slide-18
SLIDE 18

Inner loop for (de)gridder kernel: instruction mix

Many fused-multiply-add (FMA) and one sine/cosine computation:

1 α = . . . ; 2 3 for c=1,. . . , ˜

C do // channel

4

Φ = cos (α) + i sin (α);

5 6

Re(pix11) += Re(vis11[c]) ∗ Re(Φ[c]);

7

Im(pix11) += Re(vis11[c]) ∗ Im(Φ[c]);

8

Re(pix11) −= Im(vis11[c]) ∗ Im(Φ[c]);

9

Im(pix11) += Im(vis11[c]) ∗ Re(Φ[c]);

10 11

// [... same for pix12, pix21 and pix22]

12 end

FMA: peak performance on all architectures sine/cosine: poor performance on Intel architectures

17

slide-19
SLIDE 19

Roofline analysis: instruction mix

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Haswell KNL Pascal Vega

gridder degridder gridder degridder

Operational intensity [Op/Byte] Performance [TOp/s] Haswell KNL Pascal Vega

18

slide-20
SLIDE 20

Roofline analysis: shared memory

1 4 1 2

1 2 4 8 0.1 1 10

Pascal Vega

gridder degridder

Operational intensity [Op/Byte] Performance [TOp/s] Pascal Vega

19

slide-21
SLIDE 21

Results: energy consumption/efficiency

2 4 6 8 10 12 14 16 18 Haswell KNL Pascal Vega Energy consumption [kJ] gridder subgrid-ifft adder grid-fft splitter subgrid-fft degridder host 5 10 15 20 25 30 35 Haswell KNL Pascal Vega Energy efficiency [GFlop/W] gridder degridder

Most energy spent in gridder/degridder GPUs perform > order of magnitude better than CPU and Xeon Phi

20

slide-22
SLIDE 22

Results: comparison with AW-projection

8 16 24 32 40 48 56 64 107 108 W-kernel size NW Throughput [Visibilities/s] IDG Pascal WPG Pascal AWPG Pascal

IDG outperforms W-projection, while it also corrects for the challenging A-terms

21

slide-23
SLIDE 23

Results: creating very large images (gpu-only)

1024 2048 4096 8192 16384 32768 65536 180 200 220 240 Size [pixels2] Throughput [MVisibilities/s] GPU only, gridding GPU only, degridding

The size of the image is restricted by the amount of GPU device memory

22

slide-24
SLIDE 24

Results: creating very large images (gpu+cpu)

1024 2048 4096 8192 16384 32768 65536 180 200 220 240 Size [pixels2] Throughput [MVisibilities/s] Hybrid, gridding Hybrid, degridding GPU only, gridding GPU only, degridding

The adder kernel is executed by the host CPU

23

slide-25
SLIDE 25

Results: creating very large images (Unified Memory)

1024 2048 4096 8192 16384 32768 65536 180 200 220 240

tiling in adder/splitter

Size [pixels2] Throughput [MVisibilities/s] Unified, gridding Unified, degridding Hybrid, gridding Hybrid, degridding GPU only, gridding GPU only, degridding

Unified memory (and tiling) enables the GPU to create very large images

24

slide-26
SLIDE 26

Image-Domain Gridding for the Square Kilometre Array

SKA-1 Low SKA-1 Mid # receivers 512 133 # baselines 13,0816 8778 # channels 65,536 65.536 # polarizations 4 4 integration time 0.9 (s) 0.14 (s) data rate 8.3 (GVis/s) 9.53 (GVis/s)

Imaging data rate: 200 GVis/s Compute: 50 PFlop/s (DP) Power budget: ≈ 1 MW (De)gridding: ≈ 60% IDG on Tesla V100/NVLINK: ≈ 0.26 GVis/s per GPU → 770 V100s required Required compute: 770 × 7.8 ≈ 6 PFlop/s ≪ 15 PFlop/s available Power budget: 770 × 300 ≈ 231 KW ≪ 600 KW available

25

slide-27
SLIDE 27

Summary

High-performance gridding and degridding, including AW-term correction GPUs much faster and more (energy)-efficient than CPUs and Xeon Phi On GPUs, IDG outperforms AW-projection IDG is able to make very large images (using Unified Memory) Most challenging sub-parts of imaging for SKA is solved!

More details: Image-Domain Gridding on Graphics Processors, Bram Veenboer, Matthias Petschow and John. W Romein, IPDPS 2017 Source available at: https://gitlab.com/astron-idg/idg 26