Efficient Imaging in Radio Astronomy using GPUs Bram Veenboer, - - PowerPoint PPT Presentation

efficient imaging in radio astronomy using gpus
SMART_READER_LITE
LIVE PREVIEW

Efficient Imaging in Radio Astronomy using GPUs Bram Veenboer, - - PowerPoint PPT Presentation

Netherlands Institute for Radio Astronomy Efficient Imaging in Radio Astronomy using GPUs Bram Veenboer, Matthias Petschow and John W. Romein Tuesday 9 th May, 2017, GTC 2017, San Jose, USA ASTRON is part of the Netherlands Organisation for


slide-1
SLIDE 1

Netherlands Institute for Radio Astronomy

Efficient Imaging in Radio Astronomy using GPUs

Bram Veenboer, Matthias Petschow and John W. Romein

Tuesday 9th May, 2017, GTC 2017, San Jose, USA ASTRON is part of the Netherlands Organisation for Scientific Research (NWO)

slide-2
SLIDE 2

Radio Astronomy

Array of antennas and/or dishes Radio frequencies (30-240 Mhz) Map of radio sources

LOFAR, The Netherlands Bo¨

  • tes field, > 1000 Megapixel

Image credits: Wendy Williams, Reinout van Weeren and Huub Rottgering 1

slide-3
SLIDE 3

Square Kilometre Array

SKA1 Mid, Africa SKA1 Low, Australia

2

slide-4
SLIDE 4

Square Kilometre Array

3

slide-5
SLIDE 5

Imaging in Radio Astronomy

Convert measurements (visibilities) into a sky-image:

baseline (pair of stations) station i n c

  • m

i n g r a d i

  • w

a v e s

× I

correlator imager visibilities calibration sky-model sky-image imaging

Measurement equation:

Vpq =

  • l
  • m Ap(l, m) × B(l, m) × e−2πi(upql+vpqm+wpqn)dldm

visibility sky coordinates source brightness visibility coordinate u, v, w A-term phasor e−iφ W-term

4

slide-6
SLIDE 6

Fourier sampling

instantaneous u,v-coverage

every baseline contributes one point (visibility)

u,v-coverage for one hour

‘earth rotation synthesis‘ 5

slide-7
SLIDE 7

‘Gridding’ visibilities

Place visibilities onto a regular Fourier grid:

Vpq =

  • l
  • m Ap(l, m) × B(l, m) × e−2πi(upql+vpqm+wpqn)dldm

visibility sky coordinates source brightness phasor e−iφ A-term W-term visibility coordinate u, v, w

correct for earth curvature correct for ‘direction-dependent effects’ floating-point numbers phase correction

Traditional approach: apply ‘convolution’ to each visibility

6

slide-8
SLIDE 8

Imaging example

Simulated three point sources, observed by 30 stations for 4 hours: gridded visibilities

2D FFT

− → sky image

7

slide-9
SLIDE 9

Efficient Imaging in Radio Astronomy

− gridding iFFT CLEAN FFT degridding Fourier grid residual image Fourier grid model visibilities measured visibilities model image bright sources sky-image “image” “predict”

Problem: The ‘gridding’ and ‘degridding’ steps are computationally very expensive Solution: Use the novel Image-Domain Gridding (IDG) algorithm on accelerators

Algorithm credits: Bas van der Tol 8

slide-10
SLIDE 10

Placing visibilities onto a regular Fourier grid

Fourier domain gridding

using convolution kernels

Image domain gridding

using subgrids

grid visibility: convolution: pixel in subgrid:

visibilities Fourier grid gridder kernel visibilities image subgrids Fourier subgrids Fourier grid gridder kernel FFT adder 9

slide-11
SLIDE 11

Image domain gridding: subgrids

Vj : (1, ˜ C) ( ˜ C) (1, 1) ( ˜ T, 1)

grid subgrid

A subset ( ˜ T × ˜ C) of visibilities from baseline j are placed onto a subgrid

10

slide-12
SLIDE 12

Image domain gridding: work distribution

(1) work

(all subgrids for a few baselines)

(2) subset of work

(a number of subgrids)

(3) work element

(one subgrid)

(4) pixels

11

slide-13
SLIDE 13

Optimizations

General:

Coarse-grained parallelism, vectorization, libraries Double buffering, shared memory

Application specific:

Fine-grained parallelism Data transpose (visibilities) Data alignment (uvw coordinates)

Architecture specific:

Computation of phasor term (e−iφ) Nvidia: one special function unit (SFU) for every four/six cores GCN: one transcendental operation per SIMD per four clock cycles

12

slide-14
SLIDE 14

GPU implementation

gpu global memory

precompute / load from cache preload into cache

gpu threads perform computation (in registers)

cpu

subset of work HtoD queue execute queue DtoH queue

cpu threads offload subsets of work to GPU

gpu cores sfu

shared memory

store result

13

slide-15
SLIDE 15

Results: throughput/runtime

Throughput: number of visibilities processed per second

100 200 300 Haswell Pascal Fiji Throughput [MVisibilities/s]

gridding degridding

20 40 60 Haswell Pascal Fiji Runtime [seconds]

gridder subgrid-ifft adder grid-fft splitter subgrid-fft degridder

Most time spent in gridder/degridder GPUs perform > order of magnitude better than CPU

14

slide-16
SLIDE 16

Roofline analysis: overview

1 4 1 2

1 2 4 8 16 32 64 128 256 512 0.1 1 10

gridder degridder gridder degridder

Haswell Pascal Fiji

Operational intensity [Op/Byte] Performance [TOp/s] Pascal Fiji Haswell

15

slide-17
SLIDE 17

Performance for FMA/sincos instruction mix

1 4 1 2

1 2 4 8 16 32 64 128 256 102 103 104

22 % 59 % 98 % 4 ×

ρ [fma/sincos] Performance [GOp/s] Pascal Fiji Haswell

16

slide-18
SLIDE 18

Roofline analysis: instruction mix

1 4 1 2

1 2 4 8 16 32 64 128 256 512 0.1 1 10

gridder degridder gridder degridder

Haswell

D R A M Device Memory

Pascal Fiji

Operational intensity [Op/Byte] Performance [TOp/s] Pascal Fiji Haswell

17

slide-19
SLIDE 19

Roofline analysis: shared memory

1 4 1 2

1 2 4 8 16 32 64 128 256 512 0.1 1 10

gridder degridder gridder degridder gridder degridder

Haswell

D R A M Device Memory

Pascal

Shared Memory

Fiji

Operational intensity [Op/Byte] Performance [TOp/s] Pascal Fiji Haswell

18

slide-20
SLIDE 20

Results: energy consumption/efficiency

5 10 15 20 Haswell Pascal Fiji Energy consumption [kJ]

gridder subgrid-ifft adder grid-fft splitter subgrid-fft degridder host

10 20 30 Haswell Pascal Fiji Energy efficiency [GFlop/W]

gridder degridder

Most energy spent in gridder/degridder GPUs perform > order of magnitude better than CPU

19

slide-21
SLIDE 21

Results: AW-projection at the cost of W-projection

8 16 24 32 40 48 56 64 100 200 300 400 1.4× 1.1× 1.4× 1.3× 1.2× 1.8× 2.1× 2.4× W-kernel size NW Throughput [MVisibilities/s] W-projection Image-Domain Gridding

20

slide-22
SLIDE 22

Conclusion

First implementations of the IDG algorithm on CPUs and GPUs First efficient degridding implementation on GPUs ever A thorough (roofline) analysis of the achieved performance An assessment of energy efficiency IDG on GPUs is a candidate to meet the demanding computational and energy efficiency constraints imposed by future telescopes such as the Square Kilometre Array (SKA).

Image-Domain Gridding on Graphics Processors, Bram Veenboer, Matthias Petschow and John. W Romein, IPDPS 2017 21