S9306 Extreme Signal-Processing Performance Using Tensor Cores - - PowerPoint PPT Presentation

s9306
SMART_READER_LITE
LIVE PREVIEW

S9306 Extreme Signal-Processing Performance Using Tensor Cores - - PowerPoint PPT Presentation

Tuesday 19 th March, 2019 GPU Technology Conference 2019 San Jose, USA S9306 Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer This talk consists of two parts. In the first


slide-1
SLIDE 1

Netherlands Institute for Radio Astronomy

S9306

Extreme Signal-Processing Performance Using Tensor Cores Astronomical Imaging on GPUs John Romein and Bram Veenboer

This talk consists of two parts. In the first part, we explain how we use Tensor Cores to

  • btain extreme signal-processing performance. In the second part of this talk, we explain

how we solve the largest computational challenge in the imaging pipeline of modern radio telescopes. Tuesday 19th March, 2019 GPU Technology Conference 2019 San Jose, USA

slide-2
SLIDE 2

1 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

Tensor Cores: Signal Processing at Unprecedented Speeds

John Romein ASTRON (Netherlands Institute for Radio Astronomy)

slide-3
SLIDE 3

2 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

  • utline
  • tensor cores
  • complex numbers and matrix multiplications
  • signal-processing algorithms

– correlations – beam forming – ... analyze performance

slide-4
SLIDE 4

3 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

tensor cores

  • mixed-precision matrix multiplication hardware

– Volta, Turing

  • V100: peak 112 (!) TFLOPS
  • designed for deep learning
slide-5
SLIDE 5

4 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

how to use tensor cores

  • libraries (cuBLAS, cutlass, ...) ✘

– insufficient complex numbers support

  • WMMA ✔

– operates directly on 16x16 matrices (+ few more formats) – use in CUDA program

slide-6
SLIDE 6

5 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

WMMA example

  • warp performs 16x16 matrix multiplication

load_matrix_sync(a_frag, &a[…][…], K); load_matrix_sync(b_frag, &b[…][…], N); fill_fragment(c_frag, 0); mma_sync(d_frag, a_frag, b_frag, c_frag); // d=a*b+c store_matrix_sync(&d[…][…], d_frag, …);

slide-7
SLIDE 7

6 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

signal processing: complex numbers

  • describes phase & amplitude of signal
  • real and imaginary part (ar, ai)
  • complex multiply-add = 4 real multiply-adds

c += ab cr += arbr cr += –aibi –sign → no tensor core support ci += arbi ci += aibr

slide-8
SLIDE 8

7 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

two complex array/matrix formats

i0,0 i0,1 i0,2 i0,3 r1,0 i1,1 i1,2 i1,3 i2,0 i2,1 i2,2 i2,3 i3,0 i3,1 i3,2 i3,3 r0,0 r0,1 r0,2 r0,3 r1,0 r1,1 r1,2 r1,3 r2,0 r2,1 r2,2 r2,3 r3,0 r3,1 r3,2 r3,3 r0,0 i0,0 r0,1 i0,1 r0,2 i0,2 r0,3 i0,3 r1,0 r1,0 r1,1 i1,1 r1,2 i1,2 r1,3 i1,3 r2,0 i2,0 r2,1 i2,1 r2,2 i2,2 r2,3 i2,3 r3,0 i3,0 r3,1 i3,1 r3,2 i3,2 r3,3 i3,3

std::complex<float> matrix[4][4]; 2) interleaved 1) split matrix float real[4][4], imag[4][4];

slide-9
SLIDE 9

8 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

1) complex split matrices

  • maps well to tensor cores

– negate Ai values

[Cr]=[ Ar][Br]+[−Ai][Bi] [Ci]=[ Ar][Bi]+[−Ai][Br] [C]=[ A][B] →

slide-10
SLIDE 10

9 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

2) interleaved complex matrices

  • reorder right matrix for tensor core use

– duplicate/permute/negate entries

r0 i0 r1 i1 r2 i2 ⋯ ⋯ r7 i7 r i r0 i0

  • i0 r0

r1 i1

  • i1 r1

r2 i2

  • i2 r2

⋮ ⋮ ⋮ ⋮ r7 i7

  • i7 r7

X =

slide-11
SLIDE 11

10 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

complex formats: split array vs. interleaved

  • implemented both
  • generally no big performance difference
slide-12
SLIDE 12

11 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

tensor cores for signal processing

  • suitable if

– input ≤ 16 bit ✔ – algorithm translates to matrix-matrix multiplication ✔+✘

slide-13
SLIDE 13

12 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

algorithm 1: correlations

slide-14
SLIDE 14

13 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

correlations

  • combines telescope data

– each pair: multiply & accumulate – ½r(r+1) pairs

correlationrecv 1,recv 2=

time=0 integration time−1

samplerecv 1,time×samplerecv 2,time

slide-15
SLIDE 15

14 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

correlator computations

  • C ← A * AH
  • C = CH → compute & store triangle

time → receivers → receivers → receivers →

A: C:

slide-16
SLIDE 16

15 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

work decomposition

  • computeSquares()

– thread block: 64x64 receivers – warp: 32x16 receivers

  • computeTriangles()

– redundant computations above diagonal

receivers → receivers → = 64x64 receivers

slide-17
SLIDE 17

16 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

correlator implementation

  • cache input: L2 → shared mem → registers

– fix –sign on the fly

  • wmma::store_matrix_sync() cannot write to triangle

– copy via shared mem, or – write accumulation registers directly (hack!)

slide-18
SLIDE 18

17 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

correlator performance

64 128 192 256 320 384 448 512 576 10 20 30 40 50 60 70 80

  • verall

correlateSquares() correlateTriangles() # receivers TFLOPS

(measured on Tesla V100)

slide-19
SLIDE 19

18 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

correlator roofline analysis

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 FLOPS/byte TFLOPS

compute bound m e m

  • r

y b a n d w i d t h b

  • u

n d correlateTriangles() correlateSquares()

slide-20
SLIDE 20

19 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

correlator energy efficiency

64 128 192 256 320 384 448 512 576 50 100 150 200 250 300

  • verall

computeSquares() computeTriangles() # receivers GFLOP/J

(measured on Titan RTX, not V100)

slide-21
SLIDE 21

20 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

Titan RTX (Turing)

innovation beyond Moore's law

2013 2014 2015 2016 2017 2018 2019 2020 10 20 30 40 50 60 70 80 Tensor Cores FP32 TFLOPS Tesla K40 (Kepler) Titan X (Maxwell) Tesla V100 (Volta) Titan X (Pascal)

slide-22
SLIDE 22

21 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

algorithm 2: beam forming

slide-23
SLIDE 23

22 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

beam forming

  • increase sensitivity in particular direction
  • (weighted) addition of signals

credit: Jason Hessels

bfdatatime,beam= ∑

recv =0 nr recv −1

samplestime,recv weightsrecv,beam

slide-24
SLIDE 24

23 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

beam former implementation

  • multiple beams: complex matrix-matrix multiplication

beams → time → receivers → time → receivers → beam weights →

= x

bfdatatime,beam= ∑

recv =0 nr recv −1

samplestime,recv weightsrecv,beam

slide-25
SLIDE 25

24 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

beam former performance and roofline analysis

compute bound memory bandwidth bound 64 receivers 512 receivers

16 32 64 128 256 512 1024 10 100 FLOPS/byte TFLOPS 64 128 192 256 320 384 448 512 10 20 30 40 50 60 70 80 512 beams 256 beams 128 beams # receivers TFLOPS

slide-26
SLIDE 26

25 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

  • ther algorithms
slide-27
SLIDE 27

26 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

  • ther signal-processing algorithms
  • nonuniform Fourier transforms ✔

– map well to complex matrix-matrix multiplication – ≤ 80 TFLOPS

  • FIR filter ✘

– matrix multiplication → many zeros – typically memory bandwidth bound

  • FFT ✘

– not a matrix multiplication – memory bandwidth bound

1 4 1 6 6 4 2 5 6 1 2 4 4 9 6 1 10 100 FLOPS/byte TFLOPS

no need for tensor cores nuFt FIR filter FFT

slide-28
SLIDE 28

27 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

current / future work

  • try further optimizations

– correlator: near diagonal – beam forming: cublasLtMatmul() (CUDA 10.1)

  • support any number of receivers/beams
  • 8 bit, 4 bit
slide-29
SLIDE 29

28 GTC'19 / Tensor Core Signal Processing March 18-21, 2019

conclusions

  • tensor cores for signal processing:

– correlating – multi-beam forming – nonuniform Fourier transforms

  • unprecedented performance (≤ ~75 TFLOPS, ≤ 6x)

matrix-matrix multiplication

This work is funded by the European Union under grant no. H2020-FETHPC-754304 (DEEP-EST).

slide-30
SLIDE 30

Netherlands Institute for Radio Astronomy

Astronomical Imaging on GPUs

Bram Veenboer

Tuesday 19th March, 2019 GPU Technology Conference 2019 San Jose, USA

slide-31
SLIDE 31

Outline

  • Introduction:
  • Interferometry
  • The Image-Domain Gridding algorithm
  • Performance analysis
  • Analysis and optimization:
  • Gridder kernel
  • Imaging application
  • Performance and energy-efficiency comparison
  • Results in context of Square Kilometre Array
  • Summary

2

slide-32
SLIDE 32

Introduction to radio astronomy

Image credits: NRAO

  • Observe the sky at radio wavelengths −

→ map of radio sources

  • Dish-based telescopes: VLA, ALMA, MeerKAT, SKA-1 Mid
  • Size of the telescope is proportional to the wavelength
  • Use array of antennas for low frequencies: (LOFAR, MWA, SKA-1 Low)

3

slide-33
SLIDE 33

Radio telescope: astronomical interferometer

  • Interferometer: array of seperate telescopes
  • Interferometry: combine the signals from

seperate radio telescopes

  • Resolution similar to one very large dish

baseline

4

slide-34
SLIDE 34

Interferometry theory

u v w

uv plane

Θ

l m

N Pole

image plane

  • Sampling of the ‘uv-plane’: ‘visibilities’
  • Earth-rotation synthesis

−60 −40 −20 20 40 60 −60 −40 −20 20 40 60 u [km] v [km]

5

slide-35
SLIDE 35

Interferometry theory

u v w

uv plane

Θ

l m

N Pole

image plane

  • Sampling of the ‘uv-plane’: ‘visibilities’
  • Earth-rotation synthesis

−60 −40 −20 20 40 60 −60 −40 −20 20 40 60 u [km] v [km]

  • Visibilities DFT

− → image

5

slide-36
SLIDE 36

Interferometry theory

u v w

uv plane

Θ

l m

N Pole

image plane

  • Sampling of the ‘uv-plane’: ‘visibilities’
  • Earth-rotation synthesis

−60 −40 −20 20 40 60 −60 −40 −20 20 40 60 u [km] v [km]

  • Visibilities DFT

− → image

  • Visibilities

gridding

− → grid FFT − → image

5

slide-37
SLIDE 37

Square Kilometre Array

SKA1 Low, Australia SKA1 Mid, Africa

  • Data rates up to ≈ 10.000.000.000 visibilities/second

6

slide-38
SLIDE 38

Sky-image creation using Image-Domain gridding

gridding iFFT degridding FFT image visibilities

i n c

  • m

i n g r a d i

  • w

a v e s

baseline (pair of stations) station

× C I

correlation calibration imaging visibilities visibilities image gridder kernel gridder kernel FFT FFT adder kernel subgrid subgrid subgrid subgrid visibilities visibilities grid

7

slide-39
SLIDE 39

W-projection gridding and Image-Domain gridding

W-projection gridding

using convolution kernels

Image-Domain gridding

using subgrids subgrid grid visibility: convolution: subgrid: channels time Vj : (1, ¯ C), (1, 1), ( ¯ T, 1), ( ¯ T, ¯ C)

subgrid grid

See “Image Domain Gridding: a fast method for convolutional resampling of visibilities, S. van der Tol et al., A&A 2018” and “Image-Domain Gridding on Graphics Processors, B. Veenboer et al., IPDPS 2017” for details.

8

slide-40
SLIDE 40

Image-Domain Gridding algorithm

1 for s = 1 . . . S do

// subgrids mapped to thread blocks

2

complex<float> subgrid[P][N×N];

3

for i = 1 . . . N×N do // pixels mapped to threads

4

float offset = compute offset(s, i);

5

for t = 1 . . . T do

6

float index = compute index(s, i, t);

7

for c = 1 . . . C do

8

float scale = scales[c];

9

float phase = offset - (index × scale); // 1 fma

10

complex<float> phasor = {cos(phase), sin(phase)}; // 1 cis

11

for p = 1 . . . P do // 4 polarizations

12

complex<float> visibility = visibilities[t][c][p]; // load 8 bytes

13

subgrid[p][i] += cmul(phasor, visibility); // 4 fma

14

end

15

end

16

end

17

end

18

apply corrections(subgrid); // aterm, taper, iFFT

19

store(subgrid);

20 end

9

slide-41
SLIDE 41

Roofline analysis

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Pascal

memory bound compute bound example 2 example 1

Operational intensity [Op/Byte] Performance [TFlop/s] benchmark

  • Operational intensity gives upper bound on performance

10

slide-42
SLIDE 42

Kernel v1 (reference) 1/2

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Pascal

gridder is compute bound

reference

Operational intensity [Op/Byte] Performance [TFlop/s] idg gridder kernel

  • Throughput: 70 Mvis/s (number of visibilities processed per second)

11

slide-43
SLIDE 43

Kernel v1 (reference) 2/2

  • High compute utilization
  • Low memory utilization

− → compute bound

  • High FPU utilization
  • Low performance?
  • What instructions are

executed?

12

slide-44
SLIDE 44

Image-Domain Gridding algorithm

1 for s = 1 . . . S do

// subgrids mapped to thread blocks

2

complex<float> subgrid[P][N×N];

3

for i = 1 . . . N×N do // pixels mapped to threads

4

float offset = compute offset(s, i);

5

for t = 1 . . . T do

6

float index = compute index(s, i, t);

7

for c = 1 . . . C do

8

float scale = scales[c];

9

float phase = offset - (index × scale); // 1 fma

10

complex<float> phasor = {cos(phase), sin(phase)}; // 1 cis

11

for p = 1 . . . P do // 4 polarizations

12

complex<float> visibility = visibilities[t][c][p]; // load 8 bytes

13

subgrid[p][i] += cmul(phasor, visibility); // 4 fma

14

end

15

end

16

end

17

end

18

apply corrections(subgrid); // aterm, taper, iFFT

19

store(subgrid);

20 end

13

slide-45
SLIDE 45

Special Function Units (SFUs) to evaluate sine/cosine

Compile using −use fast math, or specify explicitely:

b.x = cos(a.x) → asm(cos.approx.f32 %0, %1; : =f(b.x) : f(a.x)); b.y = sin(a.x) → asm(sin.approx.f32 %0, %1; : =f(b.y) : f(a.x));

14

slide-46
SLIDE 46

Peformance for instruction mix FMA + sine/cosine

1 8 1 4 1 2

1 2 4 8 16 32 64 128 0.1 1 10

Pascal

ratio in gridder

95% 22% 25%

ρ [fma/sincos] Performance [TFlop/s] peak cisSFU cisFPU

  • Using SFUs, FMAs and sine/cosine are computed simultaneously

15

slide-47
SLIDE 47

Kernel v2 (sfu) 1/2

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Pascal

sfu reference

Operational intensity [Op/Byte] Performance [TFlop/s]

  • 2.5× performance increase, but still below the peak (180 MVis/s)

16

slide-48
SLIDE 48

Kernel v2 (sfu) 2/2

  • Many memory operations
  • Memory utilization medium?

− → optimize memory acces pattern

  • Many cache hits

− → cache input data explicitely using shared memory

17

slide-49
SLIDE 49

Kernel v3 (cache) 1/3

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Pascal

cache sfu reference

Operational intensity [Op/Byte] Performance [TFlop/s]

  • Getting close to peak (70%, 250 MVis/s)

18

slide-50
SLIDE 50

Kernel v3 (cache) 2/3

  • Memory utilization

texture → shared

  • Memory operation overhead

reduced from 20% to 10%

  • High shared memory

bandwidth → more analyis is needed...

19

slide-51
SLIDE 51

Roofline analysis

1 4 1 2

1 2 4 8 16 32 64 0.1 1 10

Pascal

benchmarks

Operational intensity [Op/Byte] Performance [TFlop/s] benchmark

  • Add additional roof for shared memory

20

slide-52
SLIDE 52

Kernel v3 (cache) 3/3

1 4 1 2

1 2 4 8 16 32 64 0.1 1 10

Pascal

shared memory bound compute bound cache

Operational intensity [Op/Byte] Performance [TFlop/s] benchmark idg kernel

  • Close to (measured) shared memory roof
  • Increase operational intensity: every thread computes multiple pixels

21

slide-53
SLIDE 53

Kernel v4 (unroll) 1/3

1 4 1 2

1 2 4 8 16 32 64 0.1 1 10

Pascal

unroll cache

Operational intensity [Op/Byte] Performance [TFlop/s]

  • The new kernel is not bound by shared memory bandwidth

22

slide-54
SLIDE 54

Kernel v4 (unroll) 2/3

1 2 4 8 16 32 64 128 256 512 1024 0.1 1 10

Pascal

unroll cache sfu reference

Operational intensity [Op/Byte] Performance [TFlop/s]

  • The kernel is very close to the theoretical peak (90%, 300 Mvis/s)

23

slide-55
SLIDE 55

Kernel v4 (unroll) 3/3

  • Very high FPU utilization
  • Low memory utilization
  • FPU usage max

− → gridder done!

24

slide-56
SLIDE 56

Kernel optimization summary

General optimization steps:

1 Measure (floating-point) performance and memory bandwidth 2 Apply roofline model to assess potential gains 3 Find optimization opportunities by profiling 4 Apply optimizations accordingly 25

slide-57
SLIDE 57

Kernel optimization summary

General optimization steps:

1 Measure (floating-point) performance and memory bandwidth 2 Apply roofline model to assess potential gains 3 Find optimization opportunities by profiling 4 Apply optimizations accordingly

In case of the gridder kernel:

  • Reference: 70 MVis/s
  • Use Special Function Units: 180 MVis/s (2.6×)
  • Use shared memory cache: 250 MVis/s (1.4×)
  • Increase operational intensity: 300 MVis/s (1.2×)

25

slide-58
SLIDE 58

I/O optimization

  • Kernel throughput = application throughput
  • Data transfers can not be ignored:

26

slide-59
SLIDE 59

I/O optimization

  • Kernel throughput = application throughput
  • Data transfers can not be ignored:
  • Overlap PCIe transfers and compute:

26

slide-60
SLIDE 60

Creating large images: using CPU+GPU

  • Subgrids are added onto a larger grid
  • The grid might not fit in GPU memory (e.g. 40k×40k ≈ 48 GB)

27

slide-61
SLIDE 61

Creating large images: using CPU+GPU

  • Subgrids are added onto a larger grid
  • The grid might not fit in GPU memory (e.g. 40k×40k ≈ 48 GB)
  • Perform addition on the host:

(1) copy input (2) gpu execution (3) copy output (4) cpu execution

27

slide-62
SLIDE 62

Creating large images: using Unified Memory

  • Allocate memory:

malloc(size) → cuMemAllocManaged(ptr, size, CU MEM ATTACH GLOBAL)

  • Apply tiling to keep pixels close in memory:

grid[Gy][Gx] → grid[Ny][Nx][Ty][Tx] (Ny ×Ty = Gy and Nx ×Tx = Gx)

28

slide-63
SLIDE 63

Creating large images: results

4096 8192 16384 32768 65536 60 80 100 120

10% 24%

Grid size [N ×N] Throughput [MVisibilities/s] GPU-only imaging Unified imaging Hybrid imaging

29

slide-64
SLIDE 64

Performance and energy-efficiency comparison

  • 2x Xeon E5-2697v3, Xeon Phi 7210X, NVIDIA GTX Titan X (Pascal) and AMD Vega FE:

20 40 60 80 100 120 140 160 180 200 220 240 260 Haswell Knl Pascal Vega Throughput [MVisibilities/s] gridding degridding 5 10 15 20 25 30 35 40 45 50 55 Haswell Knl Pascal Vega Energy efficiency [GFlop/W] gridder degridder

  • Pascal is the fastest and most energy-efficient device

30

slide-65
SLIDE 65

Image-Domain Gridding for the Square Kilometre Array

  • SKA-1 Low visibility rate: 9.5 GVis/s −

→ imaging data rate: 95 GVis/s

  • Compute: 50 PFlop/s (DP, total)
  • Power cap: ≈ 5 MW (per site)

31

slide-66
SLIDE 66

Image-Domain Gridding for the Square Kilometre Array

  • SKA-1 Low visibility rate: 9.5 GVis/s −

→ imaging data rate: 95 GVis/s

  • Compute: 50 PFlop/s (DP, total)
  • Power cap: ≈ 5 MW (per site)
  • Imaging ‘budget’ ≈ 60%:
  • 15.3 PFlop/s (DP, per site)
  • 2887 Tesla P100 GPUs
  • 588.5 kW
  • IDG on one Tesla P100: ≈ 0.10 GVis/s

31

slide-67
SLIDE 67

Image-Domain Gridding for the Square Kilometre Array

  • SKA-1 Low visibility rate: 9.5 GVis/s −

→ imaging data rate: 95 GVis/s

  • Compute: 50 PFlop/s (DP, total)
  • Power cap: ≈ 5 MW (per site)
  • Imaging ‘budget’ ≈ 60%:
  • 15.3 PFlop/s (DP, per site)
  • 2887 Tesla P100 GPUs
  • 588.5 kW
  • IDG on one Tesla P100: ≈ 0.10 GVis/s
  • IDG meets SKA-1 Low requirements!
  • 950 P100s required: 33%
  • Power consumption: 171 kW, 29%

31

slide-68
SLIDE 68

Summary

  • High performance astronomical imaging on GPUs with Image-Domain Gridding
  • Compute bound (unlike other imaging algorithms)
  • A-term correction at neglible cost
  • Very lage images (Unified Memory)

32

slide-69
SLIDE 69

Summary

  • High performance astronomical imaging on GPUs with Image-Domain Gridding
  • Compute bound (unlike other imaging algorithms)
  • A-term correction at neglible cost
  • Very lage images (Unified Memory)
  • GPUs much faster and more (energy)-efficient than CPUs and Xeon Phi

32

slide-70
SLIDE 70

Summary

  • High performance astronomical imaging on GPUs with Image-Domain Gridding
  • Compute bound (unlike other imaging algorithms)
  • A-term correction at neglible cost
  • Very lage images (Unified Memory)
  • GPUs much faster and more (energy)-efficient than CPUs and Xeon Phi
  • Most challenging sub-parts of imaging for SKA is solved!

32

slide-71
SLIDE 71

Summary

  • High performance astronomical imaging on GPUs with Image-Domain Gridding
  • Compute bound (unlike other imaging algorithms)
  • A-term correction at neglible cost
  • Very lage images (Unified Memory)
  • GPUs much faster and more (energy)-efficient than CPUs and Xeon Phi
  • Most challenging sub-parts of imaging for SKA is solved!

More details: “Image Domain Gridding: a fast method for convolutional resampling of visibilities, S. van der Tol et al., A&A 2018” and Image-Domain Gridding on Graphics Processors, B. Veenboer et al., IPDPS 2017 Source available at: https://gitlab.com/astron-idg/idg

32