[PPT] - Accelerating Reverse Time Migration application for seismic imaging PowerPoint Presentation

SLIDE 1

Accelerating Reverse Time Migration application for seismic imaging with GPU architecture

Sergio Orlandini, Cristiano Calonaci, Luca Ferraro @CINECA Nicola Bienati @ENI

GPU Technology Conference GTC San Jose (CA) April 4-7, 2016

5th April 2016

SLIDE 2

Reverse Time Migration developers

@CINECA Simone Campagna Cristiano Calonaci Marco Comparato Massimiliano Culpo Luca Ferraro Roberto Gori Chiara Latini Sergio Orlandini Stefano Tagliaventi @ENI Nicola Bienati Jacopo Panizzardi @NVIDIA Paulius Micikevicius Peng Wang

S. Orlandini

RTM algorithm & workflow @GTC16 — 1/16

SLIDE 3

Reverse Time Migration algorithm

RTM applies the discretized acoustic wave equation to propagate waves through a given velocity model. 3 main parts:

1 Forward propagation of the solution of the wave equation to model the

source wavefield.

2 Backward propagation in reverse time of data recorded on the field. 3 Calculation of imaging condition.

Both Forward and Backward propagation consist of Finite Difference (FD) wave equation solution which is: Compute intensive: fourth power of grid dimensions Memory demanding: domain decomposition Communication intensive: ghost cell exchange with first neighbours every time step

S. Orlandini

RTM algorithm & workflow @GTC16 — 2/16

SLIDE 4

RTM workflow

3 main tasks

1 FD propagation

Second order wave equation is solved with finite difference approximation

2 Exchange borders

Each time step domain borders have to be exchanged between first neighbours MPI communications

3 Imaging

Imaging frequency Write source field in Forward Read source field in Backward Compute imaging condition in Backward

S. Orlandini

RTM algorithm & workflow @GTC16 — 3/16

SLIDE 5

Finite Difference propagation kernel performances

Finite Difference propagation is the mostly compute intense task FD propagation is the most relevant part in RTM application “Easily” ported to GPU Speed-Up vs CPU roughly 4x (1 node/16 cores vs 1 node with 2 GPUs) NB: concerns only FD stand-alone kernel performance, not entire RTM application Isotropic kernel

Single GPU Radius = 4 Grid 512x512x512 200 time steps

TTI kernel

Single GPU Radius = 4 Grid 512x512x512 200 time steps

S. Orlandini

Finite Difference kernel performances @GTC16 — 4/16

SLIDE 6

RTM-GPU workflow

Queue of imaging steps with dedicated buffer for wavefield snapshots 2 concurrent host threads: Kernel thread [ KT ]

Device management Computes Wave-Fields Manages FD kernel launches Exchange borders Transfers wavefields D2H

Imaging thread [ IT ]

Host management Computes Imaging Waits for an available snapshot then computes imaging Writes/Reads to/from I/O Forks in a poll of threads for computing imaging

S. Orlandini

RTM porting to GPU @GTC16 — 5/16

SLIDE 7

RTM-GPU workflow

Queue of imaging steps with dedicated buffer for wavefield snapshots 2 concurrent host threads: Kernel thread [ KT ]

Device management Computes Wave-Fields Manages FD kernel launches Exchange borders Transfers wavefields D2H

Imaging thread [ IT ]

Host management Computes Imaging Waits for an available snapshot then computes imaging Writes/Reads to/from I/O Forks in a poll of threads for computing imaging

S. Orlandini

RTM porting to GPU @GTC16 — 5/16

SLIDE 8

RTM-GPU workflow

Optimization of communications: FD propagation in 2 steps

1 Propagate halos domain to

exchange

2 Propagate internal domain 2 While propagating internal domain

concurrently exchange halos domains

Overlap communications with computation

Communications are hidden behind device computation

Exchange borders task is more complex:

Device/Host MPI GPU-Direct PeerToPeer

S. Orlandini

RTM porting to GPU @GTC16 — 6/16

SLIDE 9

RTM-GPU workflow

Optimization of communications: FD propagation in 2 steps

1 Propagate halos domain to

exchange

2 Propagate internal domain 2 While propagating internal domain

concurrently exchange halos domains

Overlap communications with computation

Communications are hidden behind device computation

Exchange borders task is more complex:

Device/Host MPI GPU-Direct PeerToPeer

S. Orlandini

RTM porting to GPU @GTC16 — 6/16

SLIDE 10

RTM-GPU performances: only FD kernels

Simulation details: Imaging disabled (no I/O operations) Only computational FD throughput From 1 up to 5 nodes 2 GPUs per node TTI simulation Prestack mode

* Normalized on number of nodes and devices

Considerations: Same trend as pure FD computational speed-up No overhead inside RTM application Complete overlap of halos communications behind computation

Stand-alone kernel perfs

S. Orlandini

RTM porting to GPU @GTC16 — 7/16

SLIDE 11

RTM-GPU performances: with imaging calculation

Simulation details: High frequency imaging (highly stressed I/O) FD+imaging computation From 1 up to 5 nodes 2 GPUs per node TTI simulation Prestack mode

* Normalized on number of nodes and devices

Considerations: I/O bottleneck RTM speed-up depends on I/O bandwidth not on GPU throughput Unstable and dropped RTM kernel performances due to unbalance of I/O operation on nodes

S. Orlandini

RTM porting to GPU @GTC16 — 8/16

SLIDE 12

RTM-GPU I/O bottleneck

I/O operations are complex processes: Forward writes and Backward reads (from last step to first one) Write phase:

1

Data are compressed with lossy compression (multi steps process) Workload differs from process to process

2

Compressed data are written to disk

Read phase:

1

Read compressed data from disk

2

Uncompress data for computing imaging

Increasing compression rate exploites max- imum I/O bandwidth and reduces drop in RTM kernel performances.

NB: maximum lossy compression used is the

ne with maximum acceptable numerical er-

rors.

S. Orlandini

RTM porting to GPU @GTC16 — 9/16

SLIDE 13

RTM-GPU strong/weak points

Strong points: Fully exploit device throughput No overhead is introduced Communications are hidden behind device computation Overlap CPU/GPU computation Too high device throughput compared to I/O bandwidth Weak points: Complex scheme Sophisticated synchronization CPU/GPU More complex communications MPI/H2D/P2P Tricky tuning CPU/GPU/Communications and I/O Limited GPU memory Way out to GPU memory limitation:

1 Increase domain decomposition 2 Reduce memory utilization using tiny

representation (16bits)

3 Increase order of stencil

S. Orlandini

RTM porting to GPU @GTC16 — 10/16

SLIDE 14

RTM-GPU scalability

RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size

Node = 20 cores + 2 K20x

Weak scaling Increase both number of nodes and problem size (Costant domain size per process)

Node = 16 cores + 2 K10

2 types of simulation:

1 Imaging disabled

(no I/O operations)

2 Imaging enabled

S. Orlandini

RTM-GPU reduce device memory usage @GTC16 — 11/16

SLIDE 15

RTM-GPU scalability

RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size

Node = 20 cores + 2 K20x

Weak scaling Increase both number of nodes and problem size (Costant domain size per process)

Node = 16 cores + 2 K10

2 types of simulation:

1 Imaging disabled

(no I/O operations)

2 Imaging enabled

S. Orlandini

RTM-GPU reduce device memory usage @GTC16 — 11/16

SLIDE 16

RTM-GPU scalability

RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size

Node = 20 cores + 2 K20x

Weak scaling Increase both number of nodes and problem size (Costant domain size per process)

Node = 16 cores + 2 K10

2 types of simulation:

1 Imaging disabled

(no I/O operations)

2 Imaging enabled

S. Orlandini

RTM-GPU reduce device memory usage @GTC16 — 11/16

SLIDE 17

RTM-GPU 16bits representations

Only for storing not for computing Only for velocity fields, not for wave-fields more velocity fields than wavefields reduce numerical errors

16bits chances:

1 Floating Point (FP) @16bits:

FP16 IEEE-754 (aka half-float)

1bit-sign 5bits-exponent 11bits-mantissa
Range: [6.1 · 10−5 : 65504]

FP16 custom (bits and bias) and normalized

1bit-sign 4bits-exponent 12bits-mantissa
one more bit to mantissa and one less to exponent, change bias to fit in normalized

range

Increase accuracy to the detriment of dynamical range
Range: [−1 : 1]

2 Fixed Point (FX) @16bits:

FX16 Q−2.17 (store υ2 · ∆t2/∆x2)

Range [−0.25 : 2.49 · 10−1], Resolution 7.2 · 10−6

FX16 Q0.15 (all 16bits to decimal part)

Range [−1 : 1], Resolution 3.05 · 10−5
Normalize data. Multidimentional normalization domain (1D/2D/3D)
S. Orlandini

RTM-GPU reduce device memory usage @GTC16 — 12/16

SLIDE 18

RTM-GPU 16bits representations

Least numerical error with FX16 Q0.16 Normalized FX16 memory gain:

Roughly 50% for velocity fields (depends on normalization domain) Best domain: 2D 4x4 tiles along xy-plane

FX16 usage:

1 Pre-compute data

Normalize FP32 data over domain Convert normalized FP32 to FX16 Store FX16 data and normalization factors @32bits

2 Reading data

Read FX16 and normalization factors Convert FX16 to FP32 De-normalize FP32 data

FX16 perfs loss about 4%

S. Orlandini

RTM-GPU reduce device memory usage @GTC16 — 13/16

SLIDE 19

RTM-GPU higher-order stencils

Using higher order stencils: Higher accuracy on solution Sparser grid

Lower number of grid point Memory gain about 85% using 16th order

Increase of computational intensity Increase domain to exchange

Exchanges are hidden behind computation

Loss perfs roughly about 15-30% for 16th order respect to 8th order RTM-GPU supports stencil computation from 8th until 16th

rder

NB : Same simulation parameters used for tests with 8th and 16th order stencil.

S. Orlandini

RTM-GPU reduce device memory usage @GTC16 — 14/16

SLIDE 20

Conclusion

FD propagation of wave-fields is no more the most expensive task in the GPU version of RTM Time consuming part is elaborating imaging In RTM-GPU (with high frequency imaging) the bottleneck is the I/O

perations (not in RTM-CPU)

In order to reduce memory allocation on device:

1 Increase domain decomposition

RTM-GPU scales with number of nodes both in strong and weak scaling Minimize ratio surface/volume

2 Reduce memory allocation using 16bits representations

Less numerical errors with Fixed-Point representation with an accettable loss of perfs

3 Higher orders stencils

High reduction of memory Same accuracy on solution in lower time

S. Orlandini

Conclusion @GTC16 — 15/16

SLIDE 21

Acknowledgements

The authors would like to thank: ENI managements CINECA managements

Thank you for your attention!

Sergio Orlandini s.orlandini@cineca.it

S. Orlandini

Conclusion @GTC16 — 16/16