Accelerating Reverse Time Migration application for seismic imaging - - PowerPoint PPT Presentation
Accelerating Reverse Time Migration application for seismic imaging - - PowerPoint PPT Presentation
Accelerating Reverse Time Migration application for seismic imaging with GPU architecture Sergio Orlandini, Cristiano Calonaci, Luca Ferraro @CINECA Nicola Bienati @ENI GPU Technology Conference GTC San Jose (CA) April 4-7, 2016 5 th April
Reverse Time Migration developers
@CINECA Simone Campagna Cristiano Calonaci Marco Comparato Massimiliano Culpo Luca Ferraro Roberto Gori Chiara Latini Sergio Orlandini Stefano Tagliaventi @ENI Nicola Bienati Jacopo Panizzardi @NVIDIA Paulius Micikevicius Peng Wang
- S. Orlandini
RTM algorithm & workflow @GTC16 — 1/16
Reverse Time Migration algorithm
RTM applies the discretized acoustic wave equation to propagate waves through a given velocity model. 3 main parts:
1 Forward propagation of the solution of the wave equation to model the
source wavefield.
2 Backward propagation in reverse time of data recorded on the field. 3 Calculation of imaging condition.
Both Forward and Backward propagation consist of Finite Difference (FD) wave equation solution which is: Compute intensive: fourth power of grid dimensions Memory demanding: domain decomposition Communication intensive: ghost cell exchange with first neighbours every time step
- S. Orlandini
RTM algorithm & workflow @GTC16 — 2/16
RTM workflow
3 main tasks
1 FD propagation
Second order wave equation is solved with finite difference approximation
2 Exchange borders
Each time step domain borders have to be exchanged between first neighbours MPI communications
3 Imaging
Imaging frequency Write source field in Forward Read source field in Backward Compute imaging condition in Backward
- S. Orlandini
RTM algorithm & workflow @GTC16 — 3/16
Finite Difference propagation kernel performances
Finite Difference propagation is the mostly compute intense task FD propagation is the most relevant part in RTM application “Easily” ported to GPU Speed-Up vs CPU roughly 4x (1 node/16 cores vs 1 node with 2 GPUs) NB: concerns only FD stand-alone kernel performance, not entire RTM application Isotropic kernel
Single GPU Radius = 4 Grid 512x512x512 200 time steps
TTI kernel
Single GPU Radius = 4 Grid 512x512x512 200 time steps
- S. Orlandini
Finite Difference kernel performances @GTC16 — 4/16
RTM-GPU workflow
Queue of imaging steps with dedicated buffer for wavefield snapshots 2 concurrent host threads: Kernel thread [ KT ]
Device management Computes Wave-Fields Manages FD kernel launches Exchange borders Transfers wavefields D2H
Imaging thread [ IT ]
Host management Computes Imaging Waits for an available snapshot then computes imaging Writes/Reads to/from I/O Forks in a poll of threads for computing imaging
- S. Orlandini
RTM porting to GPU @GTC16 — 5/16
RTM-GPU workflow
Queue of imaging steps with dedicated buffer for wavefield snapshots 2 concurrent host threads: Kernel thread [ KT ]
Device management Computes Wave-Fields Manages FD kernel launches Exchange borders Transfers wavefields D2H
Imaging thread [ IT ]
Host management Computes Imaging Waits for an available snapshot then computes imaging Writes/Reads to/from I/O Forks in a poll of threads for computing imaging
- S. Orlandini
RTM porting to GPU @GTC16 — 5/16
RTM-GPU workflow
Optimization of communications: FD propagation in 2 steps
1 Propagate halos domain to
exchange
2 Propagate internal domain 2 While propagating internal domain
concurrently exchange halos domains
Overlap communications with computation
Communications are hidden behind device computation
Exchange borders task is more complex:
Device/Host MPI GPU-Direct PeerToPeer
- S. Orlandini
RTM porting to GPU @GTC16 — 6/16
RTM-GPU workflow
Optimization of communications: FD propagation in 2 steps
1 Propagate halos domain to
exchange
2 Propagate internal domain 2 While propagating internal domain
concurrently exchange halos domains
Overlap communications with computation
Communications are hidden behind device computation
Exchange borders task is more complex:
Device/Host MPI GPU-Direct PeerToPeer
- S. Orlandini
RTM porting to GPU @GTC16 — 6/16
RTM-GPU performances: only FD kernels
Simulation details: Imaging disabled (no I/O operations) Only computational FD throughput From 1 up to 5 nodes 2 GPUs per node TTI simulation Prestack mode
* Normalized on number of nodes and devices
Considerations: Same trend as pure FD computational speed-up No overhead inside RTM application Complete overlap of halos communications behind computation
Stand-alone kernel perfs
- S. Orlandini
RTM porting to GPU @GTC16 — 7/16
RTM-GPU performances: with imaging calculation
Simulation details: High frequency imaging (highly stressed I/O) FD+imaging computation From 1 up to 5 nodes 2 GPUs per node TTI simulation Prestack mode
* Normalized on number of nodes and devices
Considerations: I/O bottleneck RTM speed-up depends on I/O bandwidth not on GPU throughput Unstable and dropped RTM kernel performances due to unbalance of I/O operation on nodes
- S. Orlandini
RTM porting to GPU @GTC16 — 8/16
RTM-GPU I/O bottleneck
I/O operations are complex processes: Forward writes and Backward reads (from last step to first one) Write phase:
1
Data are compressed with lossy compression (multi steps process) Workload differs from process to process
2
Compressed data are written to disk
Read phase:
1
Read compressed data from disk
2
Uncompress data for computing imaging
Increasing compression rate exploites max- imum I/O bandwidth and reduces drop in RTM kernel performances.
NB: maximum lossy compression used is the
- ne with maximum acceptable numerical er-
rors.
- S. Orlandini
RTM porting to GPU @GTC16 — 9/16
RTM-GPU strong/weak points
Strong points: Fully exploit device throughput No overhead is introduced Communications are hidden behind device computation Overlap CPU/GPU computation Too high device throughput compared to I/O bandwidth Weak points: Complex scheme Sophisticated synchronization CPU/GPU More complex communications MPI/H2D/P2P Tricky tuning CPU/GPU/Communications and I/O Limited GPU memory Way out to GPU memory limitation:
1 Increase domain decomposition 2 Reduce memory utilization using tiny
representation (16bits)
3 Increase order of stencil
- S. Orlandini
RTM porting to GPU @GTC16 — 10/16
RTM-GPU scalability
RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size
Node = 20 cores + 2 K20x
Weak scaling Increase both number of nodes and problem size (Costant domain size per process)
Node = 16 cores + 2 K10
2 types of simulation:
1 Imaging disabled
(no I/O operations)
2 Imaging enabled
- S. Orlandini
RTM-GPU reduce device memory usage @GTC16 — 11/16
RTM-GPU scalability
RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size
Node = 20 cores + 2 K20x
Weak scaling Increase both number of nodes and problem size (Costant domain size per process)
Node = 16 cores + 2 K10
2 types of simulation:
1 Imaging disabled
(no I/O operations)
2 Imaging enabled
- S. Orlandini
RTM-GPU reduce device memory usage @GTC16 — 11/16
RTM-GPU scalability
RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size
Node = 20 cores + 2 K20x
Weak scaling Increase both number of nodes and problem size (Costant domain size per process)
Node = 16 cores + 2 K10
2 types of simulation:
1 Imaging disabled
(no I/O operations)
2 Imaging enabled
- S. Orlandini
RTM-GPU reduce device memory usage @GTC16 — 11/16
RTM-GPU 16bits representations
Only for storing not for computing Only for velocity fields, not for wave-fields more velocity fields than wavefields reduce numerical errors
16bits chances:
1 Floating Point (FP) @16bits:
FP16 IEEE-754 (aka half-float)
- 1bit-sign 5bits-exponent 11bits-mantissa
- Range: [6.1 · 10−5 : 65504]
FP16 custom (bits and bias) and normalized
- 1bit-sign 4bits-exponent 12bits-mantissa
- one more bit to mantissa and one less to exponent, change bias to fit in normalized
range
- Increase accuracy to the detriment of dynamical range
- Range: [−1 : 1]
2 Fixed Point (FX) @16bits:
FX16 Q−2.17 (store υ2 · ∆t2/∆x2)
- Range [−0.25 : 2.49 · 10−1], Resolution 7.2 · 10−6
FX16 Q0.15 (all 16bits to decimal part)
- Range [−1 : 1], Resolution 3.05 · 10−5
- Normalize data. Multidimentional normalization domain (1D/2D/3D)
- S. Orlandini
RTM-GPU reduce device memory usage @GTC16 — 12/16
RTM-GPU 16bits representations
Least numerical error with FX16 Q0.16 Normalized FX16 memory gain:
Roughly 50% for velocity fields (depends on normalization domain) Best domain: 2D 4x4 tiles along xy-plane
FX16 usage:
1 Pre-compute data
Normalize FP32 data over domain Convert normalized FP32 to FX16 Store FX16 data and normalization factors @32bits
2 Reading data
Read FX16 and normalization factors Convert FX16 to FP32 De-normalize FP32 data
FX16 perfs loss about 4%
- S. Orlandini
RTM-GPU reduce device memory usage @GTC16 — 13/16
RTM-GPU higher-order stencils
Using higher order stencils: Higher accuracy on solution Sparser grid
Lower number of grid point Memory gain about 85% using 16th order
Increase of computational intensity Increase domain to exchange
Exchanges are hidden behind computation
Loss perfs roughly about 15-30% for 16th order respect to 8th order RTM-GPU supports stencil computation from 8th until 16th
- rder
NB : Same simulation parameters used for tests with 8th and 16th order stencil.
- S. Orlandini
RTM-GPU reduce device memory usage @GTC16 — 14/16
Conclusion
FD propagation of wave-fields is no more the most expensive task in the GPU version of RTM Time consuming part is elaborating imaging In RTM-GPU (with high frequency imaging) the bottleneck is the I/O
- perations (not in RTM-CPU)
In order to reduce memory allocation on device:
1 Increase domain decomposition
RTM-GPU scales with number of nodes both in strong and weak scaling Minimize ratio surface/volume
2 Reduce memory allocation using 16bits representations
Less numerical errors with Fixed-Point representation with an accettable loss of perfs
3 Higher orders stencils
High reduction of memory Same accuracy on solution in lower time
- S. Orlandini
Conclusion @GTC16 — 15/16
Acknowledgements
The authors would like to thank: ENI managements CINECA managements
Thank you for your attention!
Sergio Orlandini s.orlandini@cineca.it
- S. Orlandini
Conclusion @GTC16 — 16/16