USING OPENACC TO PARALLELIZE SEISMIC ONE-WAY BASED MIGRATION - PowerPoint PPT Presentation

USING OPENACC TO PARALLELIZE SEISMIC ONE-WAY BASED MIGRATION Kshitij Mehta (Total E&P R&T) Maxime Hugues (Total E&P R&T) Oscar Hernandez (Oak Ridge National Lab) Henri Calandra (Total E&P R&T) GTC 2016 This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. TOTAL and Oak Ridge National Laboratory collaboration is done under the CRADA agreement NFE-14-05227

ONE-WAY IMAGING ● Classic depth imaging application ● Uses Fourier Finite Differencing ● Wave equation ● Approximation contains 3 terms Phase shift Lens correction Wide angle correction ● Iterative method where we compute the wavefield at every depth z ● Wavefield approximation takes 75-80% total time on a single shot 2 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

PARALLELIZING ONE-WAY MIGRATION USING OPENACC 1. Optimizing data transfer between CPU and GPU GPU CPU Data • Copy wavefield and other data to GPU before we begin migration • Only copy image to host for writing to file • Copy slice of velocity model to GPU in every iteration if required 3 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

PARALLELIZING ONE-WAY MIGRATION USING OPENACC 2. Computation on the GPU - Smaller components such as adding signal to wavefield, applying damping etc. are simple and straight-forward - Parallelizing Phase-Shift and Wide-Angle require more work 4 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE-ANGLE ALGORITHM for each row for each frequency wavefield(:) — > wave create_sparse_matrix() tridiagonal_solver() rhs — > wavefield enddo enddo 5 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE-ANGLE ALGORITHM parallelizable for each row parallelizable for each frequency parallelizable wavefield(:) — > wave parallelizable create_sparse_matrix() sequential tridiagonal_solver() parallelizable rhs — > wavefield enddo enddo 6 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE-ANGLE OPENACC I !$acc loop collapse(2) for each row for each frequency !$acc loop vector ● Parallelize outer loops as gang wavefield(:) — > wave ● Parallelize inner loops as vector !$acc loop vector create_sparse_matrix() ● Performance is very poor tridiagonal_solver() ● Reason: Solver is executed by a single thread !$acc loop vector rhs — > wavefield enddo enddo 7 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE-ANGLE OPENACC II !$acc loop collapse(2) gang vector for each row for each frequency ● Parallelize outermost loops as gang vector wavefield(:) — > wave ● Inner loops are sequential ● Much better than previous version create_sparse_matrix() ● Solver code run by multiple threads tridiagonal_solver() ● Still not as good as CPU 8-cores ● Primary reason: non-coalesced rhs — > wavefield memory access enddo ● VERY expensive on GPUs enddo 8 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE-ANGLE OPENACC II wavefield (nx,ny,nw,n_wave)  wavefield_2 (nw,nx,ny,n_wave) !$acc loop ● Create temporary wavefield in wide angle for each row where innermost dimension is w (frequencies) and vectorize along w !$acc loop ● This way we have coalesced memory wavefield2(1:nw,) — > wave(1:nw,) access ● Work on this wavefield and copy it back to !$acc routine (inside) the original wavefield at the end of the subroutine create_sparse_matrix(1:nw,) ● All local arrays must have w as the inner dimension !$acc routine (inside) tridiagonal_solver(1:nw,) ● Requires acceptable amount of code change !$acc loop ● This is how directive-based programming rhs(1:nw,) — > wavefield2(1:nw,) models are expected to work enddo wavefield2  wavefield 9 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE ANGLE OPENACC III CONTD. ● Pros: ● Can control the no. of gangs if you run !$acc parallel num_gangs(100) out of memory do ix=1,nx … ● Cons: enddo ● Inner dimension must be large enough 10 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE ANGLE OPENACC IV (BATCHED) !$acc parallel for each row ● Batching or grouping of operations for each frequency wavefield — > wave ● Break up a loop into many loops performing similar operations !$acc parallel ● Instead of privatizing local arrays, pad for each row arrays with additional dimensions for each frequency create_sparse_matrix() ● Create a large system of sparse matrices, solving a large system of solvers etc. !$acc parallel for each row ● Requires significant code changes for each frequency tridiagonal_solver() ● On the original wavefield, performance only as good as 8-core CPU !$acc parallel ● Again due to non-coalesced memory for each row access for each frequency rhs — > wavefield 11 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE ANGLE OPENACC V (nx,ny,nw,n_wave)  wavefield wavefield_2 (nw,nx,ny,n_wave) ● Create temporary wavefield which has w as the inner !$acc parallel dimension, then use batched operations for each row for each frequency wavefield_2 — > wave ● Optimize usage of local arrays !$acc parallel ● for each row Use solution of X direction as input to Y direction for each frequency create_sparse_matrix() ● Don't copy output of X direction back to wavefield !$acc parallel for each row ● Now memory access is coalesced for each frequency tridiagonal_solver() ● Leads to 2.47x performance improvement over 8- core CPU case on Titan K20 with PGI 15.7 and !$acc parallel CUDA 7.0 for each row for each frequency rhs — > wavefield_2 ● 2.62x with PGI 15.9 wavefield2  wavefield 12 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

WIDE ANGLE: FURTHER OPTIMIZATIONS ● Use CUDA code for transpose and some data copy operations in wide-angle ● Leads to ~3x performance improvement over 8-core CPU case ● We lose portability here due to use of CUDA 13 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

PARALLELIZING PHASE-SHIFT ● Phase-Shift: 1. 2D FFT Forward 2. Phase-Shift computation 3. 2D FFT Backward 4. Thin lens correction ● In OpenACC, operations such as FFT require using optimized vendor libraries - NVIDIA’s CuFFT ● Modified code to group operations so that we perform as much work as possible in the FFT call (batched FFT operations) 14 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

BENCHMARK PERFORMANCE ● 1 shot of the SEGSALT dataset ● K20X GPU on Titan vs. 8-core Intel Sandybridge CPU ● 3x speedup 400 350 300 Time (seconds) 250 CPU 200 GPU 150 100 50 0 Phase-Shift Wide-Angle Total 15 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

ONE-WAY MIGRATION AT SCALE ● Titan supercomputer at Oak Ridge National Lab - 2 nd in list of top 500 supercomputers as of March, 2016 - 18,688 nodes, each with an NVIDIA K20X GPU - Lustre file system - PGI 15.10.0 compiler - CUDA 7.0 ● K20X GPU - 6GB memory - 14 SMX, 192 single-precision FP cores in each (2688 total) Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016 16

LARGE RUN CONFIGURATION ● Dataset: - SEAM Isotropic Phase I - 2793 shots ● Used 99% of nodes on Titan (18,508/18,688 nodes) ● 28 GPUs per shot (considering memory requirement and load balancing within group) ● 661 process groups (= shots) running simultaneously ● Flat MPI mode: shot distribution to process groups is static 17 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

RESULTS ● Large run took 54 minutes to complete Partial stacked images Velocity model 18 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

SHOT SIZE AND SHOT TIMES Run on Titan using 18,508 processors 19 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

POWER MEASUREMENT ON TITAN Run on Titan using 18,508 processors 20 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

SUMMARY ● We have ported One-Way Migration to GPUs using OpenACC ● Porting to GPU requires some code modifications, but directive based model is highly preferred ● 3x speedup on benchmark dataset ● Ran One-Way Migration at large scale on Titan supercomputer ● Processed 2793 shots in less than an hour ● Running at scale yields interesting points of discussion ● Future Work: - How to scale I/O - Run more complicated applications such as RTM 21 Using OpenACC to Parallelize Seismic One-Way Based Migration, GTC 2016

USING OPENACC TO PARALLELIZE SEISMIC ONE-WAY BASED MIGRATION - PowerPoint PPT Presentation

USING OPENACC TO PARALLELIZE SEISMIC ONE-WAY BASED MIGRATION Kshitij Mehta (Total E&P R&T) Maxime Hugues (Total E&P R&T) Oscar Hernandez (Oak Ridge National Lab) Henri Calandra (Total E&P R&T) GTC 2016 This research

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

Using OpenACC to parallelize irregular computation (Session:S7478) Sunita Chandrasekaran Arnov

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

NIRANJALI WEERASEKERA EX BOI OFFICER 1 Board of Investment of Sri Lanka Promotes foreign

A Journey of Non - Readers towards Readers: A Case Study of USAID funded Sindh Reading

Country Presentation The Republic of the Union Of Myanmar By Ko Lay Win, Deputy Director

through long-term exchange Assessment practices and considerations Peer Learning Activity

SMALL SCALE LNG CATALYST FOR GROWING GAS MARKETS April 24-25, 2018 Access to Reliable Energy In

TACKLING CHILD LABOUR THROUGH EDUCATION(TACKLE) PROJECT FINAL EVALUATION WORKSHOP 2 nd 3 rd

Agriculture and Rural Transformation in Myanmar Evidence Generation for Resilient and Inclusive

an amateur side wanting to combine a few friendly matches with an African holiday, were here