High Performance Asynchronous Execution of the Reverse Time - PowerPoint PPT Presentation

High Performance Asynchronous Execution of the Reverse Time Migration for the Oil & Gas Industry NVIDIA GTC Conference at San Jose, CA March 26-29, 2018 I. Said & H. Ltaief HPC RTM 1 / 47 Issam Said 1 and Hatem Ltaief 2 1 NVIDIA Oil and Gas, Paris, France 2 Extreme Computing Research Center, KAUST, Saudi Arabia

Outline 1 Background on Seismic Imaging 2 Ubiquitous Matricization and Taskifjcation for Seismic Imaging 3 Matrices Over Runtime Systems 4 Application to Frequency Domain 5 Performance Results 6 Summary and Future Work I. Said & H. Ltaief HPC RTM 2 / 47

Seismic Imaging Application to Frequency Domain HPC RTM I. Said & H. Ltaief Summary and Future Work 6 Performance Results 5 4 Outline Matrices Over Runtime Systems 3 Ubiquitous Matricization and Taskifjcation for Seismic Imaging 2 Background on Seismic Imaging 1 3 / 47

Seismic Imaging Energy supply and demand 40% more energy is needed by 2035 No choice but Oil, Gas and Coal Sophisticated seismic methods I. Said & H. Ltaief HPC RTM 4 / 47

Seismic Imaging Seismic methods for Oil & Gas exploration Acquisition Processing Interpretation Shot = source activation + data collection (receivers) Seismic survey Shot record I. Said & H. Ltaief HPC RTM 5 / 47 • Air-gun array • Hydrophones

Seismic Imaging lation HPC RTM I. Said & H. Ltaief Subsurface image { Interpretation Imaging Interpo- Seismic methods for Oil & Gas exploration tiple Demul- tenuation Noise at- Processing Acquisition 5 / 47

Seismic Imaging Seismic methods for Oil & Gas exploration Acquisition Processing Interpretation Calculate seismic attributes I. Said & H. Ltaief HPC RTM 5 / 47 • Dip • Azimuth • Coherence

Seismic Imaging Seismic methods for Oil & Gas exploration Acquisition Processing Interpretation Calculate seismic attributes I. Said & H. Ltaief HPC RTM 5 / 47 • Dip • Azimuth • Coherence (courtesy of Total)

Seismic Imaging Reverse Time Migration (RTM) The reference computer based imaging algorithm in the industry Repositions seismic events into their true location in the subsurface I. Said & H. Ltaief HPC RTM 6 / 47

Seismic Imaging Reverse Time Migration (RTM) The reference computer based imaging algorithm in the industry Repositions seismic events into their true location in the subsurface Sub-salt and steep dips imaging Accurate (full wave equation (two-way)) Requires massive compute resources (compute and storage) I. Said & H. Ltaief HPC RTM 6 / 47

Seismic Imaging RTM workflow Forward modeling (FWD) I. Said & H. Ltaief HPC RTM 7 / 47

Seismic Imaging RTM workflow Forward modeling (FWD) Backward modeling (BWD) I. Said & H. Ltaief HPC RTM 7 / 47

Seismic Imaging RTM workflow Forward modeling (FWD) Backward modeling (BWD) Imaging condition I. Said & H. Ltaief HPC RTM 7 / 47

Seismic Imaging The Cauchy problem HPC RTM I. Said & H. Ltaief Boundary condition The underlying theory of the RTM algorithm 8 / 47 The RTM operator � H � T Img ( x ) = S h ( x , t ) ∗ R h ( x , T − t ) dt dh 0 0 ∂ 2 u ( x , t ) 1  − ∆ u ( x , t ) = s ( t ) , in Ω   c 2 ∂ t 2   u ( x , 0) = 0 ∂ u ( x , 0)    = 0  ∂ t u = 0 on ∂ Ω

Seismic Imaging Finite Difference Time Domain for RTM HPC RTM I. Said & H. Ltaief Requires High Performance Computing Terabytes of temporary data Heavy computation (hours to days of processing time) 9 / 47 Perfectly Matched Layers (PML) as an absorbing boundary condition Regular grids Finite Difference Time Domain ( 8 th order in space, 2 nd order in time) U n +1 i , j , k − U n − 1 i , j , k + c 2 i , j , k ∆ t 2 ∆ U n i , j , k + c 2 i , j , k ∆ t 2 s n i , j , k = 2 U n

Seismic Imaging Frequency Domain Translate to frequency domain and solve the Helmholtz equation (acoustic wave equation): w fjeld, and u(x, w) is the time-harmonic wavefjeld solution to the forcing term s(x, w). w/ S. Zampini I. Said & H. Ltaief HPC RTM 10 / 47 ( − ∆ − k 2 ) u ( x , w ) = s ( x , w ) k = v ( x ) , w is the angular frequency, v(x) is the seismic velocity

Matricize and Taskify Application to Frequency Domain HPC RTM I. Said & H. Ltaief Summary and Future Work 6 Performance Results 5 4 Outline Matrices Over Runtime Systems 3 Ubiquitous Matricization and Taskifjcation for Seismic Imaging 2 Background on Seismic Imaging 1 11 / 47

Matricize and Taskify Hardware Trends: Energy Matters! HPC RTM I. Said & H. Ltaief John Shalf, LBNL 12 / 47 2011 2018 DP FLOP 100 pJ 10 pJ DP DRAM Read 4800 pJ 1920 pJ Local interconnect 7500 pJ 2500 pJ Cross system 9000 pJ 3500 pJ

Matricize and Taskify Welcome DGX-2! Extremely dense, tightly connected, GPU-based system: strong scaling! I. Said & H. Ltaief HPC RTM 13 / 47

Matricize and Taskify Vendors’ Message ;-) ”You are either compute-bound or compute-irrelevant. ” P. Luszczek, ICL@UTK I. Said & H. Ltaief HPC RTM 14 / 47

Matricize and Taskify 3D Finite Difference Time Domain Four main computational phases: Stencil integration: compute-bound Snapshotting: I/O-bound Imaging condition: memory-bound Compression: binary (e.g., gzip), truncation (e.g., brute force) or dense linear algebra (e.g., Tucker decomposition) The 3D stencil domain is a tensor! I. Said & H. Ltaief HPC RTM 15 / 47

Matricize and Taskify Intertwined AI Kernels Throughout the Integration I. Said & H. Ltaief HPC RTM 16 / 47

Matrices Over Runtime Systems Application to Frequency Domain HPC RTM I. Said & H. Ltaief Summary and Future Work 6 Performance Results 5 4 Outline Matrices Over Runtime Systems 3 Ubiquitous Matricization and Taskifjcation for Seismic Imaging 2 Background on Seismic Imaging 1 17 / 47

Matrices Over Runtime Systems LAPACK DPOTRF from last century HPC RTM I. Said & H. Ltaief Figure: Block Algorithms. (c) Third step. (b) Second step. 18 / 47 (a) First step. L A N I F UPDATE L PANEL A N I F UPDATE PANEL PANEL

Matrices Over Runtime Systems PLASMA/MAGMA/CHAMELEON DPOTRF from this century Figure: Tile Algorithms. I. Said & H. Ltaief HPC RTM 19 / 47

Matrices Over Runtime Systems LAPACK: Blocked Algorithms Principles: Panel-Update sequence Transformations are blocked/accumulated within the Panel Level-2 BLAS Transformations applied at once on the trailing submatrix Level-3 BLAS Parallelism hidden inside the BLAS Fork-join model A broken model! I. Said & H. Ltaief HPC RTM 20 / 47

Matrices Over Runtime Systems Tile Data Layout Format LAPACK: column-major format PLASMA/CHAMELEON: tile format I. Said & H. Ltaief HPC RTM 21 / 47

Matrices Over Runtime Systems Remove unnecessary synchronization points between Panel-Update HPC RTM I. Said & H. Ltaief Quark, PaRSEC, OmpSs, OpenMP etc.) Default dynamic runtime system environment StarPU (but could use dependencies between them DAG execution where nodes represent tasks and edges defjne sequences Tile data layout translation PLASMA/CHAMELEON: Tile Algorithms May require the redesign of linear algebra algorithms Parallelism is brought to the fore Break the bulk synchronous programming model CHAMELEON = PLASMA = 22 / 47 ⇒ http://icl.cs.utk.edu/plasma/ ⇒ https://gitlab.inria.fr/solverstack/chameleon.git

Matrices Over Runtime Systems StarPU Runtime System 101 HPC RTM I. Said & H. Ltaief Distributed-memory = = multi-GPU) = (x86, PPC, …) = Supports: = = = Provides: 23 / 47 ⇒ Task scheduling ⇒ Memory management ⇒ Out-of-core ⇒ SMP/Multicore Processors ⇒ NVIDIA GPUs (e.g., ⇒ Hybrid architectures ⇒ Shared and

Matrices Over Runtime Systems Ahandle(k, k), HPC RTM I. Said & H. Ltaief 0); CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL, sizeof(int) &info, OUTPUT, sizeof(int), &lda, VALUE, INOUT, StarPU Runtime System: User Productivity! sizeof(int), &n, VALUE, sizeof(char), &uplo, VALUE, starpu_Insert_Task(&cl_dpotrf, Main user API: Heterogeneous tasks’ orchestration: compute, I/O, compression hardware complexity Separation of concerns: task-based numerical algorithms and 24 / 47

Matrices Over Runtime Systems Back to RTM: The Tucker Decomposition Generalization of SVD for Tensors: Courtesy of J. Choi, IBM TJ Watson I. Said & H. Ltaief HPC RTM 25 / 47

Matrices Over Runtime Systems Back to RTM: Out-of-Core Algorithms to Maximize Memory and Computing Resources Occupancy I. Said & H. Ltaief HPC RTM 26 / 47

High Performance Asynchronous Execution of the Reverse Time - PowerPoint PPT Presentation

High Performance Asynchronous Execution of the Reverse Time Migration for the Oil & Gas Industry NVIDIA GTC Conference at San Jose, CA March 26-29, 2018 I. Said & H. Ltaief HPC RTM 1 / 47 Issam Said 1 and Hatem Ltaief 2 1 NVIDIA Oil

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Reverse Osmosis Reverse Osmosis Background to Market and to Market and Background Technology

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Reverse Logistics Woodfield Distribution, LLC v081617 Reverse Logistics About Us Description

Next-Generation Debuggers For Reverse Engineering For Reverse Engineering The ERESI team

Reverse Mathematics. Antonio Montalb an. University of Chicago. September 2011 Antonio

Remanufacturing of Products Remanufacturing of Products and Reverse Logistics and Reverse

Reverse mathematics and Ramsey theorem for pairs Benoit Monin Universit e Paris-Est Cr

Reverse Traceroute Relaunch David Choffnes, Northeastern (joint work with USC) What is

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Asynchronous Presentation Asynchronous Presentation VoiceThreads http://voicethreads.com

Computer-Aided Instruction 1 Distance Learning takes place when a teacher and student(s)

A quick introduction to MPI (Message Passing Interface) Julien Braine Laureline Pinault cole

Threshold Logical Clocks Manuel Vidigueira Distributed and Decentralized Systems Lab (DEDIS)

Motivation Issues of the current live based educational system (DVTS or Vidyo etc) to the

Development of a Reading Material Recommender System Based On a Design Science Research Approach

Not Your Grandpas Replication The New Wave of MySQL Replication and How It Helps Your

Video Conference System Manish Sinha Srikanth Vemula Project Overview Top frame of screen

Doing More with More: Recent Achievements in Large-Scale Deep Reinforcement Learning Compiled by: