High Performance Asynchronous Execution of the Reverse Time - - PowerPoint PPT Presentation

high performance asynchronous execution of the reverse
SMART_READER_LITE
LIVE PREVIEW

High Performance Asynchronous Execution of the Reverse Time - - PowerPoint PPT Presentation

High Performance Asynchronous Execution of the Reverse Time Migration for the Oil & Gas Industry NVIDIA GTC Conference at San Jose, CA March 26-29, 2018 I. Said & H. Ltaief HPC RTM 1 / 47 Issam Said 1 and Hatem Ltaief 2 1 NVIDIA Oil


slide-1
SLIDE 1

High Performance Asynchronous Execution of the Reverse Time Migration for the Oil & Gas Industry

Issam Said1 and Hatem Ltaief2

1 NVIDIA Oil and Gas, Paris, France 2Extreme Computing Research Center, KAUST, Saudi Arabia

NVIDIA GTC Conference at San Jose, CA March 26-29, 2018

  • I. Said & H. Ltaief

HPC RTM 1 / 47

slide-2
SLIDE 2

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 2 / 47

slide-3
SLIDE 3

Seismic Imaging

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 3 / 47

slide-4
SLIDE 4

Seismic Imaging

Energy supply and demand

40% more energy is needed by 2035 No choice but Oil, Gas and Coal Sophisticated seismic methods

  • I. Said & H. Ltaief

HPC RTM 4 / 47

slide-5
SLIDE 5

Seismic Imaging

Seismic methods for Oil & Gas exploration

Acquisition Processing Interpretation

Shot = source activation + data collection (receivers) Seismic survey

  • Air-gun array
  • Hydrophones

Shot record

  • I. Said & H. Ltaief

HPC RTM 5 / 47

slide-6
SLIDE 6

Seismic Imaging

Seismic methods for Oil & Gas exploration

Acquisition Processing

Noise at- tenuation Demul- tiple Interpo- lation Imaging

Interpretation

{

Subsurface image

  • I. Said & H. Ltaief

HPC RTM 5 / 47

slide-7
SLIDE 7

Seismic Imaging

Seismic methods for Oil & Gas exploration

Acquisition Processing Interpretation

Calculate seismic attributes

  • Dip
  • Azimuth
  • Coherence
  • I. Said & H. Ltaief

HPC RTM 5 / 47

slide-8
SLIDE 8

Seismic Imaging

Seismic methods for Oil & Gas exploration

Acquisition Processing Interpretation

Calculate seismic attributes

  • Dip
  • Azimuth
  • Coherence (courtesy of Total)
  • I. Said & H. Ltaief

HPC RTM 5 / 47

slide-9
SLIDE 9

Seismic Imaging

Reverse Time Migration (RTM)

The reference computer based imaging algorithm in the industry Repositions seismic events into their true location in the subsurface

  • I. Said & H. Ltaief

HPC RTM 6 / 47

slide-10
SLIDE 10

Seismic Imaging

Reverse Time Migration (RTM)

The reference computer based imaging algorithm in the industry Repositions seismic events into their true location in the subsurface Sub-salt and steep dips imaging Accurate (full wave equation (two-way)) Requires massive compute resources (compute and storage)

  • I. Said & H. Ltaief

HPC RTM 6 / 47

slide-11
SLIDE 11

Seismic Imaging

RTM workflow

Forward modeling (FWD)

  • I. Said & H. Ltaief

HPC RTM 7 / 47

slide-12
SLIDE 12

Seismic Imaging

RTM workflow

Forward modeling (FWD) Backward modeling (BWD)

  • I. Said & H. Ltaief

HPC RTM 7 / 47

slide-13
SLIDE 13

Seismic Imaging

RTM workflow

Forward modeling (FWD) Backward modeling (BWD) Imaging condition

  • I. Said & H. Ltaief

HPC RTM 7 / 47

slide-14
SLIDE 14

Seismic Imaging

RTM workflow

Forward modeling (FWD) Backward modeling (BWD) Imaging condition

  • I. Said & H. Ltaief

HPC RTM 7 / 47

slide-15
SLIDE 15

Seismic Imaging

RTM workflow

Forward modeling (FWD) Backward modeling (BWD) Imaging condition

  • I. Said & H. Ltaief

HPC RTM 7 / 47

slide-16
SLIDE 16

Seismic Imaging

The underlying theory of the RTM algorithm

The RTM operator Img(x) = H T Sh(x, t) ∗ Rh(x, T − t)dt dh The Cauchy problem          1 c2 ∂2u(x, t) ∂t2 − ∆u(x, t) = s(t), in Ω u(x, 0) = 0 ∂u(x, 0) ∂t = 0 Boundary condition u = 0 on ∂Ω

  • I. Said & H. Ltaief

HPC RTM 8 / 47

slide-17
SLIDE 17

Seismic Imaging

Finite Difference Time Domain for RTM

Finite Difference Time Domain (8th order in space, 2nd order in time) Regular grids Perfectly Matched Layers (PML) as an absorbing boundary condition Un+1

i,j,k = 2Un i,j,k − Un−1 i,j,k + c2 i,j,k∆t2∆Un i,j,k + c2 i,j,k∆t2sn

Heavy computation (hours to days of processing time) Terabytes of temporary data Requires High Performance Computing

  • I. Said & H. Ltaief

HPC RTM 9 / 47

slide-18
SLIDE 18

Seismic Imaging

Frequency Domain

Translate to frequency domain and solve the Helmholtz equation (acoustic wave equation): (−∆ − k2)u(x, w) = s(x, w) k = w v(x), w is the angular frequency, v(x) is the seismic velocity fjeld, and u(x, w) is the time-harmonic wavefjeld solution to the forcing term s(x, w).

w/ S. Zampini

  • I. Said & H. Ltaief

HPC RTM 10 / 47

slide-19
SLIDE 19

Matricize and Taskify

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 11 / 47

slide-20
SLIDE 20

Matricize and Taskify

Hardware Trends: Energy Matters!

2011 2018 DP FLOP 100 pJ 10 pJ DP DRAM Read 4800 pJ 1920 pJ Local interconnect 7500 pJ 2500 pJ Cross system 9000 pJ 3500 pJ

John Shalf, LBNL

  • I. Said & H. Ltaief

HPC RTM 12 / 47

slide-21
SLIDE 21

Matricize and Taskify

Welcome DGX-2!

Extremely dense, tightly connected, GPU-based system: strong scaling!

  • I. Said & H. Ltaief

HPC RTM 13 / 47

slide-22
SLIDE 22

Matricize and Taskify

Vendors’ Message ;-)

”You are either compute-bound or compute-irrelevant. ”

  • P. Luszczek, ICL@UTK
  • I. Said & H. Ltaief

HPC RTM 14 / 47

slide-23
SLIDE 23

Matricize and Taskify

3D Finite Difference Time Domain

Four main computational phases: Stencil integration: compute-bound Snapshotting: I/O-bound Imaging condition: memory-bound Compression: binary (e.g., gzip), truncation (e.g., brute force) or dense linear algebra (e.g., Tucker decomposition) The 3D stencil domain is a tensor!

  • I. Said & H. Ltaief

HPC RTM 15 / 47

slide-24
SLIDE 24

Matricize and Taskify

Intertwined AI Kernels Throughout the Integration

  • I. Said & H. Ltaief

HPC RTM 16 / 47

slide-25
SLIDE 25

Matrices Over Runtime Systems

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 17 / 47

slide-26
SLIDE 26

Matrices Over Runtime Systems

LAPACK DPOTRF from last century

UPDATE PANEL

(a) First step.

F I N A L UPDATE PANEL

(b) Second step.

F I N A L PANEL

(c) Third step.

Figure: Block Algorithms.

  • I. Said & H. Ltaief

HPC RTM 18 / 47

slide-27
SLIDE 27

Matrices Over Runtime Systems

PLASMA/MAGMA/CHAMELEON DPOTRF from this century

Figure: Tile Algorithms.

  • I. Said & H. Ltaief

HPC RTM 19 / 47

slide-28
SLIDE 28

Matrices Over Runtime Systems

LAPACK: Blocked Algorithms

Principles: Panel-Update sequence Transformations are blocked/accumulated within the Panel Level-2 BLAS Transformations applied at once on the trailing submatrix Level-3 BLAS Parallelism hidden inside the BLAS Fork-join model A broken model!

  • I. Said & H. Ltaief

HPC RTM 20 / 47

slide-29
SLIDE 29

Matrices Over Runtime Systems

Tile Data Layout Format

LAPACK: column-major format PLASMA/CHAMELEON: tile format

  • I. Said & H. Ltaief

HPC RTM 21 / 47

slide-30
SLIDE 30

Matrices Over Runtime Systems

PLASMA/CHAMELEON: Tile Algorithms

PLASMA = ⇒ http://icl.cs.utk.edu/plasma/ CHAMELEON = ⇒ https://gitlab.inria.fr/solverstack/chameleon.git Break the bulk synchronous programming model Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges defjne dependencies between them Default dynamic runtime system environment StarPU (but could use Quark, PaRSEC, OmpSs, OpenMP etc.)

  • I. Said & H. Ltaief

HPC RTM 22 / 47

slide-31
SLIDE 31

Matrices Over Runtime Systems

StarPU Runtime System 101

Provides: = ⇒ Task scheduling = ⇒ Memory management = ⇒ Out-of-core Supports: = ⇒ SMP/Multicore Processors (x86, PPC, …) = ⇒ NVIDIA GPUs (e.g., multi-GPU) = ⇒ Hybrid architectures = ⇒ Shared and Distributed-memory

  • I. Said & H. Ltaief

HPC RTM 23 / 47

slide-32
SLIDE 32

Matrices Over Runtime Systems

StarPU Runtime System: User Productivity!

Separation of concerns: task-based numerical algorithms and hardware complexity Heterogeneous tasks’ orchestration: compute, I/O, compression Main user API: starpu_Insert_Task(&cl_dpotrf, VALUE, &uplo, sizeof(char), VALUE, &n, sizeof(int), INOUT, Ahandle(k, k), VALUE, &lda, sizeof(int), OUTPUT, &info, sizeof(int) CALLBACK, profiling?cl_dpotrf_callback:NULL, NULL, 0);

  • I. Said & H. Ltaief

HPC RTM 24 / 47

slide-33
SLIDE 33

Matrices Over Runtime Systems

Back to RTM: The Tucker Decomposition

Generalization of SVD for Tensors: Courtesy of J. Choi, IBM TJ Watson

  • I. Said & H. Ltaief

HPC RTM 25 / 47

slide-34
SLIDE 34

Matrices Over Runtime Systems

Back to RTM: Out-of-Core Algorithms to Maximize Memory and Computing Resources Occupancy

  • I. Said & H. Ltaief

HPC RTM 26 / 47

slide-35
SLIDE 35

Matrices Over Runtime Systems

Programming Model and Optimizations

Next step: StarPU as the RTM master of ceremony to ensure asynchronous execution Task-based + MPI + Pthreads + CUDA Various StarPU scheduler policies to further mitigate overheads GPU Direct support: CudaMemcpyPeer between GPUs RDMA across GPU nodes I/O backends: unistd (i.e., using read/write functions), stdio (i.e., using fread/fwrite) and unistd_o_direct (i.e., using read/write with O_DIRECT)

  • I. Said & H. Ltaief

HPC RTM 27 / 47

slide-36
SLIDE 36

Application to Frequency Domain

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 28 / 47

slide-37
SLIDE 37

Application to Frequency Domain

Exploiting the hierarchical low-rankness of these matrices

Ubiquitous in computational science and engineering Symmetric, positive-defjnite matrix structure (Apparently) Dense matrices Often data-sparse Decay of parameter correlations with distance Hierarchically of low rank

  • I. Said & H. Ltaief

HPC RTM 29 / 47

slide-38
SLIDE 38

Application to Frequency Domain

This Highly Ranked Guy!

Figure: caption

  • I. Said & H. Ltaief

HPC RTM 30 / 47

slide-39
SLIDE 39

Application to Frequency Domain

Hierarchical Matrices

Hierarchical low-rank approximation: HSS, H2, HODLR, H, etc…

4 3 3 3 5 5 5 9 9 5 9 9 5 9 9 5 4 4 4

6 9 9 9 6 9 9 9 6 5 5 5

9

9

9 9 4 9

9

9 9 9

9

9 4 9 9

9

9

9 9 9

4 4 4 5 9 9 5 9 9 5 9 9 5 5 5 4 3 3 3

9 9 6 9 9 6 9 9 6 5 5 5

9

9

9 9 5 9

9 9

9

9 9 9

9

9 9 9

9

9 9 5 9

9 9

9 5 9 9

9

9 9 9

9

9 9 9

9

9 9

9 5 9 9

9

9

5 5 5 6 9 9 6 9 9 6 9 9

3 3 3 4

5 5 5 9 9 5 9 9 5 9 9 5 4 4 4

9 9 9

9

9

9 9 4 9

9

9 9 9

9

9 4 9 9

9

9

5 5 5 6 9 9 9 6 9 9 9 6 4 4 4 5 9 9 5 9 9 5 9 9 5 5 5

3 3 3 4
  • I. Said & H. Ltaief

HPC RTM 31 / 47

slide-40
SLIDE 40

Application to Frequency Domain

The Cholesky Factorization

The Cholesky factorization of an N × N real symmetric, positive-defjnite matrix A has the form A = LLT, where L is an N × N real lower triangular matrix with positive diagonal elements.

  • I. Said & H. Ltaief

HPC RTM 32 / 47

slide-41
SLIDE 41

Application to Frequency Domain

The HiCMA Library

Available at http://github.com/ecrc/hicma

  • I. Said & H. Ltaief

HPC RTM 33 / 47

slide-42
SLIDE 42

Application to Frequency Domain

Matrix Rank X-ray: Hierarchically Low-Rank

Seismic imaging: 3D Helmholtz with N = 128 and k = N × 0.625 (10 grid points / wavelength) on Ω = [0, 1]3

6 12 20 30 126 156 178 212 126 296 347 389 126 330 421 474 126 378 533 599 126 378 626 719 126 378 693 828 256 512 1024 2048 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14 100 200 300 400 500 600 700 800

  • I. Said & H. Ltaief

HPC RTM 34 / 47

slide-43
SLIDE 43

Application to Frequency Domain

Dense Linear Algebra Renaissance: Tile Low-Rank as a Pragmatic Approach

  • T. Mary, PhD Dissertation, Block Low-Rank multifrontal solvers: complexity, performance, and

scalability, 2017.

  • C. Weisberger, PhD Dissertation, Improving multifrontal solvers by means of algebraic Block

Low-Rank representations, 2013.

  • I. Said & H. Ltaief

HPC RTM 35 / 47

slide-44
SLIDE 44

Performance Results

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 36 / 47

slide-45
SLIDE 45

Performance Results

Introducing BLAS for batched TLR LA on GPUs: KBLAS

Context: Very small sizes = ⇒ Arithmetic intensity is low. Humongous number of independent operations = ⇒ Kernel launch overhead is high. Limited GPU occupancy = ⇒ GPU CUDA cores are idle. Solution: batched executions for TLR LA operations

  • I. Said & H. Ltaief

HPC RTM 37 / 47

slide-46
SLIDE 46

Performance Results

TLR GEMM with Various K Ranks: update dense, with compression

Input matrices A & B in dense form. Compress A & B using SVD.

Uniform ranks.

Output matrix C in dense form.

4 8 16 32 64 128 256 512 1024 2048 4096 8192 5120 10240 15360 20480 25600 30720 35840 40960 Time (ms) Matrix Size (M=N=K) cuBLAS-Dense KBLAS-TLR-r256 KBLAS-TLR-r128 KBLAS-TLR-r64 KBLAS-TLR-r32 KBLAS-TLR-r16 KBLAS-TLR-r8

Figure: TLR GEMM + SVD on NVIDIA

V100.

  • I. Said & H. Ltaief

HPC RTM 38 / 47

slide-47
SLIDE 47

Performance Results

TLR GEMM with Various K Ranks: update dense, no compression

Input matrices A & B in TLR form. Output matrix C in dense form. Practically needed.

1.5X 3.5X 7.1X 12.1X 11.2X 12.2X 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 5120 10240 15360 20480 25600 30720 35840 40960 Time (ms) Matrix Size (M=N=K) cuBLAS-Dense KBLAS-TLR-r256 .. 37.5% KBLAS-TLR-r128 … 15.6% KBLAS-TLR-r64 ….... 7.1% KBLAS-TLR-r32 ....... 3.3% KBLAS-TLR-r16 ….…. 1.6% KBLAS-TLR-r8 ……... 0.8%

Figure: TLR GEMM on NVIDIA V100.

  • I. Said & H. Ltaief

HPC RTM 39 / 47

slide-48
SLIDE 48

Performance Results

TLR GEMM with Various K Ranks: all low-rank

Input matrices A & B in TLR form. Output matrix C in TLR form. Re-compression involved.

8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 5,120 10,240 15,360 20,480 25,600 30,720 35,840 40,960 46,080 51,200 56,320 61,440 Time (ms) Matrix Size (M=N=K) cuBLAS-Dense KBLAS-TLR-r256 KBLAS-TLR-r128 KBLAS-TLR-r64 KBLAS-TLR-r32 KBLAS-TLR-r16 KBLAS-TLR-r8

Figure: TLR GEMM + SVD on NVIDIA V100.

  • I. Said & H. Ltaief

HPC RTM 40 / 47

slide-49
SLIDE 49

Performance Results

TLR-GEMM Variants

TLR-GEMM-LLD: update dense C.

M N K ntth outer product

A B C

TLR-GEMM-LLL: update TLR C.

M N K ntth outer product

A B C

  • A. Charara, D. E. Keyes, and H. Ltaief, Batched Tile Low-Rank GEMM on GPUs, Submitted to

EuroPar, 2018.

  • I. Said & H. Ltaief

HPC RTM 41 / 47

slide-50
SLIDE 50

Performance Results

Batched TLR GEMM: uniform ranks

Update dense tile.

Higher memory footprint.

Update Low-rank tile:

Requires re-compression. QR + SVD + GEMM.

A A A A B B B B C C C C A A A A B B B B C C C C

  • W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, Batched QR and SVD Algorithms
  • n GPUs with Applications in Hierarchical Matrix Compression, Parallel Computing, vol. 74, p.

19-33, 2018.

  • I. Said & H. Ltaief

HPC RTM 42 / 47

slide-51
SLIDE 51

Performance Results

TLR POTRF: Uniform Ranks

Tile low-rank Cholesky factoriza- tion: Uniform ranks. Generate, compress and factorize on-the-fly. Single Pascal GPU P100.

7X 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 Time(s) Matrix Size HiCMA-TLR-36cores-CPU MAGMA-Dense-1-GPU HiCMA-TLR-1-GPU

  • A. Charara, D. E. Keyes, and H. Ltaief, Batched Tile Low-Rank GEMM on GPUs, Submitted to

EuroPar, 2018.

  • I. Said & H. Ltaief

HPC RTM 43 / 47

slide-52
SLIDE 52

Summary and Future Work

Outline

1

Background on Seismic Imaging

2

Ubiquitous Matricization and Taskifjcation for Seismic Imaging

3

Matrices Over Runtime Systems

4

Application to Frequency Domain

5

Performance Results

6

Summary and Future Work

  • I. Said & H. Ltaief

HPC RTM 44 / 47

slide-53
SLIDE 53

Summary and Future Work

Summary

Task-based programming model is a disruptive approach for Oil & Gas industries Amenability and versatility model Task-based full RTM investigation using StarPU features Performance impact of OOC algorithms Performance impact of I/O and compression Investigate Helmholtz equation with high frequency (high ranks!)

  • I. Said & H. Ltaief

HPC RTM 45 / 47

slide-54
SLIDE 54

Acknowledgments

Students/Collaborators/Vendors

Extreme Computing Research Center @ KAUST: R. Abdelkhalak, S. Abdullah, K. Akbudak, R. AlOmairy, A. Alonazi, T. Alturkestani, W. Boukaram, A. Charara, N. Doucet, M. Genton, E. Gonzalez-Fisher, D. Keyes, A. Mikhalev, D. Sukkari and Y. Sun KAUST Supercomputing Lab INRIA/INP/LaBRI Bordeaux, France: Runtime/HiePACS Teams Max-Planck Institute@Leipzig, Germany: R. Kriemann Innovative Computing Laboratory @ UTK: PLASMA/MAGMA/PaRSEC Teams American University of Beirut, Lebanon: G. Turkiyyah Tokyo Institute of Technology, Japan: R. Yokota

  • I. Said & H. Ltaief

HPC RTM 46 / 47

slide-55
SLIDE 55

Acknowledgments

Questions?

  • I. Said & H. Ltaief

HPC RTM 47 / 47