[PPT] - GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua PowerPoint Presentation

SLIDE 1

High Performance Geo-Computing Group

May 8st, 2017, GTC 2017

GPU Acceleration on the 3D Elastic RTM Method

Lin Gan, Tsinghua University

SLIDE 2

GPU Acceleration on Elastic RTM

About Tsinghua HPGC

High Performance Geo-Computing Group

– Interdisciplinary research group – High performance, high resolution geo-science acceleration

SLIDE 3

GPU Acceleration on Elastic RTM

About Tsinghua HPGC

High Performance Geo-Computing Group

– Interdisciplinary research group – High performance, high resolution geo-science acceleration Climate changing Seismic modeling

High Performance Computing

data computing

SLIDE 4

GPU Acceleration on Elastic RTM

About Tsinghua HPGC

High Performance Geo-Computing Group

– Interdisciplinary research group – High performance, high resolution geo-science acceleration – The most advanced HPC platforms

Multi-core CPU, many-core GPU & MIC
Reconfigurable data flow engines

– Maxeler DFEs, IBM OpenPower, Intel Xeon+FPGA

Supercomputer

– Tianhe-1A: 7168 CPU-GPU nodes, 4.7PFlops Rpeak – Tianhe-2: 16,000 CPU-3MIC nodes, 54.9PFlops Rpeak – Tsinghua Explore100: 740 CPU nodes, 4TFlops Rpeak

– Cooperation and Sponsorship

SLIDE 5

GPU Acceleration on Elastic RTM

HPGC-SEP Summer Exchange Project

– Advisor: Dr. Haohuan Fu, Dr. Robert Clapp, and Prof. Biondo Biondi – Special thanks to Gustavo Alves, and Ettore Biondi

Achievements on GPU

– 10x speedup accelerating a 2D elastic RTM code over 24 CPU cores – Implementation of a 3D elastic RTM kernel with adjustable interfaces – 27x speedup accelerating the 3D RTM kernel over 24 CPU cores

About This Work

SLIDE 6

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

State variables (data) and the attributes (model)

Particle velocities Normal stresses Shear stresses

3

m kg z y x mass = Δ Δ Δ = ρ GPa P = ∂ ∂ = ρ ρ λ

GPa length x Area Force = Δ = µ

Density Lambda Mu

Data Model

Forward

Data Model

Adjoint

𝑤", 𝑤#, 𝑤$,

𝜏"#,𝜏"$, 𝜏#$ 𝜏"",𝜏##, 𝜏$$

SLIDE 7

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

Forward and Adjoint

Data Model

Adjoint Forward

Data Model t=0 t=Nt t=Nt t=0 Memory ∆𝑢 ∆𝑢 … …

SLIDE 8

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

Wave Equations

1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x

x xx xy xz x

V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂

1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x

y xy yy yz y

V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂

1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x

z xz yz zz z

V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂ ( , ) [ ( ) 2 ( )] ( , ) x x x x

xx x

t V t t x σ λ µ ∂ ∂ = + ∂ ∂ ( , ) [ ( ) 2 ( )] ( , ) x x x x

yy x

t V t t x σ λ µ ∂ ∂ = + ∂ ∂

( , ) [ ( ) 2 ( )] ( , ) x x x x

zz x

t V t t x σ λ µ ∂ ∂ = + ∂ ∂

( )[ ( , ) ( , )] ( , ) x x x x

y z xx

V t V t S t y z λ ∂ ∂ + + + ∂ ∂ ( )[ ( , ) ( , )] ( , ) x x x x

x z yy

V t V t S t x z λ ∂ ∂ + + + ∂ ∂ ( )[ ( , ) ( , )] ( , ) x x x x

x y zz

V t V t S t x y λ ∂ ∂ + + + ∂ ∂

( , ) ( )[ ( , ) ( , )] ( , ) x x x x x

xy y x xy

t V t V t S t t x y σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂ ( , ) ( )[ ( , ) ( , )] ( , ) x x x x x

xz z x xz

t V t V t S t t x z σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂

( , ) ( )[ ( , ) ( , )] ( , ) x x x x x

yz z x yz

t V t V t S t t y z σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂

SLIDE 9

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

For time: 2nd ord. F.D. approximation
Based on staggered grid
For space: 10th ord. F.D. approximation

4 or 5 5 or 4

Stencil Forward Adjoint

2 2

( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x

t t t t t t t t x x xx xy xz x

t V V S x y z σ σ σ ρ

Δ Δ + −

Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂

2 2

( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x

t t t t t t t t y y xy yy yz y

t V V S x y z σ σ σ ρ

Δ Δ + −

Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂

2 2

( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x

t t t t t t t t z z xz yz zz z

t V V S x y z σ σ σ ρ

Δ Δ + −

Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂

SLIDE 10

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– K40 GPU, (200200200)*1000ts

Configuration of different blk sizes, reg. per blk
Best: blk ß 20*20; max reg. ß 56
Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

Only malloc data cubes covering three steps

t-1 t t+1 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

SLIDE 11

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– K40 GPU, (200200200)*1000ts

Configuration of different blk sizes, reg. per blk
Best: blk ß 20*20; max reg. ß 56
Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

Only malloc data cubes covering three steps

pre cur next 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

SLIDE 12

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– K40 GPU, (200200200)*1000ts

Configuration of different blk sizes, reg. per blk
Best: blk ß 20*20; max reg. ß 56
Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

Only malloc data cubes covering three steps

cur next pre 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

SLIDE 13

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– K40 GPU, (200200200)*1000ts

Configuration of different blk sizes, reg. per blk
Best: blk ß 20*20; max reg. ß 56
Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

Only malloc data cubes covering three steps

next pre cur 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

∆𝑢

Memory

SLIDE 14

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– Multiple GPUs

x y z

4 or 5 5 or 4

SLIDE 15

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– Multiple GPUs

x y z

4 or 5 5 or 4

SLIDE 16

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– Multiple GPUs

x y z

4 or 5 5 or 4

Internal halo halo

SLIDE 17

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– Multiple GPUs GPU 0 GPU 1 GPU 2

Internal halo Internal Internal halo halo halo

GPU Algorithm per Stencil sweep

For each subdomain ① Calculate RTM stencil ② Update Halo ③ Add Source ④ Switch Pointer Updating

Stencil Computing

workflow

SLIDE 18

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

GPU Optimizations

– Multiple GPUs GPU 0 GPU 1 GPU 2

Internal halo Internal Internal halo halo halo

GPU Algorithm per Stencil sweep

For each subdomain ① Calculate halo RTM stencil ② Calculate Internal RTM stencil Update Halo ④ Add Source ⑤ Switch Pointers

Halo Internal

Updating

Overlapping workflow

SLIDE 19

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 20

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 21

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 22

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 23

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 24

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 25

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 26

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 27

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 28

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Device Speedup Comp. Speedup Comp.+Comm. 1 GPU 8.85 8.36 2 GPU 17.70 16.15 4 GPU 32.67 27.31

SLIDE 29

GPU Acceleration on Elastic RTM

Elastic Wave Equations

– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation

2D Elastic RTM Stencils

SLIDE 30

GPU Acceleration on Elastic RTM

Elastic Wave Equations

– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation

Performance

– K40 GPU, (300300)3525ts

Configuration between SM/L1, blk sizes, reg. per blk
Variable data into SMs/ constant data into Read-only Cache
Best: blk ß 20*20; max reg. ß 56

2D Elastic RTM Stencils

CPU GPU

SLIDE 31

GPU Acceleration on Elastic RTM

Elastic Wave Equations

– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation

Performance

– K40 GPU, (300300)3525ts

Configuration between SM/L1, blk sizes, reg. per blk
Variable data into SMs/ constant data into Read-only Cache
Best: blk ß 20*20; max reg. ß 56

2D Elastic RTM Stencils

CPU single serial GPU 101x 10x

SLIDE 32

High Performance Geo-Computing Group

Lin Gan Center for Earth System Science, Department of Computer Science, Tsinghua University, Beijing +86 15810537953 l-gan11@mails.tsinghua.edu.cn www.thuhpgc.org

GPU Acceleration on the 3D Elastic RTM Method

SLIDE 33

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

For time: 2nd ord. F.D. approximation

Forward Adjoint

SLIDE 34

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 35

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 36

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 37

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 38

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 39

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 40

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 41

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 42

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

SLIDE 43

GPU Acceleration on Elastic RTM

Validation and Performance

– GPU Cluster in SEP

4 K40 GPUs over 24 core CPU (OpenMP)
200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Device Speedup Comp. Speedup Comp.+Comm. 1 GPU 8.85 8.36 2 GPU 17.70 16.15 4 GPU 32.67 27.31

May 8st, 2017, GTC 2017

GPU Acceleration on the 3D Elastic RTM Method

Lin Gan, Tsinghua University

About Tsinghua HPGC

– Interdisciplinary research group – High performance, high resolution geo-science acceleration

About Tsinghua HPGC

– Interdisciplinary research group – High performance, high resolution geo-science acceleration Climate changing Seismic modeling

data computing

About Tsinghua HPGC

– Interdisciplinary research group – High performance, high resolution geo-science acceleration – The most advanced HPC platforms

– Cooperation and Sponsorship

– 10x speedup accelerating a 2D elastic RTM code over 24 CPU cores – Implementation of a 3D elastic RTM kernel with adjustable interfaces – 27x speedup accelerating the 3D RTM kernel over 24 CPU cores

About This Work

3D Elastic RTM Stencils

Data Model

Data Model

𝑤", 𝑤#, 𝑤$,

3D Elastic RTM Stencils

Data Model

Data Model t=0 t=Nt t=Nt t=0 Memory ∆𝑢 ∆𝑢 … …

3D Elastic RTM Stencils

3D Elastic RTM Stencils

Stencil Forward Adjoint

3D Elastic RTM Stencils

– K40 GPU, (200*200*200)*1000ts

– Dynamic Pointer Switch & Minimum Data Cubes

t-1 t t+1 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

3D Elastic RTM Stencils

– K40 GPU, (200*200*200)*1000ts

– Dynamic Pointer Switch & Minimum Data Cubes

pre cur next 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

3D Elastic RTM Stencils

– K40 GPU, (200*200*200)*1000ts

– Dynamic Pointer Switch & Minimum Data Cubes

cur next pre 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

3D Elastic RTM Stencils

– K40 GPU, (200*200*200)*1000ts

– Dynamic Pointer Switch & Minimum Data Cubes

next pre cur 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

∆𝑢

Memory

3D Elastic RTM Stencils

– Multiple GPUs

3D Elastic RTM Stencils

– Multiple GPUs

3D Elastic RTM Stencils

– Multiple GPUs

3D Elastic RTM Stencils

– Multiple GPUs GPU 0 GPU 1 GPU 2

workflow

3D Elastic RTM Stencils

– Multiple GPUs GPU 0 GPU 1 GPU 2

Overlapping workflow

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– GPU Cluster in SEP

3D Elastic RTM Stencils

Vx Vy

– K40 GPU, (200200200)*1000ts

– K40 GPU, (200200200)*1000ts

– K40 GPU, (200200200)*1000ts

– K40 GPU, (200200200)*1000ts

– K40 GPU, (300300)3525ts

– K40 GPU, (300300)3525ts