GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua - - PowerPoint PPT Presentation

gpu acceleration on the 3d elastic rtm method
SMART_READER_LITE
LIVE PREVIEW

GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua - - PowerPoint PPT Presentation

High Performance Geo-Computing Group GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua University May 8 st , 2017, GTC 2017 About Tsinghua HPGC High Performance Geo-Computing Group Interdisciplinary research group High


slide-1
SLIDE 1

High Performance Geo-Computing Group

May 8st, 2017, GTC 2017

GPU Acceleration on the 3D Elastic RTM Method

Lin Gan, Tsinghua University

slide-2
SLIDE 2

GPU Acceleration on Elastic RTM

About Tsinghua HPGC

  • High Performance Geo-Computing Group

– Interdisciplinary research group – High performance, high resolution geo-science acceleration

slide-3
SLIDE 3

GPU Acceleration on Elastic RTM

About Tsinghua HPGC

  • High Performance Geo-Computing Group

– Interdisciplinary research group – High performance, high resolution geo-science acceleration Climate changing Seismic modeling

High Performance Computing

data computing

slide-4
SLIDE 4

GPU Acceleration on Elastic RTM

About Tsinghua HPGC

  • High Performance Geo-Computing Group

– Interdisciplinary research group – High performance, high resolution geo-science acceleration – The most advanced HPC platforms

  • Multi-core CPU, many-core GPU & MIC
  • Reconfigurable data flow engines

– Maxeler DFEs, IBM OpenPower, Intel Xeon+FPGA

  • Supercomputer

– Tianhe-1A: 7168 CPU-GPU nodes, 4.7PFlops Rpeak – Tianhe-2: 16,000 CPU-3MIC nodes, 54.9PFlops Rpeak – Tsinghua Explore100: 740 CPU nodes, 4TFlops Rpeak

– Cooperation and Sponsorship

slide-5
SLIDE 5

GPU Acceleration on Elastic RTM

  • HPGC-SEP Summer Exchange Project

– Advisor: Dr. Haohuan Fu, Dr. Robert Clapp, and Prof. Biondo Biondi – Special thanks to Gustavo Alves, and Ettore Biondi

  • Achievements on GPU

– 10x speedup accelerating a 2D elastic RTM code over 24 CPU cores – Implementation of a 3D elastic RTM kernel with adjustable interfaces – 27x speedup accelerating the 3D RTM kernel over 24 CPU cores

About This Work

slide-6
SLIDE 6

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • State variables (data) and the attributes (model)

Particle velocities Normal stresses Shear stresses

3

m kg z y x mass = Δ Δ Δ = ρ GPa P = ∂ ∂ = ρ ρ λ

GPa length x Area Force = Δ = µ

Density Lambda Mu

Data Model

Forward

Data Model

Adjoint

𝑤", 𝑤#, 𝑤$,

𝜏"#,𝜏"$, 𝜏#$ 𝜏"",𝜏##, 𝜏$$

slide-7
SLIDE 7

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • Forward and Adjoint

Data Model

Adjoint Forward

Data Model t=0 t=Nt t=Nt t=0 Memory ∆𝑢 ∆𝑢 … …

slide-8
SLIDE 8

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • Wave Equations

1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x

x xx xy xz x

V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂

1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x

y xy yy yz y

V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂

1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x

z xz yz zz z

V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂ ( , ) [ ( ) 2 ( )] ( , ) x x x x

xx x

t V t t x σ λ µ ∂ ∂ = + ∂ ∂ ( , ) [ ( ) 2 ( )] ( , ) x x x x

yy x

t V t t x σ λ µ ∂ ∂ = + ∂ ∂

( , ) [ ( ) 2 ( )] ( , ) x x x x

zz x

t V t t x σ λ µ ∂ ∂ = + ∂ ∂

( )[ ( , ) ( , )] ( , ) x x x x

y z xx

V t V t S t y z λ ∂ ∂ + + + ∂ ∂ ( )[ ( , ) ( , )] ( , ) x x x x

x z yy

V t V t S t x z λ ∂ ∂ + + + ∂ ∂ ( )[ ( , ) ( , )] ( , ) x x x x

x y zz

V t V t S t x y λ ∂ ∂ + + + ∂ ∂

( , ) ( )[ ( , ) ( , )] ( , ) x x x x x

xy y x xy

t V t V t S t t x y σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂ ( , ) ( )[ ( , ) ( , )] ( , ) x x x x x

xz z x xz

t V t V t S t t x z σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂

( , ) ( )[ ( , ) ( , )] ( , ) x x x x x

yz z x yz

t V t V t S t t y z σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂

slide-9
SLIDE 9

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • For time: 2nd ord. F.D. approximation
  • Based on staggered grid
  • For space: 10th ord. F.D. approximation

4 or 5 5 or 4

Stencil Forward Adjoint

2 2

( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x

t t t t t t t t x x xx xy xz x

t V V S x y z σ σ σ ρ

Δ Δ + −

Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂

2 2

( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x

t t t t t t t t y y xy yy yz y

t V V S x y z σ σ σ ρ

Δ Δ + −

Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂

2 2

( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x

t t t t t t t t z z xz yz zz z

t V V S x y z σ σ σ ρ

Δ Δ + −

Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂

slide-10
SLIDE 10

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– K40 GPU, (200*200*200)*1000ts

  • Configuration of different blk sizes, reg. per blk
  • Best: blk ß 20*20; max reg. ß 56
  • Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

  • Only malloc data cubes covering three steps

t-1 t t+1 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

slide-11
SLIDE 11

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– K40 GPU, (200*200*200)*1000ts

  • Configuration of different blk sizes, reg. per blk
  • Best: blk ß 20*20; max reg. ß 56
  • Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

  • Only malloc data cubes covering three steps

pre cur next 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

slide-12
SLIDE 12

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– K40 GPU, (200*200*200)*1000ts

  • Configuration of different blk sizes, reg. per blk
  • Best: blk ß 20*20; max reg. ß 56
  • Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

  • Only malloc data cubes covering three steps

cur next pre 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

slide-13
SLIDE 13

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– K40 GPU, (200*200*200)*1000ts

  • Configuration of different blk sizes, reg. per blk
  • Best: blk ß 20*20; max reg. ß 56
  • Variable data into L1/SM, Constant data into Read-only Cache

– Dynamic Pointer Switch & Minimum Data Cubes

  • Only malloc data cubes covering three steps

next pre cur 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …

∆𝑢

Memory

slide-14
SLIDE 14

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– Multiple GPUs

x y z

4 or 5 5 or 4

slide-15
SLIDE 15

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– Multiple GPUs

x y z

4 or 5 5 or 4

slide-16
SLIDE 16

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– Multiple GPUs

x y z

4 or 5 5 or 4

Internal halo halo

slide-17
SLIDE 17

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– Multiple GPUs GPU 0 GPU 1 GPU 2

Internal halo Internal Internal halo halo halo

GPU Algorithm per Stencil sweep

For each subdomain ① Calculate RTM stencil ② Update Halo ③ Add Source ④ Switch Pointer Updating

Stencil Computing

workflow

slide-18
SLIDE 18

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • GPU Optimizations

– Multiple GPUs GPU 0 GPU 1 GPU 2

Internal halo Internal Internal halo halo halo

GPU Algorithm per Stencil sweep

For each subdomain ① Calculate halo RTM stencil ② Calculate Internal RTM stencil Update Halo ④ Add Source ⑤ Switch Pointers

Halo Internal

Updating

Overlapping workflow

slide-19
SLIDE 19

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-20
SLIDE 20

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-21
SLIDE 21

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-22
SLIDE 22

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-23
SLIDE 23

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-24
SLIDE 24

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-25
SLIDE 25

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-26
SLIDE 26

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-27
SLIDE 27

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-28
SLIDE 28

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Device Speedup Comp. Speedup Comp.+Comm. 1 GPU 8.85 8.36 2 GPU 17.70 16.15 4 GPU 32.67 27.31

slide-29
SLIDE 29

GPU Acceleration on Elastic RTM

  • Elastic Wave Equations

– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation

2D Elastic RTM Stencils

slide-30
SLIDE 30

GPU Acceleration on Elastic RTM

  • Elastic Wave Equations

– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation

  • Performance

– K40 GPU, (300*300)*3525ts

  • Configuration between SM/L1, blk sizes, reg. per blk
  • Variable data into SMs/ constant data into Read-only Cache
  • Best: blk ß 20*20; max reg. ß 56

2D Elastic RTM Stencils

CPU GPU

slide-31
SLIDE 31

GPU Acceleration on Elastic RTM

  • Elastic Wave Equations

– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation

  • Performance

– K40 GPU, (300*300)*3525ts

  • Configuration between SM/L1, blk sizes, reg. per blk
  • Variable data into SMs/ constant data into Read-only Cache
  • Best: blk ß 20*20; max reg. ß 56

2D Elastic RTM Stencils

CPU single serial GPU 101x 10x

slide-32
SLIDE 32

High Performance Geo-Computing Group

Lin Gan Center for Earth System Science, Department of Computer Science, Tsinghua University, Beijing +86 15810537953 l-gan11@mails.tsinghua.edu.cn www.thuhpgc.org

GPU Acceleration on the 3D Elastic RTM Method

slide-33
SLIDE 33

GPU Acceleration on Elastic RTM

3D Elastic RTM Stencils

  • For time: 2nd ord. F.D. approximation

Forward Adjoint

slide-34
SLIDE 34

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-35
SLIDE 35

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-36
SLIDE 36

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-37
SLIDE 37

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-38
SLIDE 38

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-39
SLIDE 39

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-40
SLIDE 40

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-41
SLIDE 41

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-42
SLIDE 42

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Vx Vy

slide-43
SLIDE 43

GPU Acceleration on Elastic RTM

  • Validation and Performance

– GPU Cluster in SEP

  • 4 K40 GPUs over 24 core CPU (OpenMP)
  • 200*200*200 + 1000 steps (record every 100 steps)

3D Elastic RTM Stencils

Device Speedup Comp. Speedup Comp.+Comm. 1 GPU 8.85 8.36 2 GPU 17.70 16.15 4 GPU 32.67 27.31

Vx Vy