High Performance Geo-Computing Group
GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua - - PowerPoint PPT Presentation
GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua - - PowerPoint PPT Presentation
High Performance Geo-Computing Group GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua University May 8 st , 2017, GTC 2017 About Tsinghua HPGC High Performance Geo-Computing Group Interdisciplinary research group High
GPU Acceleration on Elastic RTM
About Tsinghua HPGC
- High Performance Geo-Computing Group
– Interdisciplinary research group – High performance, high resolution geo-science acceleration
GPU Acceleration on Elastic RTM
About Tsinghua HPGC
- High Performance Geo-Computing Group
– Interdisciplinary research group – High performance, high resolution geo-science acceleration Climate changing Seismic modeling
High Performance Computing
data computing
GPU Acceleration on Elastic RTM
About Tsinghua HPGC
- High Performance Geo-Computing Group
– Interdisciplinary research group – High performance, high resolution geo-science acceleration – The most advanced HPC platforms
- Multi-core CPU, many-core GPU & MIC
- Reconfigurable data flow engines
– Maxeler DFEs, IBM OpenPower, Intel Xeon+FPGA
- Supercomputer
– Tianhe-1A: 7168 CPU-GPU nodes, 4.7PFlops Rpeak – Tianhe-2: 16,000 CPU-3MIC nodes, 54.9PFlops Rpeak – Tsinghua Explore100: 740 CPU nodes, 4TFlops Rpeak
– Cooperation and Sponsorship
GPU Acceleration on Elastic RTM
- HPGC-SEP Summer Exchange Project
– Advisor: Dr. Haohuan Fu, Dr. Robert Clapp, and Prof. Biondo Biondi – Special thanks to Gustavo Alves, and Ettore Biondi
- Achievements on GPU
– 10x speedup accelerating a 2D elastic RTM code over 24 CPU cores – Implementation of a 3D elastic RTM kernel with adjustable interfaces – 27x speedup accelerating the 3D RTM kernel over 24 CPU cores
About This Work
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- State variables (data) and the attributes (model)
Particle velocities Normal stresses Shear stresses
3
m kg z y x mass = Δ Δ Δ = ρ GPa P = ∂ ∂ = ρ ρ λ
GPa length x Area Force = Δ = µ
Density Lambda Mu
Data Model
Forward
Data Model
Adjoint
𝑤", 𝑤#, 𝑤$,
𝜏"#,𝜏"$, 𝜏#$ 𝜏"",𝜏##, 𝜏$$
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- Forward and Adjoint
Data Model
Adjoint Forward
Data Model t=0 t=Nt t=Nt t=0 Memory ∆𝑢 ∆𝑢 … …
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- Wave Equations
1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x
x xx xy xz x
V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂
1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x
y xy yy yz y
V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂
1 ( , ) [ ( , ) ( , ) ( , ) ( , )] ( ) x x x x x x
z xz yz zz z
V t t t t S t t x y z σ σ σ ρ ∂ ∂ ∂ ∂ = + + + ∂ ∂ ∂ ∂ ( , ) [ ( ) 2 ( )] ( , ) x x x x
xx x
t V t t x σ λ µ ∂ ∂ = + ∂ ∂ ( , ) [ ( ) 2 ( )] ( , ) x x x x
yy x
t V t t x σ λ µ ∂ ∂ = + ∂ ∂
( , ) [ ( ) 2 ( )] ( , ) x x x x
zz x
t V t t x σ λ µ ∂ ∂ = + ∂ ∂
( )[ ( , ) ( , )] ( , ) x x x x
y z xx
V t V t S t y z λ ∂ ∂ + + + ∂ ∂ ( )[ ( , ) ( , )] ( , ) x x x x
x z yy
V t V t S t x z λ ∂ ∂ + + + ∂ ∂ ( )[ ( , ) ( , )] ( , ) x x x x
x y zz
V t V t S t x y λ ∂ ∂ + + + ∂ ∂
( , ) ( )[ ( , ) ( , )] ( , ) x x x x x
xy y x xy
t V t V t S t t x y σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂ ( , ) ( )[ ( , ) ( , )] ( , ) x x x x x
xz z x xz
t V t V t S t t x z σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂
( , ) ( )[ ( , ) ( , )] ( , ) x x x x x
yz z x yz
t V t V t S t t y z σ µ ∂ ∂ ∂ = + + ∂ ∂ ∂
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- For time: 2nd ord. F.D. approximation
- Based on staggered grid
- For space: 10th ord. F.D. approximation
4 or 5 5 or 4
Stencil Forward Adjoint
2 2
( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x
t t t t t t t t x x xx xy xz x
t V V S x y z σ σ σ ρ
Δ Δ + −
Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂
2 2
( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x
t t t t t t t t y y xy yy yz y
t V V S x y z σ σ σ ρ
Δ Δ + −
Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂
2 2
( ) ( ) [ ( ) ( ) ( )] ( ) ( ) x x x x x x x
t t t t t t t t z z xz yz zz z
t V V S x y z σ σ σ ρ
Δ Δ + −
Δ ∂ ∂ ∂ = + + + + ∂ ∂ ∂
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– K40 GPU, (200*200*200)*1000ts
- Configuration of different blk sizes, reg. per blk
- Best: blk ß 20*20; max reg. ß 56
- Variable data into L1/SM, Constant data into Read-only Cache
– Dynamic Pointer Switch & Minimum Data Cubes
- Only malloc data cubes covering three steps
t-1 t t+1 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– K40 GPU, (200*200*200)*1000ts
- Configuration of different blk sizes, reg. per blk
- Best: blk ß 20*20; max reg. ß 56
- Variable data into L1/SM, Constant data into Read-only Cache
– Dynamic Pointer Switch & Minimum Data Cubes
- Only malloc data cubes covering three steps
pre cur next 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– K40 GPU, (200*200*200)*1000ts
- Configuration of different blk sizes, reg. per blk
- Best: blk ß 20*20; max reg. ß 56
- Variable data into L1/SM, Constant data into Read-only Cache
– Dynamic Pointer Switch & Minimum Data Cubes
- Only malloc data cubes covering three steps
cur next pre 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– K40 GPU, (200*200*200)*1000ts
- Configuration of different blk sizes, reg. per blk
- Best: blk ß 20*20; max reg. ß 56
- Variable data into L1/SM, Constant data into Read-only Cache
– Dynamic Pointer Switch & Minimum Data Cubes
- Only malloc data cubes covering three steps
next pre cur 𝑤" 𝜏#$ … 𝑤" 𝜏#$ … 𝑤" 𝜏#$ …
∆𝑢
Memory
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– Multiple GPUs
x y z
4 or 5 5 or 4
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– Multiple GPUs
x y z
4 or 5 5 or 4
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– Multiple GPUs
x y z
4 or 5 5 or 4
Internal halo halo
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– Multiple GPUs GPU 0 GPU 1 GPU 2
Internal halo Internal Internal halo halo halo
GPU Algorithm per Stencil sweep
For each subdomain ① Calculate RTM stencil ② Update Halo ③ Add Source ④ Switch Pointer Updating
Stencil Computing
workflow
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- GPU Optimizations
– Multiple GPUs GPU 0 GPU 1 GPU 2
Internal halo Internal Internal halo halo halo
GPU Algorithm per Stencil sweep
For each subdomain ① Calculate halo RTM stencil ② Calculate Internal RTM stencil Update Halo ④ Add Source ⑤ Switch Pointers
Halo Internal
Updating
Overlapping workflow
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Device Speedup Comp. Speedup Comp.+Comm. 1 GPU 8.85 8.36 2 GPU 17.70 16.15 4 GPU 32.67 27.31
GPU Acceleration on Elastic RTM
- Elastic Wave Equations
– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation
2D Elastic RTM Stencils
GPU Acceleration on Elastic RTM
- Elastic Wave Equations
– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation
- Performance
– K40 GPU, (300*300)*3525ts
- Configuration between SM/L1, blk sizes, reg. per blk
- Variable data into SMs/ constant data into Read-only Cache
- Best: blk ß 20*20; max reg. ß 56
2D Elastic RTM Stencils
CPU GPU
GPU Acceleration on Elastic RTM
- Elastic Wave Equations
– For time: 2nd ord. F.D. approximation – For space: 10th ord. F.D. approximation
- Performance
– K40 GPU, (300*300)*3525ts
- Configuration between SM/L1, blk sizes, reg. per blk
- Variable data into SMs/ constant data into Read-only Cache
- Best: blk ß 20*20; max reg. ß 56
2D Elastic RTM Stencils
CPU single serial GPU 101x 10x
High Performance Geo-Computing Group
Lin Gan Center for Earth System Science, Department of Computer Science, Tsinghua University, Beijing +86 15810537953 l-gan11@mails.tsinghua.edu.cn www.thuhpgc.org
GPU Acceleration on the 3D Elastic RTM Method
GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils
- For time: 2nd ord. F.D. approximation
Forward Adjoint
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Vx Vy
GPU Acceleration on Elastic RTM
- Validation and Performance
– GPU Cluster in SEP
- 4 K40 GPUs over 24 core CPU (OpenMP)
- 200*200*200 + 1000 steps (record every 100 steps)
3D Elastic RTM Stencils
Device Speedup Comp. Speedup Comp.+Comm. 1 GPU 8.85 8.36 2 GPU 17.70 16.15 4 GPU 32.67 27.31