Porting Scalable Parallel CFD Application Krishnababu et. al. - - PowerPoint PPT Presentation

porting scalable parallel cfd application
SMART_READER_LITE
LIVE PREVIEW

Porting Scalable Parallel CFD Application Krishnababu et. al. - - PowerPoint PPT Presentation

HiFUN on GPU Porting Scalable Parallel CFD Application Krishnababu et. al. HiFUN on NVIDIA GPU D. V. Krishnababu, N. Munikrishna, Nikhil Vijay Shende 1 N. Balakrishnan 2 Thejaswi Rao 3 1. S & I Engineering Solutions Pvt. Ltd., Bangalore,


slide-1
SLIDE 1

HiFUN on GPU Krishnababu

  • et. al.

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

  • D. V. Krishnababu, N. Munikrishna, Nikhil Vijay Shende 1
  • N. Balakrishnan 2

Thejaswi Rao 3

  • 1. S & I Engineering Solutions Pvt. Ltd., Bangalore, India
  • 2. Aerospace Engineering, Indian Institute of Science, Banglore, India
  • 3. NVIDIA Graphics Pvt. Ltd., Banglore, India

GPU Technology Conference Silicon Valley March 26–29, 2018

1 / 18

slide-2
SLIDE 2

HiFUN on GPU Krishnababu

  • et. al.

Introduction

http://www.sandi.co.in

The HiFUN Software High Resolution Flow Solver on Unstructured Meshes. A Computational Fluid Dynamics (CFD) Flow Solver. Primary product of the company SandI. Robust, fast, accurate and efficient tool. About SandI A technology company. Incubated from Indian Institute of Science, Bangalore. Promotes high end CFD technologies with uncompromising quality standards.

2 / 18

slide-3
SLIDE 3

HiFUN on GPU Krishnababu

  • et. al.

Introduction

http://www.sandi.co.in

The HiFUN Software High Resolution Flow Solver on Unstructured Meshes. A Computational Fluid Dynamics (CFD) Flow Solver. Primary product of the company SandI. Robust, fast, accurate and efficient tool. About SandI A technology company. Incubated from Indian Institute of Science, Bangalore. Promotes high end CFD technologies with uncompromising quality standards.

2 / 18

slide-4
SLIDE 4

HiFUN on GPU Krishnababu

  • et. al.

Features of HiFUN

http://www.sandi.co.in/home/products

General

3 / 18

slide-5
SLIDE 5

HiFUN on GPU Krishnababu

  • et. al.

Features of HiFUN

http://www.sandi.co.in/home/products

Well Validated AIAA DPW SPICES AIAA HiLiftPW

4 / 18

slide-6
SLIDE 6

HiFUN on GPU Krishnababu

  • et. al.

Features of HiFUN

http://www.sandi.co.in/home/products

Super Scalable Workload: 165 Million Volumes Simulation CPU Cores Time (Hours/Days) 256 30/1.25 RANS 10000 1 256 108/4.5 URANS 10000 3 256 525/22 DES 10000 15

5 / 18

slide-7
SLIDE 7

HiFUN on GPU Krishnababu

  • et. al.

SandI–NVIDIA Collaboration

2014

✲ Joint Development Initiative Kicks Off

2015

✲ NVIDIA Innovation Award

2016

GTCx Mumbai HiFUN in GPU Apps Catalogue GTC 2016: Poster Presentation 2018

✲ GTC 2018

Way Ahead

✲ HiFUN on NVIDIA Pascal, Volta GPU

NVLink With IBM Power CPU

6 / 18

slide-8
SLIDE 8

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Hybrid Supercomputers Consist of CPU and NVIDIA GPU. Less power to achieve same FLOPS. Less cooling & space. GPU Thousands of computing cores sharing same RAM. Higher memory bandwidth. High data transfer overheads with CPU.

7 / 18

slide-9
SLIDE 9

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Hybrid Supercomputers Consist of CPU and NVIDIA GPU. Less power to achieve same FLOPS. Less cooling & space. GPU Thousands of computing cores sharing same RAM. Higher memory bandwidth. High data transfer overheads with CPU.

7 / 18

slide-10
SLIDE 10

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Parallelization Model on GPU Shared memory. Many FLOPS per byte of data from CPU to GPU. Re–look at parallelization of CFD algorithms. Parallelization Challenges General purpose algorithms. Implicit: Global data dependence. Complex multi–layered unstructured data structure.

8 / 18

slide-11
SLIDE 11

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Parallelization Model on GPU Shared memory. Many FLOPS per byte of data from CPU to GPU. Re–look at parallelization of CFD algorithms. Parallelization Challenges General purpose algorithms. Implicit: Global data dependence. Complex multi–layered unstructured data structure.

8 / 18

slide-12
SLIDE 12

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Constraints No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to GPU. Optimal data communication between CPU & GPU.

9 / 18

slide-13
SLIDE 13

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Constraints No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to GPU. Optimal data communication between CPU & GPU.

9 / 18

slide-14
SLIDE 14

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Constraints No compromise on distributed memory scalability. Source code maintainability should not suffer. Software portability should not suffer. Parallel Strategy Accelerate single node performance via offload model. Hybrid: MPI and OpenACC directives. Offload Model Computationally intensive part is offloaded to GPU. Optimal data communication between CPU & GPU.

9 / 18

slide-15
SLIDE 15

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Onera M6 NASA CRM Trap Wing Configurations & Workloads (Million) Onera M6 Wing: 1.1, 9.3, 12.12, 15.4 NASA CRM: 6.2, 26.5, 30 NASA Trap Wing: 20, 66 Simulation Type Steady RANS Simulations

10 / 18

slide-16
SLIDE 16

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Onera M6 NASA CRM Trap Wing Configurations & Workloads (Million) Onera M6 Wing: 1.1, 9.3, 12.12, 15.4 NASA CRM: 6.2, 26.5, 30 NASA Trap Wing: 20, 66 Simulation Type Steady RANS Simulations

10 / 18

slide-17
SLIDE 17

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Computing Platform: NVIDIA PSG Node configuration

Two Hexa–deca core Intel(R) Xeon(R) Haswell processors. Eight NVIDIA Tesla K–80 GPUs.

GPU Memory = 12 GB. Total CPU Memory per node = 256 GB. Infiniband interconnect Software PGI Compiler 16.7 OPENMPI 1.10.2 OpenACC 2.0

11 / 18

slide-18
SLIDE 18

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Computing Platform: NVIDIA PSG Node configuration

Two Hexa–deca core Intel(R) Xeon(R) Haswell processors. Eight NVIDIA Tesla K–80 GPUs.

GPU Memory = 12 GB. Total CPU Memory per node = 256 GB. Infiniband interconnect Software PGI Compiler 16.7 OPENMPI 1.10.2 OpenACC 2.0

11 / 18

slide-19
SLIDE 19

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Parallel Performance Parameters

Ideal Speed–up Ratio of number of nodes used for a given run to reference number of nodes. Actual Speed–up Ratio of time/iteration using reference number of nodes to time/iteration using number of nodes for given run. Accelerator Speed–up Ratio of time per iteration obtained using given no. of CPUs to time per iteration obtained using same no. of CPUs working in tandem with GPUs.

12 / 18

slide-20
SLIDE 20

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Single Node Performance

Accelerator Speed–up on 2 GPU Observations Increase in grid size increases GPU utilization and accelerator speed–up. Important to load GPU completely.

13 / 18

slide-21
SLIDE 21

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Single Node Performance

Varying GPUs % Increase Observations Increase in no. of GPUs increase accelerator speed–up. Use of 4 GPUs per node is optimal.

14 / 18

slide-22
SLIDE 22

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Single Node Performance

Time to RANS Solution (Hours) Observations Time to solution on 1 million grid ∼ 15 minutes. Time to solution on 30 million grid ∼ half a day. Single node serves as a desktop supercomputer.

15 / 18

slide-23
SLIDE 23

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Multi–node Performance

Parallel Speed–up: 66 Million Workload Observations Near linear speed–up using 2 GPUs per node. Drop in speed–up for larger no. nodes and/or higher GPUs due to lower GPU utilization.

16 / 18

slide-24
SLIDE 24

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Multi–node Performance

Normalized Time Per Iteration: 66 Million Workload Observations Drop in time/iter with increase in no. of nodes and/or GPUs. Time to solution with 8 nodes ∼ 4 hours.

17 / 18

slide-25
SLIDE 25

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU.

18 / 18

slide-26
SLIDE 26

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU.

18 / 18

slide-27
SLIDE 27

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU.

18 / 18

slide-28
SLIDE 28

HiFUN on GPU Krishnababu

  • et. al.

HiFUN on NVIDIA GPU

Concluding Remarks Offload model to port HiFUN on GPU. GPU based computing node is powerful enough to serve as desktop supercomputer. HiFUN is ideally suited to solve grand challenge problems on GPU based hybrid supercomputers. OpenACC directives based offload model is an attractive option for porting legacy CFD codes on GPU.

18 / 18