An Agile Approach to Building a GPU-enabled and Performance- - - PowerPoint PPT Presentation

an agile approach to building a gpu enabled and
SMART_READER_LITE
LIVE PREVIEW

An Agile Approach to Building a GPU-enabled and Performance- - - PowerPoint PPT Presentation

An Agile Approach to Building a GPU-enabled and Performance- portable Global Cloud-resolving Atmospheric Model Dr. Richard Loft* Director, Technology Development CISL/NCAR *National Center for Atmospheric Research GTC, San Jose, CA March 26,


slide-1
SLIDE 1

An Agile Approach to Building a GPU-enabled and Performance- portable Global Cloud-resolving Atmospheric Model

  • Dr. Richard Loft*

Director, Technology Development CISL/NCAR *National Center for Atmospheric Research GTC, San Jose, CA March 26, 2018

slide-2
SLIDE 2

Outline

  • Origins Backstory
  • The MPAS Model
  • Team
  • Tools and Design
  • Status

2

slide-3
SLIDE 3

Project began with research based on student projects

  • Two years of student internship projects in the Summer Internships in Parallel

Computational Science (SIParCS) at NCAR funded student projects related to architectural inter-comparison.

  • Projects focused on optimizing atmospheric numerical PDE solvers for both

CPUs and GPUs with performance portability in mind.

  • Architectures compared:
  • Xeon Broadwell, Haswell;
  • Xeon Phi KNL;
  • NVIDIA Tesla P100->V100.

3

slide-4
SLIDE 4

Benchmark Problem

  • Shallow Water Equations (SWE)

– A set of non-linear partial differential equations (PDE) – Capture features of atmospheric flow around the Earth

  • Radial basis function-generated finite difference (RBF-FD) methods

RBF-FD solution to SWE test case “Flow over an isolated mountain” using 655,532 points [1]

3

An example of 75-point stencil

  • n a sphere [1]

Evaluate differential

  • perator D at every point

Stencil points Non-stencil points Cone-shaped mountain Day 1 Day 15

4

Optimizing Stencils for different architectures

slide-5
SLIDE 5

Insufficient Workload Parallelism Sufficient Workload Parallelism

Directive-based portability in the RBF-FD shallow water equations (2-D unstructured stencil)

  • CI roofline model generally

predicts performance well, even for more complicated algorithms.

  • Xeon performance crashes to

DRAM BW limit when cache size is exceeded, with some state reuse.

  • Xeon Phi (KNL) HBM memory is

less sensitive to problem size that Xeon, saturates with CI figure.

  • NVIDIA Pascal P100 performance

fits CI model GPU’s require higher levels of parallelism to reach saturation.

50 100 150 200 250 300 350

Performance (GFLOPS) Broadwell KNL P100

5

slide-6
SLIDE 6

What is MPAS? – The Model for Prediction Across Scales NCAR’s Global Meteorological/Climate Model; ~100,000 SLOC

6

Simulation of 2012 Tropical Cyclones at 4Km Resolution – Courtesy of Falko Judt, NCAR

slide-7
SLIDE 7

Weather and Climate Alliance (WACA):

  • NCAR
  • NVIDIA Corporation
  • IBM Corporation/The Weather Company
  • University of Wyoming, CE&EE Department
  • Korean Institute of Science and Technology Information (KISTI)

7

slide-8
SLIDE 8

Initial Divide and Conquer Strategy

8

MPAS Dynamics MPAS Physics Problem Reports and Support Ideas and Results

slide-9
SLIDE 9

Weather and Climate Alliance (WACA):

A Collaboration for Earth System Model Acceleration

  • NCAR (2+4)
  • Dr. Rich Loft, Director TDD
  • Dr. Raghu Raj Kumar, Project Scientist TDD
  • Clint Olson, TDD
  • Bill Skamarock, Senior Science, MMM
  • Michael Duda, Software Engineer, MMM
  • Dave Gill, Software Engineer, MMM
  • KISTI (2+1)
  • Minsu Joh, KISTI Director, Disaster Management Research Center
  • Dr. Ji-Sun Kang. Senior Researcher
  • Jae-Youp Kim, GRA
  • NVIDIA/PGI (1+3)
  • Greg Branch, NVIDIA, Sales
  • Dr. Carl Ponder, Senior Applications Engineer
  • Brent Leback, PGI Compiler Engineering Manager
  • Craig Tierny, Solutions Architect
  • University of Wyoming (1+5)
  • Dr. Suresh Muknahallipatna, Professor E&CE, UW
  • Supreeth Suresh, Pranay Reddy, Sumathi Lakshmiranganathan, Cena Miller, Bradley Riotto - GRAs

9

~6 PI +13 technical staff Started in September 2016 (18 months) ~9 FTE-years

slide-10
SLIDE 10

10 Problem Reports and Support

Since September: added IBM and The Weather Company

IBM/TWC participants (1+2)

  • Jaime Moreno
  • Todd Hutchinson
  • Constantinos Evangelinos
slide-11
SLIDE 11

Tools for Accelerating Code Optimization

  • Kernel GENerator (KGEN)
  • Extracts kernels from Fortran applications
  • Creates:
  • Standalone source code
  • Input and output state for verification
  • Added support for code coverage and representation
  • Broad user community
  • 8 Domestic institutions
  • 5 international institutions
  • 1 Company
  • Available on Github:

https://github.com/NCAR/KGen

11

KGEN is a useful tool for accelerating code porting and optimization

slide-12
SLIDE 12

MPAS Synchronous and Asynchronous Execution

LW and SW Radiation Dynamics and Physics Asynch I/O Land Surface

:

Dynamics and Physics Land Surface

:

LW and SW Radiation

  • r

LW and SW Radiation LW and SW Radiation

  • r
  • r

Disk 𝛦t

  • r
slide-13
SLIDE 13

Phase 2: pushing on to a full MPAS port

  • Status of GPU-based model components
  • Ported, optimized, verified
  • Dry dynamical core
  • GPU-direct implementation of MPAS halo exchanges
  • Ported, optimized
  • Moist dynamics (tracer transport)
  • Xu-Randall Cloud fraction
  • Ported, undergoing optimization
  • WSM6 Microphysics
  • YSU Boundary layer scheme
  • Awaiting Porting
  • Scale Insensitive Tiedtke convection scheme
  • Monin-Obukhov surface layer scheme
  • CPU-based components
  • Overlapping SW and LW RRTMG Radiation (lagged radiation)
  • NOAH Land Surface Model (synchronous, remains on CPU)
  • SIONlib I/O subsystem

13

slide-14
SLIDE 14

IBM/TWC MPAS Objectives

  • MPAS grid with local refinement

24-hour global forecasts

  • 12 km global grid
  • 3 km refinement over

selected regions.

  • 32.8 M horizontal points
  • 56 layers

Forecast requirement

  • Complete 20 hour simulation
  • …in 45 minutes
  • xRe = 26.7
  • For 𝛦t = 18 sec, timestep

budget is 0.674 seconds

14

Refined grids can be generated anywhere desired.

  • Dr. Kumar will show next that as few as 800 V100s could achieve this goal…