Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de - PowerPoint PPT Presentation

Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de Supinski GTC ◆ March 26, 2018 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-747146

What is OpenMP? #pragma omp parallel for for(int i=0; i<N; �+, i) { do_something(i); } 2 Tom Scogland @ GTC 2 LLNL-PRES-747146

History of OpenMP Supports offloading In spring, 7 vendors execution to and the DOE agree on Incorporates task ? accelerator and the spelling of parallel OpenMP supports Unified Fortran and C/ parallelism. The cOMPunity, the group coprocessor devices, loops and form the taskloops, task C++: Bigger than both OpenMP memory of OpenMP users, is SIMD parallelism, and OpenMP ARB. By priorities, doacross individual specifications model is defined and formed and organizes more. Expands October, version 1.0 of loops, and hints for combined.   codified. workshops on OpenMP OpenMP beyond the OpenMP locks. Offloading now Minor modifications. in North America, traditional boundaries. specification for Fortran supports asynchronous Europe, and Asia. Support min/max is released. execution and reductions in C/C++. dependencies to host 1.1 execution. 3.0 4.0 3.1 2.0 1.0 2.5 4.5 5.0 First hybrid applications The merge of Fortran   with MPI* and OpenMP and C/C+ specifications appear. begins. 1.0 2.0 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 26 26 25 22 19 17 17 15 15 13 13 13 11 11 11 11 8 8 8 Permanent ARB Auxiliary ARB 3 Tom Scogland @ GTC 3 LLNL-PRES-747146

Why expand OpenMP target support now? • We need heterogeneous computing • Better energy efficiency • More performance without increasing clock speed • C/C++ abstractions (CUDA, Kokkos or RAJA) aren’t enough • Even the C++ abstractions have to run atop something! • Not all codes are written in C++, some are even written in F******! • New mainstream system architectures require it! Tom Scogland @ GTC 4 LLNL-PRES-747146

Sierra: The next LLNL Capability System Tom Scogland @ GTC 5 LLNL-PRES-747146

The Sierra system features a GPU-accelerated architecture Compute System Compute Rack Compute Node 4320 nodes   Standard 19” 1.29 PB Memory 2 IBM POWER9 CPUs Warm water cooling 240 Compute Racks 4 NVIDIA Volta GPUs 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ~12 MW 256 GiB DDR4 16 GiB Globally addressable HBM2   associated with each GPU Components Coherent Shared Memory IBM POWER9 • Gen2 NVLink Spectrum Scale   NVIDIA Volta File System Mellanox Interconnect • 7 TFlop/s Single Plane EDR InfiniBand 154 PB usable storage • HBM2 2 to 1 Tapered Fat Tree 1.54 TB/s R/W bandwidth • Gen2 NVLink Tom Scogland @ GTC 6 LLNL-PRES-747146

Sierra stats Sierra uSierra Nodes 4,320 684 POWER9 processors per node 2 2 GV100 (Volta) GPUs per node 4 4 Node Peak (TFLOP/s) 29.1 29.1 System Peak (PFLOP/s) 125 19.9 Node Memory (GiB) 320 320 System Memory (PiB) 1.29 0.209 Interconnect 2x IB EDR 2x IB EDR Off-Node Aggregate b/w (GB/s) 45.5 45.5 Compute racks 240 38 Network and Infrastructure racks 13 4 Storage Racks 24 4 Total racks 277 46 Peak Power (MW) ~12 ~1.8 Tom Scogland @ GTC 7 LLNL-PRES-747146

Many Updates to Accelerator Support in OpenMP 4.5 • Unstructured data mapping • Asynchronous execution • Scalar variables are firstprivate by default • Improvements for C/C++ array sections • Device runtime routines: allocation, copy, etc. • Clauses to support device pointers • Ability to map structure elements Tons of non-accelerator updates for tasking, • New combined constructs SIMD and even performance of classic worksharing • New way to map global variables ( link ) 8 Tom Scogland @ GTC 8 LLNL-PRES-747146

Gaps in OpenMP 4.5 • Base language support is out of date • C99 • C++03 • Fortran 03 • Mapping complex data structures is painful • No direct support for unified memory devices • No mechanism for deep copying in mappings • Overlapping data transfers with computation is complex and error prone • Etc. 9 Tom Scogland @ GTC 9 LLNL-PRES-747146

Base Language Support in OpenMP 5.0 • C99 -> C11 • _Atomic still in discussion • C++03 -> C++17 (yes, 11, 14 and 17 all at once) • C++ threads still in discussion • Explicit support for mapping lambdas (sanely) • Improved support for device code • Classes with virtual methods can be mapped (may even be callable) • Fortran 2008? (in the works, maybe) 10 Tom Scogland @ GTC 10 LLNL-PRES-747146

Complex Data in OpenMP 5.0: Unified Memory and Deep Copy, Why Both? 1. Mapping provides more information to both the compiler and the runtime 2. Not all hardware has unified memory 3. Not all unified memory is the same 1. Can all memory be accessed with the same performance from everywhere? 2. Do atomics work across the full system? 3. Are flushes required for coherence? How expensive are they? Tom Scogland @ GTC 11 LLNL-PRES-747146

Specifying unified memory in OpenMP • OpenMP does not require unified memory • Or even a unified address space • This is not going to change Tom Scogland @ GTC 12 LLNL-PRES-747146

How do you make non-portable features portable? • Specify what they provide when they are present • Give the user a way to assert that they are required • Give implementers a way to react to that assertion Tom Scogland @ GTC 13 LLNL-PRES-747146

One solution: Requirement declarations #pragma omp requires [extension clauses…] Extension name Effect Guarantee that device pointers are unique across all unified_address devices, is_device_ptr is not required Host pointers are valid device pointers and considered unified_shared_memory present by all implicit maps, implies unified_address, memory is synchronized at target task sync Tom Scogland @ GTC 14 LLNL-PRES-747146

OpenMP unified memory example int * arr = new int[50]; #pragma omp target teams distribute parallel for for (int i=0; i<50; �+, i){ arr[i] = i; } Tom Scogland @ GTC 15 LLNL-PRES-747146

OpenMP unified memory example int * arr = new int[50]; #pragma omp target teams distribute parallel for for (int i=0; i<50; �+, i){ arr[i] = i; } Tom Scogland @ GTC 16 LLNL-PRES-747146

OpenMP unified memory example #pragma omp requires unified_shared_memory int * arr = new int[50]; #pragma omp target teams distribute parallel for for (int i=0; i<50; �+, i){ arr[i] = i; } Tom Scogland @ GTC 17 LLNL-PRES-747146

OpenMP unified memory example #pragma omp requires unified_shared_memory int * arr = new int[50]; #pragma omp target teams distribute parallel for for (int i=0; i<50; �+, i){ arr[i] = i; } Tom Scogland @ GTC 18 LLNL-PRES-747146

Deep copy today • It is possible to use deep copy in OpenMP today • Manual deep copy works by pointer attachment Tom Scogland @ GTC 19 LLNL-PRES-747146

Pointer attachment typedef struct myvec { size_t len; double * data; } myvec_t; myvec_t v = init_myvec(); #pragma omp target map(v, v.data[:v.len]) { do_something_with_v(&v); } Tom Scogland @ GTC 20 LLNL-PRES-747146

Pointer attachment Map structure v typedef struct myvec { size_t len; double * data; } myvec_t; myvec_t v = init_myvec(); #pragma omp target map(v, v.data[:v.len]) { do_something_with_v(&v); } Tom Scogland @ GTC 21 LLNL-PRES-747146

Pointer attachment Map structure v typedef struct myvec { Map data array and attach size_t len; to v double * data; } myvec_t; myvec_t v = init_myvec(); #pragma omp target map(v, v.data[:v.len]) { do_something_with_v(&v); } Tom Scogland @ GTC 22 LLNL-PRES-747146

What’s the downside? Map an array of v structures typedef struct myvec { Now we need a loop, more size_t len; breaking up the code! double * data; } myvec_t; size_t num = 50; myvec_t * v = alloc_array_of_myvec(num); #pragma omp target map(v[ : 50], ??????) { do_something_with_v(&v); } Tom Scogland @ GTC 23 LLNL-PRES-747146

The future of deep copy: User-defined mappers #pragma omp declare mapper(<type> <var>) [name(<name>)] [use_by_default] [map(<list-items>…)…] • Allow users to define mappers in terms of normal map clauses • Offer extension mechanisms to pack or unpack data that can’t be bitwise copied, or expressed as flat maps Tom Scogland @ GTC 24 LLNL-PRES-747146

Our array example typedef struct myvec { size_t len; double * data; } myvec_t; #pragma omp declare mapper(myvec_t v)\ use_by_default map(v, v.data[:v.len]) size_t num = 50; myvec_t * v = alloc_array_of_myvec(num); #pragma omp target map(v[ : 50]) { do_something_with_v(&v); } Tom Scogland @ GTC 25 LLNL-PRES-747146

Our array example typedef struct myvec { size_t len; No loop required, no extra double * data; code at usage, just map } myvec_t; #pragma omp declare mapper(myvec_t v)\ use_by_default map(v, v.data[:v.len]) size_t num = 50; myvec_t * v = alloc_array_of_myvec(num); #pragma omp target map(v[ : 50]) { do_something_with_v(&v); } Tom Scogland @ GTC 26 LLNL-PRES-747146

Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de - PowerPoint PPT Presentation

Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de Supinski GTC March 26, 2018 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Modernizing Miami- -Dade Dade s s Modernizing Miami Signal System Signal System

Expanding Freight on our W Waterways and d Modernizing our Ports Modernizing our Ports U.S.

Modernizing Modernizing Integrated Integrated Resource Planning Resource Planning Sean

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

TEACHING CREATIVE WRITING WITH PYTHON Adam Parrish Chief Software Architect, Socialbomb

Web Development Infrastructure David Brewer Lead Systems Developer Second Story Interactive

a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus

Lecture #10: Classification & Logistic Regression Data Science 1 CS 109A, STAT 121A, AC

SUSTAINABILITY AT THE MUSEUM OF ART WHAT IS SUSTAINABILITY? Sustainable development is

New York State Museum 03.14.2018 | 100% Pre-SD |Envision Architects |Gallagher &Associates

Project Overview David Sittenfeld Museum of Science, Boston This project is funded by the

WITTE MUSEUM THE CASE FOR GROWTH December 2006 The WITTE MUSEUM and SAN ANTONIO Partners

Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de - PowerPoint PPT Presentation

Modernizing OpenMP for an Accelerated World Tom Scogland Bronis de Supinski GTC March 26, 2018 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Modernizing Miami- -Dade Dade s s Modernizing Miami Signal System Signal System

Expanding Freight on our W Waterways and d Modernizing our Ports Modernizing our Ports U.S.

Modernizing Modernizing Integrated Integrated Resource Planning Resource Planning Sean

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

TEACHING CREATIVE WRITING WITH PYTHON Adam Parrish Chief Software Architect, Socialbomb

Web Development Infrastructure David Brewer Lead Systems Developer Second Story Interactive

a functional data scientist Richard Minerich, Director of R&amp;D at Bayard Rock @Rickasaurus

Lecture #10: Classification &amp; Logistic Regression Data Science 1 CS 109A, STAT 121A, AC

SUSTAINABILITY AT THE MUSEUM OF ART WHAT IS SUSTAINABILITY? Sustainable development is

New York State Museum 03.14.2018 | 100% Pre-SD |Envision Architects |Gallagher &amp;Associates

Project Overview David Sittenfeld Museum of Science, Boston This project is funded by the

WITTE MUSEUM THE CASE FOR GROWTH December 2006 The WITTE MUSEUM and SAN ANTONIO Partners

a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus

Lecture #10: Classification & Logistic Regression Data Science 1 CS 109A, STAT 121A, AC

New York State Museum 03.14.2018 | 100% Pre-SD |Envision Architects |Gallagher &Associates