Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016

Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 2

Top Ten Challenges for Exascale: Areas where Research and advances are needed! Energy Efficiency 1. Interconnect Technology 2. Memory Technology 3. Scalable System Software 4. Programming Systems 5. Data Management 6. Exascale Algorithms 7. Algorithms for Discovery, Design 8. & Decision DoE ASCAC Resilience and Correctness 9. Subcommittee Report 10. Scientific Productivity Feb 2014 Data movement and/or programming related

Technological Challenges: Combined Bandwidth and Energy Challenges for Exascale Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14]  Locality and data movement matter a lot, cost (energy and time) rapidly increases with distance  Locality and data movement are critical even at short distance, more so at far distances

Technological Challenges : (2) Bandwidth Widening gap between available I/O Growing manycore bandwidth requirements and compute capability 0.35 Xeon Phi (Knights Corner) 0.3 Bytes/FLOP 0.25 NVIDIA K20 0.2 K40 0.15 K80 Xeon Phi 0.1 (Knights 0.05 Landing) 0 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 Year Ref: Miller, D. A, Proceedings of the IEEE , 2009.  Interconnect is not keeping up with the growth in compute capability Many apps require 1 Byte/FLOP off-chip, not possible in 10 TFLOPs chips and beyond  Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP Huge bandwidth density (GB/s/ μ m) needed on-chip due to large #cores in small area 

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems 7 Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems Cray XC40 11 Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems  TTT TILE64 Tile64 Cray XC40 12 Tarek El-Ghazawi, GWU

Where are Programming Models from That?  What is a programming model? An abstract virtual machine  A view of data and execution  The logical interface between architecture and applications   Why Programming Models? Decouple applications and architectures  Write applications that run effectively across architectures  Design new architectures that can effectively support legacy  applications  Programming Model Design Considerations Expose modern architectural features to exploit machine power  and improve performance Maintain Ease of Use  Two previous points mean increase productivity!  14 Tarek El-Ghazawi, GWU

Current Programming Models and Locality Awareness Process/Thread Address Space … … … Message Passing Partitioned Global Shared Memory Address Space × Locality Awareness Locality Awareness Locality Awareness -One-Sided -Two-Sided -One-Sided Communication Communication Communication -Examples UPC and -Example MPI -Example OpenMP Chapel 15 Tarek El-Ghazawi, GWU

PGAS Languages Include UPC, Chapel and X10 16

Memory Accesses in UPC- Shared Address Translation Overheads Measurement of the address 4.25 MB/s  100% space overheads 90% 14 734 MB/s Set of micro-benchmarks  % time in memory access 80% measuring the different 1736.8 12 aspects separately: 70% Network Time 10 60% Address Translation Time (ns) 50% Address Incrementation 4.53 8 4.53 40% Memory Access 6 5.25 GB/s 30% Thread 0 Thread1 Thread 4 20% 4.2 4.2 (Threads -1) 10% 2 Shared 1.42 1.42 1.42 0% 0 Private Private 0 Private 1 THREADS-1 Type of access Type of access

Memory Access Costs in Chapel  Tested shared address access costs in Chapel: Used Chapel Syntax to test  Local part of a distributed object,  un-optimized- Accessing local data without saying local Local Optimized – local part hand-  optimized by saying “local” Local and Non-Distributed   Compiler optimization -> 2x faster  Both compiler and hand optimization -> 70x faster  Compiler optimization affects remote accesses as well  Both UPC and Chapel require “ unproductive!” hand tuning to improve local shared accesses 19 Tarek El-Ghazawi, GWU

Fast Address Translation for PGAS  Software solutions  Hand tweaking – Non-productive  Compiler optimizations - reduced arithmetic for some straightforward cases  Look up tables, full and reduced- Take memory! ICPP05  TLB's ....  Hardware solutions  Create hardware that understands how to traverse the PGAS memory model and support basic costly needs  Avail it through instructions and leverage them by the compiler  Some work for UPC, little for Chapel 20 Tarek El-Ghazawi, GWU

Hardware Support for PGAS  Example Operations for Support in Hardware Shared address incrementing  Load/store to/from a PGAS shared address  Address translation support: convert a shared address to a system virtual  address used to perform the access Locality tests for remote data  Can be used to tell whether to call the network subroutines, by e.g. testing  the affinity field in a work sharing construct  Availed as ISA extension  New instructions used directly by compiler  Current hardware support and instructions only support address mapping  Future support for remote data accesses and various types of synchronizations are of interest

Hardware/Software Co-Design Platform in a Nutshell  First prototype in FPGAs, supports small core count and apps  Second is primarily software, supports bigger core counts and codes Benchmarking Benchmarking UPC Code Out of the Box Kernels Kernels New Instructions BUPC BUPC Inserted into Code Gen GasNet GasNet Ported on top of Gem5 Ported on top of Leon3 A Runtime System that Extended with proposed Leon3 Cores Gem5 recognizes and enforces PGAS hardware support the developed mapping for shared addressing Workstation Virtex-6 FPGA Cluster - Future 22 Tarek El-Ghazawi, GWU

PGAS Hardware Support Overview  shared [4] int arrayA[32]; arrayA[10] = 5; Thread 0 0 1 2 3 16 17 18 19 Th=0 Ph=0 Va=0x3f10 Address 4 5 6 7 Thread 1 pgas_inc_{x} Incrementation Shared Pointer 20 21 22 23 Representation Th=2 Ph=2 Va=0x3f18 8 9 10 11 Thread 2 24 25 26 27 pgas_st_{x} Address Translation/Store 12 13 14 15 Thread 3 28 29 30 31 Regular pointer 0xfff01203f14 representation arrayA 23 Tarek El-Ghazawi, GWU

Early Results- NPB Kernels with HW Support Gem5 Alpha 21264

Possible Solutions for Hierarchical Locality Exploitation  Rewrite your code with low-level tricks to target the underlying hierarchical architecture?  Great performance, but not productive & non-portable  Extend programming models with hierarchical syntax and semantics and ask programmers to worry about all of those hardware details? (make them hierarchical-locality-aware!)  Portable but not productive 26 Tarek El-Ghazawi, GWU

Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing

Efficient Parallel Functional Programming with Hierarchical Memory Management Sam Westrick

Factors that Determine Speedup Characteristics of parallel code ECE 1747 Parallel Programming

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Jos M. Andin,

Explicit vs. Implicit Parallel Programming Language, Directive, Library Expose, Express,

Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel

Advanced Parallel Programming Communicator Management David Henty, Fiona Reid Overview

Lecture 10: Parallel Patterns: The What and How of Parallel Programming G63.2011.002/G22.2945.001

Introduction to Parallel Programming using OpenMP Shared Memory Parallel Programming Part I

Parallel, Adaptive Scientific Computation in Heterogeneous, Hierarchical, and Non-Dedicated

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Parallel Programming Patterns Overview and Concepts Practical Outline Why parallel

Parallel programming Luis Alejandro Giraldo Len Topics 1. Philosophy 2. KeyWords 3.

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on

Different facets of the repair problem Alexander Barg University of Maryland, College Park

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing

Efficient Parallel Functional Programming with Hierarchical Memory Management Sam Westrick

Factors that Determine Speedup Characteristics of parallel code ECE 1747 Parallel Programming

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Jos M. Andin,

Explicit vs. Implicit Parallel Programming Language, Directive, Library Expose, Express,

Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel

Advanced Parallel Programming Communicator Management David Henty, Fiona Reid Overview

Lecture 10: Parallel Patterns: The What and How of Parallel Programming G63.2011.002/G22.2945.001

Introduction to Parallel Programming using OpenMP Shared Memory Parallel Programming Part I

Parallel, Adaptive Scientific Computation in Heterogeneous, Hierarchical, and Non-Dedicated

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect &amp; Development

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Parallel Programming Patterns Overview and Concepts Practical Outline Why parallel

Parallel programming Luis Alejandro Giraldo Len Topics 1. Philosophy 2. KeyWords 3.

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on

Different facets of the repair problem Alexander Barg University of Maryland, College Park

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development