Hierarchical Locality and Parallel Programming in the Extreme Scale - - PowerPoint PPT Presentation

hierarchical locality and parallel programming in the
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Locality and Parallel Programming in the Extreme Scale - - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing


slide-1
SLIDE 1

Hierarchical Locality and Parallel Programming in the Extreme Scale Era

Tarek El-Ghazawi

The George Washington University

University of Southern California September 29, 2016

slide-2
SLIDE 2

Tarek El-Ghazawi, GWU September 29, 2016

Overview

2

 Fundamental Challenges for Extreme

Computing

 Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality

Exploitation- Address Remapping

 Hierarchical Locality Exploitation  Concluding Remarks

slide-3
SLIDE 3

Top Ten Challenges for Exascale: Areas where Research and advances are needed!

1.

Energy Efficiency

2.

Interconnect Technology

3.

Memory Technology

4.

Scalable System Software

5.

Programming Systems

6.

Data Management

7.

Exascale Algorithms

8.

Algorithms for Discovery, Design & Decision

9.

Resilience and Correctness

  • 10. Scientific Productivity

DoE ASCAC Subcommittee Report Feb 2014 Data movement and/or programming related

slide-4
SLIDE 4

Technological Challenges: Combined Bandwidth and Energy Challenges for Exascale

 Locality and data movement matter a lot, cost (energy and time)

rapidly increases with distance

 Locality and data movement are critical even at short distance,

more so at far distances

Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14]

slide-5
SLIDE 5

Technological Challenges : (2) Bandwidth

 Interconnect is not keeping up with the growth in compute capability

  • Many apps require 1 Byte/FLOP off-chip, not possible in 10 TFLOPs chips and beyond

Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP

  • Huge bandwidth density (GB/s/μm) needed on-chip due to large #cores in small area

Ref: Miller, D. A, Proceedings of the IEEE, 2009.

Growing manycore bandwidth requirements Widening gap between available I/O and compute capability

0.05 0.1 0.15 0.2 0.25 0.3 0.35 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5

Bytes/FLOP Year

Xeon Phi (Knights Corner) Xeon Phi (Knights Landing) NVIDIA K20 K40 K80

slide-6
SLIDE 6

Tarek El-Ghazawi, GWU September 29, 2016

Overview

6  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality

Exploitation- Address Remapping

 Hierarchical Locality Exploitation  Concluding Remarks

slide-7
SLIDE 7

Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

7

slide-8
SLIDE 8

Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

8

slide-9
SLIDE 9

Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

9

slide-10
SLIDE 10

Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

10

slide-11
SLIDE 11

Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

11

Cray XC40

slide-12
SLIDE 12

Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

12

Cray XC40

 TTT TILE64

Tile64

slide-13
SLIDE 13

Tarek El-Ghazawi, GWU September 29, 2016

Overview

13  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality

Exploitation- Address Remapping

 Hierarchical Locality Exploitation  Concluding Remarks

slide-14
SLIDE 14

Tarek El-Ghazawi, GWU

Where are Programming Models from That?

14

 What is a programming model?

  • An abstract virtual machine
  • A view of data and execution
  • The logical interface between architecture and applications

 Why Programming Models?

  • Decouple applications and architectures

Write applications that run effectively across architectures

Design new architectures that can effectively support legacy applications  Programming Model Design Considerations

  • Expose modern architectural features to exploit machine power

and improve performance

  • Maintain Ease of Use
  • Two previous points mean increase productivity!
slide-15
SLIDE 15

Tarek El-Ghazawi, GWU

Current Programming Models and Locality Awareness

15

Process/Thread Address Space

Partitioned Global Address Space Locality Awareness

  • One-Sided

Communication

  • Examples UPC and

Chapel

Shared Memory Locality Awareness

  • One-Sided

Communication

  • Example OpenMP

×

Message Passing Locality Awareness

  • Two-Sided

Communication

  • Example MPI
slide-16
SLIDE 16

16

PGAS Languages Include UPC, Chapel and X10

slide-17
SLIDE 17

Tarek El-Ghazawi, GWU September 29, 2016

Overview

17  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality

Exploitation- Address Remapping

 Hierarchical Locality Exploitation  Concluding Remarks

slide-18
SLIDE 18

1.42 1.42 1.42 4.2 4.2 4.53 4.53 1736.8 2 4 6 8 10 12 14 Time (ns) Type of access Network Time Address Translation Address Incrementation Memory Access 5.25 GB/s 734 MB/s 4.25 MB/s

Memory Accesses in UPC- Shared Address Translation Overheads

Measurement of the address space overheads

Set of micro-benchmarks measuring the different aspects separately: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % time in memory access Type of access

Shared

Thread 0 Thread1 Thread (Threads -1) Private 0 Private 1 Private THREADS-1

slide-19
SLIDE 19

Tarek El-Ghazawi, GWU

Memory Access Costs in Chapel

19

 Tested shared address access

costs in Chapel:

  • Used Chapel Syntax to test

Local part of a distributed object, un-optimized- Accessing local data without saying local

Local Optimized – local part hand-

  • ptimized by saying “local”

Local and Non-Distributed

 Compiler optimization -> 2x faster  Both compiler and hand

  • ptimization -> 70x faster

 Compiler optimization affects

remote accesses as well

 Both UPC and Chapel require “

unproductive!” hand tuning to improve local shared accesses

slide-20
SLIDE 20

Tarek El-Ghazawi, GWU

Fast Address Translation for PGAS

20

 Software solutions

  • Hand tweaking – Non-productive
  • Compiler optimizations - reduced arithmetic for some

straightforward cases

  • Look up tables, full and reduced- Take memory! ICPP05
  • TLB's ....

 Hardware solutions

  • Create hardware that understands how to traverse the

PGAS memory model and support basic costly needs

  • Avail it through instructions and leverage them by the

compiler

 Some work for UPC, little for Chapel

slide-21
SLIDE 21

Hardware Support for PGAS

 Example Operations for Support in Hardware

  • Shared address incrementing
  • Load/store to/from a PGAS shared address

Address translation support: convert a shared address to a system virtual address used to perform the access

  • Locality tests for remote data

Can be used to tell whether to call the network subroutines, by e.g. testing the affinity field in a work sharing construct

 Availed as ISA extension  New instructions used directly by compiler  Current hardware support and instructions only support

address mapping

 Future support for remote data accesses and various

types of synchronizations are of interest

slide-22
SLIDE 22

Tarek El-Ghazawi, GWU

Hardware/Software Co-Design Platform in a Nutshell

22

 First prototype in FPGAs, supports small core count and apps  Second is primarily software, supports bigger core counts and

codes

GasNet BUPC Benchmarking Kernels Gem5

Ported on top of Gem5 New Instructions Inserted into Code Gen UPC Code Out of the Box A Runtime System that recognizes and enforces the developed mapping Ported on top of Leon3

GasNet BUPC Benchmarking Kernels Virtex-6 FPGA

Extended with proposed PGAS hardware support for shared addressing

Leon3 Cores Workstation Cluster - Future

slide-23
SLIDE 23

Tarek El-Ghazawi, GWU

PGAS Hardware Support Overview

23

 shared [4] int arrayA[32];

arrayA[10] = 5;

1 2 3 16 17 18 19 4 5 6 7 20 21 22 23 8 9 10 11 24 25 26 27 12 13 14 15 28 29 30 31 Thread 0 Thread 1 Thread 2 Thread 3

arrayA Address Incrementation

Th=0 Ph=0 Va=0x3f10 Th=2 Ph=2 Va=0x3f18

Address Translation/Store

0xfff01203f14

pgas_inc_{x} pgas_st_{x}

Regular pointer representation

Shared Pointer Representation

slide-24
SLIDE 24

Early Results- NPB Kernels with HW Support Gem5 Alpha 21264

slide-25
SLIDE 25

Tarek El-Ghazawi, GWU September 29, 2016

Overview

25  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality

Exploitation- Address Remapping

 Hierarchical Locality Exploitation  Concluding Remarks

slide-26
SLIDE 26

Tarek El-Ghazawi, GWU

Possible Solutions for Hierarchical Locality Exploitation

26

 Rewrite your code with low-level tricks to target

the underlying hierarchical architecture?

  • Great performance, but not productive & non-portable

 Extend programming models with hierarchical

syntax and semantics and ask programmers to worry about all of those hardware details? (make them hierarchical-locality-aware!)

  • Portable but not productive
slide-27
SLIDE 27

Tarek El-Ghazawi, GWU

Productive Division of Responsibilities: The Programmer and the System

27

 Programmer

  • Use a locality-aware programming paradigm such as

MPI or a PGAS language

  • Let programmer worry about the first-order locality,

thread-data affinity

 System

  • Understand your system hierarchy, costs associated

with data movements across levels

  • Understand the program characteristics
  • Derive locality exploitation on level-by-level basis via

Hierarchical Thread Grouping/partitioning

slide-28
SLIDE 28

Tarek El-Ghazawi, GWU

Motivations and Early Investigations

28

Synthetic benchmark showing the gain of proper with varying number of threads and percentage of remote communication  Proper placement will

  • Avoid unnecessary data

movement by exploiting locality

  • Utilize the shared

memory and caches in the neighborhood

  • Utilize the best

interconnect for the underlying communication

  • Yield a rising benefit as

the size of your system increases! A must for exascale!!

24 96 384 1008 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 Speedup from Proper Placement Remote Comm (%)

Effect of Exploiting Hier Locality (Read Access)

35‐40 30‐35 25‐30 20‐25 15‐20 10‐15 5‐10 0‐5

slide-29
SLIDE 29

Tarek El-Ghazawi, GWU

Motivations and Early Investigations

29

 The response of each level to communication

varies according to message sizes

  • Closer is not always faster

Know and characterize your architecture!! 1 2 3 4 5 6 8 B 16 B 32 B 64 B 128 B 256 B 512 B 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M Bandwidth (GB/s) Message Size

Put/Write Bandwidth – Cray XE6m

Self Same Die Same Chip Same Node Remote

slide-30
SLIDE 30

Tarek El-Ghazawi, GWU

1

PHLAME Methodology

(Parallel Hierarchical Abstraction Model of Execution)

30

Program Application Communication profile Instrumented Program Communication Benchmarks PHLAME Description File Placement Placement Algorithm Target Machine

1. Characterize the machine message costs at each level to generate PHLAME description File (PDF) 2. Profile the application communication 3. Build a placement layout for the threads based on the above 4. Run the application with the layout built in the previous step

2 3 4

slide-31
SLIDE 31

Tarek El-Ghazawi, GWU

Characterizing the target machine

31

 Message cost: total time for

message to be delivered

Msg(bytes) Level 1 8 16 32 64 128 … 1

0.516956 0.665469 1.209482 1.986097 3.606203 7.593014

2

0.688468 1.038422 1.54703 2.772387 5.138746 10.86957

3

0.687853 1.033378 1.543448 2.770083 5.128205 10.85776

4

0.706414 1.05042 1.548707 2.77855 5.128205 11.02536

Example: time per message (ns) machine communication characterization

1

slide-32
SLIDE 32

Tarek El-Ghazawi, GWU

Characterizing the application communication

32

 Instrument the application code

  • generate the communication

activity matrices

 The message sizes range is

partitioned into bins

  • Each bin corresponds to a sub range,

example: 164, 64128, …

 There are two communication

activity matrices for each bin

  • Average size
  • Number of messages
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Msg<64 64≤Msg<128 128≤Msg<256 Avg Msg Size Num

  • f

Msgs

Data Affinity Thread Initiating Thread

2

slide-33
SLIDE 33

Tarek El-Ghazawi, GWU

Calculating Level Costs

33

 Placement decisions require a

measurement of how threads fit together

 Repeat for each level :

  • For each pair of threads (, ),

where , calculate the cost of their communication

Where B is the number of bins

  • 1
2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Msg Level 8 16 32 64 128 256 …

Die

0.516956 0.665469 1.209482 1.986097 3.606203 7.593014

Chip

0.688468 1.038422 1.54703 2.772387 5.138746 10.86957

Node

0.687853 1.033378 1.543448 2.770083 5.128205 10.85776

Remote

0.706414 1.05042 1.548707 2.77855 5.128205 11.02536

Msg<64 64≤Msg<256 256≤Msg<512

Avg Msg Size Num of Msgs Msgs costs

1 2 3 1 2 3

⊙ ⊙ ⊙ Level costs =

… …

Bins

3

slide-34
SLIDE 34

Tarek El-Ghazawi, GWU 34

, ,

Hierarchical Thread Fitness Measure

1 2 3 1 2 3 1 2 3 1 2 3

 The fit measure shows

how two threads benefit

  • r lose if scheduled on a

given level

 The fit measure is based

  • n the difference of

message costs at each level

Cost

placement at , given ,

CPU Node Blade Chassis Threads

1 2 3 1 2 3 1 2 3 1 2 3
slide-35
SLIDE 35

Tarek El-Ghazawi, GWU 35

, ,

Hierarchical Thread Fitness Measure

1 2 3 1 2 3 1 2 3 1 2 3

 The fit measure shows

how two threads benefit

  • r lose if scheduled on a

given level

 The fit measure is based

  • n the difference of

message costs at each level

placement at , given ,

CPU Node Blade Chassis

1 2 3 1 2 3 1 2 3 1 2 3
slide-36
SLIDE 36

Tarek El-Ghazawi, GWU 36

1 2 3 1 2 3

, ,

Hierarchical Thread Fitness Measure

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

 The fit measure shows

how two threads benefit

  • r lose if scheduled on a

given level

 The fit measure is based

  • n the difference of

message costs at each level

placement at , given ,

CPU Node Blade Chassis

slide-37
SLIDE 37

Tarek El-Ghazawi, GWU 37

, ,

Hierarchical Thread Fitness Measure

1 2 3 1 2 3 1 2 3 1 2 3

 The fit measure shows

how two threads benefit

  • r lose if scheduled on a

given level

 The fit measure is based

  • n the difference of

message costs at each level

placement at , given ,

1 2 3 1 2 3

CPU Node Blade Chassis

1 2 3 1 2 3
slide-38
SLIDE 38

Tarek El-Ghazawi, GWU 38

, ,

Hierarchical Thread Fitness Measure

1 2 3 1 2 3 1 2 3 1 2 3

 The fit measure shows

how two threads benefit

  • r lose if scheduled on a

given level

 The fit measure is based

  • n the difference of

message costs at each level

placement at , given ,

1 2 3 1 2 3

CPU Node Blade Chassis

1 2 3 1 2 3
slide-39
SLIDE 39

Tarek El-Ghazawi, GWU

Mapping to Graph Theory

39

 The application communication pattern can be

mapped into a graph

  • Vertices represent the threads
  • Edges represent interactions between threads
  • HTF at each level are edge weights

 Multiple weights per edge

i j

wij1, wij2, wij3, … wijL

slide-40
SLIDE 40

Tarek El-Ghazawi, GWU

Hierarchical Graph Partitioning

40

Algorithms can be

  • Bottom Up

 Form partitions at lower

levels first and recursively group them at higher levels

  • Top Down

 Form partitions at upper

levels first and recursively break them at lower levels

Abstract Machine: Level1: Width = 4 (# of locales) MaxLocaleSize = 4 (# of cores in each locale)

slide-41
SLIDE 41

Tarek El-Ghazawi, GWU

Testbed

41

 Cray XE6m/XK7m

  • 24 cores per node

Two 12-core AMD Magny Cours

  • Gemini Interconnect

2D Torus

 UPC NPB Benchmark from GWU

  • IS – Class C
  • FT – Class C
  • CG – Class C
  • MG – Class C
  • EP – Class C

 Heat Diffusion

slide-42
SLIDE 42

Tarek El-Ghazawi, GWU

Profiling the application communication – Implementation

42

 TAU is selected to profile UPC and MPI

programs

  • Generates activity matrix for each bin

 Bins are not supported in TAU profiles  Modifications were made to TAU backend and

frontends to support bins

slide-43
SLIDE 43

Tarek El-Ghazawi, GWU

Customizing GASNet

43

 The clustering algorithm usually

assigns unequal number of threads to different nodes

 The Cray Application Level Placement

Scheduler (ALPS) does not support this feature

 A modified GASNet Geminie Conduit

was used to trick the system to achieve the non-uniform thread count per node

  • Dummy processes are launched
  • Environment variables control how the

runtime pick the correct number of processes on each node

3 8 1 10 2 5 11 12 6 7 9 8 3 1 2 5 6 9 7 GASNET_THREAD_MAP GASNET_NUM_THREADS 10

slide-44
SLIDE 44

Tarek El-Ghazawi, GWU

Experimental Results

44

 FT – all-to-all communication

2 4 6 8 10 12 64 128 256 512 1024 Gain (%) Number of Threads 0.2 0.4 0.6 0.8 1 64 128 256 512 1024 Relative Communication Overheard Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST

Default

slide-45
SLIDE 45

Tarek El-Ghazawi, GWU

Experimental Results MPI

45

 FT – all-to-all communication

1 2 3 4 5 6 7 8 64 128 256 512 1024 Gain (%) Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST

0.2 0.4 0.6 0.8 1

64 128 256 512 1024 Number of Threads

Default

Relative Communication Overhead

slide-46
SLIDE 46

Tarek El-Ghazawi, GWU

Experimental Results UPC

46

 CG – Irregular memory

access and communication

20 40 60 80 100 64 128 256 512 1024 Gain (%) Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST 0.2 0.4 0.6 0.8 1 64 128 256 512 1024 Number of Threads

Default

Relative Communication Overhead

slide-47
SLIDE 47

Tarek El-Ghazawi, GWU

Experimental Results MPI

47

 CG – Irregular memory

access and communication

5 10 15 20 25 30 35 40 45 64 128 256 512 1024 Gain (%) Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST 0.2 0.4 0.6 0.8 1 64 128 256 512 1024 Number of Threads

Default

Relative Communication Overhead

slide-48
SLIDE 48

Tarek El-Ghazawi, GWU

CG – Non Restricted Explanation

48

Node 0 Node 1 Remote Remote

slide-49
SLIDE 49

Tarek El-Ghazawi, GWU September 29, 2016

Overview

49  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality

Exploitation- Address Remapping

 Hierarchical Locality Exploitation  Concluding Remarks

slide-50
SLIDE 50

Tarek El-Ghazawi, GWU

Concluding Remarks

50

 Due to energy and bandwidth constrains data

movements are becoming too expensive

 Locality exploitation is an obvious target  Extreme scale architectures are becoming deeply

hierarchical giving rise to hierarchical locality

 Hierarchical locality exploitation must be done

productively, leaving programmers with the necessary min work to do

 We can expect some programming paradigms to

provide explicit solutions

 Locality-aware programming, hardware support and

run-time systems can play a bigger role while keeping programmers productivity

slide-51
SLIDE 51

Tarek El-Ghazawi, GWU

Publications

51

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel Hamid Badawy, and Tarek El- Ghazawi, “Exploiting Hierarchical Locality in Deep Parallel Architectures”. ACM Transactions on Architecture and Code Optimizations. Volume 13 Issue 2, June 2016 .

Olivier Serres, Abdullah Kayi, Ahmed Anbar and Tarek El-Ghazawi, “Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study”. ACM Transactions on Architecture and Code Optimizations. Volume 12 Issue 4, January 2016.

Ahmad Anbar, Abdel-Hameed Badawy, Olivier Serres and Tarek El-Ghazawi, “Where Should The Threads Go? Leveraging Hierarchical Data Locality to Solve the Thread Affinity Dilemma,” in Proc. 20th International Conference on Parallel and Distributed Systems (ICPADS 2014). IEEE, Hsinchu, Taiwan, Dec 16-19, 2014.

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed Badawy , Tarek El-Ghazawi PHLAME: Hierarchical Locality Exploitation Using the PGAS Model. IEEE International Conference on Partitioned Global Address Space Programming Models (PGAS 2015), Washington DC, September 18-20, 2015.

3.

Olivier Serres, Abdullah Kayi, Ahmad Anbar, and Tarek El-Ghazawi, “Enabling PGAS productivity with hardware support for shared address mapping; a UPC case study,” in Proc. 16th IEEE International Conference on High Performance Computing and Communications, August 20-22, 2014.

slide-52
SLIDE 52

Tarek El-Ghazawi, GWU

Follow up work in Hierarchical Locality Exploitation

52

 Use thread data-affinity from locality-aware program as a starting

point into a hierarchical locality exploitation system (PHLAME or FLAME: Parallel Hierarchical Abstraction Model of Execution)

 Examine best graph partitioning methods  Decentralize algorithms, and build in fast predictions to handle the

Exascale

 Consider dynamic solutions  Consider unprofiled cases and collecting intelligence on runs for later

use and optimizations

 Consider data dependent cases  Consider dynamic parallelism cases  Investigate hardware support