Hierarchical Locality and Parallel Programming in the Extreme Scale Era
Tarek El-Ghazawi
The George Washington University
University of Southern California September 29, 2016
Hierarchical Locality and Parallel Programming in the Extreme Scale - - PowerPoint PPT Presentation
Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing
University of Southern California September 29, 2016
Tarek El-Ghazawi, GWU September 29, 2016
2
Fundamental Challenges for Extreme
Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Hierarchical Locality Exploitation Concluding Remarks
1.
2.
3.
4.
5.
6.
7.
8.
9.
DoE ASCAC Subcommittee Report Feb 2014 Data movement and/or programming related
Locality and data movement matter a lot, cost (energy and time)
Locality and data movement are critical even at short distance,
Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14]
Interconnect is not keeping up with the growth in compute capability
Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP
Ref: Miller, D. A, Proceedings of the IEEE, 2009.
Growing manycore bandwidth requirements Widening gap between available I/O and compute capability
0.05 0.1 0.15 0.2 0.25 0.3 0.35 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5
Bytes/FLOP Year
Xeon Phi (Knights Corner) Xeon Phi (Knights Landing) NVIDIA K20 K40 K80
Tarek El-Ghazawi, GWU September 29, 2016
6 Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
7
Tarek El-Ghazawi, GWU
8
Tarek El-Ghazawi, GWU
9
Tarek El-Ghazawi, GWU
10
Tarek El-Ghazawi, GWU
11
Tarek El-Ghazawi, GWU
12
TTT TILE64
Tarek El-Ghazawi, GWU September 29, 2016
13 Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
14
What is a programming model?
Why Programming Models?
Write applications that run effectively across architectures
Design new architectures that can effectively support legacy applications Programming Model Design Considerations
and improve performance
Tarek El-Ghazawi, GWU
15
Process/Thread Address Space
Partitioned Global Address Space Locality Awareness
Communication
Chapel
Shared Memory Locality Awareness
Communication
Message Passing Locality Awareness
Communication
16
Tarek El-Ghazawi, GWU September 29, 2016
17 Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Hierarchical Locality Exploitation Concluding Remarks
1.42 1.42 1.42 4.2 4.2 4.53 4.53 1736.8 2 4 6 8 10 12 14 Time (ns) Type of access Network Time Address Translation Address Incrementation Memory Access 5.25 GB/s 734 MB/s 4.25 MB/s
Measurement of the address space overheads
Set of micro-benchmarks measuring the different aspects separately: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % time in memory access Type of access
Shared
Thread 0 Thread1 Thread (Threads -1) Private 0 Private 1 Private THREADS-1
Tarek El-Ghazawi, GWU
19
Tested shared address access
costs in Chapel:
Local part of a distributed object, un-optimized- Accessing local data without saying local
Local Optimized – local part hand-
Local and Non-Distributed
Compiler optimization -> 2x faster Both compiler and hand
Compiler optimization affects
remote accesses as well
Both UPC and Chapel require “
unproductive!” hand tuning to improve local shared accesses
Tarek El-Ghazawi, GWU
20
Software solutions
Hardware solutions
Some work for UPC, little for Chapel
Example Operations for Support in Hardware
Address translation support: convert a shared address to a system virtual address used to perform the access
Can be used to tell whether to call the network subroutines, by e.g. testing the affinity field in a work sharing construct
Availed as ISA extension New instructions used directly by compiler Current hardware support and instructions only support
Future support for remote data accesses and various
Tarek El-Ghazawi, GWU
22
First prototype in FPGAs, supports small core count and apps Second is primarily software, supports bigger core counts and
GasNet BUPC Benchmarking Kernels Gem5
Ported on top of Gem5 New Instructions Inserted into Code Gen UPC Code Out of the Box A Runtime System that recognizes and enforces the developed mapping Ported on top of Leon3
GasNet BUPC Benchmarking Kernels Virtex-6 FPGA
Extended with proposed PGAS hardware support for shared addressing
Leon3 Cores Workstation Cluster - Future
Tarek El-Ghazawi, GWU
23
shared [4] int arrayA[32];
1 2 3 16 17 18 19 4 5 6 7 20 21 22 23 8 9 10 11 24 25 26 27 12 13 14 15 28 29 30 31 Thread 0 Thread 1 Thread 2 Thread 3
arrayA Address Incrementation
Th=0 Ph=0 Va=0x3f10 Th=2 Ph=2 Va=0x3f18
Address Translation/Store
0xfff01203f14
pgas_inc_{x} pgas_st_{x}
Regular pointer representation
Shared Pointer Representation
Tarek El-Ghazawi, GWU September 29, 2016
25 Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
26
Rewrite your code with low-level tricks to target
Extend programming models with hierarchical
Tarek El-Ghazawi, GWU
27
Programmer
System
Tarek El-Ghazawi, GWU
28
Synthetic benchmark showing the gain of proper with varying number of threads and percentage of remote communication Proper placement will
movement by exploiting locality
memory and caches in the neighborhood
interconnect for the underlying communication
the size of your system increases! A must for exascale!!
24 96 384 1008 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 Speedup from Proper Placement Remote Comm (%)
Effect of Exploiting Hier Locality (Read Access)
35‐40 30‐35 25‐30 20‐25 15‐20 10‐15 5‐10 0‐5
Tarek El-Ghazawi, GWU
29
The response of each level to communication
Know and characterize your architecture!! 1 2 3 4 5 6 8 B 16 B 32 B 64 B 128 B 256 B 512 B 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M Bandwidth (GB/s) Message Size
Put/Write Bandwidth – Cray XE6m
Self Same Die Same Chip Same Node Remote
Tarek El-Ghazawi, GWU
(Parallel Hierarchical Abstraction Model of Execution)
30
Program Application Communication profile Instrumented Program Communication Benchmarks PHLAME Description File Placement Placement Algorithm Target Machine
1. Characterize the machine message costs at each level to generate PHLAME description File (PDF) 2. Profile the application communication 3. Build a placement layout for the threads based on the above 4. Run the application with the layout built in the previous step
Tarek El-Ghazawi, GWU
31
Message cost: total time for
message to be delivered
Msg(bytes) Level 1 8 16 32 64 128 … 1
0.516956 0.665469 1.209482 1.986097 3.606203 7.593014
2
0.688468 1.038422 1.54703 2.772387 5.138746 10.86957
3
0.687853 1.033378 1.543448 2.770083 5.128205 10.85776
4
0.706414 1.05042 1.548707 2.77855 5.128205 11.02536
Example: time per message (ns) machine communication characterization
Tarek El-Ghazawi, GWU
32
Instrument the application code
activity matrices
The message sizes range is
example: 164, 64128, …
There are two communication
Msg<64 64≤Msg<128 128≤Msg<256 Avg Msg Size Num
Msgs
Data Affinity Thread Initiating Thread
Tarek El-Ghazawi, GWU
33
Placement decisions require a
Repeat for each level :
where , calculate the cost of their communication
Where B is the number of bins
Msg Level 8 16 32 64 128 256 …
Die
0.516956 0.665469 1.209482 1.986097 3.606203 7.593014
Chip
0.688468 1.038422 1.54703 2.772387 5.138746 10.86957
Node
0.687853 1.033378 1.543448 2.770083 5.128205 10.85776
Remote
0.706414 1.05042 1.548707 2.77855 5.128205 11.02536
Msg<64 64≤Msg<256 256≤Msg<512
Avg Msg Size Num of Msgs Msgs costs
1 2 3 1 2 3⊙ ⊙ ⊙ Level costs =
Bins
Tarek El-Ghazawi, GWU 34
, ,
The fit measure shows
The fit measure is based
Cost
placement at , given ,
CPU Node Blade Chassis Threads
1 2 3 1 2 3 1 2 3 1 2 3Tarek El-Ghazawi, GWU 35
, ,
The fit measure shows
The fit measure is based
placement at , given ,
CPU Node Blade Chassis
1 2 3 1 2 3 1 2 3 1 2 3Tarek El-Ghazawi, GWU 36
1 2 3 1 2 3, ,
The fit measure shows
The fit measure is based
placement at , given ,
CPU Node Blade Chassis
Tarek El-Ghazawi, GWU 37
, ,
The fit measure shows
The fit measure is based
placement at , given ,
1 2 3 1 2 3CPU Node Blade Chassis
1 2 3 1 2 3Tarek El-Ghazawi, GWU 38
, ,
The fit measure shows
The fit measure is based
placement at , given ,
1 2 3 1 2 3CPU Node Blade Chassis
1 2 3 1 2 3Tarek El-Ghazawi, GWU
39
The application communication pattern can be
Multiple weights per edge
i j
wij1, wij2, wij3, … wijL
Tarek El-Ghazawi, GWU
40
Form partitions at lower
levels first and recursively group them at higher levels
Form partitions at upper
levels first and recursively break them at lower levels
Abstract Machine: Level1: Width = 4 (# of locales) MaxLocaleSize = 4 (# of cores in each locale)
Tarek El-Ghazawi, GWU
41
Cray XE6m/XK7m
Two 12-core AMD Magny Cours
2D Torus
UPC NPB Benchmark from GWU
Heat Diffusion
Tarek El-Ghazawi, GWU
42
TAU is selected to profile UPC and MPI
Bins are not supported in TAU profiles Modifications were made to TAU backend and
Tarek El-Ghazawi, GWU
43
The clustering algorithm usually
The Cray Application Level Placement
A modified GASNet Geminie Conduit
runtime pick the correct number of processes on each node
3 8 1 10 2 5 11 12 6 7 9 8 3 1 2 5 6 9 7 GASNET_THREAD_MAP GASNET_NUM_THREADS 10
Tarek El-Ghazawi, GWU
44
FT – all-to-all communication
2 4 6 8 10 12 64 128 256 512 1024 Gain (%) Number of Threads 0.2 0.4 0.6 0.8 1 64 128 256 512 1024 Relative Communication Overheard Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST
Default
Tarek El-Ghazawi, GWU
45
FT – all-to-all communication
1 2 3 4 5 6 7 8 64 128 256 512 1024 Gain (%) Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST
0.2 0.4 0.6 0.8 1
64 128 256 512 1024 Number of Threads
Default
Relative Communication Overhead
Tarek El-Ghazawi, GWU
46
CG – Irregular memory
access and communication
20 40 60 80 100 64 128 256 512 1024 Gain (%) Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST 0.2 0.4 0.6 0.8 1 64 128 256 512 1024 Number of Threads
Default
Relative Communication Overhead
Tarek El-Ghazawi, GWU
47
CG – Irregular memory
access and communication
5 10 15 20 25 30 35 40 45 64 128 256 512 1024 Gain (%) Number of Threads Clustering Splitting Splitting ‐ Non Restricted PHAST 0.2 0.4 0.6 0.8 1 64 128 256 512 1024 Number of Threads
Default
Relative Communication Overhead
Tarek El-Ghazawi, GWU
48
Tarek El-Ghazawi, GWU September 29, 2016
49 Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
50
Due to energy and bandwidth constrains data
Locality exploitation is an obvious target Extreme scale architectures are becoming deeply
Hierarchical locality exploitation must be done
We can expect some programming paradigms to
Locality-aware programming, hardware support and
Tarek El-Ghazawi, GWU
51
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel Hamid Badawy, and Tarek El- Ghazawi, “Exploiting Hierarchical Locality in Deep Parallel Architectures”. ACM Transactions on Architecture and Code Optimizations. Volume 13 Issue 2, June 2016 .
Olivier Serres, Abdullah Kayi, Ahmed Anbar and Tarek El-Ghazawi, “Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study”. ACM Transactions on Architecture and Code Optimizations. Volume 12 Issue 4, January 2016.
Ahmad Anbar, Abdel-Hameed Badawy, Olivier Serres and Tarek El-Ghazawi, “Where Should The Threads Go? Leveraging Hierarchical Data Locality to Solve the Thread Affinity Dilemma,” in Proc. 20th International Conference on Parallel and Distributed Systems (ICPADS 2014). IEEE, Hsinchu, Taiwan, Dec 16-19, 2014.
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed Badawy , Tarek El-Ghazawi PHLAME: Hierarchical Locality Exploitation Using the PGAS Model. IEEE International Conference on Partitioned Global Address Space Programming Models (PGAS 2015), Washington DC, September 18-20, 2015.
3.
Olivier Serres, Abdullah Kayi, Ahmad Anbar, and Tarek El-Ghazawi, “Enabling PGAS productivity with hardware support for shared address mapping; a UPC case study,” in Proc. 16th IEEE International Conference on High Performance Computing and Communications, August 20-22, 2014.
Tarek El-Ghazawi, GWU
52
Use thread data-affinity from locality-aware program as a starting
point into a hierarchical locality exploitation system (PHLAME or FLAME: Parallel Hierarchical Abstraction Model of Execution)
Examine best graph partitioning methods Decentralize algorithms, and build in fast predictions to handle the
Exascale
Consider dynamic solutions Consider unprofiled cases and collecting intelligence on runs for later
use and optimizations
Consider data dependent cases Consider dynamic parallelism cases Investigate hardware support