Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - - PowerPoint PPT Presentation
Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - - PowerPoint PPT Presentation
Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team Project Lead: Katherine Yelick Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng Former members: Christian Bell,
Berkeley UPC Team
- Project Lead: Katherine Yelick
- Team members: Filip Blagojevic, Dan
Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng
- Former members: Christian Bell, Wei Chen,
Jason Duell, Parry Husbands, Rajesh Nishtala , Mike Welcome
- A joint project of LBNL and UC Berkeley
2 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Motivation
- Scalable systems have either distributed memory or
shared memory without cache coherency
– Clusters: Ethernet, Infiniband, CRAY XT, IBM BlueGene – Hybrid nodes: CPU + GPU or other kinds of accelerators – SoC: IBM Cell, Intel Single-chip Cloud Computer (SCC)
- Challenges of Message Passing programming models
– Difficult data partitioning for irregular applications – Memory space starvation due to data replication – Performance overheads from two-sided communication semantics
6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 3
Partitioned Global Address Space
Thread 1 Thread 2 Thread 3 Thread 4
- Global data view abstraction for productivity
- Vertical partitions among threads for locality control
- Horizontal partitions between shared and private
segments for data placement optimizations
- Friendly to non-cache-coherent architectures
Private Segment Shared Segment Private Segment Shared Segment Private Segment Shared Segment Private Segment Shared Segment
4 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
PGAS Example: Global Matrix Distribution
Global Matrix View Distributed Matrix Storage
1 3 2 4 9 11 10 12 5
7
6 8 13 15 14 16 1 9 5 13 3 11 7 15 2 10 6 14 4 12 8 16
5 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
UPC Overview
- PGAS dialect of ISO C99
- Distributed shared arrays
- Dynamic shared-memory allocation
- One-sided shared-memory communication
- Synchronization: barriers, locks, memory
fences
- Collective communication library
- Parallel I/O library
6 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Key Components for Scalability
- One-sided communication and active
messages
- Efficient resource sharing for multi-core
systems
- Non-blocking collective communication
Workshop on Programming Environments for Emerging Parallel Systems 7 6/22/2010
Berkeley UPC Software Stack
UPC-to-C Translator UPC Applications UPC Runtime GASNet Communication Library Network Driver and OS Libraries Translated C code with Runtime Calls
Hardware Dependant Language Dependant
8 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Berkeley UPC Features
- Data transfer for complex data types (vector,
indexed, stride)
- Non-blocking memory copy
- Point-to-point synchronization
- Remote atomic operations
- Active Messages
- Extension to UPC collectives
- Portable timers
9 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
One-Sided vs. Two-Sided Messaging
- Two-sided messaging
– Message does not contain information about the final destination; need to look it up on the target node – Point-to-point synchronization implied with all transfers
- One-sided messaging
– Message contains information about the final destination – Decouple synchronization from data movement
- dest. addr.
message id data payload data payload
- ne-sided put (e.g., UPC)
two-sided message (e.g., MPI) network interface memory host CPU
10 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Active Messages
- Active messages = Data + Action
- Key enabling technology for both
- ne-sided and two-sided
communications
– Software implementation of Put/Get – Eager and Rendezvous protocols
- Remote Procedural Calls
– Facilitate “owner-computes” – Spawn asynchronous tasks
Request Reply
A B
Request handler Reply handler
11 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
GASNet Bandwidth on BlueGene/P
- Torus network
– Each node has six 850MB/s* bidirectional links – Vary number of links from 1 to 6
- Consecutive non-blocking puts
- n the links (round-robin)
- Similar bandwidth for large-size
messages
- GASNet outperforms MPI for
mid-size messages
– Lower software overhead – More overlapping
* Kumar et. al showed the maximum achievable bandwidth for DCMF transfers is 748 MB/s per link so we use this as our peak bandwidth See “The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer”, Kumar et al. ICS08
G O O D
See “Scaling Communication Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap”, Rajesh Nishtala, Paul Hargrove, Dan Bonachea, and Katherine Yelick, IPDPS 2009
12 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
GASNet Bandwidth on Cray XT4
200 400 600 800 1000 1200 1400 1600 1800 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Payload Size (bytes) Bandwidth of Non-Blocking Put (MB/s) portals-conduit Put OSU MPI BW test mpi-conduit Put
(up is good)
Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009
13 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
GASNet Latency on Cray XT4
5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1024
Payload Size (bytes) Latency of Blocking Put (µs) mpi-conduit Put MPI Ping-Ack portals-conduit Put
(down is good)
Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009
14 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Execution Models on Multi-core – Process vs. Thread
CPU CPU CPU CPU Physical Shared-memory Virtual Address Space
Map UPC threads to Processes Map UPC threads to Pthreads
15 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Point-to-Point Performance – Process vs. Thread
Workshop on Programming Environments for Emerging Parallel Systems 16 1000 2000 3000 4000 5000 6000 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K
Bandwidth (MB/s) Size (Bytes)
InfiniBand Bandwidth
1T-16P 2T-8P 4T-4P 8T-2P 16T-1P MPI 6/22/2010
Application Performance – Process vs. Thread
Workshop on Programming Environments for Emerging Parallel Systems 17
0.2 0.4 0.6 0.8 1 1.2 GUPS MCOP SOBEL
Fine Grained Comm.
1T-16P 2T-8P 4T-4P 8T-2P 1T-16P
6/22/2010
16T-1P
NAS Parallel Benchmarks – Process vs. Thread
0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 EP CG IS MG FT LU BT-256 SP-256
NPB - Class C
Comm Fence Critical Section Comp
18 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Collective Communication for PGAS
- Communication patterns similar to MPI:
broadcast, reduce, gather, scatter and alltoall
- Global address space enables one-sided
collectives
- Flexible synchronization modes provide more
communication and computation overlapping
- pportunities
6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 19
Collective Communication Topologies
8 2 3 12 10 4 6 1 11 9 7 5 14 13 15
binomial tree
1 2 3 12 5 8 9 4 6 7 10 11 13 14 1 2 3 12 5 8 9 4 6 7 10 11 13 14 15
Binary Tree Fork Tree
2 3 4 6 1 7 5
Radix 2 Dissemination
20 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
GASNet Module Organization
GASNet Collectives API Portable Collectives Point-to-point
- Comm. Driver
Interconnect/Memory Native Collectives Collective
- Comm. Driver
UPC Collectives Other PGAS Collectives Auto-Tuner of Algorithms and Parameters Shared-Memory Collectives
21 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Auto-tuning Collective Communication
Workshop on Programming Environments for Emerging Parallel Systems 22
Offline tuning
- Optimize for platform
common characteristics
- Minimize runtime
tuning overhead
Online tuning
- Optimize for application
runtime characteristics
- Refine offline tuning
results
Performance Influencing Factors Performance Tuning Space
Hardware
- CPU
- Memory system
- Interconnect
Software
- Application
- System software
Execution
- Process/thread
layout
- Input data set
- System workload
Algorithm selection
- Eager vs. rendezvous
- Put vs. get
- Collection of well-
known algorithms Communication topology
- Tree type
- Tree fan-out
Implementation-specific parameters
- Pipelining depth
- Dissemination radix
6/22/2010
Broadcast Performance
Cray XT4 Nonblocking Broadcast (1024 Cores)
23 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
Matrix-Multiplication on Cray XT4
Workshop on Programming Environments for Emerging Parallel Systems 24
1000 2000 3000 4000 50 100 150 200 250 300 350 400
GFlops Cores
DGEMM Peak UPC (nonblocking collectives) UPC (flat point-to-point) UPC (blocking collectivs) MPI / PBLAS Matrix size: (8K X 8K doubles) per node
6/22/2010
Choleskey Factorization on Sun Constellation (Infiniband)
3118 3757 4097
1000 2000 3000 4000 5000
Naïve UPC (get-based) Hand-coded UPC UPC team collectives GFlops 2048 cores on Ranger Matrix size: 240K
25 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
FFT Performance on Cray XT4
(1024 Cores)
26 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010
FFT Performance on BlueGene/P
6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 27
MPI FFT of HPC Challenge as of July 09 is ~4.5 Tflops on 128k Cores.
500 1000 1500 2000 2500 3000 3500 256 512 1024 2048 4096 8192 16384 32768 GFlops
- Num. of Cores
Slabs Slabs (Collective) Packed Slabs (Collective) MPI Packed Slabs
Summary
- PGAS provides programming convenience similar to
shared-memory models
- UPC has demonstrated good performance comparable
to MPI at large scale.
- Interoperable with other programming models and
languages including MPI, FORTRAN and C++
- Growing UPC community with actively developed and
maintained software implementations
– Berkeley UPC and GASNet: http://upc.lbl.gov – Other UPC compilers: Cray UPC, GNU UPC, HP UPC and IBM UPC – Tools: TotalView and Parallel Performance Wizard (PPW)
28 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010