Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - - PowerPoint PPT Presentation

yili zheng
SMART_READER_LITE
LIVE PREVIEW

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - - PowerPoint PPT Presentation

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team Project Lead: Katherine Yelick Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng Former members: Christian Bell,


slide-1
SLIDE 1

Yili Zheng Lawrence Berkeley National Laboratory

slide-2
SLIDE 2

Berkeley UPC Team

  • Project Lead: Katherine Yelick
  • Team members: Filip Blagojevic, Dan

Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng

  • Former members: Christian Bell, Wei Chen,

Jason Duell, Parry Husbands, Rajesh Nishtala , Mike Welcome

  • A joint project of LBNL and UC Berkeley

2 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-3
SLIDE 3

Motivation

  • Scalable systems have either distributed memory or

shared memory without cache coherency

– Clusters: Ethernet, Infiniband, CRAY XT, IBM BlueGene – Hybrid nodes: CPU + GPU or other kinds of accelerators – SoC: IBM Cell, Intel Single-chip Cloud Computer (SCC)

  • Challenges of Message Passing programming models

– Difficult data partitioning for irregular applications – Memory space starvation due to data replication – Performance overheads from two-sided communication semantics

6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 3

slide-4
SLIDE 4

Partitioned Global Address Space

Thread 1 Thread 2 Thread 3 Thread 4

  • Global data view abstraction for productivity
  • Vertical partitions among threads for locality control
  • Horizontal partitions between shared and private

segments for data placement optimizations

  • Friendly to non-cache-coherent architectures

Private Segment Shared Segment Private Segment Shared Segment Private Segment Shared Segment Private Segment Shared Segment

4 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-5
SLIDE 5

PGAS Example: Global Matrix Distribution

Global Matrix View Distributed Matrix Storage

1 3 2 4 9 11 10 12 5

7

6 8 13 15 14 16 1 9 5 13 3 11 7 15 2 10 6 14 4 12 8 16

5 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-6
SLIDE 6

UPC Overview

  • PGAS dialect of ISO C99
  • Distributed shared arrays
  • Dynamic shared-memory allocation
  • One-sided shared-memory communication
  • Synchronization: barriers, locks, memory

fences

  • Collective communication library
  • Parallel I/O library

6 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-7
SLIDE 7

Key Components for Scalability

  • One-sided communication and active

messages

  • Efficient resource sharing for multi-core

systems

  • Non-blocking collective communication

Workshop on Programming Environments for Emerging Parallel Systems 7 6/22/2010

slide-8
SLIDE 8

Berkeley UPC Software Stack

UPC-to-C Translator UPC Applications UPC Runtime GASNet Communication Library Network Driver and OS Libraries Translated C code with Runtime Calls

Hardware Dependant Language Dependant

8 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-9
SLIDE 9

Berkeley UPC Features

  • Data transfer for complex data types (vector,

indexed, stride)

  • Non-blocking memory copy
  • Point-to-point synchronization
  • Remote atomic operations
  • Active Messages
  • Extension to UPC collectives
  • Portable timers

9 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-10
SLIDE 10

One-Sided vs. Two-Sided Messaging

  • Two-sided messaging

– Message does not contain information about the final destination; need to look it up on the target node – Point-to-point synchronization implied with all transfers

  • One-sided messaging

– Message contains information about the final destination – Decouple synchronization from data movement

  • dest. addr.

message id data payload data payload

  • ne-sided put (e.g., UPC)

two-sided message (e.g., MPI) network interface memory host CPU

10 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-11
SLIDE 11

Active Messages

  • Active messages = Data + Action
  • Key enabling technology for both
  • ne-sided and two-sided

communications

– Software implementation of Put/Get – Eager and Rendezvous protocols

  • Remote Procedural Calls

– Facilitate “owner-computes” – Spawn asynchronous tasks

Request Reply

A B

Request handler Reply handler

11 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-12
SLIDE 12

GASNet Bandwidth on BlueGene/P

  • Torus network

– Each node has six 850MB/s* bidirectional links – Vary number of links from 1 to 6

  • Consecutive non-blocking puts
  • n the links (round-robin)
  • Similar bandwidth for large-size

messages

  • GASNet outperforms MPI for

mid-size messages

– Lower software overhead – More overlapping

* Kumar et. al showed the maximum achievable bandwidth for DCMF transfers is 748 MB/s per link so we use this as our peak bandwidth See “The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer”, Kumar et al. ICS08

G O O D

See “Scaling Communication Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap”, Rajesh Nishtala, Paul Hargrove, Dan Bonachea, and Katherine Yelick, IPDPS 2009

12 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-13
SLIDE 13

GASNet Bandwidth on Cray XT4

200 400 600 800 1000 1200 1400 1600 1800 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Payload Size (bytes) Bandwidth of Non-Blocking Put (MB/s) portals-conduit Put OSU MPI BW test mpi-conduit Put

(up is good)

Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009

13 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-14
SLIDE 14

GASNet Latency on Cray XT4

5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1024

Payload Size (bytes) Latency of Blocking Put (µs) mpi-conduit Put MPI Ping-Ack portals-conduit Put

(down is good)

Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009

14 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-15
SLIDE 15

Execution Models on Multi-core – Process vs. Thread

CPU CPU CPU CPU Physical Shared-memory Virtual Address Space

Map UPC threads to Processes Map UPC threads to Pthreads

15 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-16
SLIDE 16

Point-to-Point Performance – Process vs. Thread

Workshop on Programming Environments for Emerging Parallel Systems 16 1000 2000 3000 4000 5000 6000 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K

Bandwidth (MB/s) Size (Bytes)

InfiniBand Bandwidth

1T-16P 2T-8P 4T-4P 8T-2P 16T-1P MPI 6/22/2010

slide-17
SLIDE 17

Application Performance – Process vs. Thread

Workshop on Programming Environments for Emerging Parallel Systems 17

0.2 0.4 0.6 0.8 1 1.2 GUPS MCOP SOBEL

Fine Grained Comm.

1T-16P 2T-8P 4T-4P 8T-2P 1T-16P

6/22/2010

16T-1P

slide-18
SLIDE 18

NAS Parallel Benchmarks – Process vs. Thread

0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 EP CG IS MG FT LU BT-256 SP-256

NPB - Class C

Comm Fence Critical Section Comp

18 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-19
SLIDE 19

Collective Communication for PGAS

  • Communication patterns similar to MPI:

broadcast, reduce, gather, scatter and alltoall

  • Global address space enables one-sided

collectives

  • Flexible synchronization modes provide more

communication and computation overlapping

  • pportunities

6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 19

slide-20
SLIDE 20

Collective Communication Topologies

8 2 3 12 10 4 6 1 11 9 7 5 14 13 15

binomial tree

1 2 3 12 5 8 9 4 6 7 10 11 13 14 1 2 3 12 5 8 9 4 6 7 10 11 13 14 15

Binary Tree Fork Tree

2 3 4 6 1 7 5

Radix 2 Dissemination

20 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-21
SLIDE 21

GASNet Module Organization

GASNet Collectives API Portable Collectives Point-to-point

  • Comm. Driver

Interconnect/Memory Native Collectives Collective

  • Comm. Driver

UPC Collectives Other PGAS Collectives Auto-Tuner of Algorithms and Parameters Shared-Memory Collectives

21 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-22
SLIDE 22

Auto-tuning Collective Communication

Workshop on Programming Environments for Emerging Parallel Systems 22

Offline tuning

  • Optimize for platform

common characteristics

  • Minimize runtime

tuning overhead

Online tuning

  • Optimize for application

runtime characteristics

  • Refine offline tuning

results

Performance Influencing Factors Performance Tuning Space

Hardware

  • CPU
  • Memory system
  • Interconnect

Software

  • Application
  • System software

Execution

  • Process/thread

layout

  • Input data set
  • System workload

Algorithm selection

  • Eager vs. rendezvous
  • Put vs. get
  • Collection of well-

known algorithms Communication topology

  • Tree type
  • Tree fan-out

Implementation-specific parameters

  • Pipelining depth
  • Dissemination radix

6/22/2010

slide-23
SLIDE 23

Broadcast Performance

Cray XT4 Nonblocking Broadcast (1024 Cores)

23 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-24
SLIDE 24

Matrix-Multiplication on Cray XT4

Workshop on Programming Environments for Emerging Parallel Systems 24

1000 2000 3000 4000 50 100 150 200 250 300 350 400

GFlops Cores

DGEMM Peak UPC (nonblocking collectives) UPC (flat point-to-point) UPC (blocking collectivs) MPI / PBLAS Matrix size: (8K X 8K doubles) per node

6/22/2010

slide-25
SLIDE 25

Choleskey Factorization on Sun Constellation (Infiniband)

3118 3757 4097

1000 2000 3000 4000 5000

Naïve UPC (get-based) Hand-coded UPC UPC team collectives GFlops 2048 cores on Ranger Matrix size: 240K

25 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-26
SLIDE 26

FFT Performance on Cray XT4

(1024 Cores)

26 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010

slide-27
SLIDE 27

FFT Performance on BlueGene/P

6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 27

MPI FFT of HPC Challenge as of July 09 is ~4.5 Tflops on 128k Cores.

500 1000 1500 2000 2500 3000 3500 256 512 1024 2048 4096 8192 16384 32768 GFlops

  • Num. of Cores

Slabs Slabs (Collective) Packed Slabs (Collective) MPI Packed Slabs

slide-28
SLIDE 28

Summary

  • PGAS provides programming convenience similar to

shared-memory models

  • UPC has demonstrated good performance comparable

to MPI at large scale.

  • Interoperable with other programming models and

languages including MPI, FORTRAN and C++

  • Growing UPC community with actively developed and

maintained software implementations

– Berkeley UPC and GASNet: http://upc.lbl.gov – Other UPC compilers: Cray UPC, GNU UPC, HP UPC and IBM UPC – Tools: TotalView and Parallel Performance Wizard (PPW)

28 Workshop on Programming Environments for Emerging Parallel Systems 6/22/2010