Compilation Techniques for Partitioned Global Address Space - - PowerPoint PPT Presentation

compilation techniques for partitioned global address
SMART_READER_LITE
LIVE PREVIEW

Compilation Techniques for Partitioned Global Address Space - - PowerPoint PPT Presentation

Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Charm++ 2007 Kathy Yelick HPC Programming: Where


slide-1
SLIDE 1

1

Charm++ 2007 Kathy Yelick

Compilation Techniques for Partitioned Global Address Space Languages

Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov

slide-2
SLIDE 2

Kathy Yelick, 2 Charm++ 2007

HPC Programming: Where are We?

  • IBM SP at NERSC/LBNL has as 6K processors
  • There were 6K transistors in the Intel 8080a implementation
  • BG/L at LLNL has 64K processor cores
  • There were 68K transistors in the MC68000
  • A BG/Q system with 1.5M processors may have more

processors than there are logic gates per processor

  • HPC Applications developers today write programs that

are as complex as describing where every single bit must move between the 6,000 transistors of the 8080a

  • We need to at least get to “assembly language” level

Slide source: Horst Simon and John Shalf, LBNL/NERSC

slide-3
SLIDE 3

Kathy Yelick, 3 Charm++ 2007

100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09 1E+10 1E+11 1E+12

1993 1996 1999 2002 2005 2008 2011 2014

SUM #1 #500

Petaflop with ~1M Cores By 2008

1Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflops/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 10 MFlop/s

1 PFlop system in 2008

Slide source Horst Simon, LBNL

Data from top500.org 6-8 years

Common by 2015?

slide-4
SLIDE 4

Kathy Yelick, 4 Charm++ 2007

Predictions

  • Parallelism will explode
  • Number of cores will double every 12-24 months
  • Petaflop (million processor) machines will be common

in HPC by 2015 (all top 500 machines will have this)

  • Performance will become a software problem
  • Parallelism and locality are key will be concerns for

many programmers – not just an HPC problem

  • A new programming model will emerge for

multicore programming

  • Can one language cover laptop to top500 space?
slide-5
SLIDE 5

5

Charm++ 2007 Kathy Yelick

PGAS Languages: What, Why, and How

slide-6
SLIDE 6

Kathy Yelick, 6 Charm++ 2007

Partitioned Global Address Space

  • Global address space: any thread/process may directly

read/write data allocated by another

  • Partitioned: data is designated as local or global

Global address space

x: 1 y: l: l: l: g: g: g: x: 5 y: x: 7 y: 0 p0 p1 pn

By default:

  • Object heaps

are shared

  • Program

stacks are private

  • SPMD languages: UPC, CAF, and Titanium
  • All three use an SPMD execution model
  • Emphasis in this talk on UPC and Titanium (based on Java)
  • Dynamic languages: X10, Fortress, Chapel and Charm++
slide-7
SLIDE 7

Kathy Yelick, 7 Charm++ 2007

PGAS Language Overview

  • Many common concepts, although specifics differ
  • Consistent with base language, e.g., Titanium is strongly typed
  • Both private and shared data
  • int x[10]; and

shared int y[10];

  • Support for distributed data structures
  • Distributed arrays; local and global pointers/references
  • One-sided shared-memory communication
  • Simple assignment statements: x[i] = y[i];
  • r

t = *p;

  • Bulk operations: memcpy in UPC, array ops in Titanium and CAF
  • Synchronization
  • Global barriers, locks, memory fences
  • Collective Communication, IO libraries, etc.
slide-8
SLIDE 8

Kathy Yelick, 8 Charm++ 2007

PGAS Language for Multicore

  • PGAS languages are a good fit to shared

memory machines

  • Global address space implemented as reads/writes
  • Current UPC and Titanium implementation uses threads
  • Working on System V shared memory for UPC
  • “Competition” on shared memory is OpenMP
  • PGAS has locality information that may be important when

we get to >100 cores per chip

  • Also may be exploited for processor with explicit local

store rather than cache, e.g., Cell processor

  • SPMD model in current PGAS languages is both an

advantage (for performance) and constraining

slide-9
SLIDE 9

Kathy Yelick, 9 Charm++ 2007

PGAS Languages on Clusters: One-Sided vs Two-Sided Communication

  • A one-sided put/get message can be handled directly by a network

interface with RDMA support

  • Avoid interrupting the CPU or storing data from CPU (preposts)
  • A two-sided messages needs to be matched with a receive to

identify memory address to put data

  • Offloaded to Network Interface in networks like Quadrics
  • Need to download match tables to interface (from host)

address message id data payload data payload

  • ne-sided put message

two-sided message network interface memory host CPU

Joint work with Dan Bonachea

slide-10
SLIDE 10

Kathy Yelick, 10 Charm++ 2007

One-Sided vs. Two-Sided: Practice

100 200 300 400 500 600 700 800 900 10 100 1,000 10,000 100,000 1,000,000 Size (bytes) Bandwidth (MB/s)

GASNet put (nonblock)" MPI Flood

Re lative BW (GASNet/MPI)

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 10 1000 100000 10000000 Size (bytes)

  • InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5
  • Half power point (N ½ ) differs by one order of magnitude
  • This is not a criticism of the implementation!

Joint work with Paul Hargrove and Dan Bonachea

(up is good)

NERSC Jacquard machine with Opteron processors

slide-11
SLIDE 11

Kathy Yelick, 11 Charm++ 2007

GASNet: Portability and High-Performance

(down is good)

GASNet better for latency across machines

8-byte Roundtrip Latency 14.6 6.6 22.1 9.6 6.6 4.5 9.5 18.5 24.2 13.5 17.8 8.3

5 10 15 20 25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Roundtrip Latency (usec)

MPI ping-pong GASNet put+sync

Joint work with UPC Group; GASNet design by Dan Bonachea

slide-12
SLIDE 12

Kathy Yelick, 12 Charm++ 2007

(up is good)

GASNet at least as high (comparable) for large messages

Flood Bandwidth for 2MB messages

1504 630 244 857 225 610 1490 799 255 858 228 795 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Percent HW peak (BW in MB)

MPI GASNet

GASNet: Portability and High-Performance

Joint work with UPC Group; GASNet design by Dan Bonachea

slide-13
SLIDE 13

Kathy Yelick, 13 Charm++ 2007

(up is good)

GASNet excels at mid-range sizes: important for overlap

GASNet: Portability and High-Performance

Flood Bandwidth for 4KB messages

547 420 190 702 152 252 750 714 231 763 223 679 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Percent HW peak

MPI GASNet

Joint work with UPC Group; GASNet design by Dan Bonachea

slide-14
SLIDE 14

Kathy Yelick, 14 Charm++ 2007

Communication Strategies for 3D FFT

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

chunk = all rows with same destination pencil = 1 row

  • Three approaches:
  • Chunk:
  • Wait for 2nd dim FFTs to finish
  • Minimize # messages
  • Slab:
  • Wait for chunk of rows destined

for 1 proc to finish

  • Overlap with computation
  • Pencil:
  • Send each row as it completes
  • Maximize overlap and
  • Match natural layout

slab = all rows in a single plane with same destination

slide-15
SLIDE 15

Kathy Yelick, 15 Charm++ 2007

NAS FT Variants Performance Summary

  • Slab is always best for MPI; small message cost too high
  • Pencil is always best for UPC; more overlap

200 400 600 800 1000 M yrinet 64 Infi niBand 256 Elan3 256 Elan3 512 E lan4 256 Elan4 512 M F l

  • p

s p e r T h r e a d Best MFlop rates for all NAS FT Benchmark versions Best NAS Fortran/MPI Best MPI Best UPC

100 200 300 400 500 600 700 800 900 1000 1100 Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 MFlops per Thread Best NAS Fortran/MPI Best MPI (always Slabs) Best UPC (always Pencils)

.5 Tflops

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 #procs 64 256 256 512 256 512 MFlops per Thread Chunk (NAS FT with FFTW) Best MPI (always slabs) Best UPC (always pencils)

slide-16
SLIDE 16

Kathy Yelick, 16 Charm++ 2007

Top Ten PGAS Problems

  • 1. Pointer localization
  • 2. Automatic aggregation of communication
  • 3. Synchronization strength reduction
  • 4. Automatic overlap of communication
  • 5. Collective communication scheduling
  • 6. Data race detection
  • 7. Deadlock detection
  • 8. Memory consistency
  • 9. Global view
  • local view

10.Mixed Task and Data Parallelism

  • ptimization

analysis language

slide-17
SLIDE 17

Kathy Yelick, 17 Charm++ 2007

Optimizations in Titanium

  • Communication optimizations are done
  • Analysis in Titanium is easier than in UPC:
  • Strong typing helps with alias analysis
  • Single analysis identifies global execution points that all

threads will reach “together” (in same synch phase)

  • I.e., a barrier would be legal here
  • Allows global optimizations
  • Convert remote reads to remote writes by other side
  • Perform global runtime analysis (inspector-executor)
  • Especially useful for sparse matrix code with indirection:

y [i] = … a[b[i]]

Joint work with Jimmy Su

slide-18
SLIDE 18

Kathy Yelick, 18 Charm++ 2007

Global Communication Optimizations

Itanium/Myrinet Speedup Comparison

1 1.1 1.2 1.3 1.4 1.5 1.6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 matrix number speedup average speedup maximum speedup

Sparse Matrix-Vector Multiply on Itanium/Myrinet Speedup of Titanium over Aztec Library

  • Titanium code is written with fine-grained remote accesses
  • Compile identifies legal “inspector” points
  • Runtime selects (pack, bounding box) per machine / matrix / thread pair

Joint work with Jimmy Su

slide-19
SLIDE 19

Kathy Yelick, 19 Charm++ 2007

Parallel Program Analysis

  • To perform optimizations, new analyses are

needed for parallel languages

  • In a data parallel or serial (auto-parallelized)

language, the semantics are serial

  • Analysis is “easier” but more critical to performance
  • Parallel semantics requires
  • Concurrency analysis: which code sequences may run

concurrently

  • Parallel alias analysis: which accesses could conflict

between threads

  • Analysis is used to detect races, identify

localizable pointers, and ensure memory consistency semantics (if desired)

slide-20
SLIDE 20

Kathy Yelick, 20 Charm++ 2007

  • Relies on Titanium’s textual barriers and single-

valued expressions

  • Titanium has textual barriers: all threads must

execute the same textual sequence of barriers (this is illegal)

if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads

  • Single-valued expressions used to enforce

textual barriers while permitting useful programs

single boolean allGo = broadcast go from 0; if (allGo) Ti.barrier();

  • May also be used in loops to ensure same

number of iterations

Concurrency Analysis in Titanium

Joint work with Amir Kamil and Jimmy Su

slide-21
SLIDE 21

Kathy Yelick, 21 Charm++ 2007

Concurrency Analysis

  • Graph generated from program as follows:
  • Node for each code segment between barriers and single

conditionals

  • Edges added to represent control flow between segments
  • Barrier edges removed
  • Two accesses can run concurrently if:
  • They are in the same node, or
  • One access’s node is reachable from the other access’s node

// segment 1 if ([single]) // segment 2 else // segment 3 // segment 4 Ti.barrier() // segment 5 1 2 3 4 5 barrier

Joint work with Amir Kamil and Jimmy Su

slide-22
SLIDE 22

Kathy Yelick, 22 Charm++ 2007

Alias Analysis

  • Allocation sites correspond to abstract

locations (a-locs)

  • Abstract locations (a-locs) are typed
  • All explicit and implicit program variables

have points-to sets

  • Each field of an object has a separate set
  • Arrays have a single points-to set for all elements
  • Thread aware: Two kinds of abstract

locations: local and remote

  • Local locations reside in local thread’s memory
  • Remote locations reside on another thread
  • Generalizes to multiple levels (thread, node, cluster)

Joint work with Amir Kamil

slide-23
SLIDE 23

Kathy Yelick, 23 Charm++ 2007

Benchmarks

AMR Poisson (elliptic) solver 4700 amr-poisson Hyperbolic AMR solver for gas dynamics 8841 amr-gas Sparse matrix-vector multiply 1493 spmv Computational fluid dynamics kernel 1090 gsrb Fourier transform 614 3d-fft Dense linear algebra 420 lu-fact Parallel sort 321 sample-sort Dense matrix-vector multiply 122 demv Monte Carlo integration 56 pi Description Lines1 Benchmark

1 Line counts do not include the reachable portion of the

37,000 line Titanium/Java 1.0 libraries

Joint work with Amir Kamil

slide-24
SLIDE 24

Kathy Yelick, 24 Charm++ 2007

  • Analyses of varying levels of precision

Analysis Levels

Concurrency analysis + hierarchical (on and off node) thread-aware alias analysis concur-multi- level-pointer Previous constraint-based type analysis by Aiken, Gay, and Liblit (different versions for each client)

  • ld

LQI/SQI/Sharing All heap accesses naïve Description Analysis

Joint work with Amir Kamil

slide-25
SLIDE 25

Kathy Yelick, 25 Charm++ 2007

Local Qualification Inference

10 20 30 40 50 60 70 80 90 100

3d-fft amr- poisson amr-gas gsrb lu-fact pi pps sample- sort demv spmv Benchmark % of Declarations

Old Constraint-Based (LQI) Thread-Aware Pointer Analysis Hierarchical Pointer Analysis

Declarations Identified as “Local”

Local pointers are both faster and smaller

Joint work with Amir Kamil

slide-26
SLIDE 26

Kathy Yelick, 26 Charm++ 2007

Private Qualification Inference

10 20 30 40 50 60 70 80 90 100

3d-fft amr- poisson amr-gas gsrb lu-fact pi pps sample- sort demv spmv Benchmark % of Declarations

Old Type-Based SQI Thread-Aware Pointer Analysis

Declarations Identified as Private

Private data may be cached and is known not to be in a race

Joint work with Amir Kamil

slide-27
SLIDE 27

27

Charm++ 2007 Kathy Yelick

Making PGAS Real: Applications and Portability

slide-28
SLIDE 28

Kathy Yelick, 28 Charm++ 2007

Coding Challenges: Block-Structured AMR

  • Adaptive Mesh Refinement

(AMR) is challenging

  • Irregular data accesses and

control from boundaries

  • Mixed global/local view is useful

AMR Titanium work by Tong Wen and Philip Colella

Titanium AMR benchmark available

slide-29
SLIDE 29

Kathy Yelick, 29 Charm++ 2007

Languages Support Helps Productivity

C++/Fortran/MPI AMR

  • Chombo package from LBNL
  • Bulk-synchronous comm:
  • Pack boundary data between procs
  • All optimizations done by programmer

Titanium AMR

  • Entirely in Titanium
  • Finer-grained communication
  • No explicit pack/unpack code
  • Automated in runtime system
  • General approach
  • Language allow programmer
  • ptimizations
  • Compiler/runtime does some

automatically

Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su 5000 10000 15000 20000 25000 30000 Titanium C++/F/MPI (Chombo) Lines of Code AMRElliptic AMRTools Util Grid AMR Array

slide-30
SLIDE 30

Kathy Yelick, 30 Charm++ 2007

Performance of Titanium AMR

Speedup

10 20 30 40 50 60 70 80 16 28 36 56 112 #procs speedup Ti Chombo

  • Serial: Titanium is within a few % of C++/F; sometimes faster!
  • Parallel: Titanium scaling is comparable with generic optimizations
  • optimizations (SMP-aware) that are not in MPI code
  • additional optimizations (namely overlap) not yet implemented

Comparable parallel performance

Joint work with Tong Wen, Jimmy Su, Phil Colella

slide-31
SLIDE 31

Kathy Yelick, 31 Charm++ 2007

Particle/Mesh Method: Heart Simulation

  • Elastic structures in an incompressible fluid.
  • Blood flow, clotting, inner ear, embryo growth, …
  • Complicated parallelization
  • Particle/Mesh method, but “Particles” connected

into materials (1D or 2D structures)

  • Communication patterns irregular between particles

(structures) and mesh (fluid)

Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen

2D Dirac Delta Function Code Size in Lines

8000 Fortran 4000 Titanium

Note: Fortran code is not parallel

slide-32
SLIDE 32

Kathy Yelick, 32 Charm++ 2007

Immersed Boundary Method Performance

Hand-Optimized (planes, 2004)

10 20 30 40 50 1 2 4 8 16 32 64 128 procs time (secs)

256^3 on Power3/Colony 512^3 on Power3/Colony 512^2x256 on Pent4/Myrinet

Automatically Optimized (sphere, 2006)

0.5 1 1.5 2 1 2 4 8 16 32 64 128 procs time (secs)

128^3 on Power4/Federation 256^3 on Power4/Federation

Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen

slide-33
SLIDE 33

Kathy Yelick, 33 Charm++ 2007

slide-34
SLIDE 34

Kathy Yelick, 34 Charm++ 2007

Beyond the SPMD Model: Mixed Parallelism

  • UPC and Titanium uses a static threads (SPMD)

programming model

  • General, performance-transparent
  • Criticized as “local view” rather than “global view”
  • “for all my array elements”, or “for all my blocks”
  • Adding extension for data parallelism
  • Based on collective model:
  • Threads gang together to do data parallel operations
  • Or (from a different perspective) single data-parallel thread can

split into P threads when needed

  • Compiler proves that threads are aligned at barriers,

reductions and other collective points

  • Already used for global optimizations: read writes transform
  • Adding support for other data parallel operations

Joint work with Parry Husbands

slide-35
SLIDE 35

Kathy Yelick, 35 Charm++ 2007

Beyond the SPMD Model: Dynamic Threads

  • UPC uses a static threads (SPMD) programming model
  • No dynamic load balancing built-in, although some examples

(Delaunay mesh generation) of building it on top

  • Berkeley UPC model extends basic memory semantics (remote

read/write) with active messages

  • AM have limited functionality (no messages except acks) to avoid

deadlock in the network

  • A more dynamic runtime would have many uses
  • Application load imbalance, OS noise, fault tolerance
  • Two extremes are well-studied
  • Dynamic load balancing (e.g., random stealing) without locality
  • Static parallelism (with threads = processors) with locality
  • Charm++ has virtualized processes with locality
  • How much “unnecessary” parallelism can it support?

Joint work with Parry Husbands

slide-36
SLIDE 36

Kathy Yelick, 36 Charm++ 2007

  • How important is locality and what is locality relationship?
  • Some tasks must run with dependent tasks to re-use state
  • If data is small or compute:communicate ratio large, locality less important
  • Can we build runtimes that work for the hardest case: general dag with large

data and small compute

Task Scheduling Problem Spectrum

slide-37
SLIDE 37

Kathy Yelick, 37 Charm++ 2007

Dense and Sparse Matrix Factorization

Blocks 2D block-cyclic distributed Panel factorizations involve communication for pivoting Matrix- matrix multiplication used here. Can be coalesced

Completed part of L

A(i,j) A(i,k) A(j,i) A(j,k)

Trailing matrix to be updated Panel being factored

Joint work with Parry Husbands

Completed part of U

slide-38
SLIDE 38

Kathy Yelick, 38 Charm++ 2007

Parallel Tasks in LU

  • Theoretical and practical problem: Memory deadlock
  • Not enough memory for all tasks at once. (Each update needs two

temporary blocks, a green and blue, to run.)

  • If updates are scheduled too soon, you will run out of memory
  • If updates are scheduled too late, critical path will be delayed.

some edges omitted

slide-39
SLIDE 39

Kathy Yelick, 39 Charm++ 2007

LU in UPC + Multithreading

  • UPC uses a static threads (SPMD) programming model
  • Used to mask latency and to mask dependence delays
  • Three levels of threads:
  • UPC threads (data layout, each runs an event scheduling loop)
  • Multithreaded BLAS (boost efficiency)
  • User level (non-preemptive) threads with explicit yield
  • No dynamic load balancing, but lots of remote invocation
  • Layout is fixed (blocked/cyclic) and tuned for block size
  • Same framework being used for sparse Cholesky
  • Hard problems
  • Block size tuning (tedious) for both locality and granularity
  • Task prioritization (ensure critical path performance)
  • Resource management can deadlock memory allocator if not careful
  • Collectives (asynchronous reductions for pivoting) need high priority

Joint work with Parry Husbands

slide-40
SLIDE 40

Kathy Yelick, 40 Charm++ 2007

UPC HP Linpack Performance

X1 UPC vs. MPI/HPL

200 400 600 800 1000 1200 1400 60 X1/64 X1/128 GFlop/s MPI/HPL UPC

Opteron cluster UPC vs. MPI/HPL

50 100 150 200 Opt/64 GFlop/s MPI/HPL UPC

Altix UPC. Vs. MPI/HPL

20 40 60 80 100 120 140 160 Alt/32 GFlop/s MPI/HPL UPC

  • Faster than ScaLAPACK due to less synchronization
  • Comparable to MPI HPL (numbers from HPCC database)
  • Large scaling of UPC code on Itanium/Quadrics (Thunder)
  • 2.2 TFlops on 512p and 4.4 TFlops on 1024p

Joint work with Parry Husbands

UPC vs. ScaLAPACK

20 40 60 80

2x4 pr oc gr id 4x4 pr oc gr id

GFlops ScaLAPACK UPC

slide-41
SLIDE 41

Kathy Yelick, 41 Charm++ 2007

HPCS Languages

  • DARPA HPCS languages
  • X10 from IBM, Chapel from Cray, Fortress from Sun
  • Many interesting differences
  • Atomics vs. transactions
  • Remote read/write vs. remote invocation
  • Base language: Java vs. a new language
  • Hierarchical vs. flat space of virtual processors
  • Many interesting commonalities
  • Mixed task and data parallelism
  • Data parallel operations are “one-sided” not collective: one

thread can invoke a reduction without any help from others

  • Distributed arrays with user-defined distributions
  • Dynamic load balancing built in
slide-42
SLIDE 42

Kathy Yelick, 42 Charm++ 2007

Conclusions and Open Questions

  • Best time ever for a new parallel language
  • Community is looking for parallel programming solutions
  • Not just an HPC problem
  • Current PGAS Languages
  • Good fit for shared and distributed memory
  • Control over locality and (for better or worse) SPMD
  • Need to break out of strict SPMD model
  • Load imbalance, OS noise, faults tolerance, etc.
  • Managed runtimes like Charm++ add generality
  • Some open language questions
  • Can we get the best of global view (data-parallel) and local

view in one efficient parallel language

  • Will non-SPMD languages have sufficient resource control for

applications with complex task graph structures?