 
              Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 Charm++ 2007 Kathy Yelick
HPC Programming: Where are We? • IBM SP at NERSC/LBNL has as 6K processors • There were 6K transistors in the Intel 8080a implementation • BG/L at LLNL has 64K processor cores • There were 68K transistors in the MC68000 • A BG/Q system with 1.5M processors may have more processors than there are logic gates per processor • HPC Applications developers today write programs that are as complex as describing where every single bit must move between the 6,000 transistors of the 8080a • We need to at least get to “assembly language” level Slide source: Horst Simon and John Shalf, LBNL/NERSC Charm++ 2007 Kathy Yelick, 2
Common Petaflop with ~1M Cores By 2008 by 2015? 1Eflop/s 1E+12 1 PFlop system in 2008 100 Pflop/s 1E+11 10 Pflop/s 1E+10 1 Pflop/s 1E+09 100 Tflop/s 1E+08 SUM 10 Tflops/s 1E+07 #1 1 Tflop/s 1E+06 #500 100 Gflop/s 6-8 years 100000 10 Gflop/s 10000 Data from top500.org 1 Gflop/s 1000 10 MFlop/s 100 1993 1996 1999 2002 2005 2008 2011 2014 Charm++ 2007 Kathy Yelick, 3 Slide source Horst Simon, LBNL
Predictions • Parallelism will explode • Number of cores will double every 12-24 months • Petaflop (million processor) machines will be common in HPC by 2015 (all top 500 machines will have this) • Performance will become a software problem • Parallelism and locality are key will be concerns for many programmers – not just an HPC problem • A new programming model will emerge for multicore programming • Can one language cover laptop to top500 space? Charm++ 2007 Kathy Yelick, 4
PGAS Languages: What, Why, and How 5 Charm++ 2007 Kathy Yelick
Partitioned Global Address Space • Global address space: any thread/process may directly read/write data allocated by another • Partitioned: data is designated as local or global Global address space By default: x: 1 x: 5 x: 7 • Object heaps y: y: y: 0 are shared • Program l: l: l: stacks are private g: g: g: p0 p1 pn • SPMD languages: UPC, CAF, and Titanium • All three use an SPMD execution model • Emphasis in this talk on UPC and Titanium (based on Java) • Dynamic languages: X10, Fortress, Chapel and Charm++ Charm++ 2007 Kathy Yelick, 6
PGAS Language Overview • Many common concepts, although specifics differ • Consistent with base language, e.g., Titanium is strongly typed • Both private and shared data • int x[10]; and shared int y[10]; • Support for distributed data structures • Distributed arrays; local and global pointers/references • One-sided shared-memory communication • Simple assignment statements: x[i] = y[i]; or t = *p; • Bulk operations: memcpy in UPC, array ops in Titanium and CAF • Synchronization • Global barriers, locks, memory fences • Collective Communication, IO libraries, etc. Charm++ 2007 Kathy Yelick, 7
PGAS Language for Multicore • PGAS languages are a good fit to shared memory machines • Global address space implemented as reads/writes • Current UPC and Titanium implementation uses threads • Working on System V shared memory for UPC • “Competition” on shared memory is OpenMP • PGAS has locality information that may be important when we get to >100 cores per chip • Also may be exploited for processor with explicit local store rather than cache, e.g., Cell processor • SPMD model in current PGAS languages is both an advantage (for performance) and constraining Charm++ 2007 Kathy Yelick, 8
PGAS Languages on Clusters: One-Sided vs Two-Sided Communication host two-sided message CPU message id data payload network one-sided put message interface address data payload memory • A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) Charm++ 2007 Kathy Yelick, 9 Joint work with Dan Bonachea
One-Sided vs. Two-Sided: Practice 900 GASNet put (nonblock)" 800 MPI Flood 700 NERSC Jacquard (up is good) Bandwidth (MB/s) machine with 600 Opteron processors 500 Re lative BW (GASNet/MPI) 400 2.4 2.2 300 2.0 1.8 1.6 200 1.4 1.2 1.0 100 10 1000 100000 10000000 Size (bytes) 0 10 100 1,000 10,000 100,000 1,000,000 Size (bytes) • InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5 • Half power point (N ½ ) differs by one order of magnitude • This is not a criticism of the implementation! Charm++ 2007 Joint work with Paul Hargrove and Dan Bonachea Kathy Yelick, 10
GASNet: Portability and High-Performance 8-byte Roundtrip Latency 24.2 25 22.1 MPI ping-pong GASNet put+sync 20 (down is good) 18.5 17.8 Roundtrip Latency (usec) 14.6 15 13.5 9.6 9.5 10 8.3 6.6 6.6 4.5 5 0 Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet better for latency across machines Charm++ 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 11
GASNet: Portability and High-Performance Flood Bandwidth for 2MB messages 100% 857 858 228 1504 1490 225 799 795 90% 255 Percent HW peak (BW in MB) 244 80% 630 610 (up is good) 70% 60% 50% 40% 30% 20% MPI 10% GASNet 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet at least as high (comparable) for large messages Charm++ 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 12
GASNet: Portability and High-Performance Flood Bandwidth for 4KB messages MPI 100% 223 GASNet 90% 763 714 702 231 80% 679 70% Percent HW peak 190 152 (up is good) 60% 420 750 50% 40% 547 252 30% 20% 10% 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed GASNet excels at mid-range sizes: important for overlap Charm++ 2007 Joint work with UPC Group; GASNet design by Dan Bonachea Kathy Yelick, 13
Communication Strategies for 3D FFT chunk = all rows with same destination • Three approaches: • Chunk: • Wait for 2 nd dim FFTs to finish • Minimize # messages • Slab: • Wait for chunk of rows destined for 1 proc to finish • Overlap with computation • Pencil: • Send each row as it completes pencil = 1 row • Maximize overlap and • Match natural layout slab = all rows in a single plane with same destination Charm++ 2007 Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Kathy Yelick, 14
NAS FT Variants Performance Summary 1100 Chunk (NAS FT with FFTW) .5 Tflops Best MFlop rates for all NAS FT Benchmark versions Best NAS Fortran/MPI 1000 Best MPI (always slabs) Best MPI (always Slabs) Best NAS Fortran/MPI 1000 Best UPC (always pencils) Best MPI Best UPC (always Pencils) 900 Best UPC 800 800 MFlops per Thread d a e r MFlops per Thread 700 h 600 T r e p 600 s p 400 o l F 500 M 200 400 300 0 M yrinet 64 niBand 256 Elan3 256 Elan3 512 E lan4 256 Elan4 512 Infi 200 • Slab is always best for MPI; small message cost too high 100 • Pencil is always best for UPC; more overlap 0 Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 Charm++ 2007 Kathy Yelick, 15 #procs 64 256 256 512 256 512
Top Ten PGAS Problems 1. Pointer localization optimization 2. Automatic aggregation of communication 3. Synchronization strength reduction 4. Automatic overlap of communication 5. Collective communication scheduling analysis 6. Data race detection 7. Deadlock detection 8. Memory consistency language 9. Global view � � � � local view 10.Mixed Task and Data Parallelism Charm++ 2007 Kathy Yelick, 16
Optimizations in Titanium • Communication optimizations are done • Analysis in Titanium is easier than in UPC: • Strong typing helps with alias analysis • Single analysis identifies global execution points that all threads will reach “together” (in same synch phase) • I.e., a barrier would be legal here • Allows global optimizations • Convert remote reads to remote writes by other side • Perform global runtime analysis (inspector-executor) • Especially useful for sparse matrix code with indirection: y [i] = … a[b[i]] Charm++ 2007 Kathy Yelick, 17 Joint work with Jimmy Su
Global Communication Optimizations Sparse Matrix-Vector Multiply on Itanium/Myrinet Itanium/Myrinet Speedup Comparison Speedup of Titanium over Aztec Library 1.6 1.5 1.4 speedup 1.3 1.2 1.1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 matrix number average speedup maximum speedup • Titanium code is written with fine-grained remote accesses • Compile identifies legal “inspector” points • Runtime selects (pack, bounding box) per machine / matrix / thread pair Charm++ 2007 Kathy Yelick, 18 Joint work with Jimmy Su
Recommend
More recommend