1
LCPC 2006 Kathy Yelick
Programming Models for Parallel Computing Katherine Yelick U.C. - - PowerPoint PPT Presentation
Programming Models for Parallel Computing Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab http://titanium.cs.berkeley.edu http://upc.lbl.gov 1 LCPC 2006 Kathy Yelick Parallel Computing Past Not long ago, the viability
1
LCPC 2006 Kathy Yelick
Kathy Yelick, 2 LCPC 2006
parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
home”
1977.
Slide source: Warfield et al.
Kathy Yelick, 3 LCPC 2006
2X transistors/Chip Every 1.5 years
Moore’s Law
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
Kathy Yelick, 4 LCPC 2006
sells for $20K, whereas a PS3 is about $600 and only uses 7
Kathy Yelick, 5 LCPC 2006
Kathy Yelick, 6 LCPC 2006
continuing increase ~2x every 2 years
processor cores may double instead
hidden parallelism (ILP) to be found
be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
Kathy Yelick, 7 LCPC 2006
1 10 100 1000 10000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Performance (vs. VAX-11/780)
25%/year 52%/year ??%/year
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
3X
Kathy Yelick, 8 LCPC 2006
level languages
Kathy Yelick, 9 LCPC 2006
100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09 1E+10 1E+11 1E+12
1993 1996 1999 2002 2005 2008 2011 2014
SUM #1 #500
1Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflops/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 10 MFlop/s
1 PFlop system in 2008
Slide source Horst Simon, LBNL
Data from top500.org 6-8 years
Kathy Yelick, 10 LCPC 2006
as important
(about 7% improvement per year) compared to processor performance
cache registers datapath control processor Second level cache (SRAM) Main memory (DRAM) Secondary storage (Disk) Tertiary storage (Disk/Tape)
TB GB MB KB B Size 10sec 10ms 100ns 10ns 1ns Speed
Kathy Yelick, 11 LCPC 2006
12
LCPC 2006 Kathy Yelick
Kathy Yelick, 13 LCPC 2006
Kathy Yelick, 14 LCPC 2006
Kathy Yelick, 15 LCPC 2006
read/write data allocated by another
Global address space
x: 1 y: l: l: l: g: g: g: x: 5 y: x: 7 y: 0 p0 p1 pn
By default:
are shared
stacks are private
Kathy Yelick, 16 LCPC 2006
shared int y[10];
t = *p;
Kathy Yelick, 17 LCPC 2006
shared int ours; int mine;
shared int x[2*THREADS] /* cyclic, 1 element each, wrapped */ shared int [2] y [2*THREADS] /* blocked, with block size 2 */
Shared Global address space Private
mine: mine: mine:
Thread0 Thread1 Threadn
x[0,n+1] y[0,1] x[1,n+2] y[2,3] x[n,2n] y[2n-1,2n]
Kathy Yelick, 18 LCPC 2006
Kathy Yelick, 19 LCPC 2006
interface with RDMA support
identify memory address to put data
address message id data payload data payload
two-sided message network interface memory host CPU
Joint work with Dan Bonachea
Kathy Yelick, 20 LCPC 2006
100 200 300 400 500 600 700 800 900 10 100 1,000 10,000 100,000 1,000,000 Size (bytes) Bandwidth (MB/s)
GASNet put (nonblock)" MPI Flood
Relative BW (GASNet/MPI)
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 10 1000 100000 10000000 Size (bytes)
Joint work with Paul Hargrove and Dan Bonachea
(up is good)
NERSC Jacquard machine with Opteron processors
Kathy Yelick, 21 LCPC 2006
(down is good)
8-byte Roundtrip Latency 14.6 6.6 22.1 9.6 6.6 4.5 9.5 18.5 24.2 13.5 17.8 8.3
5 10 15 20 25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Roundtrip Latency (usec)
MPI ping-pong GASNet put+sync
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 22 LCPC 2006
(up is good)
Flood Bandwidth for 2MB messages
1504 630 244 857 225 610 1490 799 255 858 228 795 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Percent HW peak (BW in MB)
MPI GASNet
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 23 LCPC 2006
(up is good)
Flood Bandwidth for 4KB messages
547 420 190 702 152 252 750 714 231 763 223 679 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Percent HW peak
MPI GASNet
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 24 LCPC 2006
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
chunk = all rows with same destination pencil = 1 row
for 1 proc to finish
slab = all rows in a single plane with same destination
Kathy Yelick, 25 LCPC 2006
200 400 600 800 1000 Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions Best NAS Fortran/MPI Best MPI Best UPC
100 200 300 400 500 600 700 800 900 1000 1100 Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 MFlops per Thread Best NAS Fortran/MPI Best MPI (always Slabs) Best UPC (always Pencils)
Myrinet Infiniband Elan3 Elan3 Elan4 Elan4 #procs 64 256 256 512 256 512 MFlops per Thread Chunk (NAS FT with FFTW) Best MPI (always slabs) Best UPC (always pencils)
26
LCPC 2006 Kathy Yelick
Kathy Yelick, 27 LCPC 2006
Code Size in Lines
4200* 6500 35000 C++/F/MPI 1500 Elliptic PDE solver 1200 AMR operations 2000 AMR data Structures Titanium
10X reduction in lines of code!
* Somewhat more functionality in PDE part of Chombo code
AMR Work by Tong Wen and Philip Colella
Kathy Yelick, 28 LCPC 2006
Speedup
10 20 30 40 50 60 70 80 16 28 36 56 112 #procs speedup Ti Chombo
Comparable parallel performance
Joint work with Tong Wen, Jimmy Su, Phil Colella
Kathy Yelick, 29 LCPC 2006
into materials (1D or 2D structures)
(structures) and mesh (fluid)
Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen
2D Dirac Delta Function Code Size in Lines
8000 Fortran 4000 Titanium
Note: Fortran code is not parallel
Kathy Yelick, 30 LCPC 2006
10 20 30 40 50 1 2 4 8 16 32 64 128 procs time (secs)
256^3 on Power3/Colony 512^3 on Power3/Colony 512^2x256 on Pent4/Myrinet
Automatically Optimized (sphere, 2006)
0.5 1 1.5 2 1 2 4 8 16 32 64 128 procs time (secs)
128^3 on Power4/Federation 256^3 on Power4/Federation
Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen
Kathy Yelick, 31 LCPC 2006
Kathy Yelick, 32 LCPC 2006
Blocks 2D block-cyclic distributed Panel factorizations involve communication for pivoting Matrix- matrix multiplication used here. Can be coalesced
Completed part of U Completed part of L
A(i,j) A(i,k) A(j,i) A(j,k)
Trailing matrix to be updated Panel being factored
Joint work with Parry Husbands and Esmond Ng
Kathy Yelick, 33 LCPC 2006
Joint work with Parry Husbands
Kathy Yelick, 34 LCPC 2006
X1 UPC vs. MPI/HPL
200 400 600 800 1000 1200 1400 60 X1/64 X1/128 GFlop/s MPI/HPL UPC
Opteron cluster UPC vs. MPI/HPL
50 100 150 200 Opt/64 GFlop/s MPI/HPL UPC
Altix UPC. Vs. MPI/HPL
20 40 60 80 100 120 140 160 Alt/32 GFlop/s MPI/HPL UPC
Joint work with Parry Husbands
UPC vs. ScaLAPACK
20 40 60 80
2x4 pr oc gr i d 4x4 pr oc gr i d
GFlops
ScaLAPACK UPC
Kathy Yelick, 35 LCPC 2006
Kathy Yelick, 36 LCPC 2006
Also used by gcc/upc
Joint work with Titanium and UPC groups
Kathy Yelick, 37 LCPC 2006
term towards machine trends