Edge: 1 May 24, 2006
Stream Programming: Explicit Parallelism and Locality Bill Dally - - PowerPoint PPT Presentation
Stream Programming: Explicit Parallelism and Locality Bill Dally - - PowerPoint PPT Presentation
Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 May 24, 2006 Edge: 1 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other stream
Edge: 2 May 24, 2006
Outline
- Technology Constraints Architecture
- Stream programming
- Imagine and Merrimac
- Other stream processors
- Future directions
Edge: 3 May 24, 2006
ILP is mined out – end of superscalar processors Time for a new architecture
1e-4 1e-3 1e-2 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1980 1990 2000 2010 2020 Perf (ps/Inst) Linear (ps/Inst)
5 2 % / y e a r 74%/year 19%/year
30:1 1,000:1 30,000:1
Dally et al. “The Last Classsical Computer”, ISAT Study, 2001
Edge: 4 May 24, 2006
Performance = Parallelism Efficiency = Locality
Edge: 5 May 24, 2006
Arithmetic is cheap, Communication is expensive
- Arithmetic
– Can put 100s of FPUs on a chip – $0.50/GFLOPS, 50mW/GFLOPS – Exploit with parallelism
- Communication
– Dominates cost
- $8/GW/s 2W/GW/s (off-chip)
– BW decreases (and cost increases) with distance – Power increases with distance – Latency increases with distance
- But can be hidden with parallelism
– Need locality to conserve global bandwidth
90nm Chip $200 1GHz
64-bit FPU (to scale) 12mm 0.5mm
1 clock Inc re a sing powe r De c re a sing BW
Edge: 6 May 24, 2006
Cost of data access varies by 1000x
200ns $50 1nJ Off chip (node mem) 1us $500 5nJ Global 20ns $10 200pJ Global on Chip (15mm) 4ns $2 50pJ Chip Region (2mm) 1ns $0.50 10pJ Local Register Time Cost* Energy From
*Cost of providing 1GW/s of bandwidth All numbers approximate
Edge: 7 May 24, 2006
So we should build chips that look like this
Edge: 8 May 24, 2006
An abstract view
R RM A R A R A Switch R A R A R A Switch R A R A R A Switch RM RM Switch CM LM Switch Global Memory
Edge: 9 May 24, 2006
Real question is: How to orchestrate movement of data
Edge: 10 May 24, 2006
Conventional Wisdom: Use caches
R RM A R A R A Switch R A R A R A Switch R A R A R A Switch RM RM Switch CM LM Switch Global Memory
Edge: 11 May 24, 2006
Caches squander bandwidth – our scarce resource
- Unnecessary data movement
- Poorly scheduled data movement
– Idles expensive resources waiting on data
- More efficient to map programs to an explicit memory
hierarchy
Edge: 12 May 24, 2006
Example – Simplified Finite-Element Code
loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)
Edge: 13 May 24, 2006
Explicitly block into SRF
loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)
Flux passed through SRF, no memory traffic
Edge: 14 May 24, 2006
Explicitly block into SRF
loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)
Explicit re-use
- f Cells, no
misses
Edge: 15 May 24, 2006
Stream loads/stores (bulk operations) hide latency (1000s of words in flight)
DRAM Cells SRFs Cells gather LRFs fn1 Flux fn2 Cells Cells scatter
Edge: 16 May 24, 2006
Explicit storage enables simple, efficient execution
All needed data and instructions on-chip no misses
Edge: 17 May 24, 2006
Caches lack predictability (controlled via a “wet noodle”)
Edge: 18 May 24, 2006
Caches are controlled via a “wet noodle”
99% hit rate, 1 miss costs 100s of cycles, 10,000s of ops
Edge: 19 May 24, 2006
So how do we program an explicit hierarchy?
Edge: 20 May 24, 2006
Stream Programming: Parallelism, Locality, and Predictability
- Parallelism
– Data parallelism across stream elements – Task parallelsm across kernels – ILP within kernels
- Locality
– Producer/consumer – Within kernels
- Predictability
– Enables scheduling
K1 K3 K4 K2
Edge: 21 May 24, 2006
Evolution of Stream Programming
1997 StreamC/KernelC Break programs into kernels Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2001 Brook Continues the construct of streams and kernels Hides underlying details Too “one-dimensional” 2005 Sequoia Generalizes kernels to “tasks” Tasks operate on local data Local data “gathered” in an arbitrary way “Inner” tasks subdivide, “leaf” tasks compute Machine-specific details factored out
Edge: 22 May 24, 2006
StreamC/KernelC
SAD Image 1 convolve convolve Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Image 0 convolve convolve Depth Map
STREAMPROG depth) { im_stream<pixels> in, tmp; … for (i=0; i<rows; i++) { convolve(in, tmp, …); convolve(tmp, conv_row, …); } … for (i=0; i<rows; i++) { SAD(conv_row, depth_row, …); } … } KERNEL convolve( istream<int> a,
- stream<int> y) {
… loop_stream(a) { int ai, out; a >> ai; …
- ut = dotproduct(ai,…);
y << out; } }
Edge: 23 May 24, 2006
Explicit storage enables simple, efficient execution unit scheduling
10 20 30 40 50 60 70 80 90 100 110 120
10 20 30 40 50 60 70 80 90 100 110 120 20 30 40 50 60 70 80 90 100 110 120 20 30 40 50 60 70 80 90 100 110 120
ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable
One iteration SW Pipeline
Edge: 24 May 24, 2006
Stream scheduling exploits explicit storage to reduce bandwidth demand
Stre a mF E M a pplic a tion
Pre fe tc hing , re use , use / de f, limite d spilling
Compute Flux States Compute Numerical Flux Element Faces Gathered Elements Numerical Flux Gather Cell Compute Cell Interior Advance Cell Elements (Current) Elements (New) Read-Only Table Lookup Data (Master Element) Face Geometry Cell Orientations Cell Geometry
Edge: 25 May 24, 2006
- Perform actual computation
- Analogous to kernels
- “Small” working set
Sequoia – Generalize Kernels into Leaf Tasks
void __task matmul::leaf( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) { for (int i=0; i<M; i++) { for (int j=0; j<N; j++) { for (int k=0; k<P; k++) { C[i][j] += A[i][k] * B[k][j]; }
FU LS 0 Aggregate LS FU LS 7 Node memory matmul leaf
Edge: 26 May 24, 2006
- Decompose to smaller subtasks
– Recursively
- “Larger” working sets
Inner tasks
LS 0 Aggregate LS LS 7 Node memory matmul inner matmul leaf
void __task matmul::inner( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(A, U, X); blkset Bblks = rchop(B, X, V); blkset Cblks = rchop(C, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) matmul(Ablks[i][k],Bblks[k][j],Cblks[i][j]); }
matmul leaf FU FU
Edge: 27 May 24, 2006
Stream Processors make communication explicit Enables optimization
Edge: 28 May 24, 2006
Stream architecture makes communication explicit – exploits parallelism and locality
LRF LRF CL SW SRF Lane LRF LRF CL SW SRF Lane 100χ wire 1kχ switch M SW 10kχ switch Cache Bank Cache Bank Chip Pins and Router DRAM Bank DRAM Bank Chip Crossing(s) ALU and cluster arrays shown 1D here may be laid
- ut as 2D arrays
Edge: 29 May 24, 2006
- Chip Details
– 2.56cm2 die, 0.15um process, 21M transistors, 792-pin BGA – Collaboration with TI ASIC – Chips arrived on April 1, 2002
- Dual-Imagine test board
Imagine VLSI Implementation
Edge: 30 May 24, 2006
Application Performance (cont.)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
DEPTH MPEG QRD RTSL Average Execution time
host bandwidth stalls stream controller
- verhead
memory stalls cluster stalls kernel non main loop kernel main loop
- verhead
- perations
Edge: 31 May 24, 2006
Applications match the bandwidth hierarchy
0.1 1 10 100 1000 Peak DEPTH MPEG QRD RTSL Bandwidth (GB/s) LRF SRF DRAM
Edge: 32 May 24, 2006
Merrimac – Streaming Supercomputer
Sc a la b le fro m 2-T F L OP wo rksta tio n to 2-PF L OP supe rc o mpute r
16 x XDR-DRAM 2GBytes S trea m Pro c e ssor 64 FPU 128 GFL OPS On-Boa rd Network Intra -Ca b inet Network E/ O O/ E Inter-Ca b inet Network Bisec tion 24TBytes/ s 64GBytes/ s 12GBytes/ s 32+32 pa irs 48GBytes/ s 128+128 pa irs 6” Tera dyne Gb X 768GBytes/ s 2K +2K links Rib b on Fib er Ba c kpla ne Boa rd Node Node 2 Node 16 Boa rd 32 Boa rd 2 16 Nodes 1K FPUs 2T FL OPS 32GBytes Ba c kpla ne 2 32 Boa rds 512 Nodes 32K FPUs 64TFL OPS 1TBytes Ba c kpla ne 32 16 x XDR-DRAM 2GBytes S trea m Pro c e ssor 64 FPU 128 GFL OPS On-Boa rd Network Intra -Ca b inet Network E/ O O/ E Inter-Ca b inet Network Bisec tion 24TBytes/ s 64GBytes/ s 12GBytes/ s 32+32 pa irs 48GBytes/ s 128+128 pa irs 6” Tera dyne Gb X 768GBytes/ s 2K +2K links Rib b on Fib er Ba c kpla ne Boa rd Node Node 2 Node 16 Boa rd 32 Boa rd 2 16 Nodes 1K FPUs 2T FL OPS 32GBytes Ba c kpla ne 2 32 Boa rds 512 Nodes 32K FPUs 64TFL OPS 1TBytes Ba c kpla ne 32
Edge: 33 May 24, 2006
Merrimac Application Results
Simula te d o n a ma c hine with 64GF L OPS pe a k pe rfo rma nc e a nd no fuse d MADD * T he lo w numb e rs a re a re sult o f ma ny divide a nd sq ua re -ro o t o pe ra tio ns
1.5M (1.3%) 4.2M (2.9%) 108M (95.0%) 9.7* 38.8* GROMACS 3.4M (1.4%) 7.2M (2.9%) 234.3M (95.7%) 7.4* 12.9* StreamFLO 0.7M (0.8%) 1.6M (1.7%) 90.2M (97.5%) 12.1* 14.2* StreamMD (grid algorithm) 2.8M (0.2%) 7.7M (0.4%) 186.5M (99.4%) 13.8 39.2 StreamFEM3D (MHD, constant) 1.8M (1.1%) 6.3M (3.9%) 153.0M (95.0%) 17.1 31.6 StreamFEM3D (Euler, quadratic)
Mem Refs SRF Refs LRF Refs FP Ops / Mem Ref Sustained GFLOPS Application
Applic a tio ns a c hie ve hig h pe rfo rma nc e a nd ma ke g o o d use o f the b a ndwidth hie ra rc hy
Edge: 34 May 24, 2006
Other Stream Processors
Edge: 35 May 24, 2006
Other Stream Processors
Cle a rSpe e d CSX600 96 GF L OP/ s 10W ST I Ce ll
~200 GF L OP/ s
100W GPUs 50- 100 GF L OP/ s 10- 30W
Edge: 36 May 24, 2006
Other Stream Processors
- Technology pushing many to build stream processors
- GPUs (Nvidia, ATI), Game Processors (Cell), Physics
Processors (Ageia), Accelerators (Clearspeed)
- Many (10s-100s) of FPUs
- Distributed local storage
- Latency hiding on access to external memory
– Block access or deeply multithreaded
- All benefit from stream programming
– But the right architecture makes it easier and more efficient
Edge: 37 May 24, 2006
Architecture Issues
- On-chip memory
– Read and write access to on-chip storage
- Producer-consumer locality demands write access
– Data movement between on-chip memories
- Without going off chip
- Off-chip memory
– No substitute for bandwidth – Efficient gather and scatter required
Edge: 38 May 24, 2006
What’s Next?
Edge: 39 May 24, 2006
ILP is mined out – end of superscalar processors Time for a new architecture
1e-4 1e-3 1e-2 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1980 1990 2000 2010 2020 Perf (ps/Inst) Linear (ps/Inst)
5 2 % / y e a r 74%/year 19%/year
30:1 1,000:1 30,000:1
Dally et al. “The Last Classsical Computer”, ISAT Study, 2001
Edge: 40 May 24, 2006
Computing landscape is changing
- Many function units
- Deep, distributed storage hierarchy
- Communication limited
- Research is needed to understand how to architect
and program these processors
- Not an incremental fix:
– Fundamental rethinking of basic architecture, programming model, and compilers is required
Edge: 41 May 24, 2006
Software Topics
- Exposed Communication Programming Models
– Abstract storage hierarchy and communication costs – Portable codes with predictable performance
- Compiling bulk operations
– Strategic (vs tactical) program reorganization – Scheduling bulk data transfers – Size and shape of blocking – Irregular computations
- Localization of shared neighbors
– Variable size results
- Applications
– Communication-efficient algorithms
Edge: 42 May 24, 2006
Hardware Topics
- Close efficiency gap with hard-wired engines
– Gap is 10-100x today (10 for stream processors) – Efficient data movement is first step – Other overheads remain to be removed
- Storage hierarchies that can be abstracted
- Balancing parallelism ILP x DLP x TLP
- On-chip networks
– To connect within and between levels of the hierarchy
- Communication and synchronization mechanisms
– Drives granularity – which in turn determines available parallelism
- Mechanisms for reuse of irregular data
Edge: 43 May 24, 2006
Summary
- Communication is expensive, arithmetic is cheap
– Parallelism to exploit arithmetic – Locality to conserve bandwidth
- Architectures evolving toward a deep, broad storage
hierarchy
– Storage to hide latency, cover bandwidth taper – Stream processors > 10x efficiency of conventional CPUs
- Explicitly manage this hierarchy
– Makes efficient use of scarce, expensive resources – Enables optimization
- Generalized Stream programming