Stream Programming: Explicit Parallelism and Locality Bill Dally - - PowerPoint PPT Presentation

stream programming explicit parallelism and locality
SMART_READER_LITE
LIVE PREVIEW

Stream Programming: Explicit Parallelism and Locality Bill Dally - - PowerPoint PPT Presentation

Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 May 24, 2006 Edge: 1 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other stream


slide-1
SLIDE 1

Edge: 1 May 24, 2006

Stream Programming: Explicit Parallelism and Locality

Bill Dally Edge Workshop May 24, 2006

slide-2
SLIDE 2

Edge: 2 May 24, 2006

Outline

  • Technology Constraints Architecture
  • Stream programming
  • Imagine and Merrimac
  • Other stream processors
  • Future directions
slide-3
SLIDE 3

Edge: 3 May 24, 2006

ILP is mined out – end of superscalar processors Time for a new architecture

1e-4 1e-3 1e-2 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1980 1990 2000 2010 2020 Perf (ps/Inst) Linear (ps/Inst)

5 2 % / y e a r 74%/year 19%/year

30:1 1,000:1 30,000:1

Dally et al. “The Last Classsical Computer”, ISAT Study, 2001

slide-4
SLIDE 4

Edge: 4 May 24, 2006

Performance = Parallelism Efficiency = Locality

slide-5
SLIDE 5

Edge: 5 May 24, 2006

Arithmetic is cheap, Communication is expensive

  • Arithmetic

– Can put 100s of FPUs on a chip – $0.50/GFLOPS, 50mW/GFLOPS – Exploit with parallelism

  • Communication

– Dominates cost

  • $8/GW/s 2W/GW/s (off-chip)

– BW decreases (and cost increases) with distance – Power increases with distance – Latency increases with distance

  • But can be hidden with parallelism

– Need locality to conserve global bandwidth

90nm Chip $200 1GHz

64-bit FPU (to scale) 12mm 0.5mm

1 clock Inc re a sing powe r De c re a sing BW

slide-6
SLIDE 6

Edge: 6 May 24, 2006

Cost of data access varies by 1000x

200ns $50 1nJ Off chip (node mem) 1us $500 5nJ Global 20ns $10 200pJ Global on Chip (15mm) 4ns $2 50pJ Chip Region (2mm) 1ns $0.50 10pJ Local Register Time Cost* Energy From

*Cost of providing 1GW/s of bandwidth All numbers approximate

slide-7
SLIDE 7

Edge: 7 May 24, 2006

So we should build chips that look like this

slide-8
SLIDE 8

Edge: 8 May 24, 2006

An abstract view

R RM A R A R A Switch R A R A R A Switch R A R A R A Switch RM RM Switch CM LM Switch Global Memory

slide-9
SLIDE 9

Edge: 9 May 24, 2006

Real question is: How to orchestrate movement of data

slide-10
SLIDE 10

Edge: 10 May 24, 2006

Conventional Wisdom: Use caches

R RM A R A R A Switch R A R A R A Switch R A R A R A Switch RM RM Switch CM LM Switch Global Memory

slide-11
SLIDE 11

Edge: 11 May 24, 2006

Caches squander bandwidth – our scarce resource

  • Unnecessary data movement
  • Poorly scheduled data movement

– Idles expensive resources waiting on data

  • More efficient to map programs to an explicit memory

hierarchy

slide-12
SLIDE 12

Edge: 12 May 24, 2006

Example – Simplified Finite-Element Code

loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)

slide-13
SLIDE 13

Edge: 13 May 24, 2006

Explicitly block into SRF

loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)

Flux passed through SRF, no memory traffic

slide-14
SLIDE 14

Edge: 14 May 24, 2006

Explicitly block into SRF

loop over cells flux[i] = ... loop over cells ... = f(flux[i],...)

Explicit re-use

  • f Cells, no

misses

slide-15
SLIDE 15

Edge: 15 May 24, 2006

Stream loads/stores (bulk operations) hide latency (1000s of words in flight)

DRAM Cells SRFs Cells gather LRFs fn1 Flux fn2 Cells Cells scatter

slide-16
SLIDE 16

Edge: 16 May 24, 2006

Explicit storage enables simple, efficient execution

All needed data and instructions on-chip no misses

slide-17
SLIDE 17

Edge: 17 May 24, 2006

Caches lack predictability (controlled via a “wet noodle”)

slide-18
SLIDE 18

Edge: 18 May 24, 2006

Caches are controlled via a “wet noodle”

99% hit rate, 1 miss costs 100s of cycles, 10,000s of ops

slide-19
SLIDE 19

Edge: 19 May 24, 2006

So how do we program an explicit hierarchy?

slide-20
SLIDE 20

Edge: 20 May 24, 2006

Stream Programming: Parallelism, Locality, and Predictability

  • Parallelism

– Data parallelism across stream elements – Task parallelsm across kernels – ILP within kernels

  • Locality

– Producer/consumer – Within kernels

  • Predictability

– Enables scheduling

K1 K3 K4 K2

slide-21
SLIDE 21

Edge: 21 May 24, 2006

Evolution of Stream Programming

1997 StreamC/KernelC Break programs into kernels Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2001 Brook Continues the construct of streams and kernels Hides underlying details Too “one-dimensional” 2005 Sequoia Generalizes kernels to “tasks” Tasks operate on local data Local data “gathered” in an arbitrary way “Inner” tasks subdivide, “leaf” tasks compute Machine-specific details factored out

slide-22
SLIDE 22

Edge: 22 May 24, 2006

StreamC/KernelC

SAD Image 1 convolve convolve Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Image 0 convolve convolve Depth Map

STREAMPROG depth) { im_stream<pixels> in, tmp; … for (i=0; i<rows; i++) { convolve(in, tmp, …); convolve(tmp, conv_row, …); } … for (i=0; i<rows; i++) { SAD(conv_row, depth_row, …); } … } KERNEL convolve( istream<int> a,

  • stream<int> y) {

… loop_stream(a) { int ai, out; a >> ai; …

  • ut = dotproduct(ai,…);

y << out; } }

slide-23
SLIDE 23

Edge: 23 May 24, 2006

Explicit storage enables simple, efficient execution unit scheduling

10 20 30 40 50 60 70 80 90 100 110 120

10 20 30 40 50 60 70 80 90 100 110 120 20 30 40 50 60 70 80 90 100 110 120 20 30 40 50 60 70 80 90 100 110 120

ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable

One iteration SW Pipeline

slide-24
SLIDE 24

Edge: 24 May 24, 2006

Stream scheduling exploits explicit storage to reduce bandwidth demand

Stre a mF E M a pplic a tion

Pre fe tc hing , re use , use / de f, limite d spilling

Compute Flux States Compute Numerical Flux Element Faces Gathered Elements Numerical Flux Gather Cell Compute Cell Interior Advance Cell Elements (Current) Elements (New) Read-Only Table Lookup Data (Master Element) Face Geometry Cell Orientations Cell Geometry

slide-25
SLIDE 25

Edge: 25 May 24, 2006

  • Perform actual computation
  • Analogous to kernels
  • “Small” working set

Sequoia – Generalize Kernels into Leaf Tasks

void __task matmul::leaf( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) { for (int i=0; i<M; i++) { for (int j=0; j<N; j++) { for (int k=0; k<P; k++) { C[i][j] += A[i][k] * B[k][j]; }

FU LS 0 Aggregate LS FU LS 7 Node memory matmul leaf

slide-26
SLIDE 26

Edge: 26 May 24, 2006

  • Decompose to smaller subtasks

– Recursively

  • “Larger” working sets

Inner tasks

LS 0 Aggregate LS LS 7 Node memory matmul inner matmul leaf

void __task matmul::inner( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(A, U, X); blkset Bblks = rchop(B, X, V); blkset Cblks = rchop(C, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) matmul(Ablks[i][k],Bblks[k][j],Cblks[i][j]); }

matmul leaf FU FU

slide-27
SLIDE 27

Edge: 27 May 24, 2006

Stream Processors make communication explicit Enables optimization

slide-28
SLIDE 28

Edge: 28 May 24, 2006

Stream architecture makes communication explicit – exploits parallelism and locality

LRF LRF CL SW SRF Lane LRF LRF CL SW SRF Lane 100χ wire 1kχ switch M SW 10kχ switch Cache Bank Cache Bank Chip Pins and Router DRAM Bank DRAM Bank Chip Crossing(s) ALU and cluster arrays shown 1D here may be laid

  • ut as 2D arrays
slide-29
SLIDE 29

Edge: 29 May 24, 2006

  • Chip Details

– 2.56cm2 die, 0.15um process, 21M transistors, 792-pin BGA – Collaboration with TI ASIC – Chips arrived on April 1, 2002

  • Dual-Imagine test board

Imagine VLSI Implementation

slide-30
SLIDE 30

Edge: 30 May 24, 2006

Application Performance (cont.)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

DEPTH MPEG QRD RTSL Average Execution time

host bandwidth stalls stream controller

  • verhead

memory stalls cluster stalls kernel non main loop kernel main loop

  • verhead
  • perations
slide-31
SLIDE 31

Edge: 31 May 24, 2006

Applications match the bandwidth hierarchy

0.1 1 10 100 1000 Peak DEPTH MPEG QRD RTSL Bandwidth (GB/s) LRF SRF DRAM

slide-32
SLIDE 32

Edge: 32 May 24, 2006

Merrimac – Streaming Supercomputer

Sc a la b le fro m 2-T F L OP wo rksta tio n to 2-PF L OP supe rc o mpute r

16 x XDR-DRAM 2GBytes S trea m Pro c e ssor 64 FPU 128 GFL OPS On-Boa rd Network Intra -Ca b inet Network E/ O O/ E Inter-Ca b inet Network Bisec tion 24TBytes/ s 64GBytes/ s 12GBytes/ s 32+32 pa irs 48GBytes/ s 128+128 pa irs 6” Tera dyne Gb X 768GBytes/ s 2K +2K links Rib b on Fib er Ba c kpla ne Boa rd Node Node 2 Node 16 Boa rd 32 Boa rd 2 16 Nodes 1K FPUs 2T FL OPS 32GBytes Ba c kpla ne 2 32 Boa rds 512 Nodes 32K FPUs 64TFL OPS 1TBytes Ba c kpla ne 32 16 x XDR-DRAM 2GBytes S trea m Pro c e ssor 64 FPU 128 GFL OPS On-Boa rd Network Intra -Ca b inet Network E/ O O/ E Inter-Ca b inet Network Bisec tion 24TBytes/ s 64GBytes/ s 12GBytes/ s 32+32 pa irs 48GBytes/ s 128+128 pa irs 6” Tera dyne Gb X 768GBytes/ s 2K +2K links Rib b on Fib er Ba c kpla ne Boa rd Node Node 2 Node 16 Boa rd 32 Boa rd 2 16 Nodes 1K FPUs 2T FL OPS 32GBytes Ba c kpla ne 2 32 Boa rds 512 Nodes 32K FPUs 64TFL OPS 1TBytes Ba c kpla ne 32

slide-33
SLIDE 33

Edge: 33 May 24, 2006

Merrimac Application Results

Simula te d o n a ma c hine with 64GF L OPS pe a k pe rfo rma nc e a nd no fuse d MADD * T he lo w numb e rs a re a re sult o f ma ny divide a nd sq ua re -ro o t o pe ra tio ns

1.5M (1.3%) 4.2M (2.9%) 108M (95.0%) 9.7* 38.8* GROMACS 3.4M (1.4%) 7.2M (2.9%) 234.3M (95.7%) 7.4* 12.9* StreamFLO 0.7M (0.8%) 1.6M (1.7%) 90.2M (97.5%) 12.1* 14.2* StreamMD (grid algorithm) 2.8M (0.2%) 7.7M (0.4%) 186.5M (99.4%) 13.8 39.2 StreamFEM3D (MHD, constant) 1.8M (1.1%) 6.3M (3.9%) 153.0M (95.0%) 17.1 31.6 StreamFEM3D (Euler, quadratic)

Mem Refs SRF Refs LRF Refs FP Ops / Mem Ref Sustained GFLOPS Application

Applic a tio ns a c hie ve hig h pe rfo rma nc e a nd ma ke g o o d use o f the b a ndwidth hie ra rc hy

slide-34
SLIDE 34

Edge: 34 May 24, 2006

Other Stream Processors

slide-35
SLIDE 35

Edge: 35 May 24, 2006

Other Stream Processors

Cle a rSpe e d CSX600 96 GF L OP/ s 10W ST I Ce ll

~200 GF L OP/ s

100W GPUs 50- 100 GF L OP/ s 10- 30W

slide-36
SLIDE 36

Edge: 36 May 24, 2006

Other Stream Processors

  • Technology pushing many to build stream processors
  • GPUs (Nvidia, ATI), Game Processors (Cell), Physics

Processors (Ageia), Accelerators (Clearspeed)

  • Many (10s-100s) of FPUs
  • Distributed local storage
  • Latency hiding on access to external memory

– Block access or deeply multithreaded

  • All benefit from stream programming

– But the right architecture makes it easier and more efficient

slide-37
SLIDE 37

Edge: 37 May 24, 2006

Architecture Issues

  • On-chip memory

– Read and write access to on-chip storage

  • Producer-consumer locality demands write access

– Data movement between on-chip memories

  • Without going off chip
  • Off-chip memory

– No substitute for bandwidth – Efficient gather and scatter required

slide-38
SLIDE 38

Edge: 38 May 24, 2006

What’s Next?

slide-39
SLIDE 39

Edge: 39 May 24, 2006

ILP is mined out – end of superscalar processors Time for a new architecture

1e-4 1e-3 1e-2 1e-1 1e+0 1e+1 1e+2 1e+3 1e+4 1e+5 1e+6 1e+7 1980 1990 2000 2010 2020 Perf (ps/Inst) Linear (ps/Inst)

5 2 % / y e a r 74%/year 19%/year

30:1 1,000:1 30,000:1

Dally et al. “The Last Classsical Computer”, ISAT Study, 2001

slide-40
SLIDE 40

Edge: 40 May 24, 2006

Computing landscape is changing

  • Many function units
  • Deep, distributed storage hierarchy
  • Communication limited
  • Research is needed to understand how to architect

and program these processors

  • Not an incremental fix:

– Fundamental rethinking of basic architecture, programming model, and compilers is required

slide-41
SLIDE 41

Edge: 41 May 24, 2006

Software Topics

  • Exposed Communication Programming Models

– Abstract storage hierarchy and communication costs – Portable codes with predictable performance

  • Compiling bulk operations

– Strategic (vs tactical) program reorganization – Scheduling bulk data transfers – Size and shape of blocking – Irregular computations

  • Localization of shared neighbors

– Variable size results

  • Applications

– Communication-efficient algorithms

slide-42
SLIDE 42

Edge: 42 May 24, 2006

Hardware Topics

  • Close efficiency gap with hard-wired engines

– Gap is 10-100x today (10 for stream processors) – Efficient data movement is first step – Other overheads remain to be removed

  • Storage hierarchies that can be abstracted
  • Balancing parallelism ILP x DLP x TLP
  • On-chip networks

– To connect within and between levels of the hierarchy

  • Communication and synchronization mechanisms

– Drives granularity – which in turn determines available parallelism

  • Mechanisms for reuse of irregular data
slide-43
SLIDE 43

Edge: 43 May 24, 2006

Summary

  • Communication is expensive, arithmetic is cheap

– Parallelism to exploit arithmetic – Locality to conserve bandwidth

  • Architectures evolving toward a deep, broad storage

hierarchy

– Storage to hide latency, cover bandwidth taper – Stream processors > 10x efficiency of conventional CPUs

  • Explicitly manage this hierarchy

– Makes efficient use of scarce, expensive resources – Enables optimization

  • Generalized Stream programming

– Bulk operations: data movement and kernels – Parallelism, Locality, and Predictability