WaveScalar Good old days 2 Good old days ended in Nov. 2002 - - PowerPoint PPT Presentation

wavescalar good old days
SMART_READER_LITE
LIVE PREVIEW

WaveScalar Good old days 2 Good old days ended in Nov. 2002 - - PowerPoint PPT Presentation

WaveScalar Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling Area scaling 3 Chip MultiProcessors Low complexity Scalable Fast 4 CMP Problems Hard to program Not practical to scale


slide-1
SLIDE 1

WaveScalar

slide-2
SLIDE 2 2

Good old days

slide-3
SLIDE 3 3

Good old days ended in Nov. 2002

 Complexity  Clock scaling  Area scaling

slide-4
SLIDE 4 4

Chip MultiProcessors

 Low complexity  Scalable  Fast

slide-5
SLIDE 5 5

CMP Problems

 Hard to program  Not practical to scale

 There only ~8 threads

 Inflexible allocation

 Tile = allocation

 Thread parallelism only

slide-6
SLIDE 6 6

What is WaveScalar?

 WaveScalar is a new, scalable, highly

parallel processor architecture

 Not a CMP  Different algorithm for executing programs  Different hardware organization

slide-7
SLIDE 7 7

WaveScalar Outline

 Dataflow execution model  Hardware design  Evaluation  Exploiting dataflow features  Beyond WaveScalar: Future work

slide-8
SLIDE 8 8

Execution Models: Von Neumann

 Von Neumann (CMP)

 Program counter  Centralized  Sequential

slide-9
SLIDE 9 9

Execution Model: Dataflow

 Not a new idea [Dennis, ISCA’75]  Programs are dataflow graphs  Instructions fire when data arrives

 Instructions act independently  All ready instructions can fire at once  Massive parallelism

 Where are the dataflow machines?

+ 2 4 2 +

slide-10
SLIDE 10 10

Von Neumann example

A[j + i*i] = i; b = A[i*j]; Mul t1 ← i, j Mul t2 ← i, i Add t3 ← A, t1 Add t4 ← j, t2 Add t5 ← A, t4 Store (t5) ← i Load b ← (t3)

slide-11
SLIDE 11 11

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

slide-12
SLIDE 12 12

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

slide-13
SLIDE 13 13

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

slide-14
SLIDE 14 14

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

slide-15
SLIDE 15 15

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

slide-16
SLIDE 16 16

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

slide-17
SLIDE 17 17

Dataflow’s Achilles’ heel

 No ordering for memory operations  No imperative languages (C, C++, Java)  Designers relied on functional languages

instead To be useful, WaveScalar must solve the dataflow memory ordering problem

slide-18
SLIDE 18 18

WaveScalar’s solution

 Order memory

  • perations

 Just enough

  • rdering

 Preserve parallelism

* Load Store + j i * b A + +

slide-19
SLIDE 19 19

Wave-ordered memory

 Compiler annotates

memory operations

 Send memory

requests in any order

 Hardware

reconstructs the correct order Load Store Load Store Load Store 3 4 8 5 6 7

Sequence #

4 ? 9 6 8 8

Successor

2 3 ? 4 5 4

Predecessor

slide-20
SLIDE 20 20

Store buffer

Wave-ordering Example

4 ? 3 7 8 4 8 9 ? Load Store Load Store Load Store 5 6 6 8 3 4 2 8 9 ? 4 5 7 8 4 4 ? 3 3 4 2

slide-21
SLIDE 21 21

Wave-ordered Memory

 Waves are loop-free sections of

the control flow graph

 Each dynamic wave has a wave

number

 Each value carries its wave

number

 Total ordering

 Ordering between waves  “linked list” ordering within waves

[MICRO’03]

slide-22
SLIDE 22 22

Wave-ordered Memory

 Annotations summarize the CFG  Expressing parallelism

 Reorder consecutive operations

 Alternative solution: token passing [Beck, JPDC’91]

 1/2 the parallelism

slide-23
SLIDE 23 23

WaveScalar’s execution model

 Dataflow execution  Von Neumann-style memory  Coarse-grain threads  Light-weight synchronization

slide-24
SLIDE 24 24

WaveScalar Outline

 Execution model  Hardware design

 Scalable  Low-complexity  Flexible

 Evaluation  Exploiting dataflow features  Beyond WaveScalar: Future work

slide-25
SLIDE 25 25

Executing WaveScalar

* Load Store + j i * b A + +

 Ideally

 One ALU per

instruction

 Direct communication

 Practically

 Fewer ALUs  Reuse them

slide-26
SLIDE 26 26

WaveScalar processor architecture

 Array of processing

elements (PEs)

 Dynamic instruction

placement/eviction

slide-27
SLIDE 27 27

Processing Element

 Simple, small

 0.5M transistors

 5-stage pipeline  Holds 64 instructions

slide-28
SLIDE 28 28

PEs in a Pod

slide-29
SLIDE 29 29

Domain

slide-30
SLIDE 30 30

Cluster

slide-31
SLIDE 31 31

WaveScalar Processor

slide-32
SLIDE 32 32

WaveScalar Processor

 Long distance

communication

 Dynamic routing  Grid-based network

 32K instructions  ~400mm2 90nm  22FO4 (1Ghz)

slide-33
SLIDE 33 33

WaveScalar processor architecture

 Low complexity  Scalable  Flexible parallelism  Flexible allocation

Thread Thread Thread Thread Thread Thread

slide-34
SLIDE 34 34

Demo

slide-35
SLIDE 35 35

Previous dataflow architectures

 Many, many previous dataflow machines

 [Dennis, ISCA’75]  TTDA [Arvind, 1980]  Sigma-1 [Shimada, ISCA’83]  Manchester [Gurd, CACM’85]  Epsilon [Grafe, ISCA’89]  EM-4 [Sakai, ISCA’89]  Monsoon [Papadopoulos, ISCA’90]  *T [Nikhil, ISCA’92]

slide-36
SLIDE 36 36

Previous dataflow architectures

 Many, many previous dataflow machines

 [Dennis, ISCA’75]  TTDA [Arvind, 1980]  Sigma-1 [Shimada, ISCA’83]  Manchester [Gurd, CACM’85]  Epsilon [Grafe, ISCA’89]  EM-4 [Sakai, ISCA’89]  Monsoon [Papadopoulos, ISCA’90]  *T [Nikhil, ISCA’92]

WaveScalar architecture

Modern technology

slide-37
SLIDE 37 37

WaveScalar Outline

 Execution model  Hardware design  Evaluation

 Map WaveScalar’s design space  Scalability  CMP comparison

 Exploiting dataflow features  Beyond WaveScalar: Future work

slide-38
SLIDE 38 38

Performance Methodology

 Cycle-level simulator  Workloads

 SpecINT + SpecFP  Splash2  Mediabench

 Binary translator from Alpha -> WaveScalar  Alpha Instructions per Cycle (AIPC)  Synthesizable Verilog model

slide-39
SLIDE 39 39

WaveScalar’s design space

 Many, many parameters

 # of clusters, domains, PEs, instructions/PE, etc.  Very large design space

 No intuition about good designs  How to find good designs?

 Search by hand  Complete, systematic search

slide-40
SLIDE 40 40

WaveScalar’s design space

 Constrain the design space

 Synthesizable RTL model -> Area model  Fix cycle time (22FO4) and area budget

(400mm2)

 Apply some “common sense” rules  Focus on area-critical parameters

 There are 201 reasonable WaveScalar

designs

 Simulate them all

slide-41
SLIDE 41 41

WaveScalar’s design space

[ISCA’06]

slide-42
SLIDE 42 42

Pareto Optimal Designs

[ISCA’06]

slide-43
SLIDE 43 43

WaveScalar is Scalable

7x apart in area and performance

slide-44
SLIDE 44 44

Area efficiency

 Performance per silicon: IPC/mm2  WaveScalar

 1-4 clusters: 0.07  16 clusters: 0.05

 Pentium 4: 0.001-0.013  Alpha 21264: 0.008  Niagara (8-way CMP): 0.01

slide-45
SLIDE 45 45

WaveScalar Outline

 Execution model  Hardware design  Evaluation  Exploiting dataflow features

 Unordered memory  Mix-and-match parallelism

 Beyond WaveScalar: Future work

slide-46
SLIDE 46 46

The Unordered Memory Interface

 Wave-ordered memory is restrictive  Circumvent it

 Manage (lack-of) ordering explicitly  Load_Unordered  Store_Unordered

 Both interfaces co-exist happily  Combine with fine-grain threads

 10s of instructions

slide-47
SLIDE 47 47

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

slide-48
SLIDE 48 48

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-49
SLIDE 49 49

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-50
SLIDE 50 50

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-51
SLIDE 51 51

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-52
SLIDE 52 52

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-53
SLIDE 53 53

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-54
SLIDE 54 54

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-55
SLIDE 55 55

Exploiting Unordered Memory

 Fine-grain intermingling

struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

St *a, 0 <0,1,2> Mem_nop_ack <1,2,3> Mem_nop_ack <2,3,4> Ld *b <3,4,5> Ld p->x Ld p->y St r.x St r.y +

Ordered Unordered

slide-56
SLIDE 56 56

Putting it all together: Equake

 Finite element earthquake simulation  >90% execution is in two functions

 Sim()

 Easy-to-parallelize loops  Coarse-grain threads

 Smvp()

 Cross-iteration dependences  Fine-grain threads + unordered memory
slide-57
SLIDE 57 57

Putting it all together: Equake

1 1.4 2.2 6 1 2 3 4 5 6 7 Unoptimzed Sim() smvp() Sim()+smvp()

Speedup

(3.5) (11) Single-threaded

slide-58
SLIDE 58 58

Conclusion

 Low complexity dataflow architecture  Solves dataflow memory ordering  Hybrid memory and threading interfaces  Scalable high performance  First ISA to encode program structure

slide-59
SLIDE 59 59

Dataflow performance

slide-60
SLIDE 60 60

Multithreaded Performance

5 10 15 20 25 fft lu
  • cean
radix raytrace average Speedup 1 2 4 8 16 32 64 128
slide-61
SLIDE 61 61

Single Thread Performance

Performance 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 ammp art equake gzip mcf twolf djpeg mpeg2encode rawdaudio average AIPC

WS OOO

slide-62
SLIDE 62 62

Single Thread Performance/mm2

Performance per area 0.01 0.02 0.03 0.04 0.05 ammp art equake gzip mcf twolf djpeg mpeg2encode rawdaudio average AIPC/mm 2 WS OOO
slide-63
SLIDE 63 63

Wave-ordered Parallelism

Load 3 4 2 Store 2 3 1 Store 6 7 5 Load 5 6 4 Load 4 5 3 1 2 2 2 6

slide-64
SLIDE 64 64

Scaling to 4 clusters

aipc/mm2 = 0.04 0.02 370mm2 8.2 AIPC Max performance 0.06 0.04 219mm2 8.6 AIPC Max AIPC/mm2

slide-65
SLIDE 65 65

Scaling to 16 clusters

0.06 0.02

828mm2 17 AIPC

0.04

219mm2 8.6 AIPC

0.03 0.05

463mm2 15 AIPC

Max AIPC/mm2

169mm2 7.8 AIPC

There is no scalable tile!