Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - - PowerPoint PPT Presentation

dataflow the road less complex
SMART_READER_LITE
LIVE PREVIEW

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - - PowerPoint PPT Presentation

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One


slide-1
SLIDE 1

Dataflow: The Road Less Complex

Steven Swanson Ken Michelson

Sponsored by NSF and Intel

University of Washington Andrew Schwerin Mark Oskin

slide-2
SLIDE 2

2

Things to keep you up at night (~2016)

Opportunities

8 billion transistors; 28Ghz One DRAM chip will exhaust a 32bit

address space

120 P4s OR 200,000 RISC-1 will fit on

a die.

Challenges

It will take 36 cycles to cross the die. For reasonable yields, only 1

transistor in 24 billion may be broken, if one flaw breaks a chip.

7yrs and 10000 people

Fault tolerance is required Chips are networks Simpler designs Better tools

slide-3
SLIDE 3

3

Outline

Monolithic von Neumann processing WaveScalar Results Future work and Conclusions

slide-4
SLIDE 4

4

Monolithic Processing

Von Neumann is

simple.

We know how to

build them.

2016?

Communication Fault tolerance Complexity Performance

slide-5
SLIDE 5

5

Decentralized Processing

☺ Communication ☺ Fault tolerance ☺ Complexity ☺ Performance

slide-6
SLIDE 6

6

The Problem with Von Neumann

Fundamentally centralized. Fetch is the key.

There is only one program counter. There is no parallelism in the model.

The alternative is dataflow

slide-7
SLIDE 7

7

Dataflow has been done before..

Dataflow is not new Operations fire when data is available No program counter No false control dependencies Exposes massive parallelism But...

slide-8
SLIDE 8

8

...it had issues

It never executed mainstream code

Special languages

No mutable data structures No aliasing Functional Strange memory semantics

There are scalability concerns

Large slow token stores

slide-9
SLIDE 9

9

The WaveScalar Model

WaveScalar is memory-centric Dataflow Compared to Von Neumann

There is no fetch.

Compared to traditional dataflow.

Memory ordering is a first-class citizen. Normal memory semantics.

No I-structures or special languages. We run Spec.

slide-10
SLIDE 10

10

What is a wave?

Maximal loop-free

sections of the dataflow graph.

May contain branches and

joins.

They are bigger than

hyperblocks.

Each dynamic wave has a

Wave number.

Every value has a wave

number.

slide-11
SLIDE 11

11

Maintaining Memory Order

Loads and stores can issue requests to

memory in any order:

Wave number. Operation sequence number. Ordering information (predecessor and successor

sequence numbers).

The memory systems reconstructs the correct

  • rder.

Wave number+sequence numbers provide a total

  • rder

Your favorite speculative memory system.

Or a store buffer.

slide-12
SLIDE 12

12

WaveScalar benefits

Expose everything about the program

Data dependencies Memory order

Instructions manipulate wave numbers. Multiple, parallel sequences of operations are

possible.

Synchronization Concurrency Communication

slide-13
SLIDE 13

13 L2 Cache

The WaveCache The I The I-

  • Cache

Cache is is the the processor. processor.

slide-14
SLIDE 14

14 L2 Cache

WaveCache

FLOW CONTROL

FU

FLOW CONTROL DECODE

CONFIG. LOGIC

INPUTS OUTPUTS

Processing Element PE Domain

D$ + Store Buffer

Cluster

Long distance

communication

Dynamic routing Grid-based network 1-2 cycle/domain.

Traditional cache

coherence.

Normal memory

hierarchy.

16K instructions.

slide-15
SLIDE 15

15

Current results

Compiled SPEC/mediabench

DEC cc compiler (-O4 -unroll 16)

Binary translator/compiler

From Alpha AXP to WaveScalar

Timing/execution-based simulation. Results in alpha instructions per cycle

(AIPC)

slide-16
SLIDE 16

16

Comparison architectures

Superscalar

16-wide, 16 ported cache, 1024 issue window,

1024 regs, gshare branch predictor

15 stage pipeline. Perfect cache.

WaveCache

~2000 processing elements 16 elements/domain Perfect cache.

slide-17
SLIDE 17

17

WaveScalar vs Superscalar

2.8x faster Not counting clock

rate improvements.

1 2 3 4 5 6 vpr tw olf mcf equake art adpcm mpeg fft AIPC WaveScalar superscalar

slide-18
SLIDE 18

18

Cache replacement

Not all the instructions will fit. WaveCache miss

Destination instruction is not present Evict/Load an instruction (flush/load queues)

Instructions volunteer for removal Location is important

Normal hashing won’t work

slide-19
SLIDE 19

19

Cache size

Thrashing is

dangerous

Dynamic mapping

is a big win.

0.2 0.4 0.6 0.8 1 1.2 10 100 1000 10000 Cache size (log) Normalized performance vpr tw olf mcf equake art adpcm mpeg fft

slide-20
SLIDE 20

20

Speculation

Speculation

helps

2.4x on

average for both

This is gravy!!

1 2 3 4 5 6 7 8 9 10 vpr tw olf mcf equake art adpcm mpeg fft AIPC WaveScalar Perfect branch Perfect mem. Disambig Both

slide-21
SLIDE 21

21

Future work

Hardware implementation

A la the Bathysphere

Compiler issues

Memory parallelism More than von Neumann emulation

Vector Streaming

WaveScalar is an ISA for writing architectures.

Operating system issues

What is a context switch? What is a system call?

slide-22
SLIDE 22

22

Future work

Online placement optimization

Simulated annealing

Defect tolerance

Hard and soft faults

WaveCache as a computer system.

WaveScalar everything (graphics, IO, CPU,

Keyboard, hard drive)

Uniform namespace for a computer. Adaptation at load time.

slide-23
SLIDE 23

23

Conclusion

Decentralized computing will let you

rest easy in 2016!

WaveScalar and the WaveCache

Dataflow with normal memory!! Outperforms an OOO superscalar by 2.8x Feasible now and in 2016

Enormous opportunities for future

research