Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - - PowerPoint PPT Presentation

▶

Feb 11, 2024 663 likes •912 views

Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One

SLIDE 1

Dataflow: The Road Less Complex

Steven Swanson Ken Michelson

University of Washington Andrew Schwerin Mark Oskin

SLIDE 2

Things to keep you up at night (~2016)

Opportunities

8 billion transistors; 28Ghz One DRAM chip will exhaust a 32bit

address space

120 P4s OR 200,000 RISC-1 will fit on

a die.

Challenges

It will take 36 cycles to cross the die. For reasonable yields, only 1

transistor in 24 billion may be broken, if one flaw breaks a chip.

7yrs and 10000 people

Fault tolerance is required Chips are networks Simpler designs Better tools

SLIDE 3

Outline

Monolithic von Neumann processing WaveScalar Results Future work and Conclusions

SLIDE 4

Monolithic Processing

Von Neumann is

simple.

We know how to

build them.

2016?

Communication Fault tolerance Complexity Performance

SLIDE 5

Decentralized Processing

☺ Communication ☺ Fault tolerance ☺ Complexity ☺ Performance

SLIDE 6

The Problem with Von Neumann

Fundamentally centralized. Fetch is the key.

There is only one program counter. There is no parallelism in the model.

The alternative is dataflow

SLIDE 7

Dataflow has been done before..

Dataflow is not new Operations fire when data is available No program counter No false control dependencies Exposes massive parallelism But...

SLIDE 8

...it had issues

It never executed mainstream code

Special languages

No mutable data structures No aliasing Functional Strange memory semantics

There are scalability concerns

Large slow token stores

SLIDE 9

The WaveScalar Model

WaveScalar is memory-centric Dataflow Compared to Von Neumann

There is no fetch.

Compared to traditional dataflow.

Memory ordering is a first-class citizen. Normal memory semantics.

No I-structures or special languages. We run Spec.

SLIDE 10

What is a wave?

Maximal loop-free

sections of the dataflow graph.

May contain branches and

joins.

They are bigger than

hyperblocks.

Each dynamic wave has a

Wave number.

Every value has a wave

number.

SLIDE 11

Maintaining Memory Order

Loads and stores can issue requests to

memory in any order:

Wave number. Operation sequence number. Ordering information (predecessor and successor

sequence numbers).

The memory systems reconstructs the correct

rder.

Wave number+sequence numbers provide a total

rder

Your favorite speculative memory system.

Or a store buffer.

SLIDE 12

WaveScalar benefits

Expose everything about the program

Data dependencies Memory order

Instructions manipulate wave numbers. Multiple, parallel sequences of operations are

possible.

Synchronization Concurrency Communication

SLIDE 13

13 L2 Cache

The WaveCache The I The I-

Cache

Cache is is the the processor. processor.

SLIDE 14

14 L2 Cache

WaveCache

FLOW CONTROL

FLOW CONTROL DECODE

CONFIG. LOGIC

INPUTS OUTPUTS

Processing Element PE Domain

D$ + Store Buffer

Cluster

Long distance

communication

Dynamic routing Grid-based network 1-2 cycle/domain.

Traditional cache

coherence.

Normal memory

hierarchy.

16K instructions.

SLIDE 15

Current results

Compiled SPEC/mediabench

DEC cc compiler (-O4 -unroll 16)

Binary translator/compiler

From Alpha AXP to WaveScalar

Timing/execution-based simulation. Results in alpha instructions per cycle

(AIPC)

SLIDE 16

Comparison architectures

Superscalar

16-wide, 16 ported cache, 1024 issue window,

1024 regs, gshare branch predictor

15 stage pipeline. Perfect cache.

WaveCache

~2000 processing elements 16 elements/domain Perfect cache.

SLIDE 17

WaveScalar vs Superscalar

2.8x faster Not counting clock

rate improvements.

1 2 3 4 5 6 vpr tw olf mcf equake art adpcm mpeg fft AIPC WaveScalar superscalar

SLIDE 18

Cache replacement

Not all the instructions will fit. WaveCache miss

Destination instruction is not present Evict/Load an instruction (flush/load queues)

Instructions volunteer for removal Location is important

Normal hashing won’t work

SLIDE 19

Cache size

Thrashing is

dangerous

Dynamic mapping

is a big win.

0.2 0.4 0.6 0.8 1 1.2 10 100 1000 10000 Cache size (log) Normalized performance vpr tw olf mcf equake art adpcm mpeg fft

SLIDE 20

Speculation

helps

2.4x on

average for both

This is gravy!!

1 2 3 4 5 6 7 8 9 10 vpr tw olf mcf equake art adpcm mpeg fft AIPC WaveScalar Perfect branch Perfect mem. Disambig Both

SLIDE 21

Future work

Hardware implementation

A la the Bathysphere

Compiler issues

Memory parallelism More than von Neumann emulation

Vector Streaming

WaveScalar is an ISA for writing architectures.

Operating system issues

What is a context switch? What is a system call?

SLIDE 22

Future work

Online placement optimization

Simulated annealing

Defect tolerance

Hard and soft faults

WaveCache as a computer system.

WaveScalar everything (graphics, IO, CPU,

Keyboard, hard drive)

Uniform namespace for a computer. Adaptation at load time.

SLIDE 23

Conclusion

Decentralized computing will let you

rest easy in 2016!

WaveScalar and the WaveCache

Dataflow with normal memory!! Outperforms an OOO superscalar by 2.8x Feasible now and in 2016

Enormous opportunities for future

Dataflow: The Road Less Complex

Steven Swanson Ken Michelson

University of Washington Andrew Schwerin Mark Oskin

Things to keep you up at night (~2016)

address space

a die.

transistor in 24 billion may be broken, if one flaw breaks a chip.

Fault tolerance is required Chips are networks Simpler designs Better tools

Outline

Monolithic Processing

simple.

build them.

Communication Fault tolerance Complexity Performance

Decentralized Processing

☺ Communication ☺ Fault tolerance ☺ Complexity ☺ Performance

The Problem with Von Neumann

Dataflow has been done before..

...it had issues

The WaveScalar Model

What is a wave?

sections of the dataflow graph.

joins.

hyperblocks.

Wave number.

number.

Maintaining Memory Order

memory in any order:

sequence numbers).

WaveScalar benefits

possible.

The WaveCache The I The I-

Cache is is the the processor. processor.

WaveCache

Processing Element PE Domain

Cluster

communication

coherence.

hierarchy.

Current results

(AIPC)

Comparison architectures

1024 regs, gshare branch predictor

WaveScalar vs Superscalar

rate improvements.

Cache replacement

Cache size

dangerous

is a big win.

Speculation

helps

average for both

Future work

Future work

Keyboard, hard drive)

Conclusion

rest easy in 2016!

research