Dataflow: The Road Less Complex
Steven Swanson Ken Michelson
Sponsored by NSF and Intel
Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin - - PowerPoint PPT Presentation
Dataflow: The Road Less Complex Steven Swanson Andrew Schwerin Ken Michelson Mark Oskin University of Washington Sponsored by NSF and Intel Things to keep you up at night (~2016) Opportunities 8 billion transistors; 28Ghz One
Sponsored by NSF and Intel
2
Opportunities
8 billion transistors; 28Ghz One DRAM chip will exhaust a 32bit
120 P4s OR 200,000 RISC-1 will fit on
Challenges
It will take 36 cycles to cross the die. For reasonable yields, only 1
7yrs and 10000 people
3
Monolithic von Neumann processing WaveScalar Results Future work and Conclusions
4
Von Neumann is
We know how to
2016?
5
6
Fundamentally centralized. Fetch is the key.
There is only one program counter. There is no parallelism in the model.
The alternative is dataflow
7
Dataflow is not new Operations fire when data is available No program counter No false control dependencies Exposes massive parallelism But...
8
It never executed mainstream code
Special languages
No mutable data structures No aliasing Functional Strange memory semantics
There are scalability concerns
Large slow token stores
9
WaveScalar is memory-centric Dataflow Compared to Von Neumann
There is no fetch.
Compared to traditional dataflow.
Memory ordering is a first-class citizen. Normal memory semantics.
No I-structures or special languages. We run Spec.
10
Maximal loop-free
May contain branches and
They are bigger than
Each dynamic wave has a
Every value has a wave
11
Loads and stores can issue requests to
Wave number. Operation sequence number. Ordering information (predecessor and successor
The memory systems reconstructs the correct
Wave number+sequence numbers provide a total
Your favorite speculative memory system.
Or a store buffer.
12
Expose everything about the program
Data dependencies Memory order
Instructions manipulate wave numbers. Multiple, parallel sequences of operations are
Synchronization Concurrency Communication
13 L2 Cache
14 L2 Cache
FLOW CONTROL
FU
FLOW CONTROL DECODE
CONFIG. LOGIC
INPUTS OUTPUTS
D$ + Store Buffer
Long distance
Dynamic routing Grid-based network 1-2 cycle/domain.
Traditional cache
Normal memory
16K instructions.
15
Compiled SPEC/mediabench
DEC cc compiler (-O4 -unroll 16)
Binary translator/compiler
From Alpha AXP to WaveScalar
Timing/execution-based simulation. Results in alpha instructions per cycle
16
Superscalar
16-wide, 16 ported cache, 1024 issue window,
15 stage pipeline. Perfect cache.
WaveCache
~2000 processing elements 16 elements/domain Perfect cache.
17
2.8x faster Not counting clock
1 2 3 4 5 6 vpr tw olf mcf equake art adpcm mpeg fft AIPC WaveScalar superscalar
18
Not all the instructions will fit. WaveCache miss
Destination instruction is not present Evict/Load an instruction (flush/load queues)
Instructions volunteer for removal Location is important
Normal hashing won’t work
19
Thrashing is
Dynamic mapping
0.2 0.4 0.6 0.8 1 1.2 10 100 1000 10000 Cache size (log) Normalized performance vpr tw olf mcf equake art adpcm mpeg fft
20
Speculation
2.4x on
This is gravy!!
1 2 3 4 5 6 7 8 9 10 vpr tw olf mcf equake art adpcm mpeg fft AIPC WaveScalar Perfect branch Perfect mem. Disambig Both
21
Hardware implementation
A la the Bathysphere
Compiler issues
Memory parallelism More than von Neumann emulation
Vector Streaming
WaveScalar is an ISA for writing architectures.
Operating system issues
What is a context switch? What is a system call?
22
Online placement optimization
Simulated annealing
Defect tolerance
Hard and soft faults
WaveCache as a computer system.
WaveScalar everything (graphics, IO, CPU,
Uniform namespace for a computer. Adaptation at load time.
23
Decentralized computing will let you
WaveScalar and the WaveCache
Dataflow with normal memory!! Outperforms an OOO superscalar by 2.8x Feasible now and in 2016
Enormous opportunities for future