Executing a Program on the MIT Tagged Token Dataflow Architecture - - PowerPoint PPT Presentation

▶

Mar 21, 2023 442 likes •733 views

Executing a Program on the MIT Tagged Token Dataflow Architecture Arvind and Nikhil Notes on the paper This is a Big A Architecture paper Its a PL, an ISA, and an execution model and a dash of hardware Execution Models:

SLIDE 1

Executing a Program on the MIT Tagged Token Dataflow Architecture

Arvind and Nikhil

SLIDE 2

Notes on the paper

This is a “Big A” Architecture paper
It’s a PL, an ISA, and an execution model
and a dash of hardware

SLIDE 3

Execution Models: Von Neumann

 Von Neumann (CMP)

 Program counter  Centralized  Sequential

 To serialization points

 Instruction fetch  Memory access

SLIDE 4

Execution Model: Dataflow

 Not a new idea [Dennis, ISCA’75]  Programs are dataflow graphs  Instructions fire when data arrives

 Instructions act independently  All ready instructions can fire at once  Massive parallelism

SLIDE 5

Execution Model: Dataflow

 Not a new idea [Dennis, ISCA’75]  Programs are dataflow graphs  Instructions fire when data arrives

 Instructions act independently  All ready instructions can fire at once  Massive parallelism

+ 2

SLIDE 6

Execution Model: Dataflow

 Not a new idea [Dennis, ISCA’75]  Programs are dataflow graphs  Instructions fire when data arrives

 Instructions act independently  All ready instructions can fire at once  Massive parallelism

+ 2 2 +

SLIDE 7

Execution Model: Dataflow

 Not a new idea [Dennis, ISCA’75]  Programs are dataflow graphs  Instructions fire when data arrives

 Instructions act independently  All ready instructions can fire at once  Massive parallelism

+ 2 4 2 +

SLIDE 8

Von Neumann example

A[j + i*i] = i; b = A[i*j]; Mul t1 ← i, j Mul t2 ← i, i Add t3 ← A, t1 Add t4 ← j, t2 Add t5 ← A, t4 Store (t5) ← i Load b ← (t3)

SLIDE 9

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

SLIDE 10

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

SLIDE 11

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

SLIDE 12

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

SLIDE 13

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

SLIDE 14

Dataflow example

A[j + i*i] = i; b = A[i*j];

* Load Store + j i * b A + +

SLIDE 15

Conditionals

 Use a switch operator

 No wasted work.  Natural correspondence to if-then  Can build loops

 Use a gated phi function (ala SSA)

 More parallelism -- defer predicate

computation

 Not suitable for loops  Computing predicate is tricky (but

solved)

phi

T F P

Switch

T F P

SLIDE 16

Conditionals

 Use a “steering” operator.

SLIDE 17

Loops

SLIDE 18

Managing parallelism: Static dataflow

 Exactly one input on each dataflow arc at one

time

 Finite state (~ the size of the dataflow graph)  Scheduling is easy  Parallelism limited by dataflow graph size (i.e. static

instruction count)

 No loop parallelism.

+

A B

SLIDE 19

Managing Parallelism: Dynamic dataflow

 Dynamic dataflow

 Multiple inputs on an arc at one time  Parallelism is possible -- pipeline iterations through

the loops graph

 Unbounded state  Circulation speed mismatch -- mis-matched inputs  Tags are required.

+

A A A B B B S S S

+

1:A 3:A 2:A 3:B 1:B 2:B 3:S 2:S 1:S

SLIDE 20

Dataflow tags

 Tags distinguish between different dynamic

instances of the same value

 Tag management in TTDA

 Tags are the address of an activation record (aka

stack frame)

 A dynamic instance of an “instruction block” has a

tag.

 A central manager allocates/reclaims them.

SLIDE 21

Dataflow Granularity

 How big should the threads that “fire” be?  Fine-grain

 In the limit, each instruction is a thread  Maximum parallelism  Lots synchronization overhead.  Bounded # of inputs

 Coarse-grain

 Potentially less parallelism (in practice?)  less synchronization overhead and variable inputs

 It’s had to beat straight-line code on a pipelined machine.

 5-stages == 5-way parallelism  Pretty good for short threads

SLIDE 22

Challenges in Dataflow Execution

 Building well-formed graphs.

 In von Neumann ISAs any sequence of instructions is

valid

 Complex rules for well-formed dataflow graphs

 Detecting completion

 It is hard to tell when a fully distributed system is

“finished”

 Preventing tag explosion

 k-loop bounding et. al.

 Executing “normal” languages.

SLIDE 23

 j will

probably run ahead of s.

 token pile up!

 But it might

not

 Tokens out of

rder!

SLIDE 24

Id

 Elegant

 Determinate  Functional  Non-strict  Implicit parallelism.  I-structures

 Non-strictness is the least intuitive property

 Exposes enormous parallelism.  Leadings to mind bending code.

SLIDE 25

I-structures

 A sort of dataflow-enabled storage element  Simple rules

 Write/initialize once.  Read from an uninitialized I-structure blocks.  Read from an initialized I-structure returns.  Write to an uninitialized I-structure unblocks reads  Write to an initialized I-structure is an error.

 Implementation is tricky: you need a queue for

for blocked reads.

SLIDE 26

In context

 Id never really went anywhere  This paper is a good snap shot of late 80’s

dataflow thinking.

 Eventually gives rise to OOO execution (a la

HPS)

 Excellent example of vertical co-design.

 They rethought the whole system  Almost always impractical  Often yields great ideas.

SLIDE 27

Bits from your summaries

 How do you execute normal languages?  How do you multitask?  How does function linking work?  Top-to-bottom design.  Where’s the data?  Would I-structures be useful today?