Chronos: Efficient Speculative Parallelism for Accelerators MALEEN - - PowerPoint PPT Presentation

chronos efficient speculative
SMART_READER_LITE
LIVE PREVIEW

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN - - PowerPoint PPT Presentation

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN ABEYDEERA, DANIEL SANCHEZ ASPLOS 2020 Current hardware accelerators are limited to easy parallelism Current Accelerators Chronos Target easy parallelism Targets hard


slide-1
SLIDE 1

Chronos: Efficient Speculative Parallelism for Accelerators

MALEEN ABEYDEERA, DANIEL SANCHEZ ASPLOS 2020

slide-2
SLIDE 2

Current hardware accelerators are limited to easy parallelism

Current Accelerators Target easy parallelism Tasks and dependences known in advance

2

e.g.: Deep learning, Genomics Chronos Targets hard parallelism Require speculative execution e.g.: Graph analytics, simulation, transactional databases

slide-3
SLIDE 3

Problem and Insight

Problem Prior speculation mechanisms (Transactional Memory, Thread Level Speculation) require global conflict detection

3

Shared memory system → coherence protocol Coherence poorly suited for accelerators

Transaction 2 Core 2 Transaction 1(W, Y)

W

Order constraints Transaction 1

Insight Limit the data that each core can access Divide work into tiny tasks and send them to data Coordinate tasks through order constraints

W X Y Z

Transaction 2 (Z, W)

Memory

Core 2 Core 1

Y Z W

Core 1

Local conflict detection → No coherence needed

slide-4
SLIDE 4

Contributions

SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelism

4

https://chronos-arch.csail.mit.edu/

slide-5
SLIDE 5

Speculative parallelism with single-object tasks

Discrete Event Simulation (DES) for Digital Circuits

5

O1 N2 X3 X6

1 2 3 4 5 6 Time (ns)

OR

1 ns 5 ns

1 ns 2 ns

1 NAND XOR 1

1 ns 6 ns

NAND OR XOR

If X6 is being speculatively executed

slide-6
SLIDE 6

Prior techniques rely on global conflict detection

6

Shared Cache / Directory O1 N2 X3 X6

1 2 3 4 5 6

Time (ns) Why? No restriction on where a task can run

Private Cache Private Cache Core 1 Core 2 O1 N2 X6 X3

Relies on coherence protocol to find conflicts

slide-7
SLIDE 7

Insight 1: Leveraging spatial task mapping for local conflict detection

7

Shared Cache / Directory O1 N2 X3 X6

1 2 3 4 5 6

Time (ns) Impose restrictions on where a task can run

Private Cache Private Cache Core 1 Core 2 O1 N2 X6 X3

Conflict detection is local to a core Mapped to Core 1 Mapped to Core 2

slide-8
SLIDE 8

Insight 2: Leveraging order to ensure atomicity

8

Account (object) Balance W $100 X $1500 Y $200 Z $400

  • Tx. 1:

Transfer W Y

1

  • Tx. 3:

Transfer X Z

20 21

  • Tx. 2:

Transfer Z W

10 11

Timestamp Banking application: Each transaction decrements the balance of one account and increments another Assign a disjoint timestamp range for each coarse transaction

slide-9
SLIDE 9

Benefits of fine-grained tasks

9

✓ Increased data locality ✓ Reduced network traffic ✓ Increased parallelism

Transaction 2 Core 2

Transaction 1 (W, Y)

W

Order constraints Transaction 1

W X Y Z

Transaction 2 (Z, Y) Memory

Core 2 Core 1

Y Z W

Core 1

Brings data to compute Sends compute to data

✓ Low probability and impact of aborts ✓ Asynchronous communication

slide-10
SLIDE 10

SLOT (Spatially Located Ordered Tasks)

10

SLOT programs consist of tasks Tasks can create children tasks through a simple API:

slot::enqueue( fn_ptr, timestamp, object-id, arguments…);

Timestamp : Specifies order. Tasks appear to execute in timestamp order Object-id : Specifies dependences. Tasks with same object-id are treated as data-dependent Tasks with different object-ids can only communicate through arguments

slide-11
SLIDE 11

SLOT programming example (in software)

11 1 ns 5 ns

1 ns 2 ns

1 1

// Simulates an event arriving at a gate void simToggle(Time time, GateInput input) { gate = input.gate; toggledOutput = updateState(gate, input); if (toggledOutput) { // create events for connected gates for (GateInput i : gate.connectedInputs()) { Time nextTime = time + gate.delay(input, i); slot::enqueue( simToggle, nextTime, i.gateID, i); } } } enqueueInitialTasks() slot::run() // Simulates an event arriving at a gate void simToggle(Time time, GateInput input) { gate = input.gate; toggledOutput = updateState(gate, input); if (toggledOutput) { // create events for connected gates for (GateInput i : gate.connectedInputs()) { Time nextTime = time + gate.delay(input, i); eventQueue.enqueue(nextTime, i); } } } PriorityQueue<Time, GateInput> eventQueue; enqueueInitialEvents() // event loop. Sequentially execute in ts order while (!eventQueue.empty()){ (time, input) = eventQueue.dequeue(); simToggle(time, input); }

slide-12
SLIDE 12

Chronos: An implementation of SLOT

slide-13
SLIDE 13

Chronos overview

Chronos provides a framework to build accelerators for applications with speculative parallelism

13

PE

Cache (Private, non-coherent)

Task Unit

Task Traffic Interconnect Mem0 Mem1 Mem2 Mem3 Memory Traffic Interconnect Tile Tile 1

Tile 2 Tile N

PE PE

The developer specifies the tasks and how they are implemented

  • Either software routines on soft cores, or specialized Processing Elements (PE)

Framework takes care of task management and speculative execution

Chronos Framework Application-specific RTL

slide-14
SLIDE 14

Task life cycle

14

Create Dispatch Idle Running Finish Finished Y N Parent aborted? Commit Abort Discard

slide-15
SLIDE 15

Mapped to Tile A Mapped to Tile B

Chronos internal dataflow

15

Task Interconnect

Tile A Tile B

Task Queue Commit Queue TSB Cache Cache IDLE (I) RUNNING (R) FINISHED (F) 1 I 2 I 3 I 6 I 1 2 1 Task creation/ dispatch 1 R 6 PE Speculative state of finished tasks

1 ns 5 ns

1 ns 2 ns

1

1 1 6 6 1 F 2 2 2 R 3 6 R 8 I Abort messages Requeue task

2 ns

6

slide-16
SLIDE 16

Versioning and commit protocol

Core

Main Memory / Cache Eager versioning Undo Log Commit Protocol (GVT – Global Virtual Time)

Tile 0 Tile 1 Tile N GVT Arbiter

LVT (Earliest unfinished ts in the tile) GVT (Earliest unfinished ts in the system) GVT = min{LVT0, .. LVTN} Key benefits Makes the common case (commits) fast Makes speculative data available before commit Key benefits Achieves fast and parallel commits Updates speculative values in place Store old values in an undo log

16

slide-17
SLIDE 17

Chronos FPGA implementation

Developed an FPGA implementation of Chronos – up to 16 tiles Running at 125 MHz High task throughput – can enqueue, dequeue, execute and commit 8 tasks per cycle on a 16-tile system

17

AWS Shell 16 Tiles

slide-18
SLIDE 18

Experimental methodology

Four accelerators built using Chronos framework running on AWS FPGAs

  • Discrete Event Simulation (DES)
  • Maxflow
  • Single Source Shortest Paths (SSSP)
  • Astar Search

Custom PEs per application: 32-way multithreaded PE, single PE/tile Baseline: Highly optimized software parallel implementations

running on a 40-threaded Xeon AWS instance

18

Platform AWS Instance Price ($/hr) Baseline CPU M4.10xlarge 2.00 FPGA F1.2xlarge 1.65

slide-19
SLIDE 19

Chronos performance vs. 40-threaded Xeon

19

App Concurrent

  • Max. Tasks

FPGA 1t/ CPU 1t Overall Speedup des 256 2.45× 15.3× maxflow 192 0.11× 4.3× sssp 512 0.24× 3.6× astar 192 0.58× 3.5× 3.6x 4.3x 3.5x 15.3x

Runs many more tasks in parallel Specialization helps to run a single task efficiently (narrowing the 19× frequency gap with CPU)

slide-20
SLIDE 20

Chronos performance analysis

20

Breakdown of aggregate PE cycles Observation: Most work is ultimately useful (only 11% of cycles result in wasted work)

slide-21
SLIDE 21

See the paper for more

Non-speculative applications Non-rollback applications Chronos with RISC-V cores Projected performance on ASIC Chronos Chronos resource utilization

21

slide-22
SLIDE 22

Conclusion

Prior speculative parallel systems have relied on cache coherence to detect conflicts, precluding their use in accelerators SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts Chronos: An implementation of SLOT that provides a common framework for acceleration

  • f applications with speculative parallelism
  • Use Chronos to build FPGA accelerators for four challenging applications providing up to 15x speedup
  • ver a multicore baseline

22

https://chronos-arch.csail.mit.edu/