Chronos: Efficient Speculative Parallelism for Accelerators MALEEN - PowerPoint PPT Presentation

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN ABEYDEERA, DANIEL SANCHEZ ASPLOS 2020

Current hardware accelerators are limited to easy parallelism Current Accelerators Chronos Target easy parallelism Targets hard parallelism Tasks and dependences known in Require speculative execution advance e.g.: Graph analytics, simulation, e.g.: Deep learning, Genomics transactional databases 2

Problem and Insight Problem Insight Limit the data that each core can access Prior speculation mechanisms (Transactional Memory, Thread Level Speculation) require Divide work into tiny tasks and send them to data global conflict detection Coordinate tasks through order constraints Transaction 1 Memory Transaction 2 W Transaction 1(W, Y) Core 1 Core 1 W W X Order constraints Y Core 2 Transaction 2 (Z, W) Y Z Core 2 Z Shared memory system → coherence protocol Local conflict detection → No coherence needed Coherence poorly suited for accelerators 3

Contributions SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelism https://chronos-arch.csail.mit.edu/ 4

Speculative parallelism with single-object tasks Discrete Event Simulation (DES) for Digital Circuits 2 ns N2 NAND 1 1 NAND ns 1 O1 XOR OR 5 X3 X6 XOR ns If X6 is being 1 ns 6 ns 0 speculatively executed OR 0 1 ns 6 Time (ns) 1 2 3 4 5 5

Prior techniques rely on global conflict detection Why? No restriction on where a task can run Shared Cache / Directory Private Private N2 Cache Cache O1 O1 X3 X6 N2 X6 X3 Core 1 Core 2 Time (ns) 1 2 3 4 5 6 Relies on coherence protocol to find conflicts 6

Insight 1: Leveraging spatial task mapping for local conflict detection Impose restrictions on where a task can run Shared Cache / Directory Mapped to Core 1 Private Private N2 Cache Cache O1 O1 X3 X6 N2 X6 X3 Mapped to Core 2 Core 1 Core 2 Time (ns) Conflict detection is local to a core 1 2 3 4 5 6 7

Insight 2: Leveraging order to ensure atomicity Banking application: Each transaction decrements the balance of one account and increments another Tx. 1: Tx. 3: Tx. 2: Account Balance Transfer W Y Transfer X Z Transfer Z W (object) 0 W $100 11 20 X $1500 1 Y $200 21 10 Z $400 Timestamp Assign a disjoint timestamp range for each coarse transaction 8

Benefits of fine-grained tasks Memory Transaction 1 Transaction 2 W Core 1 Transaction 1 (W, Y) W Core 1 W X Order constraints Y Core 2 Transaction 2 (Z, Y) Y Z Core 2 Z Brings data to compute Sends compute to data ✓ Increased data locality ✓ Low probability and impact of aborts ✓ Reduced network traffic ✓ Asynchronous communication ✓ Increased parallelism 9

SLOT (Spatially Located Ordered Tasks) SLOT programs consist of tasks Tasks can create children tasks through a simple API: slot::enqueue( fn_ptr , timestamp , object-id , arguments …); Timestamp : Specifies order. Tasks appear to execute in timestamp order Object-id : Specifies dependences. Tasks with same object-id are treated as data-dependent Tasks with different object-ids can only communicate through arguments 10

SLOT programming example (in software) // Simulates an event arriving at a gate // Simulates an event arriving at a gate void simToggle(Time time, GateInput input) { void simToggle(Time time, GateInput input) { gate = input.gate; gate = input.gate; toggledOutput = updateState(gate, input); toggledOutput = updateState(gate, input); if (toggledOutput) { if (toggledOutput) { // create events for connected gates // create events for connected gates for (GateInput i : gate.connectedInputs()) { for (GateInput i : gate.connectedInputs()) { Time nextTime = time + gate.delay(input, i); Time nextTime = time + gate.delay(input, i); slot::enqueue( eventQueue.enqueue(nextTime, i); simToggle, nextTime, i.gateID, i); } } } } } } 2 ns PriorityQueue<Time, GateInput> eventQueue; enqueueInitialTasks() 1 ns 1 enqueueInitialEvents() slot::run() 1 // event loop. Sequentially execute in ts order while (!eventQueue.empty()){ 5 ns 1 ns (time, input) = eventQueue.dequeue(); 0 simToggle(time, input); 0 } 11

Chronos: An implementation of SLOT

Chronos overview Chronos provides a framework to build accelerators for applications with speculative parallelism Chronos Cache (Private, Mem0 Mem1 Mem2 Mem3 Framework non-coherent) Memory Traffic Interconnect … Application-specific PE PE PE … Tile Tile Tile Tile RTL 0 1 2 N Task Unit Task Traffic Interconnect The developer specifies the tasks and how they are implemented ◦ Either software routines on soft cores, or specialized Processing Elements (PE) Framework takes care of task management and speculative execution 13

Task life cycle Finish Dispatch Create Commit Finished Idle Running Abort Parent aborted? Discard N Y 14

Chronos internal dataflow Cache 2 ns 1 ns 2 ns 1 1 1 1 1 ns IDLE (I) 5 ns Tile A 1 RUNNING (R) 0 2 2 2 FINISHED (F) TSB Task Queue Commit Queue PE 2 2 Mapped to Tile A R I 1 1 1 Task Interconnect R I F 6 6 3 R I I Tile B 6 6 6 8 6 Mapped to Tile B I 3 Task creation/ dispatch Speculative state of finished tasks Abort messages Cache Requeue task 15

Versioning and commit protocol Eager versioning Commit Protocol (GVT – Global Virtual Time) Tile 0 Updates speculative values in place Main Memory / Cache Tile 1 GVT Arbiter Core GVT = min{LVT 0 , .. LVT N } Tile N Undo Log LVT (Earliest unfinished ts in the tile) GVT (Earliest unfinished ts in the system) Store old values in an undo log Key benefits Key benefits Makes the common case (commits) fast Achieves fast and parallel commits Makes speculative data available before commit 16

Chronos FPGA implementation Developed an FPGA implementation of Chronos – up to 16 tiles Running at 125 MHz High task throughput – can enqueue, dequeue, execute and commit 8 tasks per cycle on a 16-tile system 16 Tiles AWS Shell 17

Experimental methodology Four accelerators built using Chronos framework running on AWS FPGAs • Discrete Event Simulation (DES) Platform AWS Instance Price ($/hr) • Maxflow • Single Source Shortest Paths (SSSP) Baseline CPU M4.10xlarge 2.00 • Astar Search FPGA F1.2xlarge 1.65 Custom PEs per application: 32-way multithreaded PE, single PE/tile Baseline: Highly optimized software parallel implementations running on a 40-threaded Xeon AWS instance 18

Chronos performance vs. 40-threaded Xeon App Concurrent FPGA 1t/ Overall Max. Tasks CPU 1t Speedup des 256 2.45× 15.3× 4.3x 15.3x 192 0.11× 4.3× maxflow sssp 512 0.24× 3.6× astar 192 0.58× 3.5× Runs many more tasks in parallel Specialization helps to run a single task efficiently 3.5x 3.6x (narrowing the 19× frequency gap with CPU) 19

Chronos performance analysis Observation: Most work is ultimately useful (only 11% of cycles result in wasted work) Breakdown of aggregate PE cycles 20

See the paper for more Non-speculative applications Non-rollback applications Chronos with RISC-V cores Projected performance on ASIC Chronos Chronos resource utilization 21

Conclusion Prior speculative parallel systems have relied on cache coherence to detect conflicts, precluding their use in accelerators SLOT (Spatially Located Ordered Tasks): A new execution model that does not require coherence, but relies on task ordering and spatial task mapping to detect conflicts Chronos: An implementation of SLOT that provides a common framework for acceleration of applications with speculative parallelism o Use Chronos to build FPGA accelerators for four challenging applications providing up to 15x speedup over a multicore baseline https://chronos-arch.csail.mit.edu/ 22

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN - PowerPoint PPT Presentation

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN ABEYDEERA, DANIEL SANCHEZ ASPLOS 2020 Current hardware accelerators are limited to easy parallelism Current Accelerators Chronos Target easy parallelism Targets hard

eLoran Terrestrial PRS Quality Timing Charles Curry B.Eng, FIET MD, Chronos Technology Ltd

CHRONOS is a project, granted by EIT Health BP2018, to develop an efficient medical device for

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

Speculative Defragmentation Speculative Defragmentation A Technique to Improve the

Components & Sub-Systems for Positioning, Navigation and Timing (PNT) Applications from

Preventing (Network) Time Travel with Chronos Omer Deutsch, Neta Rozen Schiff , Danny Dolev,

Risk 13: Impact of an increase in unplanned and speculative local developments to address the

Quantifying the Speculative Component in the Real Price of Oil: A Review of Recent Results Lutz

and Effi ficient Speculative Execution JIYONG YU, NAMRATA MANTRI, JOSEP TORRELLAS, ADAM

Heuristics for Profile- -driven Method driven Method- - Heuristics for Profile level

FRACTAL AN EXECUTION MODEL FOR FINE-GRAIN NESTED SPECULATIVE PARALLELISM SU SUVINAY Y SU

Speculative Plan Execution for Information Agents Greg Barish University of Southern California

Data-Centric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN,

Data-Centric Execution of Speculative Parallel Programs MA MARK JEFFREY, SUVINAY SUBRAMANIAN,

SpeechMiner: A Framework for Investigating and Measuring Speculative Execution Vulnerabilities

Supply and Shorting in Speculative Markets Marcel Nutz Columbia University with Johannes

Asya Rolls, Damien Colas, Antoine Adamantidis, Matt Carter, Tope Lanre-Amos, H Craig Heller, and

The millennium question over the reals, the complex numbers and other general structures.

GAMMA-RAY PRODUCTION IN MILLISECOND PULSAR BINARY SYSTEMS W lodek Bednarek Department of

Observation of Excess Electronic Recoil Events in XENON1T ( KMI &

W orkload Generation for ns Sim ulations of Wide Area Net w orks and the

Accuracy Enhancements of the 802.11 Model and EDCA QoS Extensions in ns-3 Timo Bingmann

Scaling choice models of relational social data Jan Overgoor Stanford University SIAM-NS July

CS 557 Domain Name System Development of the Domain Name System Mockapetris and Dunlap, 1988

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN - PowerPoint PPT Presentation

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN ABEYDEERA, DANIEL SANCHEZ ASPLOS 2020 Current hardware accelerators are limited to easy parallelism Current Accelerators Chronos Target easy parallelism Targets hard

eLoran Terrestrial PRS Quality Timing Charles Curry B.Eng, FIET MD, Chronos Technology Ltd

CHRONOS is a project, granted by EIT Health BP2018, to develop an efficient medical device for

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

Speculative Defragmentation Speculative Defragmentation A Technique to Improve the

Components &amp; Sub-Systems for Positioning, Navigation and Timing (PNT) Applications from

Preventing (Network) Time Travel with Chronos Omer Deutsch, Neta Rozen Schiff , Danny Dolev,

Risk 13: Impact of an increase in unplanned and speculative local developments to address the

Quantifying the Speculative Component in the Real Price of Oil: A Review of Recent Results Lutz

and Effi ficient Speculative Execution JIYONG YU, NAMRATA MANTRI, JOSEP TORRELLAS, ADAM

Heuristics for Profile- -driven Method driven Method- - Heuristics for Profile level

FRACTAL AN EXECUTION MODEL FOR FINE-GRAIN NESTED SPECULATIVE PARALLELISM SU SUVINAY Y SU

Speculative Plan Execution for Information Agents Greg Barish University of Southern California

Data-Centric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN,

Data-Centric Execution of Speculative Parallel Programs MA MARK JEFFREY, SUVINAY SUBRAMANIAN,

SpeechMiner: A Framework for Investigating and Measuring Speculative Execution Vulnerabilities

Supply and Shorting in Speculative Markets Marcel Nutz Columbia University with Johannes

Asya Rolls, Damien Colas, Antoine Adamantidis, Matt Carter, Tope Lanre-Amos, H Craig Heller, and

The millennium question over the reals, the complex numbers and other general structures.

GAMMA-RAY PRODUCTION IN MILLISECOND PULSAR BINARY SYSTEMS W lodek Bednarek Department of

Observation of Excess Electronic Recoil Events in XENON1T ( KMI &amp;

W orkload Generation for ns Sim ulations of Wide Area Net w orks and the

Accuracy Enhancements of the 802.11 Model and EDCA QoS Extensions in ns-3 Timo Bingmann

Scaling choice models of relational social data Jan Overgoor Stanford University SIAM-NS July

CS 557 Domain Name System Development of the Domain Name System Mockapetris and Dunlap, 1988

Components & Sub-Systems for Positioning, Navigation and Timing (PNT) Applications from

Observation of Excess Electronic Recoil Events in XENON1T ( KMI &