Multi-core Architectures Interconnect Technology Virendra Singh - PowerPoint PPT Presentation

Multi-core Architectures Interconnect Technology Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in CS-683: Advanced Computer Architecture Lecture 29 (30 Oct 2013) CADSL

Topology Summary • First network design decision • Critical impact on network latency and throughput – Hop count provides first order approximation of message latency – Bottleneck channels determine saturation throughput CADSL 30 Oct 2013 CS-683@IITB 2

Routing Summary • Latency paramount concern – Minimal routing most common for NoC – Non-minimal can avoid congestion and deliver low latency • To date: NoC research favors DOR for simplicity and deadlock freedom – On-chip networks often lightly loaded • Only covered unicast routing – Recent work on extending on-chip routing to support multicast CADSL 30 Oct 2013 CS-683@IITB 3

Switching/Flow Control Overview • Topology: determines connectivity of network • Routing: determines paths through network • Flow Control: determine allocation of resources to messages as they traverse network – Buffers and links – Significant impact on throughput and latency of network CADSL 30 Oct 2013 CS-683@IITB 4

Packets • Messages: composed of one or more packets – If message size is <= maximum packet size only one packet created • Packets: composed of one or more flits • Flit: flow control digit • Phit: physical digit – Subdivides flit into chunks = to link width – In on-chip networks, flit size == phit size. ● Due to very wide on-chip channels CADSL 30 Oct 2013 CS-683@IITB 5

Switching • Different flow control techniques based on granularity • Circuit-switching: operates at the granularity of messages • Packet-based: allocation made to whole packets • Flit-based: allocation made on a flit-by-flit basis CADSL 30 Oct 2013 CS-683@IITB 6

Virtual Cut Through • Packet-based: similar to Store and Forward • Links and Buffers allocated to entire packets • Flits can proceed to next hop before tail flit has been received by current router – But only if next router has enough buffer space for entire packet • Reduces the latency significantly compared to SAF CADSL • But still requires large buffers 30 Oct 2013 CS-683@IITB 7

Virtual Cut Through Example 0 5 • Lower per-hop latency • Larger buffering required CADSL 30 Oct 2013 CS-683@IITB 8

Flit Level Flow Control • Wormhole flow control • Flit can proceed to next router when there is buffer space available for that flit – Improved over SAF and VCT by allocating buffers on a flit-basis • Pros – More efficient buffer utilization (good for on- chip) – Low latency • Cons CADSL 30 Oct 2013 CS-683@IITB 9 – Poor link utilization: if head flit becomes

Wormhole Example Red holds this Channel idle but channel: channel red packet blocked remains idle until read behind blue proceeds Buffer full: blue cannot proceed Blocked by other packets • 6 flit buffers/input port CADSL 30 Oct 2013 CS-683@IITB 10

Virtual Channel Flow Control • Virtual channels used to combat HOL block in wormhole • Virtual channels: multiple flit queues per input port – Share same physical link (channel) • Link utilization improved – Flits on different VC can pass blocked packet CADSL 30 Oct 2013 CS-683@IITB 11

Virtual Channel Example Buffer full: blue cannot proceed Blocked by other packets • 6 flit buffers/input port • 3 flit buffers/VC CADSL 30 Oct 2013 CS-683@IITB 12

Deadlock • Using flow control to guarantee deadlock freedom give more flexible routing • Escape Virtual Channels – If routing algorithm is not deadlock free – VCs can break resource cycle – Place restriction on VC allocation or require one VC to be DOR • Assign different message classes to different VCs to prevent protocol level deadlock CADSL – Prevent req-ack message cycles 30 Oct 2013 CS-683@IITB 13

Buffer Backpressure • Need mechanism to prevent buffer overflow – Avoid dropping packets – Upstream nodes need to know buffer availability at downstream routers • Significant impact on throughput achieved by flow control • Credits • On-off CADSL 30 Oct 2013 CS-683@IITB 14

Credit-Based Flow Control • Upstream router stores credit counts for each downstream VC • Upstream router – When flit forwarded ● Decrement credit count – Count == 0, buffer full, stop sending • Downstream router – When flit forwarded and buffer freed ● Send credit to upstream router ● Upstream increments credit count CADSL 30 Oct 2013 CS-683@IITB 15

Credit Timeline Node 1 Node 2 t1 Flit departs Credit router t2 Process Credit round t3 trip delay Credit F l i t t4 Process t5 • Round-trip credit delay: – Time between when buffer empties and when next flit can be processed from that buffer entry – If only single entry buffer, would result in significant throughput degradation CADSL – Important to size buffers to tolerate credit turn- 30 Oct 2013 CS-683@IITB 16 around

On-Off Flow Control • Credit: requires upstream signaling for every flit • On-off: decreases upstream signaling • Off signal – Sent when number of free buffers falls below threshold Foff • On signal – Send when number of free buffers rises above threshold Fon CADSL 30 Oct 2013 CS-683@IITB 17

On-Off Timeline Foffthreshold Node 1 Node 2 reached t1 Flit Flit Foffset to prevent t2 Flit Off flits arriving t3 Flit before t4 from Proces t4 Flit overflowing s Flit Flit Flit Fonthreshold t5 Flit reached Fonset so that On Flit t6 Node 2 does Flit Proces not run out of t7 Flit s Flit flits between t5 Flit and t8 t8 Flit • Less signaling but more buffering – On-chip buffers more expensive than wires CADSL 30 Oct 2013 CS-683@IITB 18

Flow Control Summary • On-chip networks require techniques with lower buffering requirements – Wormhole or Virtual Channel flow control • Dropping packets unacceptable in on-chip environment – Requires buffer backpressure mechanism • Complexity of flow control impacts router microarchitecture (next) CADSL 30 Oct 2013 CS-683@IITB 19

Router Microarchitecture Overview • Consist of buffers, switches, functional units, and control logic to implement routing algorithm and flow control • Focus on microarchitecture of Virtual Channel router • Router is pipelined to reduce cycle time CADSL 30 Oct 2013 CS-683@IITB 20

Virtual Channel Router Virtual Channel Routing Computation Allocator Switch Allocator VC 0 VC 0 VC 0 MVC 0 VC x VC 0 Input Ports VC 0 MVC 0 VC x CADSL 30 Oct 2013 CS-683@IITB 21

Baseline Router Pipeline BW RC VA SA ST LT • Canonical 5-stage (+link) pipeline – BW: Buffer Write – RC: Routing computation – VA: Virtual Channel Allocation – SA: Switch Allocation – ST: Switch Traversal – LT: Link Traversal CADSL 30 Oct 2013 CS-683@IITB 22

Baseline Router Pipeline 1 2 3 4 5 6 7 8 9 Head BW RC VA SA ST LT Body 1 BW SA ST LT BW SA ST LT Body 2 BW SA ST LT Tail • Routing computation performed once per packet • Virtual channel allocated once per packet • body and tail flits inherit this info from head flit CADSL 30 Oct 2013 CS-683@IITB 23

Router Pipeline Optimizations • Baseline (no load) delay ( ) = 5 + × + cycles link delay hops t serializat ion • Ideally, only pay link delay • Techniques to reduce pipeline stages – Lookahead routing: At current router perform routing computation for next router ● Overlap with BW BW VA SA ST LT NRC CADSL 30 Oct 2013 CS-683@IITB 24

Router Pipeline Optimizations • Speculation – Assume that Virtual Channel Allocation stage will be successful ● Valid under low to moderate loads – Entire VA and SA in parallel BW VA ST LT NRC SA – If VA unsuccessful (no virtual channel returned) CADSL ● Must repeat VA/SA in next cycle 30 Oct 2013 CS-683@IITB 25

Router Pipeline Optimizations • Bypassing: when no flits in input buffer – Speculatively enter ST – On port conflict, speculation aborted VA NRC ST LT Setup – In the first stage, a free VC is allocated, next routing is performed and the crossbar is setup CADSL 30 Oct 2013 CS-683@IITB 26

Buffer Organization Physical Virtual channel channel s s • Single buffer per input • Multiple fixed length queues per physical CADSL channel 30 Oct 2013 CS-683@IITB 27

Arbiters and Allocators • Allocator matches N requests to M resources • Arbiter matches N requests to 1 resource • Resources are VCs (for virtual channel routers) and crossbar switch ports. • Virtual-channel allocator (VA) – Resolves contention for output virtual channels – Grants them to input virtual channels • Switch allocator (SA) that grants crossbar CADSL 30 Oct 2013 CS-683@IITB 28 switch ports to input virtual channels

Round Robin Arbiter • Last request serviced given lowest priority • Generate the next priority vector from current grant vector • Exhibits fairness CADSL 30 Oct 2013 CS-683@IITB 29

Multi-core Architectures Interconnect Technology Virendra Singh - PowerPoint PPT Presentation

Multi-core Architectures Interconnect Technology Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Architectures Architectural styles Software architectures Architectures versus middleware

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia

Exploiting Multi-Core Architectures for Fast Modular Synthesis LAC2008 Feb 29, 2008 Jrgen

DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures Quan Chen, Long

Multi-Scale and Multi-Physics Simulations on Present and Future Architectures

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Highlydense Mixed Grained Reconfigurable Architecture with Viaswitch Ryutaro Doi 1,6 Junshi

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program

A Highly-dense Mixed Grained Reconfigurable Architecture with Overlay Crossbar Interconnect Using

Analysis of TDMA Crossbar Real-Time Switch Design for AFDX Networks Lei Rao *, Qixin Wang

Evaluation of On-Chip Router Components in Spintronics Pierre Schamberger & Zhonghai Lu

Interconnection Networks Frdric Desprez INRIA F. Desprez - UE Parallel alg. and prog.

JAVASCRIPT Miguel Angel Pastor Halfbrick Presentation Miguel Angel Pastor Manuel 15+ years game

GCC Code Organization Emulation libraries (eg. libgcc to emulate operations not supported on

Multi-core Architectures Interconnect Technology Virendra Singh - PowerPoint PPT Presentation

Multi-core Architectures Interconnect Technology Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Architectures Architectural styles Software architectures Architectures versus middleware

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Parallel Linear Algebra Software for Multi-Core Architectures (PLASMA) for the CELL BE Georgia

Exploiting Multi-Core Architectures for Fast Modular Synthesis LAC2008 Feb 29, 2008 Jrgen

DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures Quan Chen, Long

Multi-Scale and Multi-Physics Simulations on Present and Future Architectures

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Highlydense Mixed Grained Reconfigurable Architecture with Viaswitch Ryutaro Doi 1,6 Junshi

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program

A Highly-dense Mixed Grained Reconfigurable Architecture with Overlay Crossbar Interconnect Using

Analysis of TDMA Crossbar Real-Time Switch Design for AFDX Networks Lei Rao *, Qixin Wang

Evaluation of On-Chip Router Components in Spintronics Pierre Schamberger &amp; Zhonghai Lu

Interconnection Networks Frdric Desprez INRIA F. Desprez - UE Parallel alg. and prog.

JAVASCRIPT Miguel Angel Pastor Halfbrick Presentation Miguel Angel Pastor Manuel 15+ years game

GCC Code Organization Emulation libraries (eg. libgcc to emulate operations not supported on

Evaluation of On-Chip Router Components in Spintronics Pierre Schamberger & Zhonghai Lu