Multi-core Architectures Interconnect Technology Virendra Singh - - PowerPoint PPT Presentation

multi core architectures
SMART_READER_LITE
LIVE PREVIEW

Multi-core Architectures Interconnect Technology Virendra Singh - - PowerPoint PPT Presentation

Multi-core Architectures Interconnect Technology Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/


slide-1
SLIDE 1

CADSL

Multi-core Architectures

Interconnect Technology

Virendra Singh

Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

http://www.ee.iitb.ac.in/~viren/ E-mail: viren@ee.iitb.ac.in

CS-683: Advanced Computer Architecture

Lecture 29 (30 Oct 2013)

slide-2
SLIDE 2

CADSL

Topology Summary

  • First network design decision
  • Critical impact on network latency and

throughput

– Hop count provides first order approximation

  • f message latency

– Bottleneck channels determine saturation

throughput

30 Oct 2013 CS-683@IITB 2

slide-3
SLIDE 3

CADSL

Routing Summary

  • Latency paramount concern

– Minimal routing most common for NoC – Non-minimal can avoid congestion and

deliver low latency

  • To date: NoC research favors DOR for

simplicity and deadlock freedom

– On-chip networks often lightly loaded

  • Only covered unicast routing

– Recent work on extending on-chip routing to

support multicast

30 Oct 2013 CS-683@IITB 3

slide-4
SLIDE 4

CADSL

Switching/Flow Control Overview

  • Topology: determines connectivity of

network

  • Routing: determines paths through

network

  • Flow Control: determine allocation of

resources to messages as they traverse network

– Buffers and links – Significant impact on throughput and latency

  • f network

30 Oct 2013 CS-683@IITB 4

slide-5
SLIDE 5

CADSL

Packets

  • Messages: composed of one or more

packets

– If message size is <= maximum packet size

  • nly one packet created
  • Packets: composed of one or more flits
  • Flit: flow control digit
  • Phit: physical digit

– Subdivides flit into chunks = to link width – In on-chip networks, flit size == phit size.

  • Due to very wide on-chip channels

30 Oct 2013 CS-683@IITB 5

slide-6
SLIDE 6

CADSL

Switching

  • Different flow control techniques based on

granularity

  • Circuit-switching: operates at the

granularity of messages

  • Packet-based: allocation made to whole

packets

  • Flit-based: allocation made on a flit-by-flit

basis

30 Oct 2013 CS-683@IITB 6

slide-7
SLIDE 7

CADSL

Virtual Cut Through

  • Packet-based: similar to Store and

Forward

  • Links and Buffers allocated to entire

packets

  • Flits can proceed to next hop before tail flit

has been received by current router

– But only if next router has enough buffer

space for entire packet

  • Reduces the latency significantly

compared to SAF

  • But still requires large buffers

30 Oct 2013 CS-683@IITB 7

slide-8
SLIDE 8

CADSL

Virtual Cut Through Example

  • Lower per-hop latency
  • Larger buffering required

5

30 Oct 2013 CS-683@IITB 8

slide-9
SLIDE 9

CADSL

Flit Level Flow Control

  • Wormhole flow control
  • Flit can proceed to next router when there

is buffer space available for that flit

– Improved over SAF and VCT by allocating

buffers on a flit-basis

  • Pros

– More efficient buffer utilization (good for on-

chip)

– Low latency

  • Cons

– Poor link utilization: if head flit becomes

30 Oct 2013 CS-683@IITB 9

slide-10
SLIDE 10

CADSL

Wormhole Example

  • 6 flit buffers/input port

Blocked by other packets Channel idle but red packet blocked behind blue Buffer full: blue cannot proceed Red holds this channel: channel remains idle until read proceeds

30 Oct 2013 CS-683@IITB 10

slide-11
SLIDE 11

CADSL

Virtual Channel Flow Control

  • Virtual channels used to combat HOL

block in wormhole

  • Virtual channels: multiple flit queues per

input port

– Share same physical link (channel)

  • Link utilization improved

– Flits on different VC can pass blocked packet

30 Oct 2013 CS-683@IITB 11

slide-12
SLIDE 12

CADSL

Virtual Channel Example

  • 6 flit buffers/input port
  • 3 flit buffers/VC

Blocked by other packets Buffer full: blue cannot proceed

30 Oct 2013 CS-683@IITB 12

slide-13
SLIDE 13

CADSL

Deadlock

  • Using flow control to guarantee deadlock

freedom give more flexible routing

  • Escape Virtual Channels

– If routing algorithm is not deadlock free – VCs can break resource cycle – Place restriction on VC allocation or require

  • ne VC to be DOR
  • Assign different message classes to

different VCs to prevent protocol level deadlock

– Prevent req-ack message cycles

30 Oct 2013 CS-683@IITB 13

slide-14
SLIDE 14

CADSL

Buffer Backpressure

  • Need mechanism to prevent buffer
  • verflow

– Avoid dropping packets – Upstream nodes need to know buffer

availability at downstream routers

  • Significant impact on throughput achieved

by flow control

  • Credits
  • On-off

30 Oct 2013 CS-683@IITB 14

slide-15
SLIDE 15

CADSL

Credit-Based Flow Control

  • Upstream router stores credit counts for

each downstream VC

  • Upstream router

– When flit forwarded

  • Decrement credit count

– Count == 0, buffer full, stop sending

  • Downstream router

– When flit forwarded and buffer freed

  • Send credit to upstream router
  • Upstream increments credit count

30 Oct 2013 CS-683@IITB 15

slide-16
SLIDE 16

CADSL

Credit Timeline

  • Round-trip credit delay:

– Time between when buffer empties and when next

flit can be processed from that buffer entry

– If only single entry buffer, would result in significant

throughput degradation

– Important to size buffers to tolerate credit turn-

around

Node 1 Node 2 Flit departs router t1 Credit

Process

t2 t3 Credit F l i t

Process

t4 t5 Credit round trip delay

30 Oct 2013 CS-683@IITB 16

slide-17
SLIDE 17

CADSL

On-Off Flow Control

  • Credit: requires upstream signaling for

every flit

  • On-off: decreases upstream signaling
  • Off signal

– Sent when number of free buffers falls below

threshold Foff

  • On signal

– Send when number of free buffers rises

above threshold Fon

30 Oct 2013 CS-683@IITB 17

slide-18
SLIDE 18

CADSL

Proces s

On-Off Timeline

  • Less signaling but more buffering

– On-chip buffers more expensive than wires

Node 1 Node 2 t1 Flit Flit Flit Flit t2 Foffthreshold reached Off Flit Flit Flit Flit Flit On

Proces s

Flit Flit Flit Flit Flit Flit t3 t4 t5 t6 t7 t8 Foffset to prevent flits arriving before t4 from

  • verflowing

Fonthreshold reached Fonset so that Node 2 does not run out of flits between t5 and t8

30 Oct 2013 CS-683@IITB 18

slide-19
SLIDE 19

CADSL

Flow Control Summary

  • On-chip networks require techniques with

lower buffering requirements

– Wormhole or Virtual Channel flow control

  • Dropping packets unacceptable in on-chip

environment

– Requires buffer backpressure mechanism

  • Complexity of flow control impacts router

microarchitecture (next)

30 Oct 2013 CS-683@IITB 19

slide-20
SLIDE 20

CADSL

Router Microarchitecture Overview

  • Consist of buffers, switches, functional

units, and control logic to implement routing algorithm and flow control

  • Focus on microarchitecture of Virtual

Channel router

  • Router is pipelined to reduce cycle time

30 Oct 2013 CS-683@IITB 20

slide-21
SLIDE 21

CADSL

Virtual Channel Router

VC 0 VC 0 MVC 0 VC 0 VC x MVC 0

Switch Allocator Virtual Channel Allocator

VC 0 VC x

Input Ports Routing Computation

VC 0

30 Oct 2013 CS-683@IITB 21

slide-22
SLIDE 22

CADSL

Baseline Router Pipeline

  • Canonical 5-stage (+link) pipeline

– BW: Buffer Write – RC: Routing computation – VA: Virtual Channel Allocation – SA: Switch Allocation – ST: Switch Traversal – LT: Link Traversal

BW RC VA SA ST LT

30 Oct 2013 CS-683@IITB 22

slide-23
SLIDE 23

CADSL

Baseline Router Pipeline

  • Routing computation performed once per packet
  • Virtual channel allocated once per packet
  • body and tail flits inherit this info from head flit

BW RC VA SA ST LT BW BW BW SA ST LT SA ST LT SA ST LT Head Body 1 Body 2 Tail 1 2 3 4 5 6 7 8 9

30 Oct 2013 CS-683@IITB 23

slide-24
SLIDE 24

CADSL

Router Pipeline Optimizations

  • Baseline (no load) delay
  • Ideally, only pay link delay
  • Techniques to reduce pipeline stages

– Lookahead routing: At current router

perform routing computation for next router

  • Overlap with BW

BW NRC VA SA ST LT

30 Oct 2013 CS-683@IITB 24

( )

ion serializat

t hops delay link cycles

+ × + = 5

slide-25
SLIDE 25

CADSL

Router Pipeline Optimizations

  • Speculation

– Assume that Virtual Channel Allocation stage

will be successful

  • Valid under low to moderate loads

– Entire VA and SA in parallel – If VA unsuccessful (no virtual channel

returned)

  • Must repeat VA/SA in next cycle

BW NRC VA SA ST LT

30 Oct 2013 CS-683@IITB 25

slide-26
SLIDE 26

CADSL

Router Pipeline Optimizations

  • Bypassing: when no flits in input buffer

– Speculatively enter ST – On port conflict, speculation aborted – In the first stage, a free VC is allocated, next

routing is performed and the crossbar is setup

VA NRC Setup ST LT

30 Oct 2013 CS-683@IITB 26

slide-27
SLIDE 27

CADSL

Buffer Organization

  • Single buffer per input
  • Multiple fixed length queues per physical

channel

Physical channel s Virtual channel s

30 Oct 2013 CS-683@IITB 27

slide-28
SLIDE 28

CADSL

Arbiters and Allocators

  • Allocator matches N requests to M

resources

  • Arbiter matches N requests to 1 resource
  • Resources are VCs (for virtual channel

routers) and crossbar switch ports.

  • Virtual-channel allocator (VA)

– Resolves contention for output virtual

channels

– Grants them to input virtual channels

  • Switch allocator (SA) that grants crossbar

switch ports to input virtual channels

30 Oct 2013 CS-683@IITB 28

slide-29
SLIDE 29

CADSL

Round Robin Arbiter

  • Last request serviced given lowest priority
  • Generate the next priority vector from

current grant vector

  • Exhibits fairness

30 Oct 2013 CS-683@IITB 29

slide-30
SLIDE 30

CADSL

Crossbar Dimension Slicing

  • Crossbar area and power grow with

O((pw)2)

  • Replace 1 5x5 crossbar with 2 3x3

crossbars

Inject E-in W-in E-out W-out N-in S-in N-out S-out Eject

30 Oct 2013 CS-683@IITB 30

slide-31
SLIDE 31

CADSL

Crossbar speedup

  • Increase internal switch bandwidth
  • Simplifies allocation or gives better

performance with a simple allocator

  • Output speedup requires output buffers

– Multiplex onto physical link

10:5 crossbar 5:10 crossbar 10:10 crossbar

30 Oct 2013 CS-683@IITB 31

slide-32
SLIDE 32

CADSL

Evaluating Interconnection Networks

  • Network latency

– Zero-load latency: average distance * latency

per unit distance

  • Accepted traffic

– Measure the max amount of traffic accepted

by the network before it reaches saturation

  • Cost

– Power, area, packaging

30 Oct 2013 CS-683@IITB 32

slide-33
SLIDE 33

CADSL

Interconnection Network Evaluation

  • Trace based

– Synthetic trace-based

  • Injection process

– Periodic, Bernoulli, Bursty

– Workload traces

  • Full system simulation

33 CS-683@IITB 30 Oct 2013

slide-34
SLIDE 34

CADSL

Traffic Patterns

  • Uniform Random

– Each source equally likely to send to each

destination

– Does not do a good job of identifying load

imbalances in design

  • Permutation (several variations)

– Each source sends to one destination

  • Hot-spot traffic

– All send to 1 (or small number) of

destinations

34 CS-683@IITB 30 Oct 2013

slide-35
SLIDE 35

CADSL

Microarchitecture Summary

  • Ties together topological, routing and flow

control design decisions

  • Pipelined for fast cycle times
  • Area and power constraints important in

NoC design space

30 Oct 2013 CS-683@IITB 35

slide-36
SLIDE 36

CADSL

Interconnection Network Summary

  • Latency vs. Offered Traffic

Latenc y Offered Traffic (bits/sec) Min latency given by topology Min latency given by routing algorithm Zero load latency (topology+routing+flo w control) Throughput given by topology Throughput given by routing Throughput given by flow control

30 Oct 2013 CS-683@IITB 36

slide-37
SLIDE 37

CADSL

Multi-core Designs

  • Use available transistors efficiently

– Provide better perf, perf/cost, perf/watt

  • Effectively share expensive resources

– Socket/pins:

  • DRAM interface
  • Coherence interface
  • I/O interface

30 Oct 2013 CS-683@IITB 37

slide-38
SLIDE 38

CADSL

High-Level Design Issues

  • 1. Where to connect cores?

– Time to market:

  • at off-chip bus (Pentium D)
  • at coherence interconnect (Opteron)

– Requires substantial (re)design:

  • at L2 (Power 4, Core Duo, Core 2 Duo)
  • at L3 (Opteron, Itanium)

30 Oct 2013 CS-683@IITB 38

slide-39
SLIDE 39

CADSL

High-Level Design Issues

  • 1. Share caches?

– yes: all designs that connect at L2 or L3 – no: all designs that don't

  • 3. Coherence?

– Private caches? Reuse existing MP/socket

coherence

– Shared caches?

  • Need new coherence protocol for on-chip caches
  • Often write-through L1 with back-invalidates for other

caches

30 Oct 2013 CS-683@IITB 39

slide-40
SLIDE 40

CADSL

High-Level Design Issues

  • 1. How to connect?

– Off-chip bus? Time-to-market hack, not scalable – Existing pt-to-pt coherence interconnect

(hypertransport)

– Shared L2/L3:

  • Crossbar, up to 3-4 cores (8 weak cores in Niagara)
  • 1D "dancehall“ organization

– On-chip bus? Not scalable (8 weak cores in Piranha) – Interconnection network

  • scalable, but high overhead
  • E.g. 2D tiled organization, mesh interconnect

30 Oct 2013 CS-683@IITB 40

slide-41
SLIDE 41

CADSL

Private vs shared caches

  • Advantages of private:

– They are closer to core, so faster access – Reduces contention

  • Advantages of shared:

– Threads on different cores can share the

same cache data

– More cache space available if a single (or a

few) high-performance thread runs on the system

30 Oct 2013 CS-683@IITB 41

slide-42
SLIDE 42

CADSL

CS-683@IITB

Cache Coherence

  • Coherence

– All reads by any processor must return the

most recently written value

– Writes to the same location by any two

processors are seen in the same order by all processors

  • Consistency

– When a written value will be returned by a

read

– If a processor writes location A followed by

location B, any processor that sees the new value of B must also see the new value of A

30 Oct 2013 42

slide-43
SLIDE 43

CADSL

  • Click to edit the outline text format

Second Outline Level

  • Third Outline Level

– Fourth Outline Level

  • Fifth Outline Level
  • Sixth Outline Level
  • Seventh Outline LevelClick to edit Master

text styles

– Second level

  • Third level

– Fourth level

  • Fifth level

The cache coherence problem

  • Since we have private caches:

How to keep the data consistent across caches?

  • Each core should perceive the memory as a

monolithic array, shared by all the cores

30 Oct 2013 CS-683@IITB 43

slide-44
SLIDE 44

CADSL

The cache coherence problem

Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4 One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

30 Oct 2013 CS-683@IITB 44

slide-45
SLIDE 45

CADSL

The cache coherence problem

Core 1 reads x

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

30 Oct 2013 CS-683@IITB 45

slide-46
SLIDE 46

CADSL

The cache coherence problem

Core 2 reads x

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213

multi-core chip

30 Oct 2013 CS-683@IITB 46

slide-47
SLIDE 47

CADSL

The cache coherence problem

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

assuming write-through caches

30 Oct 2013 CS-683@IITB 47

slide-48
SLIDE 48

CADSL

The cache coherence problem

Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660

multi-core chip

30 Oct 2013 CS-683@IITB 48

slide-49
SLIDE 49

CADSL

Solutions for cache coherence

  • This is a general problem with

multiprocessors, not limited just to multi-core

  • There exist many solution algorithms,

coherence protocols, etc.

  • A simple solution:

invalidation-based protocol with snooping

30 Oct 2013 CS-683@IITB 49