ON-CHIP NETWORK INNOVATIONS Mahdi Nazm Bojnordi Assistant Professor - - PowerPoint PPT Presentation

on chip network innovations
SMART_READER_LITE
LIVE PREVIEW

ON-CHIP NETWORK INNOVATIONS Mahdi Nazm Bojnordi Assistant Professor - - PowerPoint PPT Presentation

ON-CHIP NETWORK INNOVATIONS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 7810: Advanced Computer Architecture Overview Upcoming deadline Feb.3 rd : project group formation No groups have sent


slide-1
SLIDE 1

ON-CHIP NETWORK INNOVATIONS

CS/ECE 7810: Advanced Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor School of Computing University of Utah

slide-2
SLIDE 2

Overview

¨ Upcoming deadline

¤ Feb.3rd: project group formation ¤ No groups have sent me emails!

¨ This lecture

¤ Basics of the interconnection networks ¤ Network topologies ¤ Flow control ¤ Routing algorithm ¤ Emerging on-chip networks

slide-3
SLIDE 3

On-chip Interconnection Networks

¨ An infrastructure connecting various components in

current and future ICs

Interconnecti

  • n Network

CPU Mem CPU Mem CPU Mem CPU Mem CPU Mem CPU Mem

Mesh is mostly employed due to its scalability.

slide-4
SLIDE 4

Network Topology

slide-5
SLIDE 5

Network Topologies

¨ Regular vs. irregular graphs

¤ Examples of regular networks are mesh and ring

¨ Distances in the network

¤ Routing distance: number of links/hops along a route ¤ Network diameter: maximum number of hops per route ¤ Average distance: average number of links/hops across

all valid routes

slide-6
SLIDE 6

Example Topologies

¨ Bus

¤ Simple structure; efficient for small number of nodes ¤ Not scalable; highly contended ¤ Used in many processors Bus Point to Point

slide-7
SLIDE 7

Example Topologies

¨ Crossbar

¤ Complex arbitration ¤ High throughput and fast ¤ Requires a lot of resources ¤ Used in Sun Niagara I/II [UltraSPARC T1]

1 2 3 1 2 3 4 5 4 5

slide-8
SLIDE 8

Example Topologies

¨ Segmented crossbar

¤ Reduce switching capacitance (~15-30%) ¤ Need a few additional signals to control tri-states [Wang’03]

slide-9
SLIDE 9

Example Topologies

¨ Goal: optimize for the common case

¤ Straight-through traffic does not go thru tristate buffers [Wang’03]

¨ Some combinations of

turns are not allowed

¤ Why? Read the paper for details.

slide-10
SLIDE 10

Example Topologies

¨ Express channels to reduce number of hops

¤ like taking the freeway [Wang’03]

slide-11
SLIDE 11

Example Topologies

¨ Ring

¤ Cheap; long latency ¤ IBM Cell

¨ Mesh

¤ Path diversity, efficient ¤ Tilera 100-core

¨ Torus

¤ More path diversity ¤ Expensive and complex

slide-12
SLIDE 12

Example Topologies

¨ Tree

¤ Simple and low cost ¤ Easy to layout ¤ Efficiently handles local traffic ¤ Towards root, links are heavily contended Fat Tree

slide-13
SLIDE 13

Example Topologies

¨ Omega network

¤ Single path from source

to destination

¤ Does not support all

possible permutations

¤ Proposed to replace

costly crossbars as processor-memory interconnect

[Gottlieb’82]

slide-14
SLIDE 14

Flow Control

slide-15
SLIDE 15

Sending Data in Network

¨ Circuit switching

¤ Establish full path; then send data ¤ Everyone else using the same link has to wait ¤ Setup overheads

¨ Packet switching

¤ Route individual packets (via different paths) ¤ More flexible than CS ¤ May be slower than CS

slide-16
SLIDE 16

Handling Contention

¨ Problem

¤ Two packets want to use the same link at the same time

¨ Possible solutions

¤ Drop one ¤ Misroute one (deflection) ¤ Buffer one

slide-17
SLIDE 17

Circuit Switching Example

Acknowledgement Configuration Probe Data Circuit

5

¨ Significant latency overhead prior to data transfer ¨ Other requests forced to wait for resources

[Lipasti]

slide-18
SLIDE 18

Store and Forward Example

5

¨ High per-hop latency ¨ Larger buffering required

[Lipasti]

slide-19
SLIDE 19

Virtual Cut Through Example

5 [Lipasti]

¨ Lower per-hop latency ¨ Larger buffering required

slide-20
SLIDE 20

Wormhole Example

Blocked by other packets Channel idle but red packet blocked behind blue Buffer full: blue cannot proceed Red holds this channel: channel remains idle until read proceeds [Lipasti] Allocating buffers on a flit-basis

slide-21
SLIDE 21

Virtual Channel Example

Blocked by other packets Buffer full: blue cannot proceed [Lipasti] Multiple flit queues per input port

slide-22
SLIDE 22

Virtual Channel Buffers

¨ Single buffer per input ¨ Multiple fixed length queues per physical channel

Physical channels Virtual channels [Lipasti]

slide-23
SLIDE 23

Routing Algorithm

slide-24
SLIDE 24

Types of Routing Algorithms

¨ Deterministic

¤ Always chooses the same path for a communicating

source-destination pair

¨ Oblivious

¤ Chooses different paths, without considering network

state

¨ Adaptive

¤ Can choose different paths, adapting to the state of

the network

slide-25
SLIDE 25

Deterministic Routing

¨ All packets between the same (source, destination)

pair take the same path

¨ Dimension-order routing

¤ E.g., XY routing (used in Cray T3D, and many on-chip

networks)

¨ First traverse dimension X, then traverse dimension Y ¨ Deadlock freedom ¨ Could lead to high contention

slide-26
SLIDE 26

Oblivious Routing

¨ Valiant’s Algorithm

¤ randomly choose

intermediate node d’

¤ Route from s to d’ and

from d’ to d.

¨ Randomizes any traffic

pattern

¤ Balances network load ¤ Non-minimal

d’

d

s

slide-27
SLIDE 27

Oblivious Routing

¨ Minimal Oblivious

¤ d’ must lie within minimum

quadrant

¤ 6 options for d’ ¤ Only 3 different paths

¨ Achieve some load

balancing, but use shortest paths

d

s

slide-28
SLIDE 28

Adaptive Routing

¨ Make decisions according to the current state of the

network

¨ Local vs. global information

¤ Local states are available easily ¤ Global information more expensive

d1 d2 S

slide-29
SLIDE 29

Deadlock

¨ No forward progress ¨ Caused by circular dependencies on resources ¨ Each packet waits for a buffer occupied by another

packet downstream

[Glass’92]

slide-30
SLIDE 30

Handling Deadlock

¨ Analyze directions in which packets can turn in the

network

¨ Determine the cycles that such turns can form ¨ Prohibit just enough turns to break possible cycles

[Glass’92] The 4 allowed turns Cycles in 2D mesh

= =

slide-31
SLIDE 31

A Typical Router Architecture

Routing Computation VC Arbiter Switch Arbiter VC1 VC2 VCv VC1 VC2 VCv

Input Port N Input Port 1 N x N Crossbar

Input Channel 1 Input Channel N Scheduler Output Channel 1 Output Channel N

slide-32
SLIDE 32

Buffer-less Routing

¨ Routing buffers

¤ necessary for high throughput routing ¤ consume significant chip area and power

n 75% of die area in TRIPS IC [Gratz’06]

[Moscibroda’09] Deflected! Buffered Bufferless Problem: packets may be deflected forever (livelock)

slide-33
SLIDE 33

Buffer-less Routing

¨ Significant energy improvements (almost 40%)

0.2 0.4 0.6 0.8 1 1.2

Energy (normalized)

BufferEnergy LinkEnergy RouterEnergy 4x4, 16x milc 8x8, 16x milc 4x4, 8x milc

[Moscibroda’09]

slide-34
SLIDE 34

Networks for 3D Architectures

slide-35
SLIDE 35

3D NOC Architectures

¨ Interconnection networks using die-stacking

technology

2D Mesh Network Stacked layers

Through Silicon Via (TSV) [Feero’09]

slide-36
SLIDE 36

Thermal Challenges

¨ Power consumption is more challenging in 3D chips

¤ Longer heat dissipation paths ¤ More transistors on chip; larger power density

¨ Resultant issues for 3D ICs

¤ Higher temperature; more leakage ¤ New set of reliability issues ¤ Performance degradation

slide-37
SLIDE 37

Current Flow in TSVs

¨ Current flow is data

dependent

¨ Every voltage level

switching in a TSV consumes energy

¨ TSV switching has

inductive effects

[Eghbal’14] Can we reduce switching activity of TSVs?

slide-38
SLIDE 38

Multi-layer Router Architecture

¨ Observation: many of the data flits (up to 60% of CMP

Cache Data from real workloads) have frequent patterns such as all zeros or all ones

¨ Split router comps (crossbar, buffer, etc.) in the third

dimension, and the consequent vertical interconnect (via) design overheads.

[Park’08]

slide-39
SLIDE 39

Summary of Possible Optimizations

¨ Architectural solutions for thermal issues

¤ Thermal-aware application layout ¤ Reducing power by reducing voltage ¤ Data compression to lower dynamic power ¤ Data encoding for reducing switching power ¤ etc.

slide-40
SLIDE 40

Cache Coherence: Intro

slide-41
SLIDE 41

Communication in Multiprocessors

¨ How multiple processor cores communicate?

Shared Memory Message Passing § Multiple threads employ shared memory § Easy for programmers (loads and stores) § Explicit communication through interconnection network § Simple hardware

Core 1 Core N Shared Memory

Core 1 Core N Mem Mem

Interconnection Network

slide-42
SLIDE 42

Shared Memory Architectures

¨ Equal latency for all

processors

¨ Simple software

control

¨ Access latency is

proportional to proximity

¤ Fast local accesses

Uniform Memory Access Non-Uniform Memory Access

Core 1 Core 4 Memory … Core 1 Mem Router Core 4 Mem Router … Example UMA Example NUMA

slide-43
SLIDE 43

Network Topologies

¨ Low latency ¨ Low bandwidth ¨ Simple control

¤ e.g., bus

¨ High latency ¨ High bandwidth ¨ Complex control

¤ e.g., mesh, ring

Shared Network Point to Point Network

Core 1 Mem Router Core 4 Mem Router … Core 1 Mem Router Core 2 Mem Router Core 4 Mem Router Core 3 Mem Router

slide-44
SLIDE 44

Challenges in Shared Memories

¨ Correctness of an application is influenced by

¤ Memory consistency

n All memory instructions appear to execute in the program

  • rder

n Known to the programmer

¤ Cache coherence

n All the processors see the same data for a particular

memory address as they should have if there were no caches in the system

n Invisible to the programmer

slide-45
SLIDE 45

Cache Coherence Problem

¨ Multiple copies of each cache block

¤ In main memory and caches

¨ Multiple copies can get inconsistent when writes

happen

¤ Solution: propagate writes from one core to others core 1 Core N Cache 1 Cache N

Main Memory

slide-46
SLIDE 46

Scenario 1: Loading From Memory

¨ Variable A initially has value 0 ¨ P1 stores value 1 into A ¨ P2 loads A from memory and sees old value 0

P1 P2

Memory Bus A:0 Cache Cache

slide-47
SLIDE 47

Scenario 2: Loading From Cache

¨ P1 and P2 both have variable A (value 0) in their

caches

¨ P1 stores value 1 into A ¨ P2 loads A from its cache and sees old value

P1 P2

Memory Bus A:0 Cache Cache

slide-48
SLIDE 48

Cache Coherence

¨ The key operation is update/invalidate sent to all

  • r a subset of the cores

¤ Software based management

n Flush: write all of the dirty blocks to memory n Invalidate: make all of the cache blocks invalid

¤ Hardware based management

n Update or invalidate other copies on every write n Send data to everyone, or only the ones who have a copy ¨ Invalidation based protocol is better. Why?