Circuit-Switched Coherence Natalie Enright Jerger* , Li-Shiuan Peh + - - PowerPoint PPT Presentation

circuit switched coherence
SMART_READER_LITE
LIVE PREVIEW

Circuit-Switched Coherence Natalie Enright Jerger* , Li-Shiuan Peh + - - PowerPoint PPT Presentation

Circuit-Switched Coherence Natalie Enright Jerger* , Li-Shiuan Peh + , Mikko Lipasti* * University of Wisconsin - Madison + Princeton University 2 nd IEEE International Symposium on Networks-on-Chip Motivation Network on Chip for general


slide-1
SLIDE 1

Circuit-Switched Coherence

Natalie Enright Jerger* , Li-Shiuan Peh+, Mikko Lipasti* * University of Wisconsin - Madison

+Princeton University

2nd IEEE International Symposium on Networks-on-Chip

slide-2
SLIDE 2

Motivation

Network on Chip for general purpose

multi-core

Replacing dedicated global wires Efficient/scalable communication on-chip

Router latency overhead can be

significant

Exploit application characteristics to lower

latency

Co-design coherence protocol to match

network functionality

2 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-3
SLIDE 3

Executive Summary

Hybrid Network

Interleaves circuit-switched and packet-

switched flits

Optimize setup latency Improve throughput over traditional circuit-

switching

Reduce interconnect delay by up to 22%

Co-design cache coherence protocol

Improves performance by up to 17%

3 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-4
SLIDE 4

Switching Techniques

Packet Switching

Efficient bandwidth utilization Router latency overhead

Circuit Switching

Poor bandwidth utilization

Stalled requests due to unavailable resources

Low latency

Avoids router overhead after circuit is

established

Best of both worlds? Efficient bandwidth utilization + low latency

4 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-5
SLIDE 5

Circuit-Switched Coherence

Two key

  • bservations

Commercial workloads

are very sensitive to communication latency

Significant pair-wise

sharing Construct fast pair-wise circuits?

Commercial Workloads: SpecJBB, SpecWeb, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace

5 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-6
SLIDE 6

Traditional Circuit Switching

Traditional circuit-switching hurts performance

by up to ~ 7%

*Data collected for 16 in-order core chip multiprocessor

6 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-7
SLIDE 7

Circuit Switching Redesigned

Latency is critical Utilize Circuit Switching for lower

latency

A circuit connects resources across multiple

hops to avoid router overhead

Traditional circuit-switching performs

poorly

My contributions

Novel setup mechanism Bandwidth stealing

7 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-8
SLIDE 8

Outline

Motivation Router Design

Setup Mechanism Bandwidth Stealing

Coherence Protocol Co-design

Pair-wise sharing 3-hop optimization Region prediction

Results Conclusions

8 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-9
SLIDE 9

Traditional Circuit Switching Path Setup (with Acknowledgement)

Configuration Probe

5

Data Circuit Acknowledgement

Significant latency overhead prior to data

transfer

Other requests forced to wait for resources

4/11/2008

9 Natalie Enright Jerger - University of Wisconsin

slide-10
SLIDE 10

Novel Circuit Setup Policy

A

Configuration Packet Data

5

Circuit

Overlap circuit setup with 1st data transfer Reconfigure existing circuits if no unused links available

Allows piggy-backed request to always achieve low

latency

Multiple circuit planes prevent frequent reconfiguration

10 Natalie Enright Jerger - University of Wisconsin

4/11/2008

slide-11
SLIDE 11

Setup Network

Light-weight setup network

Narrow

Circuit plane identifier (2 bits) + Destination (4 bits)

Low Load

No virtual channels small area footprint

Stores circuit configuration information

Multiple narrow circuit planes prevent frequent

reconfiguration

Reconfiguration

Buffered, traverses packet-switched pipeline

11 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-12
SLIDE 12

Packet-Switched Bandwidth Stealing

Remember: problem with traditional

Circuit-Switching is poor bandwidth

Need to overcome this limitation

Hybrid Circuit-Switched Solution: Packet-

switched messages snoop incoming links

When there are no circuit-switched

messages on the link

A waiting packet-switched message can steal

idle bandwidth

12 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-13
SLIDE 13

Hybrid Circuit-Switched Router Design

Allocators

T T T T T

Inj Ej Crossbar N N S S E W E W

13 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-14
SLIDE 14

HCS Pipeline

Circuit-switched messages: 1 stage Packet-switched messages: 3 stages

Aggressive Speculation reduces stages

Switch Traversal Link Traversal Link Traversal Router Link Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Link Traversal

4/11/2008 Natalie Enright Jerger - University of Wisconsin

Router Link

14

slide-15
SLIDE 15

Outline

Motivation Router Design

Setup Mechanism Bandwidth Stealing

Coherence Protocol Co-design

Pair-wise sharing 3-hop optimization Region prediction

Results Conclusions

15 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-16
SLIDE 16

Sharing Characterization

Temporal sharing relationship: 67-76% of misses

are serviced by 2 most recently shared with cores

4/11/2008 16 Natalie Enright Jerger - University of Wisconsin

Commercial Workloads: SpecJBB, SpecWeb, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace

slide-17
SLIDE 17

Directory Coherence

Directory

Address State Sharers A Exclusive 2 B Shared 1,2

1 2

Read A

1 2

Data Response A

3

Directory

Address State Sharers A Shared 1,2 B Shared 1,2

Forward Read A

17 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-18
SLIDE 18

Coherence Protocol Co-Design

Goal: Better exploit circuits through

coherence protocol

Modifications:

Allow a cache to send a request directly to

another cache

Notify the directory in parallel Prediction mechanism for pair-wise sharers

Directory is sole ordering point

18 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-19
SLIDE 19

Circuit-Switched Coherence Optimization

Directory

Address State Sharers A Exclusive 2 B Shared 1,2

1 2

Update A

1

Data Response A

2 3

Directory

Address State Sharers A Shared 1,2 B Shared 1,2

Ack A Read A

1

19 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-20
SLIDE 20

Region Prediction

Each memory region spans 1KB

Takes advantage of spatial and temporal sharing

4/11/2008 20 Natalie Enright Jerger - University of Wisconsin

Directory

Address State Sharers A[0] Shared 2 A[1] Shared 2

1 2

Miss A[0]

1

Forward Read A[0]

2

Data Response A[0]

3

Region Table

A

  • B

3

Region Table

A 2 B 3

Region A Update

4 5

Read A[1] Directory

Address State Sharers A[0] Shared 1,2 A[1] Shared 2

slide-21
SLIDE 21

Simulation Methodology

PHARMSim

Full-system multi-core simulator Detailed network level model

Cycle accurate router model Flit-level contention modeled

More results in paper

21 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-22
SLIDE 22

Simulation Workloads

Commercial

SPECjbb Java server workload 24 warehouse, 200 requests SPECweb Web server, 300 requests TPC-W Web e-commerce, 40 transactions TPC-H Decision support system

Scientific

Barnes-Hut 8k particles, full run Ocean 514x514, parallel phase Radiosity Parallel phase Raytrace Car input, parallel phase

Synthetic

Uniform Random Destination select with uniform random distribution Permutation Traffic Each node communicates with one other node (pair-wise)

22 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-23
SLIDE 23

Simulation Configuration

Table with config parameters

4/11/2008 23 Natalie Enright Jerger - University of Wisconsin

Processors

Cores 16 in-order general purpose

Memory System

L1 I/D Caches 32 KB 2-way set associative 1 cycle Private L2 caches 512 KB 4-way set associative 6 cycles 64 Byte lines Shared L3 Cache 16 MB (1MB bank/tile) 4-way set associative 12 cycles Main Memory Latency 100 cycles

I nterconnect: 4x4 2-D Mesh

Packet-switched baseline Optimized 1-3 router stages 4 Virtual channels with 4 Buffers each Hybrid Circuit Switching 1 router stage 2 or 4 Circuit planes

slide-24
SLIDE 24

Network Results

  • Communication latency is key: shave off precious cycles in

network latency

4/11/2008 24 Natalie Enright Jerger - University of Wisconsin

slide-25
SLIDE 25

Flit breakdown

Reduce interconnect latency for a

significant fraction of messages

25 Natalie Enright Jerger - University of Wisconsin 4/11/2008

slide-26
SLIDE 26

HCS + Protocol Optimization

  • Improvement of HCS + Protocol optimization is greater than the

sum of HCS or Protocol Optimization alone.

Protocol Optimization drives up circuit reuse, better utilizing HCS 26 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-27
SLIDE 27

Uniform Random Traffic

HCS successfully overcomes bandwidth

limitations associated with Circuit Switching

27 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-28
SLIDE 28

Related Work

Router optimizations

Express Virtual Channels [Kumar, ISCA 2007] Single-cycle router [Mullins, ISCA 2004] Many more…

Hybrid Circuit-Switching

Wave-switching [Duato, ICPP 1996] SoCBus [Wiklund, IPDPS 2003]

Coherence Protocols

Significant research in removing overhead of

indirection

28 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-29
SLIDE 29

Circuit-Switched Coherence Summary

Replace packet-switched mesh with

hybrid circuit-switched mesh

Interleave circuit and packet switched

flits

Reconfigurable circuits Dedicated bandwidth for frequent

pair-wise sharers

Low Latency and low power

Avoid switching/routing

Devise novel coherence

mechanisms to take advantage of benefits of circuit switching

29 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-30
SLIDE 30

Thank you

www.ece.wisc.edu/~ pharm enrightn@cae.wisc.edu

30 4/11/2008 Natalie Enright Jerger - University of Wisconsin

slide-31
SLIDE 31

Circuit Setup

Novel Setup Policy

Overlap circuit setup with first data transfer

Store circuit information at each router

Reconfigure existing circuits if no unused links

available

Allows piggy-backed request to always achieve low

latency

Multiple narrow circuit planes prevent frequent

reconfiguration

Reconfiguration

Buffered, traverses packet-switched pipeline

31 4/11/2008 Natalie Enright Jerger - University of Wisconsin