SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES - - PowerPoint PPT Presentation

signet network on chip filtering for coarse vector
SMART_READER_LITE
LIVE PREVIEW

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES - - PowerPoint PPT Presentation

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES Natalie Enright Jerger University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols


slide-1
SLIDE 1

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES

Natalie Enright Jerger University of Toronto

slide-2
SLIDE 2

Natalie Enright Jerger DATE 2010

Interaction of Coherence and Network

 Cache coherence protocol drives network-on-chip

traffic

 Scalable coherence protocols needed for many-core

architectures

 Consider interconnection network optimizations to

help facilitate scalable coherence

2

slide-3
SLIDE 3

Natalie Enright Jerger DATE 2010

Talk Outline

 Introduction  Network-on-Chip Challenges with Scalable Coherence

Protocol

 SigNet Architecture: Network filtering solution  Evaluation  Conclusion

3

slide-4
SLIDE 4

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

4

slide-5
SLIDE 5

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

 Broadcast

 Good latency  Poor scaling due to

bandwidth requirements

4

slide-6
SLIDE 6

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

 Broadcast

 Good latency  Poor scaling due to

bandwidth requirements

4

slide-7
SLIDE 7

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

 Broadcast

 Good latency  Poor scaling due to

bandwidth requirements

4

 Directory

 Good scalability due to point to

point communication

 Storage overheads

slide-8
SLIDE 8

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

 Broadcast

 Good latency  Poor scaling due to

bandwidth requirements

4

Store Miss

 Directory

 Good scalability due to point to

point communication

 Storage overheads

slide-9
SLIDE 9

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

 Broadcast

 Good latency  Poor scaling due to

bandwidth requirements

4

Store Miss ForwardX

 Directory

 Good scalability due to point to

point communication

 Storage overheads

slide-10
SLIDE 10

Natalie Enright Jerger DATE 2010

Many-Core Cache Coherence Challenges

 Broadcast

 Good latency  Poor scaling due to

bandwidth requirements

4

Store Miss ForwardX Response

 Directory

 Good scalability due to point to

point communication

 Storage overheads

slide-11
SLIDE 11

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

5

slide-12
SLIDE 12

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

5

slide-13
SLIDE 13

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

5

slide-14
SLIDE 14

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

5

slide-15
SLIDE 15

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

5

slide-16
SLIDE 16

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

5

slide-17
SLIDE 17

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

5

Cores 1, 2, 5 & 15 share cache line

slide-18
SLIDE 18

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line

slide-19
SLIDE 19

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line 2 pointers

slide-20
SLIDE 20

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line 2 pointers

slide-21
SLIDE 21

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers

slide-22
SLIDE 22

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers

slide-23
SLIDE 23

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers

slide-24
SLIDE 24

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers

slide-25
SLIDE 25

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers

slide-26
SLIDE 26

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers

slide-27
SLIDE 27

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers

slide-28
SLIDE 28

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers

slide-29
SLIDE 29

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow

3 sharers but 6 invalidations

2 pointers

slide-30
SLIDE 30

Natalie Enright Jerger DATE 2010

Scalable Cache Coherence

 Directory protocol storage

  • verheads

 Single bit per core in sharing

vector (full map)

 256 cores ➔ 32 Bytes of

  • verhead per cache line!

 Coarse Vector Directories

 DiriCVr

 i: # of pointers  r: # of cores in region

 Example: Dir2CV2

 Requires 1/2 as much storage

5

Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow

Represents cores 0 & 1

3 sharers but 6 invalidations

2 pointers

slide-31
SLIDE 31

Natalie Enright Jerger DATE 2010

Extraneous Invalidations with Coarse Vectors

 For many applications: number of sharers is small

 2 to 3

 When number of sharers exceeds i, pointers overflow

 Directory entry operates in coarse mode

 1 bit represents multiple cores  Imprecise sharing list

 Extra processors will receive and acknowledge invalidation  Consumes network power  Requires additional cache lookups

6

slide-32
SLIDE 32

Natalie Enright Jerger DATE 2010

Latency Impact of Coarse Vectors

 Increase in coarseness leads to significant contention

7

1 2 3 4 256 128 64 32 16 8

5% 10% 12% 15%

Normalized Packet Latency

slide-33
SLIDE 33

Natalie Enright Jerger DATE 2010

Bandwidth Impact

 Increased load results in decreased effectiveness of

pipeline optimizations

 Increased dynamic power consumption

8

2 4 6 256 128 64 32 16 8

Normalized Link Traversals

slide-34
SLIDE 34

Natalie Enright Jerger DATE 2010

System-level impact

 Coarse vectors increase

average packet latency

 Increase completion time

for invalidations

 All acknowledgments must

be received

 Delay subsequent requests

to pending cache line

9

30 60 90 120 150

Inval Complete Local Remote

SPECjbb SPECweb TPC-H TPC-W Barnes Ocean Radiosity Raytrace

Average Cycles

slide-35
SLIDE 35

Natalie Enright Jerger DATE 2010

SigNet Overview

 Coarseness reduces directory storage

 But with significant potential impact on network

 Due to extraneous invalidations  Safely remove extraneous invalidations

 Save power and reduce network contention

 Place cache summary information in routers

 Counting Bloom filters used for cache summary signatures  Use summary information to filter network packets

10

slide-36
SLIDE 36

Natalie Enright Jerger DATE 2010

SigNet Bloom Filters

 Counting Bloom filters summarize cache information in

routers

 Signature of cache contents

 Bloom Filter Hit

 Core exists between current node and destination that is

caching an address mapping to same entry

 Bloom Filter Miss

 None of the downstream caches are caching lines that map

to this entry

11

slide-37
SLIDE 37

Natalie Enright Jerger DATE 2010

Modifying Bloom Filters

 Bloom Filter Insertion

 Cache misses increment counter as they travel to

directory

 Bloom Filter Deletion

 Writeback and invalidation acknowledgments

decrement associated counter at routers between cache and directory

12

slide-38
SLIDE 38

Natalie Enright Jerger DATE 2010

SigNet Architecture

13 !"#$%& '"()#$*+",&

  • '&.//"0*$"1&

234$05& .//"0*$"1& '1"667*1&634$05&

8,)#$&9& 8,)#$&:& ;#$)#$&9& ;#$)#$&:& <%0"=%&

24>&?& 24>&2& 24>&@& 24>&A&

8,)#$&7#B%16&

  • '&:&
  • '&,&

.0C&D%66*>%& E"1(*+",&

8,)#$&7#B%16&

  • '&:&
  • '&,&
  • '&:&-'&F&
slide-39
SLIDE 39

Natalie Enright Jerger DATE 2010

SigNet Pipeline

 Header flits traverse modified pipeline

 If the packet needs to check/update signature

14

!"# $%# &'()*'+,-.#

/0# ,0#

,1# 21#

3'4*#5-6# ,-.#3-6# 0(7#89.# :);<#

,1# 21#

/0# ,0#

,-.#8-99#

BW RC VA SA ST LT

slide-40
SLIDE 40

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

slide-41
SLIDE 41

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory

slide-42
SLIDE 42

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory

slide-43
SLIDE 43

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory

slide-44
SLIDE 44

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature

slide-45
SLIDE 45

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature

slide-46
SLIDE 46

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature

slide-47
SLIDE 47

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-48
SLIDE 48

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-49
SLIDE 49

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-50
SLIDE 50

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-51
SLIDE 51

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-52
SLIDE 52

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-53
SLIDE 53

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-54
SLIDE 54

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-55
SLIDE 55

Natalie Enright Jerger DATE 2010

SigNet Example

 Requests follow X-Y path  Responses follow Y-X path

15

1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request

slide-56
SLIDE 56

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

slide-57
SLIDE 57

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A

slide-58
SLIDE 58

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A

slide-59
SLIDE 59

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A

slide-60
SLIDE 60

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests

slide-61
SLIDE 61

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests

slide-62
SLIDE 62

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss

slide-63
SLIDE 63

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks

slide-64
SLIDE 64

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks

slide-65
SLIDE 65

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks

slide-66
SLIDE 66

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks

slide-67
SLIDE 67

Natalie Enright Jerger DATE 2010

SigNet Example (2)

16

4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks 8 Invalidations reach 2 destination cores and acks sent to directory

slide-68
SLIDE 68

Natalie Enright Jerger DATE 2010

SigNet Implementation

 Recall: Full-map directory requires 32 bytes per

sharing vector for 256 cores

 50% overhead per cache line

 Evaluation uses 8K entry Bloom filters at each output

port

 Reduces overhead to 12.5% to 25% per cache line

 Depends on size of counters and number of pointers in coarse

vector directory

17

slide-69
SLIDE 69

Natalie Enright Jerger DATE 2010

Correctness and Utilization

 All cores caching a block must receive invalidation

request

 Bloom filter can have false positives

 Lessens performance benefit but correct  Cannot have false negatives

 Differences in utilization due to location and memory

usage

18

slide-70
SLIDE 70

Natalie Enright Jerger DATE 2010

Simulation Methodology

19

Network configuration Number of Nodes 256 Topology 16 x16 mesh Virtual Channels and Buffers 4 VC/port 8 Buffers/VC Link Width 16 Bytes Signature Size 8192

slide-71
SLIDE 71

Natalie Enright Jerger DATE 2010

Simulation Methodology (2)

 Create synthetic benchmarks based on characteristics of

16-core workloads

20

Workload Parameters meters Name %Invalidates Average Sharers Database 6.0 2.3 Web 3.5 3.8 Java 2.7 2.2 Scientific A 2.0 2.3 Scientific B 5.0 3.0

slide-72
SLIDE 72

Natalie Enright Jerger DATE 2010

Results: 2 pointers, 16 cores per region

 Filters effectively reduce network contention

 Lower average packet latency

 Additional research needed to further close gap with

Full Map Directory

21

0.25 0.50 0.75 1.00 Database Web Java SciA SciB

CV Directory SigNet Full Map

Normalized Average Packet Latency

slide-73
SLIDE 73

Natalie Enright Jerger DATE 2010

Results: Invalidation Completion Time

 SigNet improves invalidation completion time  Comparison with Pruning Caches in paper

22

30 60 90 120 Database Web Java SciA SciB CV Directory SigNet Full Map

Cycles for Invalidations to Complete

slide-74
SLIDE 74

Natalie Enright Jerger DATE 2010

Related Work

 Interconnection Network Support

 Pruning Caches  In-network coherence filters

 Cache Coherence Optimizations with Bloom filters

 Jetty filters: reduce cache snoops  Tagless Coherence Directories

 Reduce storage overheads

23

slide-75
SLIDE 75

Natalie Enright Jerger DATE 2010

Conclusions

 Characterize impact of CV directories

 Significant power consumption and performance

degradation

 Interconnect support to facilitate scalable cache

coherence

 SigNet: network filters to reduce extraneous

invalidations

24

slide-76
SLIDE 76

Natalie Enright Jerger DATE 2010

Thank you

 Questions?

25