SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES - - PowerPoint PPT Presentation
SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES - - PowerPoint PPT Presentation
SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES Natalie Enright Jerger University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols
Natalie Enright Jerger DATE 2010
Interaction of Coherence and Network
Cache coherence protocol drives network-on-chip
traffic
Scalable coherence protocols needed for many-core
architectures
Consider interconnection network optimizations to
help facilitate scalable coherence
2
Natalie Enright Jerger DATE 2010
Talk Outline
Introduction Network-on-Chip Challenges with Scalable Coherence
Protocol
SigNet Architecture: Network filtering solution Evaluation Conclusion
3
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
4
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
Broadcast
Good latency Poor scaling due to
bandwidth requirements
4
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
Broadcast
Good latency Poor scaling due to
bandwidth requirements
4
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
Broadcast
Good latency Poor scaling due to
bandwidth requirements
4
Directory
Good scalability due to point to
point communication
Storage overheads
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
Broadcast
Good latency Poor scaling due to
bandwidth requirements
4
Store Miss
Directory
Good scalability due to point to
point communication
Storage overheads
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
Broadcast
Good latency Poor scaling due to
bandwidth requirements
4
Store Miss ForwardX
Directory
Good scalability due to point to
point communication
Storage overheads
Natalie Enright Jerger DATE 2010
Many-Core Cache Coherence Challenges
Broadcast
Good latency Poor scaling due to
bandwidth requirements
4
Store Miss ForwardX Response
Directory
Good scalability due to point to
point communication
Storage overheads
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
5
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
5
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
5
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
5
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
5
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
5
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
5
Cores 1, 2, 5 & 15 share cache line
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow 2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow
3 sharers but 6 invalidations
2 pointers
Natalie Enright Jerger DATE 2010
Scalable Cache Coherence
Directory protocol storage
- verheads
Single bit per core in sharing
vector (full map)
256 cores ➔ 32 Bytes of
- verhead per cache line!
Coarse Vector Directories
DiriCVr
i: # of pointers r: # of cores in region
Example: Dir2CV2
Requires 1/2 as much storage
5
Cores 1, 2, 5 & 15 share cache line Only 2 sharers: 1 & 15 3rd sharer: overflow
Represents cores 0 & 1
3 sharers but 6 invalidations
2 pointers
Natalie Enright Jerger DATE 2010
Extraneous Invalidations with Coarse Vectors
For many applications: number of sharers is small
2 to 3
When number of sharers exceeds i, pointers overflow
Directory entry operates in coarse mode
1 bit represents multiple cores Imprecise sharing list
Extra processors will receive and acknowledge invalidation Consumes network power Requires additional cache lookups
6
Natalie Enright Jerger DATE 2010
Latency Impact of Coarse Vectors
Increase in coarseness leads to significant contention
7
1 2 3 4 256 128 64 32 16 8
5% 10% 12% 15%
Normalized Packet Latency
Natalie Enright Jerger DATE 2010
Bandwidth Impact
Increased load results in decreased effectiveness of
pipeline optimizations
Increased dynamic power consumption
8
2 4 6 256 128 64 32 16 8
Normalized Link Traversals
Natalie Enright Jerger DATE 2010
System-level impact
Coarse vectors increase
average packet latency
Increase completion time
for invalidations
All acknowledgments must
be received
Delay subsequent requests
to pending cache line
9
30 60 90 120 150
Inval Complete Local Remote
SPECjbb SPECweb TPC-H TPC-W Barnes Ocean Radiosity Raytrace
Average Cycles
Natalie Enright Jerger DATE 2010
SigNet Overview
Coarseness reduces directory storage
But with significant potential impact on network
Due to extraneous invalidations Safely remove extraneous invalidations
Save power and reduce network contention
Place cache summary information in routers
Counting Bloom filters used for cache summary signatures Use summary information to filter network packets
10
Natalie Enright Jerger DATE 2010
SigNet Bloom Filters
Counting Bloom filters summarize cache information in
routers
Signature of cache contents
Bloom Filter Hit
Core exists between current node and destination that is
caching an address mapping to same entry
Bloom Filter Miss
None of the downstream caches are caching lines that map
to this entry
11
Natalie Enright Jerger DATE 2010
Modifying Bloom Filters
Bloom Filter Insertion
Cache misses increment counter as they travel to
directory
Bloom Filter Deletion
Writeback and invalidation acknowledgments
decrement associated counter at routers between cache and directory
12
Natalie Enright Jerger DATE 2010
SigNet Architecture
13 !"#$%& '"()#$*+",&
- '&.//"0*$"1&
234$05& .//"0*$"1& '1"667*1&634$05&
8,)#$&9& 8,)#$&:& ;#$)#$&9& ;#$)#$&:& <%0"=%&
24>&?& 24>&2& 24>&@& 24>&A&
8,)#$&7#B%16&
- '&:&
- '&,&
.0C&D%66*>%& E"1(*+",&
8,)#$&7#B%16&
- '&:&
- '&,&
- '&:&-'&F&
Natalie Enright Jerger DATE 2010
SigNet Pipeline
Header flits traverse modified pipeline
If the packet needs to check/update signature
14
!"# $%# &'()*'+,-.#
/0# ,0#
,1# 21#
3'4*#5-6# ,-.#3-6# 0(7#89.# :);<#
,1# 21#
/0# ,0#
,-.#8-99#
BW RC VA SA ST LT
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example
Requests follow X-Y path Responses follow Y-X path
15
1 Cache miss to A Send request to directory 2 At each hop, insert address into input port signature 3 Directory operating in coarse mode Bit for region set Directory responds to request
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks
Natalie Enright Jerger DATE 2010
SigNet Example (2)
16
4 Write request for A 5 Directory forwards invalidation requests 6 North Port Filter Hit West Port Filter Miss 7 Spawn 2 Acks 8 Invalidations reach 2 destination cores and acks sent to directory
Natalie Enright Jerger DATE 2010
SigNet Implementation
Recall: Full-map directory requires 32 bytes per
sharing vector for 256 cores
50% overhead per cache line
Evaluation uses 8K entry Bloom filters at each output
port
Reduces overhead to 12.5% to 25% per cache line
Depends on size of counters and number of pointers in coarse
vector directory
17
Natalie Enright Jerger DATE 2010
Correctness and Utilization
All cores caching a block must receive invalidation
request
Bloom filter can have false positives
Lessens performance benefit but correct Cannot have false negatives
Differences in utilization due to location and memory
usage
18
Natalie Enright Jerger DATE 2010
Simulation Methodology
19
Network configuration Number of Nodes 256 Topology 16 x16 mesh Virtual Channels and Buffers 4 VC/port 8 Buffers/VC Link Width 16 Bytes Signature Size 8192
Natalie Enright Jerger DATE 2010
Simulation Methodology (2)
Create synthetic benchmarks based on characteristics of
16-core workloads
20
Workload Parameters meters Name %Invalidates Average Sharers Database 6.0 2.3 Web 3.5 3.8 Java 2.7 2.2 Scientific A 2.0 2.3 Scientific B 5.0 3.0
Natalie Enright Jerger DATE 2010
Results: 2 pointers, 16 cores per region
Filters effectively reduce network contention
Lower average packet latency
Additional research needed to further close gap with
Full Map Directory
21
0.25 0.50 0.75 1.00 Database Web Java SciA SciB
CV Directory SigNet Full Map
Normalized Average Packet Latency
Natalie Enright Jerger DATE 2010
Results: Invalidation Completion Time
SigNet improves invalidation completion time Comparison with Pruning Caches in paper
22
30 60 90 120 Database Web Java SciA SciB CV Directory SigNet Full Map
Cycles for Invalidations to Complete
Natalie Enright Jerger DATE 2010
Related Work
Interconnection Network Support
Pruning Caches In-network coherence filters
Cache Coherence Optimizations with Bloom filters
Jetty filters: reduce cache snoops Tagless Coherence Directories
Reduce storage overheads
23
Natalie Enright Jerger DATE 2010
Conclusions
Characterize impact of CV directories
Significant power consumption and performance
degradation
Interconnect support to facilitate scalable cache
coherence
SigNet: network filters to reduce extraneous
invalidations
24
Natalie Enright Jerger DATE 2010
Thank you
Questions?
25