Memory Hierarchy for Web Search
Grant Ayers*, Jung Ho Ahn†, Christos Kozyrakis*, Partha Ranganathan‡
Stanford University* Seoul National University† Google‡
Work performed while authors were at Google
Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , - - PowerPoint PPT Presentation
Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , Christos Kozyrakis * , Partha Ranganathan Stanford University * Seoul National University Google Work performed while authors were at Google The world is
Stanford University* Seoul National University† Google‡
Work performed while authors were at Google
2
3
4
Scalability / Hardware Optimizations
(+11%), hardware prefetching (+5%)
excellent software scaling
5
6
Web search leaf node CPU utilization
7
8
but touched footprint grows with cores and time (little data locality in the shard)
around 1 GiB, suggests sharing and cold structures
9
10
removes code misses
the shard
heap
11
L3 Hit Rate L3 MPKI
removes code misses
the shard
heap
12
L3 Hit Rate L3 MPKI 16 MiB sufficient for instructions
removes code misses
the shard
heap
13
L3 Hit Rate L3 MPKI 16 MiB sufficient for instructions Shard
removes code misses
the shard
heap Large shared caches are highly effective for heap accesses.
14
L3 Hit Rate L3 MPKI 1 GiB sufficient for heap 16 MiB sufficient for instructions
removes code misses
the shard
heap Large shared caches are highly effective for heap accesses. The L3 cache is in a region of diminishing returns
15
L3 Hit Rate L3 MPKI 1 GiB sufficient for heap 16 MiB sufficient for instructions Region of diminishing returns
16
17
18 1 “The Xeon Processor E5-2600 v3: A 22nm 18-core product family” (ISSCC ‘15)
Sweep core count and L3 capacity in terms of chip area used
Some L3 transistors could be better used for cores
(9c/2.5 MiB/core worse than 11c/1.23 MiB/core)
Core count is not all that matters!
(All 18c with < 1 MiB/core are bad)
19
Incorporate the sweep data into a linear model
1 MiB/core of L3 cache allows 5 extra cores and 14% performance improvement
20
Cache-for-Cores Performance
○ eDRAM provides lower latency ○ Multi-chip package allows for existing 128 MiB dies
21
22
Proposed L4 Cache based on eDRAM
23
L4 Hit Rate L4 MPKI
24
L4 Hit Rate L4 MPKI
25
additional miss penalty)
L4 and Cache for Cores
26
Web search leaf node CPU utilization
27