Time for Compressed Memory Hierarchies
Biswabandan Panda
WOCS 2018 December 8th, 2018 L3
Hierarchies Biswabandan Panda WOCS 2018 December 8 th , 2018 - - PowerPoint PPT Presentation
L3 Time for Compressed Memory Hierarchies Biswabandan Panda WOCS 2018 December 8 th , 2018 Memory in Single-core Systems DRAM Contr. L3 Core L1 L2 4 Cycles 12 Cycles 30 Cycles 200 Cycles Latency wall DRAM was already critical for
Time for Compressed Memory Hierarchies
Biswabandan Panda
WOCS 2018 December 8th, 2018 L3
TIME FOR COMPRESSED MEMORY HIERARCHIES 2
Memory in Single-core Systems
DRAM was already critical for performance
L3 L1 L2
4 Cycles 12 Cycles 30 Cycles 200 Cycles
DRAM Contr.
Core
Latency wall
Memory in Multi-core Systems
Core
DRAM Contr.
Core Core Core Core Core Core Core
Core count doubling every 2 years DRAM bandwidth doubling every 4 years Memory wall = Latency wall + Bandwidth wall
200 Cycles 800 Cycles
TIME FOR COMPRESSED MEMORY HIERARCHIES 3
Solution
Cache Compression
L3
L3
TIME FOR COMPRESSED MEMORY HIERARCHIES 4
Cache Reuse
L3
Cache Compression-10000 Feet View
Increases cache capacity without increasing cache size
Core L3 L1 L2 DRAM Contr. Core Core Core
Exploits redundancy in data patterns
TIME FOR COMPRESSED MEMORY HIERARCHIES 5
Examples – Data Patterns
0x00000000
A[i][j]=0
0x00000000 0x00000000 0x00000000 0x333333FF
A[i][j]=constant
0x333333FF 0x333333FF 0x333333FF 0x888888C0
*ptr
0x888888C8 0x888888D0 0x888888D8 0x00000001
Narrow values (small value in large data type)
0x00000002 0x00000003 0x00000004
4 Bytes
TIME FOR COMPRESSED MEMORY HIERARCHIES 6
Cache Compression
B0 B0 Fixed block size(64B) Compressed block size (< 64B)
TIME FOR COMPRESSED MEMORY HIERARCHIES 7
Compaction
TIME FOR COMPRESSED MEMORY HIERARCHIES 8
Compacts multiple contiguous compressed blocks of similar compressibility
B0
DCC [MICRO ’13] and SCC [MICRO ’14]
B1 B2 B3
B0 B1 B2 B3
16B 16B 16B 64B 16B
Compressed Cache Layout
OFFSET SET INDEX TAG
6
OFFSET BLK-ID SET INDEX TAG
6 8
SET 0
B0 B1 B2 B3
SET 0 SET 1 SET 2 SET 3
B0 B1 B2 B3
SUPER BLOCK TAG
TIME FOR COMPRESSED MEMORY HIERARCHIES 9
Compressed Cache Layout
TAG
0 0 - ID0
TAG
0 1 ID0 ID2 B2 B0 B0 Data array Tag array
TAG
1 X ID0 ID2 B3 B2 B1 B0 ID1 ID3
TIME FOR COMPRESSED MEMORY HIERARCHIES 10
Observations
16B 16B 64B 32B
B0 B1 B3
Un-occupied space
B2
32B
B2
64B
B3
32B 32B 16B 32B 64B 48B
B0 B1
32B
B2
64B
B3
16B 32B 48B
B0 B2
32B
B3
32B
B1
32B
B1
32B 16B
B1
32B 16B 64B 64B
B2
64B
Block-I Block-II Block-III Block-IV
TIME FOR COMPRESSED MEMORY HIERARCHIES 11
Observations
Compression and compaction techniques:
B0 B1 B0 B1 B0 B1 B0 B1
Compression Compaction ❷ ❶ Need for coordination
TIME FOR COMPRESSED MEMORY HIERARCHIES 12
DISH: DICTIONARY SHARING BASED CACHE COMPRESSION [MICRO ‘16]
TIME FOR COMPRESSED MEMORY HIERARCHIES 13
Scheme-I (Data Content locality)
A B A Z C A A D A E A G C H G H
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A B Z C D E G H
0 1 2 3 4 5 6 7
0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 3-BIT POINTERS FOR 16 4B CHUNKS DICTIONARY OF 8 4B ENTRIES
SHARED DICTIONARY
B1’s PTRs B2’s PTRs B0’s PTRs B3’s PTRs 4B Chunk 64B Block 32B 6B 32B 6B 6B 6B 6B 38B Compressed Block 56B Compressed Block
TIME FOR COMPRESSED MEMORY HIERARCHIES 14
Scheme-I (Data Content locality)
A B A Z C A A D A E A G C H G H
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A B Z C D E G H
0 1 2 3 4 5 6 7
0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 3-BIT POINTERS FOR 16 4B CHUNKS DICTIONARY OF 8 4B ENTRIES
SHARED DICTIONARY
B1’s PTRs B2’s PTRs B0’s PTRs B3’s PTRs 4B Chunk 64B Block 32B 6B 32B 6B 6B 6B 6B 38B Compressed Block 56B Compressed Block
Four 64B blocks inside one 64B block Not in the critical path Compression latency of 24 cycles
TIME FOR COMPRESSED MEMORY HIERARCHIES 15
Scheme-II (Upper-bits Locality)
0x00000021 0x00000030 0x…32 0x…12
…
0x…02
0x0000002 0x…3 0x…1 0x…0 0 1 1 2 3
1 2 3
…
1 0 2 2 2
15 1 2 3 15
DICTIONARY OF 4 28-BIT ENTRIES 16 2-BIT POINTERS 16 4-BIT OFFSETS
SHARED DICTIONARY
B0’s PTRs & Offsets 14B
…
1 2 3 15 1 2 3
64B Block B1’s PTRs & Offsets B2’s PTRs & Offsets B3’s PTRs & Offsets 12B 12B 12B 12B
Four 64B blocks inside one 64B block
62B Compressed Block
TIME FOR COMPRESSED MEMORY HIERARCHIES 16
Decompression
A B Z C D E G H
0 1 2 3 4 5 6 7
0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A B A Z C A A D A E A G C H G H
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SHARED DICTIONARY
B1’s PTRs B2’s PTRs B0’s PTRs B3’s PTRs 6B 32B 6B 6B 6B
…
One cycle decompression latency
TIME FOR COMPRESSED MEMORY HIERARCHIES 17
DISH Layout
TAG
TAG
1 ID1ID0 ID3ID2 One uncompressed block
a block B3 B2 B1 B0 B0 Data array Tag array
No additional meta-data
TIME FOR COMPRESSED MEMORY HIERARCHIES 18
DISH in Action - Compression
Compressor
DICT B0’s PTR B1’s PTR
B3’s PTR
…
v
L2
DRAM Contr.
L3
❶ ❷ ❸ B0 B0 B1
=
TAG 1 0 ID1ID0 ID3ID2
B1
1 0
B2 B3
Miss
❸ ❹
OFFSET BLK-ID SET INDEX TAG 6 8
TIME FOR COMPRESSED MEMORY HIERARCHIES 19
DISH in Action- Decompression
De compressor
v
L2 L3
❶ B0 ❸
TAG ID1ID0 ID3ID2
B1
1 0
B2 B3
Hit
Core
❶ ❹ ❷
ID0
B0 ⓿
TIME FOR COMPRESSED MEMORY HIERARCHIES 20
Compression Ratio
Higher the better 1 1.5 2 2.5 3 CPACK+Z [TVLSI '10] BDI [PACT '12] DISH 2.3X
TIME FOR COMPRESSED MEMORY HIERARCHIES 21
Speedup
0.9 0.95 1 1.05 1.1 1.15 1.2 astar bwaves bzip2 cactusADM GemsFDTD gromacs h264ref hmmer lbm leslie3d SPEC GRAPHS GMEAN IPC CPACK+Z BDI DISH 2X Baseline 4X Baseline
12.4% improvement over an uncompressed cache
Higher the better
TIME FOR COMPRESSED MEMORY HIERARCHIES 22
Summary of Contributions
Case for compaction aware compression Inter-block data localities Leverages the compressed cache layout
TIME FOR COMPRESSED MEMORY HIERARCHIES 23
What Else Can be Done with the Layout? Reuse Detection !!
TIME FOR COMPRESSED MEMORY HIERARCHIES 24
Cache Reuse
L3
Reuse Cache [MICRO ‘13]
Tags Data Tags Data 8MB Conventional LLC Reuse LLC (4MB Tag + 1MB Data)
TIME FOR COMPRESSED MEMORY HIERARCHIES 25
Reuse Cache: 1st Access
Tag Data Reuse LLC DRAM L2
Only tag entry is allocated
TIME FOR COMPRESSED MEMORY HIERARCHIES 26
Reuse Cache: 2nd Access
Tag Data Reuse LLC DRAM Tag Hit L2
Data entry allocated, block is reused Highly efficient 4X more tag entries Decoupled tag/data array
TIME FOR COMPRESSED MEMORY HIERARCHIES 27
The Question
Can we detect the reusability of LLC blocks without 4X more tags in conventional and compressed caches? Can we use the existing layout of compressed caches for reuse also?
TIME FOR COMPRESSED MEMORY HIERARCHIES 28
The Answer: Our Contribution
Reuse Cache Cache Compression
Tag Data
A0 A0
Synergistic Cache Layout for Reuse and Compression
TIME FOR COMPRESSED MEMORY HIERARCHIES 29
SRC: SYNERGISTIC CACHE LAYOUT FOR REUSE AND COMPRESSION [PACT ‘18]
TIME FOR COMPRESSED MEMORY HIERARCHIES 30
Our Contribution:10K Feet View
A Single cache layout (Super-block tag) for (i) Reuse detection of a cache block: even without compression. (ii) Cache Compression: Both reuse and compression in compressed caches.
TIME FOR COMPRESSED MEMORY HIERARCHIES 31
Synergistic Cache Layout for Reuse and Compression Reuse Only
TIME FOR COMPRESSED MEMORY HIERARCHIES 32
Dead Blocks at the LLC
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 bzip2 cactusADM libquantum leslie3d
sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Average Fraction of non-reused blocks
~60% of the LLC blocks are dead
TIME FOR COMPRESSED MEMORY HIERARCHIES 33
Compressed Blocks at the LLC
0.2 0.4 0.6 0.8 1 bzip2 cactusADM libquantum leslie3d
sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Average Fraction of uncompressed blocks
~40% of the LLC blocks can’t be compressed
40% of the super-block tags are underutilized ☺
TIME FOR COMPRESSED MEMORY HIERARCHIES 34
U-SRC for Reuse Only-I
TAG
A2 (Global Miss) V A2
TAG
A0 (Tag hit, block miss) V A2 FU
TAG
A0 (Tag hit, block reuse) V A2 I
TAG
A0 V I I I I I I I I I I
TIME FOR COMPRESSED MEMORY HIERARCHIES 35
U-SRC for Reuse Only-II
TAG
A3 (Multiple tag hits, block miss) V A2 I
TAG
A0 V I I I FU FU
TAG
Writeback A3 (Multiple tag hits, block miss, data forwarded to the DRAM) V A2 I
TAG
A0 V I I I FU FU
TIME FOR COMPRESSED MEMORY HIERARCHIES 36
Normalized Performance: U-SRC
1 1.05 1.1 1.15 1.2
bzip2 cactusADM libquantum leslie3d
sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Geomean
Uncompressed Reuse Cache U-SRC Cache
~ Reuse cache, 6.5%/8% over the baseline single-core/multi-core workloads
Higher the better
TIME FOR COMPRESSED MEMORY HIERARCHIES 37
Synergistic Cache Layout for Reuse and Compression Both Reuse and Compression
TIME FOR COMPRESSED MEMORY HIERARCHIES 38
SRC for Compressed LLC-I
TAG
A2 (Global Miss, compressible) V 1 A2
TAG
A0 (Tag hit, block miss, co-compressible) V 1 A2 V I I I I I A0
TIME FOR COMPRESSED MEMORY HIERARCHIES 39
SRC for Compressed LLC - II
TAG
A1 (Tag hit, block miss, incompressible) V 1 A2 V FU I A0
TAG
A1 (Tag hit, reuse detected, incompressible) V 1 A2 V I I A0
TAG
I I V I A1
TIME FOR COMPRESSED MEMORY HIERARCHIES 40
Two Cases on Multiple Tag Hits
TAG
A3 (Multiple tag hits, co-compressible) V 1 A2 V I V A0
TAG
I I V I A1 A3
TAG
V 1 A2 V I FU A0
TAG
I I V FU A1 A3 (Multiple tag hits, incompressible)
TIME FOR COMPRESSED MEMORY HIERARCHIES 41
Special Case: Writeback
TAG
V 1 A2 V I FU A0 A2 (Writeback, Not co-compressible)
TAG
I 1 V I FU A0
TIME FOR COMPRESSED MEMORY HIERARCHIES 42
SRC for Reuse+Compression
1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4
bzip2 cactusADM libquantum leslie3dU-SRC Cache YACC+DISH 4X Uncompressed Baseline SRC Cache Higher the better
1.46 1.77 1.64 1.515%
TIME FOR COMPRESSED MEMORY HIERARCHIES 43
Journey from YACC to SRC
YACC: An efficient and cost effective layout for a compressed cache. [TACO 16] YACC+DISH: One can leverage the layout to enhance compression. [MICRO 16] SRC: One can leverage the YACC layout to enhance LLC reuse along with compression. [PACT 18]
TIME FOR COMPRESSED MEMORY HIERARCHIES 44
MBZip [HiPEAC 18]
TIME FOR COMPRESSED MEMORY HIERARCHIES 45 v
L2
DRAM Contr.
L3
Compressed DRAM Compressed Bus
“It takes two to speak the truth -
TIME FOR COMPRESSED MEMORY HIERARCHIES 46