MORC
A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF
12/7/2015 1
MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID - - PowerPoint PPT Presentation
MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1 Architectures moving toward manycore Tilera: 64-72 cores Intel MIC: NVIDIA GPGPUs: (2007) 288 threads (2015) 3072 threads (2015) Increasing thread
A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF
12/7/2015 1
12/7/2015 2
Tilera: 64-72 cores (2007) Intel MIC: 288 threads (2015) NVIDIA GPGPUs: 3072 threads (2015)
Increasing thread aggregation
Throughput is already bandwidth-bound
Bandwidth-wall will stall practical manycore scaling
12/7/2015 3
Throughput = min(compute_avail, bandwidth_avail)
More on-chip cache correlates with higher performance More effective cache through compression correlates with perf.
12/7/2015 4
More on-chip cache correlates with higher performance More effective cache through compression correlates with perf.
12/7/2015 5
MORC:
reduce off-chip misses
Insight:
algorithms
12/7/2015 6
Common software data compression algorithms
Sequentially compresses cache lines as a single stream
12/7/2015 7
Common software data compression algorithms
Sequentially compresses cache lines as a single stream
12/7/2015 8
Common software data compression algorithms
Sequentially compresses cache lines as a single stream
12/7/2015 9
Common software data compression algorithms
Sequentially compresses cache lines as a single stream
12/7/2015 10
12/7/2015 11
12/7/2015 12
12/7/2015 13
12/7/2015 14
Stream-based compression achieves much higher compression
12/7/2015 15
Stream-based compression achieves much higher compression Many prior-work uses block-based compression Two reasons: single-threaded performance & implement-ability
Decompression is inherently expensive
12/7/2015 16
Decompression is inherently expensive
12/7/2015 17
Decompression is inherently expensive
12/7/2015 18
Decompression is inherently expensive
12/7/2015 19
Insight:
Memory accesses are expensive!
12/7/2015 20
12/7/2015 21
Implementation: compress each cache set as a compressed stream
Cache sets are unsuited for stream-based compression
12/7/2015 22
Implementation: compress each cache set as a compressed stream
12/7/2015 23
Log-based caches organize cache lines by temporal fill order
12/7/2015 24
12/7/2015 25
12/7/2015 26
Log-flush happens when not enough space
12/7/2015 27
12/7/2015 28
LMT: Line-Map Table (redirection table)
12/7/2015 29
LMT: Line-Map Table (redirection table)
1. Stream compressor 2. LMT 3. Eviction policy (flush)
Multiple active logs enable content aware compression
12/7/2015 30
12/7/2015 31
Internal-fragmentation in compression blocks
External fragmentation
[1] Alameldeen et al, “Adaptive cache compression for high-performance processors,” ISCA’04 [2] Sardashti et al, “Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching,” MICRO’13 [3] Arelakis et al, “SC2: A statistical compression cache scheme,” ISCA’14
Scheme Internal fragmentation External fragmentation Tags
Requiring software Set-based Algorithm Adaptive[1] Yes Yes Medium No Yes Block Decoupled[2] Yes No Low No Yes Block SC2[3] Yes Yes High Yes Yes Centralized MORC Very little No Low No Log-based Stream
Simulator: PriME[1]
SPEC2006 benchmarks Future manycore system
12/7/2015 32
[1] Y. Fu et al, “PriME: A parallel and distributed simulator for thousand-core chips,” ISPASS 2014
12/7/2015 33
12/7/2015 34
Max average comp. ratio: 6x Arithmetic mean: 3x
12/7/2015 35
Max average comp. ratio: 6x Arithmetic mean: 3x
12/7/2015 36
Max average comp. ratio: 6x Arithmetic mean: 3x Throughput improvements: 40% Best prior work: 20%
12/7/2015 37
Max average comp. ratio: 6x Arithmetic mean: 3x Throughput improvements: 40% Best prior work: 20%
12/7/2015 38
Max average comp. ratio: 6x Arithmetic mean: 3x Throughput improvements: 40% Best prior work: 20% Improvements depends
Two questions:
12/7/2015 39
Expensive DRAM accesses Negligible compression energy Small decompression energy
12/7/2015 40
Memory subsystem energy normalized to uncompressed baseline
Expensive DRAM accesses Negligible compression energy Small decompression energy
12/7/2015 41
Memory subsystem energy normalized to uncompressed baseline
Stream compression is much better versus block-based
Log-based caches efficiently support stream-based compression
Architecture
Results
12/7/2015 42