MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID - - PowerPoint PPT Presentation

morc
SMART_READER_LITE
LIVE PREVIEW

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID - - PowerPoint PPT Presentation

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1 Architectures moving toward manycore Tilera: 64-72 cores Intel MIC: NVIDIA GPGPUs: (2007) 288 threads (2015) 3072 threads (2015) Increasing thread


slide-1
SLIDE 1

MORC

A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF

12/7/2015 1

slide-2
SLIDE 2

Architectures moving toward manycore

12/7/2015 2

Tilera: 64-72 cores (2007) Intel MIC: 288 threads (2015) NVIDIA GPGPUs: 3072 threads (2015)

Increasing thread aggregation

  • Cloud computing
  • Massive warehouse scale center
slide-3
SLIDE 3

Motivation: off-chip bandwidth scalability

Throughput is already bandwidth-bound

  • Assumption: 1000 threads, 1GB/s per thread
  • Demand: 1000GB/s
  • Supply: 102.4GB/s (four DDR4 channels)
  • Oversubscribed ratio: ~10x

Bandwidth-wall will stall practical manycore scaling

  • Economy of high pin-count packaging
  • Pin size hard to be smaller even in high cost chips
  • Frequency does not scale well

12/7/2015 3

Throughput = min(compute_avail, bandwidth_avail)

slide-4
SLIDE 4

Compressing LLC as a solution

More on-chip cache correlates with higher performance More effective cache through compression correlates with perf.

12/7/2015 4

slide-5
SLIDE 5

Compressing LLC as a solution

More on-chip cache correlates with higher performance More effective cache through compression correlates with perf.

12/7/2015 5

MORC:

  • Manycore-oriented compressed cache
  • Compresses the LLC (last level cache) to

reduce off-chip misses

Insight:

  • throughput over single-threaded
  • expensive stream-based compression

algorithms

slide-6
SLIDE 6

Outline

  • Stream compression is great!
  • …but is hard with set-based caches
  • …and is not for single-threaded performance
  • Stream compression with log-based caches
  • Architecture of log-based compressed cache
  • Results
  • Performance
  • Energy

12/7/2015 6

slide-7
SLIDE 7

What is stream-based compression?

Common software data compression algorithms

  • LZ77, gzip, LZMA

Sequentially compresses cache lines as a single stream

  • Compress using pointers to copy repeated string (data)

12/7/2015 7

slide-8
SLIDE 8

What is stream-based compression?

Common software data compression algorithms

  • LZ77, gzip, LZMA

Sequentially compresses cache lines as a single stream

  • Compress using pointers to copy repeated string (data)

12/7/2015 8

slide-9
SLIDE 9

What is stream-based compression?

Common software data compression algorithms

  • LZ77, gzip, LZMA

Sequentially compresses cache lines as a single stream

  • Compress using pointers to copy repeated string (data)

12/7/2015 9

slide-10
SLIDE 10

What is stream-based compression?

Common software data compression algorithms

  • LZ77, gzip, LZMA

Sequentially compresses cache lines as a single stream

  • Compress using pointers to copy repeated string (data)

12/7/2015 10

slide-11
SLIDE 11

Stream compression example

12/7/2015 11

slide-12
SLIDE 12

Stream compression example

12/7/2015 12

slide-13
SLIDE 13

Stream compression example

12/7/2015 13

slide-14
SLIDE 14

Stream vs block-based compression

12/7/2015 14

Stream-based compression achieves much higher compression

slide-15
SLIDE 15

Stream vs block-based compression

12/7/2015 15

Stream-based compression achieves much higher compression Many prior-work uses block-based compression Two reasons: single-threaded performance & implement-ability

slide-16
SLIDE 16

First reason: Well-matched for throughput

Decompression is inherently expensive

12/7/2015 16

slide-17
SLIDE 17

First reason: Well-matched for throughput

Decompression is inherently expensive

12/7/2015 17

slide-18
SLIDE 18

First reason: Well-matched for throughput

Decompression is inherently expensive

12/7/2015 18

slide-19
SLIDE 19

First reason: Well-matched for throughput

Decompression is inherently expensive

12/7/2015 19

Insight:

  • High latency
  • High energy consumption

Memory accesses are expensive!

slide-20
SLIDE 20

Second reason: Hard to implement with set-based caches

12/7/2015 20

slide-21
SLIDE 21

Second reason: Hard to implement with set-based caches

12/7/2015 21

Implementation: compress each cache set as a compressed stream

slide-22
SLIDE 22

Second reason: Hard to implement with set-based caches

Cache sets are unsuited for stream-based compression

  • Evictions and write-backs corrupt the compression stream

12/7/2015 22

Implementation: compress each cache set as a compressed stream

slide-23
SLIDE 23

Introducing log-based caches

12/7/2015 23

Log-based caches organize cache lines by temporal fill order

slide-24
SLIDE 24

Fill data-path architecture

12/7/2015 24

  • Lines stream to one active log sequentially
  • Record address_1 to log_3 in a table
slide-25
SLIDE 25

Fill data-path architecture

12/7/2015 25

  • Lines stream to one active log sequentially
  • Record address_2 to log_3 in a table
slide-26
SLIDE 26

Fill data-path architecture

12/7/2015 26

Log-flush happens when not enough space

  • Not in critical-path
  • Only writes back dirty cache lines
slide-27
SLIDE 27

Fill data-path architecture

12/7/2015 27

  • Lines stream to one active log sequentially
  • Record address_3 to log_4 in a table
slide-28
SLIDE 28

Request data-path

12/7/2015 28

LMT: Line-Map Table (redirection table)

  • Indexed by addresses
  • Points to logs
slide-29
SLIDE 29

Request data-path

12/7/2015 29

LMT: Line-Map Table (redirection table)

  • Indexed by addresses
  • Points to logs

1. Stream compressor 2. LMT 3. Eviction policy (flush)

slide-30
SLIDE 30

Content-aware compression with logs

Multiple active logs enable content aware compression

  • Dynamically chooses the best stream based on similarity
  • Better than strict sequential compression

12/7/2015 30

slide-31
SLIDE 31

Prior work in LLC compression

12/7/2015 31

Internal-fragmentation in compression blocks

  • Decreases absolute compression ratio as much as 12.5%

External fragmentation

  • Increase LLC energy by as much as 200% (studied in [2])

[1] Alameldeen et al, “Adaptive cache compression for high-performance processors,” ISCA’04 [2] Sardashti et al, “Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching,” MICRO’13 [3] Arelakis et al, “SC2: A statistical compression cache scheme,” ISCA’14

Scheme Internal fragmentation External fragmentation Tags

  • verhead

Requiring software Set-based Algorithm Adaptive[1] Yes Yes Medium No Yes Block Decoupled[2] Yes No Low No Yes Block SC2[3] Yes Yes High Yes Yes Centralized MORC Very little No Low No Log-based Stream

slide-32
SLIDE 32

Simulation methodology

Simulator: PriME[1]

  • Execution driven, x86 inorder

SPEC2006 benchmarks Future manycore system

  • 1024 cores in a single chip
  • 128MB LLC (128KB per core)
  • 100GB/s off-chip bandwidth (100MB/s per core)

12/7/2015 32

[1] Y. Fu et al, “PriME: A parallel and distributed simulator for thousand-core chips,” ISPASS 2014

slide-33
SLIDE 33

Compression results

12/7/2015 33

slide-34
SLIDE 34

Compression results

12/7/2015 34

Max average comp. ratio: 6x Arithmetic mean: 3x

slide-35
SLIDE 35

Throughput improvements

12/7/2015 35

Max average comp. ratio: 6x Arithmetic mean: 3x

slide-36
SLIDE 36

Throughput improvements

12/7/2015 36

Max average comp. ratio: 6x Arithmetic mean: 3x Throughput improvements: 40% Best prior work: 20%

slide-37
SLIDE 37

Throughput improvements

12/7/2015 37

Max average comp. ratio: 6x Arithmetic mean: 3x Throughput improvements: 40% Best prior work: 20%

slide-38
SLIDE 38

Throughput improvements

12/7/2015 38

Max average comp. ratio: 6x Arithmetic mean: 3x Throughput improvements: 40% Best prior work: 20% Improvements depends

  • n working set sizes
slide-39
SLIDE 39

Energy

Two questions:

  • DRAM access energy savings
  • Compression/decompression energy concern

12/7/2015 39

slide-40
SLIDE 40

Energy

Expensive DRAM accesses Negligible compression energy Small decompression energy

12/7/2015 40

Memory subsystem energy normalized to uncompressed baseline

slide-41
SLIDE 41

Energy

Expensive DRAM accesses Negligible compression energy Small decompression energy

12/7/2015 41

Memory subsystem energy normalized to uncompressed baseline

slide-42
SLIDE 42

Summary

Stream compression is much better versus block-based

  • …but is hard with set-based caches
  • …and is not right approach for single-threaded performance

Log-based caches efficiently support stream-based compression

  • Sequential cache line placements

Architecture

  • Stream compressor, LMT, eviction policy

Results

  • 50% better compression, 100% better throughput improvements
  • Better energy efficiency

12/7/2015 42