Hierarchies Biswabandan Panda WOCS 2018 December 8 th , 2018 - - PowerPoint PPT Presentation

hierarchies
SMART_READER_LITE
LIVE PREVIEW

Hierarchies Biswabandan Panda WOCS 2018 December 8 th , 2018 - - PowerPoint PPT Presentation

L3 Time for Compressed Memory Hierarchies Biswabandan Panda WOCS 2018 December 8 th , 2018 Memory in Single-core Systems DRAM Contr. L3 Core L1 L2 4 Cycles 12 Cycles 30 Cycles 200 Cycles Latency wall DRAM was already critical for


slide-1
SLIDE 1

Time for Compressed Memory Hierarchies

Biswabandan Panda

WOCS 2018 December 8th, 2018 L3

slide-2
SLIDE 2

TIME FOR COMPRESSED MEMORY HIERARCHIES 2

Memory in Single-core Systems

DRAM was already critical for performance

L3 L1 L2

4 Cycles 12 Cycles 30 Cycles 200 Cycles

DRAM Contr.

Core

Latency wall

slide-3
SLIDE 3

Memory in Multi-core Systems

Core

DRAM Contr.

Core Core Core Core Core Core Core

Core count doubling every 2 years DRAM bandwidth doubling every 4 years Memory wall = Latency wall + Bandwidth wall

200 Cycles 800 Cycles

TIME FOR COMPRESSED MEMORY HIERARCHIES 3

slide-4
SLIDE 4

Solution

Cache Compression

L3

L3

TIME FOR COMPRESSED MEMORY HIERARCHIES 4

Cache Reuse

L3

slide-5
SLIDE 5

Cache Compression-10000 Feet View

Increases cache capacity without increasing cache size

Core L3 L1 L2 DRAM Contr. Core Core Core

Exploits redundancy in data patterns

TIME FOR COMPRESSED MEMORY HIERARCHIES 5

slide-6
SLIDE 6

Examples – Data Patterns

0x00000000

A[i][j]=0

0x00000000 0x00000000 0x00000000 0x333333FF

A[i][j]=constant

0x333333FF 0x333333FF 0x333333FF 0x888888C0

*ptr

0x888888C8 0x888888D0 0x888888D8 0x00000001

Narrow values (small value in large data type)

0x00000002 0x00000003 0x00000004

4 Bytes

TIME FOR COMPRESSED MEMORY HIERARCHIES 6

slide-7
SLIDE 7

Cache Compression

B0 B0 Fixed block size(64B) Compressed block size (< 64B)

TIME FOR COMPRESSED MEMORY HIERARCHIES 7

slide-8
SLIDE 8

Compaction

TIME FOR COMPRESSED MEMORY HIERARCHIES 8

Compacts multiple contiguous compressed blocks of similar compressibility

B0

DCC [MICRO ’13] and SCC [MICRO ’14]

B1 B2 B3

B0 B1 B2 B3

16B 16B 16B 64B 16B

slide-9
SLIDE 9

Compressed Cache Layout

OFFSET SET INDEX TAG

6

OFFSET BLK-ID SET INDEX TAG

6 8

SET 0

B0 B1 B2 B3

SET 0 SET 1 SET 2 SET 3

B0 B1 B2 B3

SUPER BLOCK TAG

TIME FOR COMPRESSED MEMORY HIERARCHIES 9

slide-10
SLIDE 10

Compressed Cache Layout

TAG

0 0 - ID0

TAG

0 1 ID0 ID2 B2 B0 B0 Data array Tag array

  • YACC [Sardashti et al., TACO ’16]

TAG

1 X ID0 ID2 B3 B2 B1 B0 ID1 ID3

TIME FOR COMPRESSED MEMORY HIERARCHIES 10

slide-11
SLIDE 11

Observations

16B 16B 64B 32B

B0 B1 B3

Un-occupied space

B2

32B

B2

64B

B3

32B 32B 16B 32B 64B 48B

B0 B1

32B

B2

64B

B3

16B 32B 48B

B0 B2

32B

B3

32B

B1

32B

B1

32B 16B

B1

32B 16B 64B 64B

B2

64B

Block-I Block-II Block-III Block-IV

TIME FOR COMPRESSED MEMORY HIERARCHIES 11

slide-12
SLIDE 12

Observations

Compression and compaction techniques:

  • blivious to each other

B0 B1 B0 B1 B0 B1 B0 B1

Compression Compaction ❷ ❶ Need for coordination

TIME FOR COMPRESSED MEMORY HIERARCHIES 12

slide-13
SLIDE 13

DISH: DICTIONARY SHARING BASED CACHE COMPRESSION [MICRO ‘16]

TIME FOR COMPRESSED MEMORY HIERARCHIES 13

slide-14
SLIDE 14

Scheme-I (Data Content locality)

A B A Z C A A D A E A G C H G H

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A B Z C D E G H

0 1 2 3 4 5 6 7

0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 3-BIT POINTERS FOR 16 4B CHUNKS DICTIONARY OF 8 4B ENTRIES

SHARED DICTIONARY

B1’s PTRs B2’s PTRs B0’s PTRs B3’s PTRs 4B Chunk 64B Block 32B 6B 32B 6B 6B 6B 6B 38B Compressed Block 56B Compressed Block

TIME FOR COMPRESSED MEMORY HIERARCHIES 14

slide-15
SLIDE 15

Scheme-I (Data Content locality)

A B A Z C A A D A E A G C H G H

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A B Z C D E G H

0 1 2 3 4 5 6 7

0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 3-BIT POINTERS FOR 16 4B CHUNKS DICTIONARY OF 8 4B ENTRIES

SHARED DICTIONARY

B1’s PTRs B2’s PTRs B0’s PTRs B3’s PTRs 4B Chunk 64B Block 32B 6B 32B 6B 6B 6B 6B 38B Compressed Block 56B Compressed Block

Four 64B blocks inside one 64B block Not in the critical path Compression latency of 24 cycles

TIME FOR COMPRESSED MEMORY HIERARCHIES 15

slide-16
SLIDE 16

Scheme-II (Upper-bits Locality)

0x00000021 0x00000030 0x…32 0x…12

0x…02

0x0000002 0x…3 0x…1 0x…0 0 1 1 2 3

1 2 3

1 0 2 2 2

15 1 2 3 15

DICTIONARY OF 4 28-BIT ENTRIES 16 2-BIT POINTERS 16 4-BIT OFFSETS

SHARED DICTIONARY

B0’s PTRs & Offsets 14B

1 2 3 15 1 2 3

64B Block B1’s PTRs & Offsets B2’s PTRs & Offsets B3’s PTRs & Offsets 12B 12B 12B 12B

Four 64B blocks inside one 64B block

62B Compressed Block

TIME FOR COMPRESSED MEMORY HIERARCHIES 16

slide-17
SLIDE 17

Decompression

A B Z C D E G H

0 1 2 3 4 5 6 7

0 1 0 2 3 0 0 4 0 5 0 6 3 7 6 7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A B A Z C A A D A E A G C H G H

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SHARED DICTIONARY

B1’s PTRs B2’s PTRs B0’s PTRs B3’s PTRs 6B 32B 6B 6B 6B

One cycle decompression latency

TIME FOR COMPRESSED MEMORY HIERARCHIES 17

slide-18
SLIDE 18

DISH Layout

TAG

  • ID0

TAG

1 ID1ID0 ID3ID2 One uncompressed block

  • ne or more than one compressed blocks within

a block B3 B2 B1 B0 B0 Data array Tag array

  • Scheme-I/scheme-II

No additional meta-data

TIME FOR COMPRESSED MEMORY HIERARCHIES 18

slide-19
SLIDE 19

DISH in Action - Compression

Compressor

DICT B0’s PTR B1’s PTR

B3’s PTR

v

L2

DRAM Contr.

L3

❶ ❷ ❸ B0 B0 B1

=

TAG 1 0 ID1ID0 ID3ID2

B1

1 0

B2 B3

Miss

❸ ❹

OFFSET BLK-ID SET INDEX TAG 6 8

TIME FOR COMPRESSED MEMORY HIERARCHIES 19

slide-20
SLIDE 20

DISH in Action- Decompression

De compressor

v

L2 L3

❶ B0 ❸

TAG ID1ID0 ID3ID2

B1

1 0

B2 B3

Hit

Core

❶ ❹ ❷

ID0

B0 ⓿

TIME FOR COMPRESSED MEMORY HIERARCHIES 20

slide-21
SLIDE 21

Compression Ratio

Higher the better 1 1.5 2 2.5 3 CPACK+Z [TVLSI '10] BDI [PACT '12] DISH 2.3X

TIME FOR COMPRESSED MEMORY HIERARCHIES 21

slide-22
SLIDE 22

Speedup

0.9 0.95 1 1.05 1.1 1.15 1.2 astar bwaves bzip2 cactusADM GemsFDTD gromacs h264ref hmmer lbm leslie3d SPEC GRAPHS GMEAN IPC CPACK+Z BDI DISH 2X Baseline 4X Baseline

12.4% improvement over an uncompressed cache

Higher the better

TIME FOR COMPRESSED MEMORY HIERARCHIES 22

slide-23
SLIDE 23

Summary of Contributions

Case for compaction aware compression Inter-block data localities Leverages the compressed cache layout

TIME FOR COMPRESSED MEMORY HIERARCHIES 23

slide-24
SLIDE 24

What Else Can be Done with the Layout? Reuse Detection !!

TIME FOR COMPRESSED MEMORY HIERARCHIES 24

Cache Reuse

L3

slide-25
SLIDE 25

Reuse Cache [MICRO ‘13]

Tags Data Tags Data 8MB Conventional LLC Reuse LLC (4MB Tag + 1MB Data)

TIME FOR COMPRESSED MEMORY HIERARCHIES 25

slide-26
SLIDE 26

Reuse Cache: 1st Access

Tag Data Reuse LLC DRAM L2

Only tag entry is allocated

TIME FOR COMPRESSED MEMORY HIERARCHIES 26

slide-27
SLIDE 27

Reuse Cache: 2nd Access

Tag Data Reuse LLC DRAM Tag Hit L2

Data entry allocated, block is reused Highly efficient 4X more tag entries  Decoupled tag/data array 

TIME FOR COMPRESSED MEMORY HIERARCHIES 27

slide-28
SLIDE 28

The Question

Can we detect the reusability of LLC blocks without 4X more tags in conventional and compressed caches? Can we use the existing layout of compressed caches for reuse also?

TIME FOR COMPRESSED MEMORY HIERARCHIES 28

slide-29
SLIDE 29

The Answer: Our Contribution

Reuse Cache Cache Compression

Tag Data

A0 A0

Synergistic Cache Layout for Reuse and Compression

TIME FOR COMPRESSED MEMORY HIERARCHIES 29

slide-30
SLIDE 30

SRC: SYNERGISTIC CACHE LAYOUT FOR REUSE AND COMPRESSION [PACT ‘18]

TIME FOR COMPRESSED MEMORY HIERARCHIES 30

slide-31
SLIDE 31

Our Contribution:10K Feet View

A Single cache layout (Super-block tag) for (i) Reuse detection of a cache block: even without compression. (ii) Cache Compression: Both reuse and compression in compressed caches.

TIME FOR COMPRESSED MEMORY HIERARCHIES 31

slide-32
SLIDE 32

Synergistic Cache Layout for Reuse and Compression Reuse Only

TIME FOR COMPRESSED MEMORY HIERARCHIES 32

slide-33
SLIDE 33

Dead Blocks at the LLC

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 bzip2 cactusADM libquantum leslie3d

  • mnetpp

sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Average Fraction of non-reused blocks

~60% of the LLC blocks are dead

TIME FOR COMPRESSED MEMORY HIERARCHIES 33

slide-34
SLIDE 34

Compressed Blocks at the LLC

0.2 0.4 0.6 0.8 1 bzip2 cactusADM libquantum leslie3d

  • mnetpp

sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Average Fraction of uncompressed blocks

~40% of the LLC blocks can’t be compressed

40% of the super-block tags are underutilized ☺

TIME FOR COMPRESSED MEMORY HIERARCHIES 34

slide-35
SLIDE 35

U-SRC for Reuse Only-I

TAG

A2 (Global Miss) V A2

TAG

A0 (Tag hit, block miss) V A2 FU

TAG

A0 (Tag hit, block reuse) V A2 I

TAG

A0 V I I I I I I I I I I

TIME FOR COMPRESSED MEMORY HIERARCHIES 35

slide-36
SLIDE 36

U-SRC for Reuse Only-II

TAG

A3 (Multiple tag hits, block miss) V A2 I

TAG

A0 V I I I FU FU

TAG

Writeback A3 (Multiple tag hits, block miss, data forwarded to the DRAM) V A2 I

TAG

A0 V I I I FU FU

TIME FOR COMPRESSED MEMORY HIERARCHIES 36

slide-37
SLIDE 37

Normalized Performance: U-SRC

1 1.05 1.1 1.15 1.2

bzip2 cactusADM libquantum leslie3d

  • mnetpp

sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Geomean

Uncompressed Reuse Cache U-SRC Cache

~ Reuse cache, 6.5%/8% over the baseline single-core/multi-core workloads

Higher the better

TIME FOR COMPRESSED MEMORY HIERARCHIES 37

slide-38
SLIDE 38

Synergistic Cache Layout for Reuse and Compression Both Reuse and Compression

TIME FOR COMPRESSED MEMORY HIERARCHIES 38

slide-39
SLIDE 39

SRC for Compressed LLC-I

TAG

A2 (Global Miss, compressible) V 1 A2

TAG

A0 (Tag hit, block miss, co-compressible) V 1 A2 V I I I I I A0

TIME FOR COMPRESSED MEMORY HIERARCHIES 39

slide-40
SLIDE 40

SRC for Compressed LLC - II

TAG

A1 (Tag hit, block miss, incompressible) V 1 A2 V FU I A0

TAG

A1 (Tag hit, reuse detected, incompressible) V 1 A2 V I I A0

TAG

I I V I A1

TIME FOR COMPRESSED MEMORY HIERARCHIES 40

slide-41
SLIDE 41

Two Cases on Multiple Tag Hits

TAG

A3 (Multiple tag hits, co-compressible) V 1 A2 V I V A0

TAG

I I V I A1 A3

TAG

V 1 A2 V I FU A0

TAG

I I V FU A1 A3 (Multiple tag hits, incompressible)

TIME FOR COMPRESSED MEMORY HIERARCHIES 41

slide-42
SLIDE 42

Special Case: Writeback

TAG

V 1 A2 V I FU A0 A2 (Writeback, Not co-compressible)

TAG

I 1 V I FU A0

TIME FOR COMPRESSED MEMORY HIERARCHIES 42

slide-43
SLIDE 43

SRC for Reuse+Compression

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4

bzip2 cactusADM libquantum leslie3d
  • mnetpp
sjeng xalancbmk milc lbm soplex mcf zeusmp gromacs hmmer h264ref namd bwaves blackscholes bodytrack canneal dedup freqmine fluidanimate facesim ferret streamcluster swaptions vips x264 apsp bc bfs community pagerank sssp sgd lsh spmv symgs knn Geomean

U-SRC Cache YACC+DISH 4X Uncompressed Baseline SRC Cache Higher the better

1.46 1.77 1.64 1.5

15%

TIME FOR COMPRESSED MEMORY HIERARCHIES 43

slide-44
SLIDE 44

Journey from YACC to SRC

YACC: An efficient and cost effective layout for a compressed cache. [TACO 16] YACC+DISH: One can leverage the layout to enhance compression. [MICRO 16] SRC: One can leverage the YACC layout to enhance LLC reuse along with compression. [PACT 18]

TIME FOR COMPRESSED MEMORY HIERARCHIES 44

slide-45
SLIDE 45

MBZip [HiPEAC 18]

TIME FOR COMPRESSED MEMORY HIERARCHIES 45 v

L2

DRAM Contr.

L3

Compressed DRAM Compressed Bus

slide-46
SLIDE 46

“It takes two to speak the truth -

  • ne to speak and another to hear”
  • Henry David Thoreau

Than hank Y k You

  • u

TIME FOR COMPRESSED MEMORY HIERARCHIES 46