Compress Objects, Not Cache Lines: An Object-Based Compressed Memory - - PowerPoint PPT Presentation

compress objects not cache lines an object based
SMART_READER_LITE
LIVE PREVIEW

Compress Objects, Not Cache Lines: An Object-Based Compressed Memory - - PowerPoint PPT Presentation

Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy Po-An Tsai and Daniel Sanchez Prior memory compression techniques are limited to compressing cache lines 2 Prior memory compression techniques are limited to


slide-1
SLIDE 1

Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy

Po-An Tsai and Daniel Sanchez

slide-2
SLIDE 2

Prior memory compression techniques are limited to compressing cache lines

2

slide-3
SLIDE 3

 Data movement limits performance and efficiency

 A memory access takes 100X the latency and 1000X the energy of a FP operation

Prior memory compression techniques are limited to compressing cache lines

2

slide-4
SLIDE 4

 Data movement limits performance and efficiency

 A memory access takes 100X the latency and 1000X the energy of a FP operation

 Applying hardware-based compression to the memory hierarchy to reduce

data movement thus becomes beneficial

Prior memory compression techniques are limited to compressing cache lines

2

Core Private L1/L2 Shared LLC Main Mem Comp. Data Comp. Data Data uncompressed Compressed Cache Compressed Main Mem

More capacity & less traffic

slide-5
SLIDE 5

 Data movement limits performance and efficiency

 A memory access takes 100X the latency and 1000X the energy of a FP operation

 Applying hardware-based compression to the memory hierarchy to reduce

data movement thus becomes beneficial

Prior memory compression techniques are limited to compressing cache lines

2

Core Private L1/L2 Shared LLC Main Mem Comp. Data Comp. Data Data uncompressed Compressed Cache Compressed Main Mem

More capacity & less traffic To support random accesses, the memory hierarchy transfers cache lines between levels  Prior techniques are thus limited to compressing cache lines

Cache lines Cache lines

slide-6
SLIDE 6

Challenges due to compressing at cache-line granularity

3

slide-7
SLIDE 7

Challenges due to compressing at cache-line granularity

3

  • 1. Locating the compressed cache line (architecture)

Fixed-size cache lines become variable-size compressed blocks  HW needs to translate uncompressed addresses to compressed blocks

slide-8
SLIDE 8

Challenges due to compressing at cache-line granularity

3

  • 1. Locating the compressed cache line (architecture)

Fixed-size cache lines become variable-size compressed blocks  HW needs to translate uncompressed addresses to compressed blocks

  • 2. Compressing cache lines (algorithm)

Cache lines are small, and decompression latency is on the critical path  HW cannot compress more than 64B at a time  Only low-latency algorithms are practical

slide-9
SLIDE 9

Prior compressed memory architectures sacrifice compression ratio for low latency

4

slide-10
SLIDE 10

Prior compressed memory architectures sacrifice compression ratio for low latency

4

 They aim to quickly translate uncompressed to compressed addresses

 Example: Linearly compressed pages [LCP, Pekhimenko et al., MICRO’13] Shared LLC Original cache line address Compressed block address

slide-11
SLIDE 11

Prior compressed memory architectures sacrifice compression ratio for low latency

4

 They aim to quickly translate uncompressed to compressed addresses

 Example: Linearly compressed pages [LCP, Pekhimenko et al., MICRO’13] Shared LLC Original cache line address Compressed block address 4KB page 64B lines … … Uncompressed format

slide-12
SLIDE 12

Prior compressed memory architectures sacrifice compression ratio for low latency

4

 They aim to quickly translate uncompressed to compressed addresses

 Example: Linearly compressed pages [LCP, Pekhimenko et al., MICRO’13] Shared LLC Original cache line address Compressed block address 4KB page 64B lines … … Uncompressed format 2KB page 32B lines … … Translation via the VM system Compressed format

LCP compresses page by page to leverage VM for translation  Fast and low overhead LCP forces cache lines in the same page to compress into the same size  Sacrifice compression ratio

slide-13
SLIDE 13

Prior compressed memory architectures sacrifice compression ratio for low latency

4

 They aim to quickly translate uncompressed to compressed addresses

 Example: Linearly compressed pages [LCP, Pekhimenko et al., MICRO’13]

 Other techniques make similar tradeoffs

 E.g., 4 different sizes for cache lines in a page Shared LLC Original cache line address Compressed block address 4KB page 64B lines … … Uncompressed format 2KB page 32B lines … … Translation via the VM system Compressed format

LCP compresses page by page to leverage VM for translation  Fast and low overhead LCP forces cache lines in the same page to compress into the same size  Sacrifice compression ratio

[RMC, Ekman and Stenstorm, HPCA’06] [DMC, Kim et al., PACT’17] [Compresso, Choukse et al, MICRO’18]

slide-14
SLIDE 14

Prior compression algorithms are limited to exploit redundancy within a cache line to achieve low latency

5

slide-15
SLIDE 15

Prior compression algorithms are limited to exploit redundancy within a cache line to achieve low latency

5

100 100 102 101 103 103 102 104 108 109 109 111

Uncompressed layout

Int array 1.1 1.2 1.3 0x18 0x30 0x48 Float array Reference array …… ……

 Example: Base-Delta-Immediate compression [Base-Delta-Immediate, Pekhimenko et al., PACT’12]

slide-16
SLIDE 16

Prior compression algorithms are limited to exploit redundancy within a cache line to achieve low latency

5

100 100 102 101 103 103 102 104 108 109 109 111 100 + + 2 + 1 + 3 + 3 + 2 + 4 108 + 1 + 1 + 3

Compressed layout

Work well on arrays: Homogeneous, regular

Uncompressed layout

Int array 1.1 1.2 1.3 0x18 0x30 0x48 …… Float array Reference array …… ……

64B cache line

[FP-H, Arelakis et al., MICRO’15] [BPC, Kim et al., ISCA’16]

 Example: Base-Delta-Immediate compression [Base-Delta-Immediate, Pekhimenko et al., PACT’12]

slide-17
SLIDE 17

Prior compression algorithms are limited to exploit redundancy within a cache line to achieve low latency

5

100 100 102 101 103 103 102 104 108 109 109 111 100 + + 2 + 1 + 3 + 3 + 2 + 4 108 + 1 + 1 + 3

Compressed layout

Work well on arrays: Homogeneous, regular

Uncompressed layout

Int array 1.1 1.2 1.3 0x18 0x30 0x48 …… Float array Reference array …… ……

64B cache line

[FP-H, Arelakis et al., MICRO’15] [BPC, Kim et al., ISCA’16]

1 1 1.67 1.55

0.5 1 1.5 2

FFT SPMV

COMPRESSION RATIO No compression Prior work

 Example: Base-Delta-Immediate compression [Base-Delta-Immediate, Pekhimenko et al., PACT’12]

slide-18
SLIDE 18

Prior compression algorithms work poorly on objects

6

slide-19
SLIDE 19

Prior compression algorithms work poorly on objects

6

100 1.1 0x18 102 1.3 0x48

Work poorly on objects: Heterogeneous, irregular

Object A1 Object A2 …… Object B Object C

slide-20
SLIDE 20

Prior compression algorithms work poorly on objects

6

100 1.1 0x18 102 1.3 0x48

Work poorly on objects: Heterogeneous, irregular

Object A1 Object A2 …… Object B Object C

64B cache line

Little redundancy within a cache line

slide-21
SLIDE 21

Prior compression algorithms work poorly on objects

6

100 1.1 0x18 102 1.3 0x48

Work poorly on objects: Heterogeneous, irregular

Object A1 Object A2 …… Object B Object C

64B cache line

Little redundancy within a cache line Array-heavy apps: 61% compression ratio Object-heavy apps: 14% compression ratio

1 1 1.67 1.55

0.5 1 1.5 2

FFT SPMV

COMPRESSION RATIO No compression Prior work

1 1 1 1 1 1 1.15 1.27 1.06 1.07 1.1 1.15

0.5 1 1.5 2

H2 SPECJBB PAGERANK COLORING BTREE GUAVACACHE

slide-22
SLIDE 22

Objects, not cache lines, are the natural unit of compression

7

slide-23
SLIDE 23

Objects, not cache lines, are the natural unit of compression

7

Insight 1: Object-based applications always follow pointers to access objects

slide-24
SLIDE 24

Objects, not cache lines, are the natural unit of compression

7

Object A1 Object B1 Object A2 Object C Object B2 Uncompressed layout

Insight 1: Object-based applications always follow pointers to access objects

0xFF 0x00

slide-25
SLIDE 25

Objects, not cache lines, are the natural unit of compression

7

Object A1 Object B1 Object A2 Object C Object B2 Uncompressed layout

Insight 1: Object-based applications always follow pointers to access objects Idea 1: Point directly to the location of compressed objects to avoid uncompressed-to-compressed address translation!

Object A1 Object B1 Object A2 Object C Object B2 Compressed layout

0xFF 0x00 0xDF 0x00

slide-26
SLIDE 26

Objects, not cache lines, are the natural unit of compression

8

slide-27
SLIDE 27

Objects, not cache lines, are the natural unit of compression

8

Insight 2: There is significant redundancy across objects of the same type

slide-28
SLIDE 28

Objects, not cache lines, are the natural unit of compression

8

Insight 2: There is significant redundancy across objects of the same type

Object A1 Object B1 Object A2 Object C Object B2 Compressed layout

0xDF 0x00

slide-29
SLIDE 29

Objects, not cache lines, are the natural unit of compression

8

Insight 2: There is significant redundancy across objects of the same type Idea 2: Compress across objects, not within cache lines, to leverage more redundancy!

Object A1 Object B1 Object A2 Object C Object B2 Compressed layout ∆ A1 ∆ B1 ∆ A2 ∆ C ∆ B2 Further compressed layout ∆ A1

= Bytes that differ from a shared base object

0xDF 0x00 0x8F 0x00

slide-30
SLIDE 30

Compressing objects would be hard to do on cache hierarchies

9

slide-31
SLIDE 31

Compressing objects would be hard to do on cache hierarchies

9

 Ideally, we want a memory system that

 Moves objects, rather than cache lines  Transparently updates pointers during compression

slide-32
SLIDE 32

Compressing objects would be hard to do on cache hierarchies

9

 Ideally, we want a memory system that

 Moves objects, rather than cache lines  Transparently updates pointers during compression

 Therefore, we realize our ideas on Hotpads [Tsai et al., MICRO’18]

 A recent object-based memory hierarchy

slide-33
SLIDE 33

Baseline system: Hotpads overview

10

slide-34
SLIDE 34

Baseline system: Hotpads overview

10

Core L1 pad L2 pad L3 pad

slide-35
SLIDE 35

Baseline system: Hotpads overview

10

 Data array

 Managed as a circular buffer using simple

sequential allocation

 Stores variable-sized objects compactly

Core L1 pad L2 pad L3 pad

Objects Data Array Free space

slide-36
SLIDE 36

Baseline system: Hotpads overview

10

 Data array

 Managed as a circular buffer using simple

sequential allocation

 Stores variable-sized objects compactly

Core L1 pad L2 pad L3 pad

Objects Data Array Free space

  • Obj. A
slide-37
SLIDE 37

Baseline system: Hotpads overview

10

 Data array

 Managed as a circular buffer using simple

sequential allocation

 Stores variable-sized objects compactly

Core L1 pad L2 pad L3 pad

Objects Data Array Free space

  • Obj. A
  • Obj. B
slide-38
SLIDE 38

Baseline system: Hotpads overview

10

 Data array

 Managed as a circular buffer using simple

sequential allocation

 Stores variable-sized objects compactly

 Can store variable-sized compressed objects compactly too!

Core L1 pad L2 pad L3 pad

Objects Data Array Free space

  • Obj. A
  • Obj. B
slide-39
SLIDE 39

Baseline system: Hotpads overview

10

 Data array

 Managed as a circular buffer using simple

sequential allocation

 Stores variable-sized objects compactly

 Can store variable-sized compressed objects compactly too!

 C-Tags

 Decoupled tag store

 Metadata

 Pointer? valid? dirty? recently-used?

Core L1 pad L2 pad L3 pad

C-Tags Metadata (word/object) Objects Data Array Free space

  • Obj. A
  • Obj. B
slide-40
SLIDE 40

Hotpads moves objects instead of cache lines

11

slide-41
SLIDE 41

Hotpads moves objects instead of cache lines

11

L1 Pad L2 Pad Main Mem

A B

r0 r1 r2 r3

RegFile Free space Objects

Initial state.

Example object: class ListNode { int value; ListNode next; }

slide-42
SLIDE 42

Hotpads moves objects instead of cache lines

11

L1 Pad L2 Pad Main Mem

A B

r0 r1 r2 r3

RegFile Free space Objects

Initial state.

Example object: class ListNode { int value; ListNode next; } Program code: int v = A.value; A B

r0 r1 r2 r3

A

A copied into L1 pad.

1

slide-43
SLIDE 43

Hotpads moves objects instead of cache lines

11

L1 Pad L2 Pad Main Mem

A B

r0 r1 r2 r3

RegFile Free space Objects

Initial state.

Example object: class ListNode { int value; ListNode next; } Program code: int v = A.value; A B

r0 r1 r2 r3

A

A copied into L1 pad.

1

Program code: v = A.next.value;

B copied into L1 pad.

B

2

slide-44
SLIDE 44

Hotpads moves objects instead of cache lines

11

L1 Pad L2 Pad Main Mem

A B

r0 r1 r2 r3

RegFile Free space Objects

Initial state.

Example object: class ListNode { int value; ListNode next; } Program code: int v = A.value; A B

r0 r1 r2 r3

A

A copied into L1 pad.

1

Program code: v = A.next.value;

B copied into L1 pad.

B

2

Hotpads takes control of the memory layout, hides pointers from software, and encodes

  • bject information in pointers

Size Object address (48b) 47 48 63 50

Fetching size words from the starting address yields the entire object

slide-45
SLIDE 45

Hotpads moves objects instead of cache lines

11

L1 Pad L2 Pad Main Mem

A B

r0 r1 r2 r3

RegFile Free space Objects

Initial state.

Example object: class ListNode { int value; ListNode next; } Program code: int v = A.value; A B

r0 r1 r2 r3

A

A copied into L1 pad.

1

Program code: v = A.next.value;

B copied into L1 pad.

B

2

Hotpads takes control of the memory layout, hides pointers from software, and encodes

  • bject information in pointers

Compressed size Compressed object address (48b) 47 48 63 50

Fetching compressed size words from the starting compressed address yields the entire compressed object

slide-46
SLIDE 46

Hotpads updates pointers among objects on evictions

12

slide-47
SLIDE 47

Hotpads updates pointers among objects on evictions

12

A (stale) B A (modified) B C D L1 pad is now full, triggering a bulk eviction in HW. L1 pad is full because of fetched objects or newly- allocate objects

3

slide-48
SLIDE 48

Hotpads updates pointers among objects on evictions

12

A (stale) B A (modified) B C D L1 pad is now full, triggering a bulk eviction in HW. L1 pad is full because of fetched objects or newly- allocate objects

3

A B B D Free space After an L1 bulk eviction: Pointers are updated to point to the new locations.

4

Copied objects (A) are back to old location New objects (D) are sequentially allocated

slide-49
SLIDE 49

Hotpads updates pointers among objects on evictions

12

 Bulk eviction amortizes the cost of finding and updating pointers across objects

A (stale) B A (modified) B C D L1 pad is now full, triggering a bulk eviction in HW. L1 pad is full because of fetched objects or newly- allocate objects

3

A B B D Free space After an L1 bulk eviction: Pointers are updated to point to the new locations.

4

Copied objects (A) are back to old location New objects (D) are sequentially allocated

slide-50
SLIDE 50

Hotpads updates pointers among objects on evictions

12

 Bulk eviction amortizes the cost of finding and updating pointers across objects  Since updating pointers already happens in Hotpads,

there is no extra cost to update them to compressed locations!

A (stale) B A (modified) B C D L1 pad is now full, triggering a bulk eviction in HW. L1 pad is full because of fetched objects or newly- allocate objects

3

A B B D Free space After an L1 bulk eviction: Pointers are updated to point to the new locations.

4

Copied objects (A) are back to old location New objects (D) are sequentially allocated

slide-51
SLIDE 51

Zippads: Locating objects without translations

13

slide-52
SLIDE 52

Zippads: Locating objects without translations

13

 Zippads leverages Hotpads to

 Manipulate and compress objects rather than cache lines  Avoid translation by pointing directly to compressed objects during evictions

slide-53
SLIDE 53

Zippads: Locating objects without translations

13

 Zippads leverages Hotpads to

 Manipulate and compress objects rather than cache lines  Avoid translation by pointing directly to compressed objects during evictions

L1 Pad

Core

L2 Pad L3 Pad Main Memory Uncompressed Compress Decompress Compressed

slide-54
SLIDE 54

Zippads: Locating objects without translations

13

 Zippads leverages Hotpads to

 Manipulate and compress objects rather than cache lines  Avoid translation by pointing directly to compressed objects during evictions

L1 Pad

Core

L2 Pad L3 Pad Main Memory Uncompressed Compress Decompress Compressed Compress both on-chip and off-chip memories Neutral to the algorithm

slide-55
SLIDE 55

Zippads compresses objects when they move

14

slide-56
SLIDE 56

Zippads compresses objects when they move

14

 Objects are compressed during bulk object evictions

slide-57
SLIDE 57

Zippads compresses objects when they move

14

 Objects are compressed during bulk object evictions

Objects Free space L3 pad Case 1: Newly moved objects L2 pad Objects start their lifetime uncompressed in private levels

Object (uncompressed)

slide-58
SLIDE 58

Zippads compresses objects when they move

14

 Objects are compressed during bulk object evictions

Objects Free space L3 pad Case 1: Newly moved objects L2 pad Objects start their lifetime uncompressed in private levels

Object (uncompressed)

Compression HW

New object (compressed)

When objects are evicted into a compressed level, they are compressed in that level and store compactly

slide-59
SLIDE 59

Zippads compresses objects when they move

14

 Objects are compressed during bulk object evictions

Objects Free space L3 pad Case 1: Newly moved objects L2 pad Objects start their lifetime uncompressed in private levels

Object (uncompressed)

Compression HW

New object (compressed)

When objects are evicted into a compressed level, they are compressed in that level and store compactly

Piggyback the bulk eviction process to find and update all pointers at once, amortizing update costs

slide-60
SLIDE 60

Zippads compresses objects when they move

15

 Objects are compressed during bulk object evictions

slide-61
SLIDE 61

Zippads compresses objects when they move

15

 Objects are compressed during bulk object evictions

L2 pad Case 2: Dirty writeback

Old object (compressed) Objects

Free space

Compression HW Objects Updated object (uncompressed)

L3 pad

slide-62
SLIDE 62

Zippads compresses objects when they move

15

 Objects are compressed during bulk object evictions

Updated object (compressed)

Free space

Unused space Objects Objects

L2 pad Case 2: Dirty writeback

Old object (compressed) Objects

Free space

Compression HW Objects Updated object (uncompressed)

L3 pad

slide-63
SLIDE 63

Zippads compresses objects when they move

15

 Objects are compressed during bulk object evictions

Updated object (compressed)

Free space

Unused space Objects Objects Forwarding thunk Unused space Updated object (compressed) Objects Objects

L2 pad Case 2: Dirty writeback

Old object (compressed) Objects

Free space

Compression HW Objects Updated object (uncompressed)

L3 pad

slide-64
SLIDE 64

Zippads compresses objects when they move

15

 Objects are compressed during bulk object evictions

Updated object (compressed)

Free space

Unused space Objects Objects Forwarding thunk Unused space Updated object (compressed) Objects Objects

L2 pad Case 2: Dirty writeback

Old object (compressed) Objects

Free space

Compression HW Objects Updated object (uncompressed)

Periodic compaction reclaims those unused spaces (Bulk eviction in on-chip pads, GC in main memory) L3 pad

slide-65
SLIDE 65

Zippads uses pointers to accelerate decompression

16

slide-66
SLIDE 66

Zippads uses pointers to accelerate decompression

16

 Every object access starts with a pointer!

 Pointers are updated to the compressed locations, so no translation is needed

slide-67
SLIDE 67

Zippads uses pointers to accelerate decompression

16

 Every object access starts with a pointer!

 Pointers are updated to the compressed locations, so no translation is needed

 Prior work shows it’s beneficial to use different algorithms for various patterns

 Zippads encodes compression metadata in pointers to decompress objects quickly

Compressed size Compressed object address (48-X bits) 48 48-X 63 50

Compression encoding bits (X bits)

slide-68
SLIDE 68

Zippads uses pointers to accelerate decompression

16

 Every object access starts with a pointer!

 Pointers are updated to the compressed locations, so no translation is needed

 Prior work shows it’s beneficial to use different algorithms for various patterns

 Zippads encodes compression metadata in pointers to decompress objects quickly

 Zippads thus knows how to locate and what decompression algorithm to use

when accessing compressed objects with pointers

Compressed size Compressed object address (48-X bits) 48 48-X 63 50

Compression encoding bits (X bits)

slide-69
SLIDE 69

COCO: Cross-object-compression algorithm

17

slide-70
SLIDE 70

COCO: Cross-object-compression algorithm

17

 COCO exploits similarity across objects with shared base objects

 A collection of representative objects

slide-71
SLIDE 71

COCO: Cross-object-compression algorithm

17

 COCO exploits similarity across objects with shared base objects

 A collection of representative objects

Uncompressed

  • bject

Base object Compression HW

slide-72
SLIDE 72

COCO: Cross-object-compression algorithm

17

 COCO exploits similarity across objects with shared base objects

 A collection of representative objects

Uncompressed

  • bject

Base object Compression HW

Pointer to the base object Bytes that are different

Compressed object

slide-73
SLIDE 73

COCO: Cross-object-compression algorithm

18

slide-74
SLIDE 74

COCO: Cross-object-compression algorithm

18

 COCO requires accessing base objects for every compression/decompression

slide-75
SLIDE 75

COCO: Cross-object-compression algorithm

18

 COCO requires accessing base objects for every compression/decompression  Caching base objects avoids extra latency and bandwidth to fetch them  A small (8KB) base object cache works well

 Few types account for most accesses

slide-76
SLIDE 76

See paper for additional features and details

19

 Compressing large objects with subobjects and allocate-on-access  COCO compression/decompression circuit RTL implementation details  Details on integrating Zippads and COCO  Discussion on using COCO with conventional memory hierarchies

slide-77
SLIDE 77

Evaluation

20

slide-78
SLIDE 78

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

slide-79
SLIDE 79

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

 We compare 4 schemes

slide-80
SLIDE 80

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

 We compare 4 schemes

 Uncomp: Conventional 3-level cache hierarchy with no compression

slide-81
SLIDE 81

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

 We compare 4 schemes

 Uncomp: Conventional 3-level cache hierarchy with no compression  CMH: Compressed memory hierarchy

 LLC: VSC [Alameldeen and Wood, ISCA’04]

 Main memory: LCP [Pekhimenko et al., MICRO’13]

 Algorithm: HyComp-style hybrid algorithm [Arelakis et al., MICRO’15]

 BDI [Pekhimenko et al., PACT’12] + FPC [Alameldeen and Wood, ISCA’04]

slide-82
SLIDE 82

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

 We compare 4 schemes

 Uncomp: Conventional 3-level cache hierarchy with no compression  CMH: Compressed memory hierarchy

 LLC: VSC [Alameldeen and Wood, ISCA’04]

 Main memory: LCP [Pekhimenko et al., MICRO’13]

 Algorithm: HyComp-style hybrid algorithm [Arelakis et al., MICRO’15]

 BDI [Pekhimenko et al., PACT’12] + FPC [Alameldeen and Wood, ISCA’04]

 Hotpads: The baseline system we build on

slide-83
SLIDE 83

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

 We compare 4 schemes

 Uncomp: Conventional 3-level cache hierarchy with no compression  CMH: Compressed memory hierarchy

 LLC: VSC [Alameldeen and Wood, ISCA’04]

 Main memory: LCP [Pekhimenko et al., MICRO’13]

 Algorithm: HyComp-style hybrid algorithm [Arelakis et al., MICRO’15]

 BDI [Pekhimenko et al., PACT’12] + FPC [Alameldeen and Wood, ISCA’04]

 Hotpads: The baseline system we build on  Zippads: With and without COCO

slide-84
SLIDE 84

Evaluation

20

 We simulate Zippads using MaxSim [Rodchenko et al., ISPASS’17]

 A simulator combining ZSim and Maxine JVM

 We compare 4 schemes

 Uncomp: Conventional 3-level cache hierarchy with no compression  CMH: Compressed memory hierarchy

 LLC: VSC [Alameldeen and Wood, ISCA’04]

 Main memory: LCP [Pekhimenko et al., MICRO’13]

 Algorithm: HyComp-style hybrid algorithm [Arelakis et al., MICRO’15]

 BDI [Pekhimenko et al., PACT’12] + FPC [Alameldeen and Wood, ISCA’04]

 Hotpads: The baseline system we build on  Zippads: With and without COCO

 Workloads: 8 Java apps with large memory footprint from different domains

slide-85
SLIDE 85

Zippads improves compression ratio

21

slide-86
SLIDE 86

Zippads improves compression ratio

21

slide-87
SLIDE 87

Zippads improves compression ratio

21

slide-88
SLIDE 88

Zippads improves compression ratio

21

Same algo as CMH

slide-89
SLIDE 89

Zippads improves compression ratio

21

Same algo as CMH CMH algo + COCO

slide-90
SLIDE 90

Zippads improves compression ratio

21

Same algo as CMH CMH algo + COCO

slide-91
SLIDE 91

Zippads improves compression ratio

21

Same algo as CMH CMH algo + COCO

Only 24% better than Uncomp.

slide-92
SLIDE 92

Zippads improves compression ratio

21

70% better

Same algo as CMH CMH algo + COCO

Only 24% better than Uncomp.

slide-93
SLIDE 93

Zippads improves compression ratio

21

70% better 2X better

Same algo as CMH CMH algo + COCO

Only 24% better than Uncomp.

slide-94
SLIDE 94

Zippads improves compression ratio

21

  • 1. Both Zippads and CMH work

well in array-heavy apps

70% better 2X better

Same algo as CMH CMH algo + COCO

Only 24% better than Uncomp.

slide-95
SLIDE 95

Zippads improves compression ratio

21

  • 1. Both Zippads and CMH work

well in array-heavy apps

  • 2. Zippads works much better than

CMH in object-heavy apps

70% better 2X better

Same algo as CMH CMH algo + COCO

Only 24% better than Uncomp.

slide-96
SLIDE 96

Zippads reduces memory traffic and improves performance

22

slide-97
SLIDE 97

Zippads reduces memory traffic and improves performance

22 Lower is better

slide-98
SLIDE 98

Zippads reduces memory traffic and improves performance

22

  • 1. CMH reduces traffic by 15%

with data compression

Lower is better

slide-99
SLIDE 99

Zippads reduces memory traffic and improves performance

22

  • 2. Hotpads reduces traffic by

66% with object-based data movement

  • 1. CMH reduces traffic by 15%

with data compression

Lower is better

slide-100
SLIDE 100

Zippads reduces memory traffic and improves performance

22

  • 2. Hotpads reduces traffic by

66% with object-based data movement

  • 1. CMH reduces traffic by 15%

with data compression

  • 3. Zippads combines the benefits
  • f both, reducing traffic by 2X

(70% less traffic than CMH)

Lower is better

slide-101
SLIDE 101

Zippads reduces memory traffic and improves performance

22

  • 2. Hotpads reduces traffic by

66% with object-based data movement

  • 1. CMH reduces traffic by 15%

with data compression

  • 3. Zippads combines the benefits
  • f both, reducing traffic by 2X

(70% less traffic than CMH) Similar trend in performance: Zippads is 24% faster than CMH; 30% faster than Uncomp.

Lower is better Higher is better

slide-102
SLIDE 102

Zippads also provides benefits on compiled code

23

slide-103
SLIDE 103

Zippads also provides benefits on compiled code

23

 We study two object-heavy benchmarks written in C/C++

slide-104
SLIDE 104

Zippads also provides benefits on compiled code

23

 We study two object-heavy benchmarks written in C/C++

slide-105
SLIDE 105

Zippads also provides benefits on compiled code

23

 We study two object-heavy benchmarks written in C/C++

Zippads again works much better than CMH in compressing memory footprint

slide-106
SLIDE 106

Zippads also provides benefits on compiled code

23

 We study two object-heavy benchmarks written in C/C++

Zippads again works much better than CMH in compressing memory footprint Zippads improves both memory traffic and performance the most

slide-107
SLIDE 107

See paper for more evaluation results

24

 Zippads hardware storage overhead analysis  COCO RTL implementation result  Comparison against CMH with hardware support for memory management  Zippads analysis

 Base object cache size sensitivity study  Overflow frequency

slide-108
SLIDE 108

We propose the first object-based compressed memory hierarchy

25

slide-109
SLIDE 109

We propose the first object-based compressed memory hierarchy

25

 Prior compressed memory hierarchies focus on compressing cache lines

 Require address translation and work poorly on object-heavy apps

slide-110
SLIDE 110

We propose the first object-based compressed memory hierarchy

25

 Prior compressed memory hierarchies focus on compressing cache lines

 Require address translation and work poorly on object-heavy apps

 Object-based apps provide new opportunities for compression

 Always access objects through pointers  Have significant redundancy across objects, not within cache lines

slide-111
SLIDE 111

We propose the first object-based compressed memory hierarchy

25

 Prior compressed memory hierarchies focus on compressing cache lines

 Require address translation and work poorly on object-heavy apps

 Object-based apps provide new opportunities for compression

 Always access objects through pointers  Have significant redundancy across objects, not within cache lines

 We present techniques that compress objects, not cache lines

Zippads rewrites pointers to avoid uncompressed-to-compressed address translation COCO compresses across objects to leverage more redundancy

slide-112
SLIDE 112

Thanks! Questions?

26

 Prior compressed memory hierarchies focus on compressing cache lines

 Require address translation and work poorly on object-heavy apps

 Object-based apps provide new opportunities for compression

 Always access objects through pointers  Have significant redundancy across objects, not within cache lines

 We present techniques that compress objects, not cache lines

Zippads rewrites pointers to avoid uncompressed-to-compressed address translation COCO compresses across objects to leverage more redundancy