CCHL: Compression-Consolidation Hardware Logging for Efficient - - PowerPoint PPT Presentation

cchl compression consolidation hardware logging for
SMART_READER_LITE
LIVE PREVIEW

CCHL: Compression-Consolidation Hardware Logging for Efficient - - PowerPoint PPT Presentation

CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates Xueliang Wei, Dan Feng, Wei Tong, Jingning Liu, Chengning Wang, Liuqing Ye Huazhong University of Science and Technology Persistent Memory


slide-1
SLIDE 1

CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates

Xueliang Wei, Dan Feng, Wei Tong, Jingning Liu, Chengning Wang, Liuqing Ye Huazhong University of Science and Technology

slide-2
SLIDE 2

Persistent Memory

  • Provide data persistence at main-memory level
  • Reduce persistence overhead compared with using traditional storage devices

CPU DRAM Disk/Flash Fast memory interface No persistence Slow I/O interface Persistence DRAM PM Persistent Memory Fast memory interface Persistence PCM, ReRAM, 3D Xpoint, etc. CPU

2

slide-3
SLIDE 3

Failure-Atomic Updates

  • Example: Insert a node into a linked list in persistent memory

A B … … C Insert A B … … C A B … … C Operation 1: Operation 2: A B … … C Linked list is broken Data are lost Unexpected system failures happen

Failure-Atomic Updates: Persist a group of writes in an all or nothing manner in the presence of system failures

3

slide-4
SLIDE 4

Durable Transactions

  • Example: A durable transaction with write-ahead logging

4

A B … … C Insert Tx_Begin Compute new data Log A Log C CLWB MFENCE St C, c1 St A, a1 CLWB MFENCE Tx_End PM Home Region Log region b0 a0 c0 Caches Cores Log(C) Log(A) c1 a1 Execute

slide-5
SLIDE 5

Full/Delayed Transaction Durability

  • Example: Fully/Delayed durable transactions with redo logging

5

Tx_Begin Compute Log a1 Log c1 CLWB MFENCE St C, c1 St A, a1 CLWB MFENCE Tx_End ❶ Compute ❷ Write Log ❸ Persist Log ❹ Write Data ❺ Persist Data

Full durability: The transaction is persisted during commit Delayed durability: The transaction can be persisted after commit

Compute Write Log Persist Log Write Data Persist Data

Tx_Begin Tx_End

time Compute Write Log Persist Log Write Data Persist Data

Tx_Begin Tx_End

time

slide-6
SLIDE 6

Software/Hardware Logging

  • Example: Durable transactions with software/hardware redo logging

6

Tx_Begin Compute Log a1 Log c1 CLWB MFENCE St C, c1 St A, a1 CLWB MFENCE Tx_End

Compute Write Log Persist Log Write Data Persist Data

Tx_Begin Tx_End

time

Tx_Begin Compute St C, c1 St A, a1 Tx_End Hardware Logging

Compute Write Log Persist Log Write Data Persist Data

Tx_Begin Tx_End

time

Software logging

Software performs log writes on the critical path of transaction execution, causing up to 70% performance degradation [ATOM, HPCA’17]

Hardware logging

Hardware performs log writes, asynchronous to volatile execution Software Logging

slide-7
SLIDE 7

Overview

  • Motivation: Many log writes are still executed in the critical path in hardware logging,

particularly for the multi-core systems with many threads.

  • Our Approach: Eliminate unnecessary log writes and enable delayed transaction durability.
  • Intra-Tx Log Compression
  • Observation 1: 29.5% of data updated in transactions are dirty.
  • Avoid redundant log writes by logging only dirty data.
  • Inter-Tx Log Consolidation
  • Observation 2: 53.4% of data are updated by two close transactions (distance < 4).
  • Avoid redundant log writes by combining successive transactions when they

update the same data.

  • Evaluation: Improve performance by 47.4%, reduce PM write traffic by 36.1%, and

reduces memory dynamic energy by 18.7%.

7

slide-8
SLIDE 8

Outline

  • Motivation
  • CCHL: Compression-Consolidation Hardware Logging
  • Intra-Tx Log Compression
  • Inter-Tx Log Consolidation
  • Evaluation
  • Conclusion

8

slide-9
SLIDE 9

Execution Flow with Hardware Logging

9

Tx_Begin Compute St C, c1 St A, a1 Tx_End

  • Example: Transaction execution flow with hardware redo logging

Cores Caches Log Buffer Memory Controller PM Home Region Log Region b0 a0 c0 Write Queue a0 c0

slide-10
SLIDE 10

Execution Flow with Hardware Logging

10

  • Example: Transaction execution flow with hardware redo logging

Cores Caches Log Buffer Memory Controller PM Home Region Log Region b0 a0 c0 Write Queue a0 c0 Tx_Begin Compute St C, c1 St A, a1 Tx_End a1 c1 ❶ Compute

slide-11
SLIDE 11

Execution Flow with Hardware Logging

11

  • Example: Transaction execution flow with hardware redo logging

Cores Caches Log Buffer Memory Controller PM Home Region Log Region b0 a0 c0 Write Queue a0 c0 Tx_Begin Compute St C, c1 St A, a1 Tx_End ❶ Compute ❷ Write Data a1 c1 ❸ Write Log Log(C)

slide-12
SLIDE 12

Execution Flow with Hardware Logging

12

  • Example: Transaction execution flow with hardware redo logging

Cores Caches Log Buffer Memory Controller PM Home Region Log Region b0 a0 c0 Write Queue a0 c1 Tx_Begin Compute St C, c1 St A, a1 Tx_End ❶ Compute ❷ Write Data a1 ❸ Write Log Log(A) Log(C) ❹ Persist Log

slide-13
SLIDE 13

Execution Flow with Hardware Logging

13

  • Example: Transaction execution flow with hardware redo logging

Cores Caches Log Buffer Memory Controller PM Home Region Log Region b0 a0 c0 Write Queue a1 c1 Tx_Begin Compute St C, c1 St A, a1 Tx_End ❶ Compute ❷ Write Data ❸ Write Log ❹ Persist Log Log(A) Log(C)

slide-14
SLIDE 14

Execution Flow with Hardware Logging

14

  • Example: Transaction execution flow with hardware redo logging

Cores Caches Log Buffer Memory Controller PM Home Region Log Region b0 a0 c0 Write Queue Tx_Begin Compute St C, c1 St A, a1 Tx_End ❶ Compute ❷ Write Data ❸ Write Log ❹ Persist Log ❺ Persist Data Log(C) Log(A) a1 c1

slide-15
SLIDE 15

Analysis of Hardware Logging Overhead

15

  • Some log writes are still executed in the critical path
  • Example 1: Evict a cache line when the write queue is full

Cores Caches Log Buffer Memory Controller PM Home Region Log Region Write Queue Log(A) Log(B) c1 Log(C) a1 b0 a0 c0 d0 d0 b1 d1 Log(D)

slide-16
SLIDE 16

Analysis of Hardware Logging Overhead

16

  • Some log writes are still executed in the critical path
  • Example 2: Commit a transaction when some log entries are buffered

Cores Caches Log Buffer Memory Controller PM Home Region Log Region Write Queue Log(B) Log(C) c1 a1 b1 b0 a0 c0 d0 d1 Log(D) Log(A)

slide-17
SLIDE 17

Analysis of Hardware Logging Overhead

17

  • Hardware logging overhead increases as the thread number increases
  • The percentage of log writes increases as the thread number increases
slide-18
SLIDE 18

Outline

  • Motivation
  • CCHL: Compression-Consolidation Hardware Logging
  • Intra-Tx Log Compression
  • Inter-Tx Log Consolidation
  • Evaluation
  • Conclusion

18

slide-19
SLIDE 19

Intra-Tx Log Compression

19

  • Dirty data: The data of which values are modified by transactions
  • Observation 1: Only 29.5% bytes among all the updated words are dirty
slide-20
SLIDE 20

Intra-Tx Log Compression

20

  • Only the log data for dirty data are essential for recovery

Caches Log Buffer Memory Controller PM Home Region Log Region Write Queue Log(A) = 0x01020030 a1 = 0x01020030 a0 = 0x00000000 b1 = 0x12345678 Log(B) = 0x12345678 a1 = 0x01020030 Recovery

slide-21
SLIDE 21

Intra-Tx Log Compression

21

  • Key idea: Avoid redundant log writes by logging only dirty bytes
  • A (p,q) dirty flag is added in each log entry to track the dirtiness of data
  • (p,q) means the dirtiness of every q-byte data is tracked with p flag bits

a0 1 2 3 a1 Log(A) Metadata Dirty Flag Log Data (1,1) dirty flag 1 1 1 (1,2) dirty flag 1 1 1 (1,1) log data 1 2 3 (1,2) log data 1 2 3 Less clean data costs Less dirty flag costs

slide-22
SLIDE 22

Intra-Tx Log Compression

22

  • How does intra-tx log compression reduce log writes?

MD MD MD MD LD 0 LD 1 LD 2 LD 3 MD Flag CLD 0 MD Flag MD Flag MD Flag CLD 3 CLD 1 CLD 2 MD Flag MD Flag CLD 0 CLD 3 CLD 1 CLD 2 … 8 log writes 5 log writes 3 log writes Intra-tx log compression Log Packing [Jeong+ MICRO’18] Reduce 5 log writes MD: Metadata Flag: Dirty flag LD: Log data CLD: Compressed log data

slide-23
SLIDE 23

Intra-Tx Log Compression

23

  • Implementation

Cores Caches Log Buffer Memory Controller PM Home Region Log Region Write Queue a0 = 0x00000000 A, 0x52, 0x01020030 a0 = 0x00000000 a1 = 0x01020030 A, 0x52, 0x123 Get dirty flag by comparing the old and the new value

slide-24
SLIDE 24

Outline

  • Motivation
  • CCHL: Compression-Consolidation Hardware Logging
  • Intra-Tx Log Compression
  • Inter-Tx Log Consolidation
  • Evaluation
  • Conclusion

24

slide-25
SLIDE 25

Inter-Tx Log Consolidation

25

  • Transaction distance: The number of transactions between two transactions that

update the same words

  • Observation 2: 53.4% of the updated words are written by two transactions of which

the distance is less than 4

slide-26
SLIDE 26

Inter-Tx Log Consolidation

26

  • Reduce log writes by avoiding writing unused log entries when several transactions

update the same data

Tx_Begin St A, a1 St B, b1 Tx_End Tx_Begin St C, c2 St B, b2 Tx_End PM Home Region Log Region Log(C) = c2 a0 b0 Log(B) = b2 Log(A) = a1 Log(B) = b1 c0 a1 b2 c2 Recovery Unused

slide-27
SLIDE 27

Inter-Tx Log Consolidation

27

  • Key idea: Combine several successive transactions into a large one if they update the

same data, and only log the newest values of the data

Tx_Begin St A, a1 St B, b1 Tx_End Tx_Begin St C, c2 St B, b2 Tx_End Log(C) = c2 Log(B) = b2 Log(A) = a1 Log(B) = b1 Log(C) = c2 Log(B) = b2 Log(A) = a1 Tx_Begin St A, a1 St B, b1 St C, c2 St B, b2 Tx_End Inter-tx log consolidation 4 log entries 3 log entries

slide-28
SLIDE 28

Inter-Tx Log Consolidation

28

  • Limitation

Tx1 Tx_Begin St A, a1 St B, b1 Tx_End Tx2 Tx_Begin St C, c2 St B, b2 Tx_End PM Home Region Log Region Log(C) = c2 a0 b0 Log(A) = a1 Log(B) = b1 c0 a1 b1 Recovery PM Home Region Log Region Log(C) = c2 a0 b0 Log(A) = a1 c0 With inter-tx log consolidation: Without inter-tx log consolidation: Failure happens Only the updates of Tx2 are lost The updates of both Tx1 and Tx2 are lost

slide-29
SLIDE 29

DRAM Cache

Inter-Tx Log Consolidation

29

  • Implementation

PM Home Region Log Region a0 b0 Log(A) = a1 Log(B) = b1 Log(C) = c2 c0 Tx1 Tx_Begin St A, a1 St B, b1 Tx_End Tx2 Tx_Begin St C, c2 St B, b2 Tx_End a1 b1 c2 Log(B) = b2 b2

slide-30
SLIDE 30

Outline

  • Motivation
  • CCHL: Compression-Consolidation Hardware Logging
  • Intra-Tx Log Compression
  • Inter-Tx Log Consolidation
  • Evaluation
  • Conclusion

30

slide-31
SLIDE 31

Experimental Setup

31

  • Gem5 simulator configuration
  • Eight-core processor with private L1 and L2 caches and a shared L3 cache
  • 16-entry log buffer and (1,1) dirty flag
  • Memory parameters from [Ren+ MICRO’15, Lee+ ISCA’09, Ogleari+ HPCA’18]
  • Workloads
  • Micro-benchmarks: Btree, Hash, Queue, RBTree, SDG, SPS
  • Macro-benchmarks [Nalli+ ASPLOS’17]: Echo, YCSB, TPCC
  • Evaluated designs
  • ATOM: Hardware undo logging with full durability
  • FWB: Hardware undo+redo logging with full durability
  • ReDU: Hardware redo logging with full durability
  • CCHL-fd: Proposed design with full durability and intra-tx log compression
  • CCHL-dwN: Proposed design with delayed durability and both log optimizations
slide-32
SLIDE 32

Performance Comparison

32

  • CCHL-fd outperforms ReDU by 47.8% for the small dataset sizes
  • CCHL-fd outperforms ReDU by 47.0% for the large dataset sizes
slide-33
SLIDE 33

Write Traffic and Energy Consumption

33

Dataset size ATOM FWB ReDU CCHL-fd Normalized PM write traffic Small 1.00 1.49 0.99 0.64 Large 1.00 1.45 1.00 0.63 Normalized memory dynamic energy Small 1.00 1.63 1.00 0.79 Large 1.00 1.53 0.80 0.67

  • CCHL-fd significantly reduces both PM write traffic and memory dynamic energy
slide-34
SLIDE 34

Efficiency of Intra-Tx Log Compression

34

Type ATOM+C FWB+C ReDU+C Transaction throughput improvement 23.3% 33.1% 29.5% PM write traffic reduction 34.4% 43.0% 34.4% Memory dynamic energy reduction 14.7% 22.0% 16.0%

  • “+C” represents the corresponding design with intra-tx log compression
  • All the three designs benefit from intra-tx log compression
slide-35
SLIDE 35

Efficiency of Inter-Tx Log Consolidation

35

  • CCHL-dwN (N=2,4,6,8) improves transaction throughput on all the workloads through

delayed durability and inter-tx log consolidation

slide-36
SLIDE 36

Outline

  • Motivation
  • CCHL: Compression-Consolidation Hardware Logging
  • Intra-Tx Log Compression
  • Inter-Tx Log Consolidation
  • Evaluation
  • Conclusion

36

slide-37
SLIDE 37

Conclusion

  • Motivation: Many log writes are still executed in the critical path in hardware logging,

particularly for the multi-core systems with many threads.

  • Key Idea: Eliminate unnecessary log writes and enable delayed transaction durability.
  • Intra-Tx Log Compression
  • Observation 1: 29.5% of data updated in transactions are dirty.
  • Avoid redundant log writes by logging only dirty data.
  • Inter-Tx Log Consolidation
  • Observation 2: 53.4% of data are updated by two close transactions (distance < 4).
  • Avoid redundant log writes by combining successive transactions when they

update the same data.

  • Evaluation: Improve performance by 47.4%, reduce PM write traffic by 36.1%, and

reduces memory dynamic energy by 18.7%.

37

slide-38
SLIDE 38

Thank you!

Xueliang Wei, Dan Feng, Wei Tong, Jingning Liu, Chengning Wang, Liuqing Ye Huazhong University of Science and Technology

CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates