Proteus: A Flexible and Fast Software supported Hardware Logging - - PowerPoint PPT Presentation

proteus a flexible and fast software supported hardware
SMART_READER_LITE
LIVE PREVIEW

Proteus: A Flexible and Fast Software supported Hardware Logging - - PowerPoint PPT Presentation

Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin North Carolina State University The 2018 Non-Volatile Memories Workshop (NVMW 2018) 1


slide-1
SLIDE 1

Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM

Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin

North Carolina State University

1

The 2018 Non-Volatile Memories Workshop (NVMW 2018)

slide-2
SLIDE 2

Background

  • Use NVM as storage or main memory?
  • We assume NV main memory (NVMM)

– Keep important data in memory instead of file – Need to ensure failure safety + Fast + Byte-addressable

  • Volatile

DRAM

+ Non-volatile

  • Slow
  • Block-addressable

Disk / Flash NVM

+ Fast + Byte-addressable + Non-volatile

2

slide-3
SLIDE 3

Failure Safety through Durable Transactions

  • Durable transaction
  • Needed to ensure failure safety

A B C D X Insert Node X

3

System Failure Undo-logging

  • All updates in a transaction are atomically durable
  • Atomicity can be achieved through HW or SW undo logging
slide-4
SLIDE 4

Transaction with Software Undo-Logging

  • Step 1
  • Create undo log and make it durable
  • Step 2
  • Set log-flag and make it durable, indicating transaction start
  • Step 3
  • Perform data updates and make them durable
  • Step 4
  • Unset log-flag and make it durable, indicating transaction end

4

slide-5
SLIDE 5

Memory Persistency

  • Unpredictable persist ordering
  • Persist: operation which makes NVMM writes durable
  • NVMM persist order is determined by LLC writebacks,

instead of program order

  • Persistency Model
  • Defines when stores become durable (i.e. placed in

the persistence domain)

  • E.g. Intel PMEM persistency model, strict

persistency model, epoch persistency model, buffered epoch persistency model, strand persistency model, etc.

5

Shared Cache (LLC) MC NVMM Private Cache Unpredictable order

slide-6
SLIDE 6

PERSISTENCE DOMAIN PERSISTENCE DOMAIN

Intel PMEM Instruction and ADR

  • Asynchronous DRAM Refresh (ADR)
  • Added write pending queue (WPQ) in MC

to persistence domain

  • Flush data in WPQ to NVMM automatically
  • n system failure
  • CLWB
  • Write back a dirty block from caches to

WPQ

  • A fence is needed for ordering

Shared Cache MC MC NVMM NVMM L1 L1 L2 L1 L1 L2 clwb

6

st A st B st B clwb A st A st A clwb A sfence st B

slide-7
SLIDE 7
  • Hardware logging (HL)
  • Hardware creates and manages logs automatically (e.g. ATOM [HPCA’17])
  • (+) Low performance overheads
  • (−) Not flexible

Let’s Revisit Software Logging

Time Program Order log A log B log C log D st A st B st C st D FENCE Time Program Order log A log B log C log D st A st B st C st D

Software Logging Hardware Logging

  • a. Memory fence is not required between logging and data modification
  • b. New logging optimizations possible

7

  • Software logging (SL)
  • Software performs log creation, maintenance, and truncation
  • (+) Flexible (e.g. no OS support needed)
  • (−) High performance overheads (~50% slowdown)
slide-8
SLIDE 8

Software Supported Hardware Logging

  • Software Supported Hardware Logging (SSHL)
  • Hardware provides logging instructions
  • Software performs logging operations using logging instructions
  • Hardware applies optimizations

HL SL

SSHL

Flexible, but not fast Fast, but not flexible Fast and flexible

8

slide-9
SLIDE 9

Proteus: SSHL Design

  • Flexibility: Software involvement in logging
  • Add instructions which starts logging operations in hardware
  • Two instructions are required: log-load and log-flush
  • Performance Optimizations
  • Parallel logging: process multiple loggings concurrently
  • Redundant logging detection and removal
  • Endurance Optimization (log write removal)
  • With the introduction of ADR, WPQ is considered non-volatile
  • Key insight: logs are no longer needed when a transaction commits
  • Remove logs without flushing to NVMM

9

slide-10
SLIDE 10

Proteus: New Logging Instructions

  • log-from address (M1): address of original data
  • log-to address (M2): address of log entry
  • Log data register (LR#): register holding logging data

log-load $LR1 M1 LR1= Mem[M1] log-flush $LR1 M2 Mem[M2] = LR1 Shared Cache MC NVM L1 L1 L2 M1 M2 $LR1 log-load $LR1 M1 log-flush $LR1 M2

tx_begin A = … B = … tx_end i1: tx_begin i2: log-load LR1, A i3: log-flush LR1, (LTA)+ i4: st A i5: log-load LR2, B i6: log-flush LR2, (LTA)+ i7: st B i8: tx_end

Code generation

10

slide-11
SLIDE 11

Proteus Hardware Design

Pipeline

LDR Int fp txID log-start log-end cur-log

Register File

from to data

Cache

tag LRU txID

Router

txID coreID loginfo

WPQ Arbiter LoadQ StoreQ NVMM LLT LPQ

data

LogQ Memory Controller with ADR

Dep. Check Dep. Check 11

Log data register (LDR) Keep log data while logging instructions are in pipeline txID: holds current transaction ID being executed in the core log-start: the start address of the log area log-end: the end address of the log area cur-log: tracks the current free log entry Arbiter Prioritize writes from WPQ unless LPQ has no free entries (less than threshold) Log Queue (LogQ) Maintain log to store dependencies Keep track of logging executions (parallel loggings) Log Look-up Table (LLT) Prevent redundant loggings in a transaction Log Pending Queue (LPQ) Holds logs until the transaction ends or there is no free entries Separate logs from WPQ to avoid the incoming read requests check log entries

slide-12
SLIDE 12

02 1 0x200 A 02 1 0x200 A

B A 0x200 0x300 0x100 01 02 02 0x800 0x800 0x200 0x800 0x200 A 0x800 0x200 A

Proteus Hardware Design

LDR Register File

from to data

Cache Router

txID coreID loginfo

WPQ

Arbiter

StoreQ

NVMM

LPQ

data

LogQ Memory Controller with ADR txID log-start log-end cur-log

LR1 LR2 tx_begin log-load LR1, (0x800) log-flush LR1, (LTA)+ store B, (0x800) clwb (0x800) sfence tx_end

0x800: A 0x800: B

12

slide-13
SLIDE 13

Proteus LDR: 8 registers, LogQ: 8 entries LLT: 64 entries (8way), LPQ: 256 entries 11-29(109)-11-28-39-12-6-6-5-24 (tRCD 29 for Read, 109 for Write) tCAS-tRCD-tRP-tRAS-tRC-tWR-tWTR-tRTP-tRRD-tFAW NVM DDR3 like interface, 800MHz, 8GB 1 channel 16 Banks per rank, 2KB row-buffer L3 Cache 8MB, 16-way, 64B block, 42 cycles, shared by all cores L2 Cache 256KB, 8-way, 64B block, 12 cycles, private per core L1 I/D Cache 32KB,8-way,64Bblock,4cycles,private per core Processor OOO, 3.4GHz, 4 cores, System Configuration

Methodology

13

  • MarssX86 + DRAMsim2 simulator is used
  • NVM has 50ns for read latency and 150ns for write latency
slide-14
SLIDE 14

Evaluation (1) - Speedup

  • Baseline: software logging using Intel PMEM instructions
  • Proteus performs 46% better than baseline, 10% better than ATOM

46% better than baseline 10% better than ATOM

14

Queue Btree AvlTree Hashmap RB tree StringSwap

slide-15
SLIDE 15

Evaluation (2) – Numbers of writes

  • Baseline: no logging (not failure safe)
  • ATOM incurs 350% more writes than baseline
  • Proteus has similar writes to baseline (only 2% higher)

15

ATOM introduces 3.4x more writes than Proteus

slide-16
SLIDE 16

Conclusions

  • Software logging is expensive but flexible
  • Hardware logging is fast but inflexible
  • Proteus: Software Supported Hardware Logging (SSHL)
  • Fast and flexible
  • New logging instructions allow software to manage logging
  • Performance optimizations: parallel logging, redundant logging removal
  • Endurance optimization: remove logs before flushing to NVMM
  • Results
  • Performance: 46% better vs. SW logging (10% better vs. ATOM)
  • Endurance 2% more writes to NVMM vs. 350% with ATOM

16

slide-17
SLIDE 17

Thank you

17