NVthreads: Practical Persistence for Multi-threaded Applications - - PowerPoint PPT Presentation

nvthreads practical persistence for multi threaded
SMART_READER_LITE
LIVE PREVIEW

NVthreads: Practical Persistence for Multi-threaded Applications - - PowerPoint PPT Presentation

NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu* , Purdue University Helge Brgner*, TU Mnchen Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University *


slide-1
SLIDE 1

NVthreads: Practical Persistence for Multi-threaded Applications

Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University

* Work was done at Hewlett Packard Labs.

NVMW 2018

❖ NVthreads was published in EuroSys 2017 ❖ This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.

slide-2
SLIDE 2

What is non-volatile memory (NVM)?

2

  • Key features: persistence, good performance, byte addressability
  • Persistence
  • Retain data without power
  • Good performance
  • Outperform traditional filesystem interface
  • Byte addressability
  • Allow for pure memory operations
slide-3
SLIDE 3

4

☞Problem: Can we provide a simpler programming interface?

  • NVM aware filesystems: BPFS, PMFS, PMEM
  • Pro: provide good performance
  • Con: require applications to use file-system interfaces and may

need hardware modifications

  • Durable transaction and heaps: NV-Heaps, Mnemosyne
  • Pro: allow fine-grained NVM access
  • Con: force programs to use transactions and require non-trivial

effort to retrofit transactions in lock-based programs

Programming interfaces for NVM

slide-4
SLIDE 4

8

NVM-aware apps programming

1 . head 5 . e NULL tail

NVM

Challenges: 1.data consistency programmability volatile caches performance

1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : 5 : 6 : e->value = 5; 7 : 8 : 9 : e->next = NULL; 10: 11: 12: head->next = e; //crash 13: 14: 15: tail = e; 16: pthread_unlock(&m); 12: head->next = e; // crash

slide-5
SLIDE 5

9

1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : 6 : e->value = 5; 7 : <save old value of e->next> 8 : 9 : e->next = NULL; 10: <save old value of head->next> 11: 12: head->next = e; 13: <save old value of tail> 14: 15: tail = e; 16: pthread_unlock(&m);

NVM-aware apps programming

NVM

1 . head 5 . e NULL tail

Challenges: 1.data consistency 2.programmability volatile caches performance

slide-6
SLIDE 6

1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : <flush log entry to NVM> 6 : e->value = 5; 7 : <save old value of e->next> 8 : <flush log entry to NVM> 9 : e->next = NULL; 10: <save old value of head->next> 11: <flush log entry to NVM> 12: head->next = e; 13: <save old value of tail> 14: <flush log entry to NVM> 15: tail = e; 16: pthread_unlock(&m);

10

NVM-aware apps programming

NVM

1 . head 5 . e NULL tail flushing…

Challenges: 1.data consistency 2.programmability 3.volatile caches performance

Cache

slide-7
SLIDE 7

1 : # Add element to the tail of list 2 : pthread_lock(&m); 3 : malloc(&e, sizeof(*e)); 4 : <save old value of e->value> 5 : <flush log entry to NVM> 6 : e->value = 5; 7 : <save old value of e->next> 8 : <flush log entry to NVM> 9 : e->next = NULL; 10: <save old value of head->next> 11: <flush log entry to NVM> 12: head->next = e; 13: <save old value of tail> 14: <flush log entry to NVM> 15: tail = e; 16: pthread_unlock(&m);

11

NVM-aware apps programming

NVM

1 . head 5 . e NULL tail

Cache

Challenges: 1.data consistency 2.programmability 3.volatile caches 4.performance

flushing…

slide-8
SLIDE 8
  • Data consistency
  • Ensure data consistency even after crash
  • Volatile caches
  • Manage data movement from volatile caches to NVM
  • Programmability
  • Avoid extensive program modifications
  • Performance
  • Minimize runtime overhead

13

Challenges of using NVM

!Proposal: NVthreads, a programming model and runtime

that adds persistence to multi-threaded C/C++ programs

slide-9
SLIDE 9

Goals of NVthreads

  • Make existing lock-based C/C++ applications crash tolerant
  • Minimize porting effort
  • Drop-in replacement for pthreads library
  • No need for transactions
  • Advantages of the NVthreads
  • Good performance
  • Easier to develop NVM-aware applications

14

slide-10
SLIDE 10

Key ideas

  • Use synchronization points to infer consistent regions

(cf. Atlas [OOPSLA’14])

  • Does not require applications to use transactions
  • Execute multithreaded program as multi-process

program (cf. DThreads [SOSP’11])

  • Process memory buffers uncommitted writes
  • Track data modifications at page granularity
  • Amortizes logging overhead vs fine-grained tracking

15

slide-11
SLIDE 11

Unmodified C/C++ application

Using NVthreads

  • Ease of use:

19

bash$ gcc foo.c –o foo.out –rdynamic libnvthread.so –ldl

DRAM

Volatile main memory e.g., stacks

Operating system

Memory allocation and file system interface for both DRAM and NVM

NVthreads library

Multi-process, intercepting synchronization, tracking data, maintaining log

Modifications

  • Allocate data in NVM: nvmalloc()
  • Recover data in NVM: nvrecover()

Add recovery code, specify persistent allocations

NVM

Persistent regions e.g., linked list on heap

User space Kernel space Hardware

Link to NVthreads library

DRAM NVM

slide-12
SLIDE 12

NVthreads: programming model

22

1 void main(){ 2 if( crashed() ){ 3 int *c = (int*) nvmalloc(sizeof(int), “c”); 4 *c = nvrecover(c, sizeof(int), “c”); 5 } 6 else{ // normal execution 7 int *c = (int*) nvmalloc(sizeof(int), “c”); 8 ... // thread creation 9 m.lock() 10 *c = *c+1; 11 ... 12 m.unlock() 13 } 14 } 6 else{ // normal execution 7 int *c = (int*) nvmalloc(sizeof(int), “c”); 8 ... // thread creation 9 m.lock() 10 *c = *c+1; 11 ... 12 m.unlock() 13 }

Locks mark boundary for durable code section.

slide-13
SLIDE 13

NVthreads: programming model

23

1 void main(){ 2 if( crashed() ){ 3 int *c = (int*)nvmalloc(sizeof(int), “c”); 4 *c = nvrecover(c, sizeof(int), “c”); 5 } 6 else{ // normal execution 7 int *c = (int*) nvmalloc(sizeof(int), “c”); 8 ... // thread creation 9 m.lock() 10 *c = *c+1; 11 ... 12 m.unlock() 13 } 14 }

Application specific recovery code. Programer needs to add.

2 if( crashed() ){ 3 int *c = (int*) nvmalloc(sizeof(int), “c”); 4 *c = nvrecover(c, sizeof(int), “c”); 5 }

slide-14
SLIDE 14

Example: linked list

25

  • NVthreads guarantees that the linked list is atomically

appended w.r.t. failures

1 : # L is a persistent list 2 : Start threads {T1, T2, T3} 3 : … 4 : # Add element to the tail of list 5 : pthread_lock(&m); 6 : nvmalloc(&e, sizeof(*e)); 7 : e->val = localVal; 8 : tail->next = e; 9 : e->prev = tail; // crash! 10: tail = e; 11: pthread_unlock(&m)

Critical section (add e1) Critical section (add e2) Critical section (add e3)

L={} L={e1} L={e1, e2} NVM T1 T2 T3 Recovery phase (execute redo ops) state of the list data structure “L” 9 : e->prev = tail; // crash!

slide-15
SLIDE 15

Implementing atomic durability

  • Convert threads to processes (cf. DThreads [SOSP’11])
  • Each process works on private memory, no undo log
  • At synchronization points, propagate private updates, execute

processes sequentially

  • Track dirty pages and log them to NVM for recovery
  • Apply redo log in the event of crash

26

shared address space disjoint address spaces

slide-16
SLIDE 16

From threads to processes

33

Pass token Wait Wait T1 T2 Critical section Parallel phase Parallel phase

Execute Wait

Start NVM log write Merge shared state Track dirty pages Stop Start NVM log write Merge shared state Track dirty pages Stop

slide-17
SLIDE 17

Redo logging

34

Rego log Shared state T1

log dirty pages sync() merge updated bytes write back to NVM

NVM

Critical section Parallel phase

Clean page Dirtied page

slide-18
SLIDE 18

NVM

Tracking data dependencies

46

T1 T2

X=Y=0 Y=X

B A

X=1 cond_wait()

cond_signal()

dependence

Log1 Log2 Log3

NVthreads maintains metadata for memory pages per lockset to track data dependencies.

slide-19
SLIDE 19

Evaluation

  • Environment
  • Ubuntu 14.04 (Linux 3.16.7)
  • Two Intel Xeon X5650 processors (12cores@2.67GHz)
  • 198GB RAM and 600GB SSD
  • Applications
  • PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means
  • NVM emulator
  • Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs)
  • Injected 1000ns delay to each 4KB page write via RDTSCP instruction

47

slide-20
SLIDE 20

Performance vs pthreads

48

  • Phoenix and PARSEC benchmarks
  • No recovery protocol

Slowdown (x) 4 8 12 16 h i s t

  • g

r a m k m e a n s l i n e a r r e g r e s s i

  • n

m a t r i x m u l t i p l y p c a r e v e r s e i n d e x s t r i n g m a t c h w

  • r

d c

  • u

n t b l a c k s c h

  • l

e s c a n n e a l d e d u p f e r r e t s t r e a m c l u s t e r s w a p t i

  • n

s

Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

slide-21
SLIDE 21

Performance vs pthreads

50

  • 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads
  • Remaining 5 applications: 4x to 7x slowdown vs pthreads

Slowdown (x) 4 8 12 16 h i s t

  • g

r a m k m e a n s l i n e a r r e g r e s s i

  • n

m a t r i x m u l t i p l y p c a r e v e r s e i n d e x s t r i n g m a t c h w

  • r

d c

  • u

n t b l a c k s c h

  • l

e s c a n n e a l d e d u p f e r r e t s t r e a m c l u s t e r s w a p t i

  • n

s

Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

slide-22
SLIDE 22

52

  • 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas

101.96 46.92

Slowdown (x) 4 8 12 16 h i s t

  • g

r a m k m e a n s l i n e a r r e g r e s s i

  • n

m a t r i x m u l t i p l y p c a r e v e r s e i n d e x s t r i n g m a t c h w

  • r

d c

  • u

n t b l a c k s c h

  • l

e s c a n n e a l d e d u p f e r r e t s t r e a m c l u s t e r s w a p t i

  • n

s

Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

x x

Performance vs Atlas [OOPSLA’14]

slide-23
SLIDE 23

53

  • 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas
  • Remaining 2 applications: 7% to 2x slower vs Atlas

Slowdown (x) 4 8 12 16 h i s t

  • g

r a m k m e a n s l i n e a r r e g r e s s i

  • n

m a t r i x m u l t i p l y p c a r e v e r s e i n d e x s t r i n g m a t c h w

  • r

d c

  • u

n t b l a c k s c h

  • l

e s c a n n e a l d e d u p f e r r e t s t r e a m c l u s t e r s w a p t i

  • n

s

Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

x x

Performance vs Atlas [OOPSLA’14]

slide-24
SLIDE 24

Is coarse grained tracking a good fit?

54

  • 9 out of 14 applications touch more than 55% of each page
  • It is worthwhile to track data at page granularity in these apps

% of each page modified 10 20 30 40 50 60 70 80 90 100 l i n e a r r e g r e s s i

  • n

( 2 5 ) s t r i n g m a t c h ( 3 7 ) h i s t

  • g

r a m ( 4 4 ) b l a c k s c h

  • l

e s ( 8 9 ) s w a p t i

  • n

s ( 4 8 3 ) m a t r i x m u l t i p l y ( 4 K ) k m e a n s ( 1 K ) p c a ( 1 1 K ) w

  • r

d c

  • u

n t ( 1 2 K ) f e r r e t ( 1 5 K ) s t r e a m c l u s t e r ( 1 8 K ) d e d u p ( 2 . 3 M ) r e v e r s e i n d e x ( 2 . 7 M ) c a n n e a l ( 7 . 4 M )

slide-25
SLIDE 25
  • Microbenchmark: 4 threads randomly modify parts of 1000 memory pages
  • Mnemosyne [ASPLOS’11] and Atlas [OOPSLA’14] use word-level tracking
  • NVthreads is 3x to 30x faster than fine-grained tracking

56

NVthreads is faster than fine-grained tracking

Slowdown over pthreads (x)

25 50 75 100 125 150 175 200 225 250

Percentage of page modified

5% 10% 25% 50% 75% 100%

NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas

slide-26
SLIDE 26
  • We made K-means crash at synthetic program points, recover, continue

until convergence at ~160th iteration

  • NVthreads’ K-means provides up to 1.9x speedup vs pthreads
  • NVthreads requires only 4 SLOC changes to make K-means crash tolerant

58

Input size

0.5 1 1.5 2 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 10 50 75 150

Speedup over pthreads

Iteration when crash occured

Pthreads NVthreads (nvm=1000ns)

Benefits of recovery (K-means)

Speedup over pthreads (x)

slide-27
SLIDE 27

Summary

  • NVthreads allows programmers to easily leverage NVM

with just few lines of source code changes

  • Recovery requires only redo log because multi-process

execution buffers private updates

  • Coarse-grained page-level tracking amortizes logging
  • verheads
  • NVthreads prototype is publicly available at:

https://github.com/HewlettPackard/nvthreads

61