Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying - - PowerPoint PPT Presentation

distributed shared persistent memory
SMART_READER_LITE
LIVE PREVIEW

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying - - PowerPoint PPT Presentation

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory (PM/NVM) CPU Byte Addressable Persistent Cache Low Latency Capacity Cost effective PM DRAM 2 Many PM Work, but All in Single Machine


slide-1
SLIDE 1

Distributed Shared Persistent Memory

(SoCC ’17)

Yizhou Shan, Yiying Zhang

slide-2
SLIDE 2

Persistent Memory (PM/NVM)

Byte Addressable Persistent Low Latency Capacity Cost effective

CPU Cache DRAM PM

2

slide-3
SLIDE 3

Many PM Work, but All in Single Machine

  • Local memory models

– NV-Heaps [ASPLOS ’11], Mnemosyne [ASPLOS ’11] – Memory Persistency [ISCA ’14], Synchronous Ordering [Micro’16]

  • Local file systems

– BPFS [SOSP’09], PMFS [EuroSys’14], SCMFS [SC’11], HiNFS [EuroSys’16]

  • Local transaction/logging systems

– NVWAL [ASPLOS’16], SCT/DCT [ASPLOS’16], Kamino-Tx [Eurosys’17]

3

slide-4
SLIDE 4
  • PM fits datacenter
  • Applications require a lot memory
  • and accessing persistent data fast
  • with low monetary cost
  • Challenges
  • Handle node failure
  • Ensure good performance and scalability
  • Easy-to-use abstraction

4

Moving PM into Datacenters

slide-5
SLIDE 5

How to Use PM in Distributed Environments?

  • As distributed memory?
  • As distributed storage?
  • Mojim [Zhang etal., ASPLOS’15]
  • First PM work in distributed environments
  • Efficient PM replication
  • But far from a full-fledged distributed NVM system

5

slide-6
SLIDE 6

Node 1 Node 2

6

8GB Main Memory Core Core Core Core Core Core

VM1 App1

4GB Main Memory Core Core Core Core

Container1

Resource Allocation in Datacenters

3GB Memory Core

App2

slide-7
SLIDE 7

7

Resource Utilization in Production Clusters

Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints

* Google Production Cluster Trace Data. “https://github.com/google/cluster-data”

slide-8
SLIDE 8

8

Q1: How to achieve better resource utilization?

Use remote memory

slide-9
SLIDE 9

Node 1 Node 2

9

Main Memory Core Core Core Core Core Core

VM1 App1

Main Memory Core Core Core Core

Container1

Distributed (Remote) Memory

App2

Memory Core

App2

slide-10
SLIDE 10

Purdue ECE WukLab

  • ¡

PowerGraph TensorFlow

10

Modern Datacenter Applications Have Significant Memory Sharing

slide-11
SLIDE 11

11

Q2: How to scale out parallel applications?

Distributed shared memory

slide-12
SLIDE 12

What about persistence?

  • Data persistence is useful
  • Many existing data storage systems

➡Performance

  • Memory-based, long-running applications

➡Checkpointing

12

slide-13
SLIDE 13

13

Q3: How to provide data persistence?

slide-14
SLIDE 14

Distributed Shared Persistent Memory (DSPM)


a significant step towards using PM in datacenters

DSM

14

slide-15
SLIDE 15

DSPM

  • Native memory load/store interface

– Local or remote (transparent) – Pointers and in-memory data structures

  • Supports memory read/write sharing

15

slide-16
SLIDE 16

Distributed Shared Persistent Memory (DSPM)


a significant step towards using PM in datacenters

DSM

16

slide-17
SLIDE 17

DSPM

  • Memory load/store interface

– Local or remote (transparent) – Pointers and in-memory data structures

  • Supports memory read/write sharing
  • Persistent naming
  • Data durability and reliability

17

(Distributed) ¡Memory (Distributed) ¡Storage

DSPM: One Layer Approach

Benefits of both memory and storage

No redundant layers No data marshaling/unmarshalling

slide-18
SLIDE 18

Hotpot:
 A Kernel-Level RDMA-Based
 DSPM System

  • Easy to use
  • Native memory interface
  • Fast, scalable
  • Flexible consistency levels
  • Data durability & reliability

18

slide-19
SLIDE 19

Hotpot Architecture

19

slide-20
SLIDE 20

Hotpot Architecture

20

slide-21
SLIDE 21

21

Hotpot Architecture

slide-22
SLIDE 22

22

Hotpot Architecture

slide-23
SLIDE 23

23

Hotpot Architecture

slide-24
SLIDE 24

24

Hotpot Architecture

slide-25
SLIDE 25

Hotpot Code Example

/* Open a dataset named 'boilermaker’ */ int fd = open(”/mnt/hotpot/boilermaker”, O_CREAT|O_RDWR);

/* map it to application’s virtual address space */

void *base = mmap(0, 40960, PROT_WRITE, MAP_PRIVATE, fd, 0); /* First access: Hotpot will fetch page from remote */ *base = 9; /* Later accesses: Direct memory load/store */ memset(base, 0x27, PAGE_SIZE); /* Commit data: making data coherent, durable, and replicated */ msync(sg_addr, sg_len, MSYNC_HOTPOT);

25

slide-26
SLIDE 26

How to efficiently add P to “DSM”?

  • Distributed Shared Memory
  • Cache remote memory on-demand for fast local access
  • Multiple redundant copies
  • Distributed Storage Systems
  • Actively add more redundancy to provide data reliability

Integrate two forms of redundancy with morphable page states

26

One Layer Principle

slide-27
SLIDE 27

Morphable Page States

  • A PM page can serve different purposes, possibly

at different times

  • as a local cached copy to improve performance
  • as a redundant data page to improve data reliability

10/9/17

27

Node ¡2 Node ¡1

3 1 4 2 4 2 3 1 3 Node 2 accesses page 3

slide-28
SLIDE 28
  • When to make cached copies coherent?
  • When to make data durable and reliability?
  • Observations
  • Data-store applications have well-defined commit points
  • Commit points: time to make data persistent
  • Visible to storage devices => visible to other nodes

Exploit application behavior: Make data coherent only at commit points

28

How to efficiently add P to “DSM”?

slide-29
SLIDE 29

Commit Point

29

Node 1 Node 2 Node 3 PM A CPU cache A’ CPU cache PM PM CPU cache A A’ A’

  • durable
  • coherent
  • reliable

B’ B

  • single-node and distributed consistency
  • two consistency modes: single/multiple writer

C C’

slide-30
SLIDE 30

Flexible Coherence Levels

  • Multiple Reader Multiple Writer (MRMW)
  • Allows multiple concurrent dirty copies
  • Great parallelism, but weaker consistency
  • Three-phase commit protocol
  • Multiple Reader Single Writer (MRSW)
  • Allows only one dirty copy
  • Trades parallelism for stronger consistency
  • Single phase commit protocol

30

slide-31
SLIDE 31

MongoDB Results

  • Modify MongoDB with ~120 LOC, use MRMW mode
  • Compare with tmpfs, PMFS, Mojim, Octopus using YCSB

10/9/17

31

slide-32
SLIDE 32
  • One layer approach: challenges and benefits
  • Hotpot: a kernel-level RDMA-based DSPM system
  • Hide complexity behind simple abstraction
  • Calls for attention to use PM in datacenter
  • Many open problems in distributed PM!

32

Conclusion

slide-33
SLIDE 33

Thank You Questions?

Get Hotpot at: https://github.com/WukLab/Hotpot

wuklab.io