Effectively Prefetching Remote Memory with Leap Hasan Al Maruf and - - PowerPoint PPT Presentation

effectively prefetching remote memory with leap
SMART_READER_LITE
LIVE PREVIEW

Effectively Prefetching Remote Memory with Leap Hasan Al Maruf and - - PowerPoint PPT Presentation

Effectively Prefetching Remote Memory with Leap Hasan Al Maruf and Mosharaf Chowdhury 1 Memory-Intensive Applications 2 Perform Great! 40 38.61 35 30 TPS (Thousands) 25 20 15 10 6.61 5 1.01 0 100% 75% 50% In-Memory Working Set


slide-1
SLIDE 1

Effectively Prefetching Remote Memory with Leap

Hasan Al Maruf and Mosharaf Chowdhury

1

slide-2
SLIDE 2

2

Memory-Intensive Applications

slide-3
SLIDE 3

Perform Great!

3

TPC-C on VoltDB

38.61 6.61 1.01 5 10 15 20 25 30 35 40 100% 75% 50% TPS (Thousands) In-Memory Working Set

slide-4
SLIDE 4

Perform Great Until Memory Runs Out

4

TPC-C on VoltDB

38.61 6.61 1.01 5 10 15 20 25 30 35 40 100% 75% 50% TPS (Thousands) In-Memory Working Set

slide-5
SLIDE 5

Perform Great Until Memory Runs Out

5

TPC-C on VoltDB

38.61 6.61 1.01 5 10 15 20 25 30 35 40 100% 75% 50% TPS (Thousands) In-Memory Working Set

PageRank on PowerGraph

116.19 124.96 424.47 100 200 300 400 500 100% 75% 50% Completion Time (s) In-Memory Working Set

slide-6
SLIDE 6

50% Less Memory Causes Slowdown of …

PageRank on PowerGraph

6

TPC-C on VoltDB

38.61 6.61 1.01 5 10 15 20 25 30 35 40 100% 75% 50% TPS (Thousands) In-Memory Working Set 116.19 124.96 424.47 100 200 300 400 500 100% 75% 50% Completion Time (s) In-Memory Working Set

slide-7
SLIDE 7

Between a Rock and a Hard Place Overallocation

Leads to underutilization 30-40% in Google, Alibaba, and Facebook

Underallocation

Leads to severe performance loss VS.

7

slide-8
SLIDE 8

Machine 1 Machine 2 Machine 3 Machine N Used Memory Free Memory … Disaggregated Memory

Memory Disaggregation

Remote Memory

8

slide-9
SLIDE 9

Remote Memory Access

9

User-space Applications Memory Disaggregation Frameworks Remote Memory Infiniswap (NSDI’17) Remote memory paging Remote Regions (ATC’18) Remote file abstraction LegoOS (OSDI’18) Disaggregated OS

4KB page access latency local vs. remote

100 ns vs. 4 µs

slide-10
SLIDE 10

Remote Memory Access

10

User-space Applications Memory Disaggregation Frameworks Remote Memory Infiniswap (NSDI’17) Remote memory paging Remote Regions (ATC’18) Remote file abstraction LegoOS (OSDI’18) Disaggregated OS

[1] P . X. Gao et al. “Network requirements for resource disaggregation” OSDI’16.

Latency requirement for preferable performance[1]

3 µs

Existing frameworks can’t achieve! 4KB page access latency local vs. remote

100 ns vs. 4 µs

slide-11
SLIDE 11

Remote Memory Access

11

User-space Applications Memory Disaggregation Frameworks Remote Memory Infiniswap (NSDI’17) Remote memory paging Remote Regions (ATC’18) Remote file abstraction LegoOS (OSDI’18) Disaggregated OS variation in network latency

[1] P . X. Gao et al. “Network requirements for resource disaggregation” OSDI’16.

data path

  • verhead

Latency requirement for preferable performance[1]

3 µs

Existing frameworks can’t achieve! 4KB page access latency local vs. remote

100 ns vs. 4 µs

slide-12
SLIDE 12

Life of a Page

I/O Scheduler Request Queue Request queue processing: Insertion, Merging, Sorting, Staging and Dispatch Dispatch Queue Device Mapping Layer Generic Block Layer bio 10.04 us 2.1 us

Remote Memory

RDMA: 4.3 us Cache Miss 0.27 us Cache Hit User Space Kernel Space Memory Management Unit (MMU) Process 1 Process 2 Process N … Page Fault

MMU Page Cache 12

Block Device Driver 21.88 us

slide-13
SLIDE 13

Where Does the Time Go?

Page Request In Page Cache? Read Request? Yes Update Page Table & End I/O Yes 0. 0.12 12 µs 0. 0.15 15 µs

Fast Path

13

slide-14
SLIDE 14

Where Does the Time Go?

Page Request In Page Cache? Allocate Cache for Page Read Request? No Yes Update Page Table & End I/O Prepare for I/O Yes No Queue and Batch Requests Execute I/O 0. 0.12 12 µs 2. 2.1 1 µs 10. 10.04 04 µs 21. 21.88 88 µs

RDMA RDMA: 4. 4.3 3 µs

0. 0.15 15 µs

Fast Path Slow Path

14

slide-15
SLIDE 15

Design Goal

  • 1. Increase cache hit
  • faster path serves more page faults
  • 2. Reduce the latency of the slow path
  • remove unnecessary block-layer operations for RDMA

15

slide-16
SLIDE 16

Online remote memory prefetcher

Leap

Identifies memory access patterns to prefetch pages in a

  • fast,
  • cache-efficient, and
  • resilient manner

without modifying any

  • applications, or
  • hardware

16

slide-17
SLIDE 17

Life of a Page

User Space Kernel Space Device Mapping Layer Block Device Driver Generic Block Layer I/O Scheduler Request Queue Request queue processing: Insertion, Merging, Sorting, Staging and Dispatch bio

Remote Memory

Dispatch Queue Memory Management Unit (MMU) Process 1 Process 2 Process N … Page Fault RDMA: 4.3 us 0.27 us 10.04 us 21.88 us 2.1 us Cache Miss Cache Hit

MMU Page Cache 17

slide-18
SLIDE 18

Life of a Page w/ Leap

User Space Kernel Space

Remote Memory

Memory Management Unit (MMU) Process 1 Process 2 Process N … Page Fault RDMA: 4.3 us 0.27 us 2.1 us Cache Miss Cache Hit

MMU Page Cache 18

slide-19
SLIDE 19

Life of a Page w/ Leap

User Space Kernel Space

Remote Memory

Memory Management Unit (MMU) Process 1 Process 2 Process N … Page Fault RDMA: 4.3 us 0.27 us 2.1 us Cache Miss Cache Hit

MMU Page Cache

Process Specific Page Access Tracker Le Leap Trend Detection Prefetch Candidate Generation Prefetcher Eager Cache Eviction

19

0.34 us

slide-20
SLIDE 20

Prefetching in Linux

Reads ahead pages sequentially Based only on the last page access Does not distinguish between processes Cannot detect thread-level access irregularities

too aggressive on seq: cache pollution too conservative off seq: brings nothing

20

slide-21
SLIDE 21

Prefetching Techniques

Approach Low Computational Complexity Low Memory Overhead Unmodified Application HW/SW Independence Temporal Locality Spatial Locality Low Cache Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction Prefetch No No No No Yes Yes No Linux Read-Ahead Yes Yes Yes Yes Yes Yes No Leap Yes Yes Yes Yes Yes Yes Yes

21

slide-22
SLIDE 22

Prefetching Techniques

Approach Low Computational Complexity Low Memory Overhead Unmodified Application HW/SW Independence Temporal Locality Spatial Locality Low Cache Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction Prefetch No No No No Yes Yes No Linux Read-Ahead Yes Yes Yes Yes Yes Yes No Leap Yes Yes Yes Yes Yes Yes Yes

22

slide-23
SLIDE 23

Prefetching Techniques

Approach Low Computational Complexity Low Memory Overhead Unmodified Application HW/SW Independence Temporal Locality Spatial Locality Low Cache Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction Prefetch No No No No Yes Yes No Linux Read-Ahead Yes Yes Yes Yes Yes Yes No Leap Yes Yes Yes Yes Yes Yes Yes

23

slide-24
SLIDE 24

Prefetching Techniques

Approach Low Computational Complexity Low Memory Overhead Unmodified Application HW/SW Independence Temporal Locality Spatial Locality Low Cache Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction Prefetch No No No No Yes Yes No Linux Read-Ahead Yes Yes Yes Yes Yes Yes No Leap Yes Yes Yes Yes Yes Yes Yes

24

slide-25
SLIDE 25

Prefetching Techniques

Approach Low Computational Complexity Low Memory Overhead Unmodified Application HW/SW Independence Temporal Locality Spatial Locality Low Cache Pollution Next N-Line Yes Yes Yes Yes No Yes No Stride Yes Yes Yes Yes No Yes No Instruction Prefetch No No No No Yes Yes No Linux Read-Ahead Yes Yes Yes Yes Yes Yes No Leap Yes Yes Yes Yes Yes Yes Yes

25

slide-26
SLIDE 26

Leap Prefetcher

Linear-time and constant memory space Two main components: § Trend detection § Prefetch window size detection

Get Prefetch Window Size Window Size = 0? Read only the requested page Trend Found? Prefetch with Current Trend Prefetch with Previous Trend Yes No No Yes

26

slide-27
SLIDE 27

Trend Detection

Start with a smaller window of Access History Majority found? Doubles the window size No Yes Run Boyer-Moore on the window Return Majority ∆maj Max. window size? Yes No trend found No

Flexible to short term irregularity Identifies the majority element in access history Regular trends can be found within recent accesses

27

slide-28
SLIDE 28

Trend Detection Example

t4 t5 t6 t7 0x3C 0x02 0x04 0x06 t0 t1 t2 t3 0x48 0x45 0x42 0x3F

  • 3
  • 3
  • 3

+72 +2 +2

  • 58
  • 3

t0 t1 t2 t3 0x48 0x45 0x42 0x3F

  • 3
  • 3
  • 3

+72 t8 t9 t10 t11 0x08 0x0A 0x0C 0x10 +4 +2 +2 +2 +2 +2

  • 39
  • 41

t12 t13 t14 t15 0x39 0x12 0x14 0x16 t8 t1 t2 t3 0x08 0x45 0x42 0x3F

  • 3
  • 3
  • 3

+2 +2 +2

  • 58
  • 3

t4 t5 t6 t7 0x3C 0x02 0x04 0x06

(a) at time t3 (b) at time t7 (c) at time t8 (d) at time t15

tre trend of -3 tre trend of -3 3 disappears, no major new trend trend of +2 2 detected trend of +2 2 detected among irregularities

28

slide-29
SLIDE 29

Prefetch Window Size Detection

29

Cache hit indicates prefetch utilization High cache hit: increase prefetch window aggressively No cache hit Gradual slow down helps during sudden changes

trend availability: increase prefetch window gradually no trend: decrease prefetch window gradually

slide-30
SLIDE 30

Evaluation

Memory Disaggregation Frameworks

Deploy and evaluate over 56 Gbps InfiniBand network

30

Disaggregated VMM: Infiniswap Disaggregated VFS: Remote Regions

slide-31
SLIDE 31

Lowers Remote Page Access Latency by…

Sequential Access

0.2 0.4 0.6 0.8 1 0.01 1 100 10000 CDF Latency (us) Infiniswap Infiniswap+Leap

Stride Access

0.2 0.4 0.6 0.8 1 0.01 1 100 10000 CDF Latency (us)

31

slide-32
SLIDE 32

Efficient Pattern Detection

Detects 29.70% more sequential accesses Detects most of the irregularity

32

slide-33
SLIDE 33

Efficient Pattern Detection

Detects 29.70% more sequential accesses Detects most of the irregularity During irregularities, doing nothing helps the most

33

slide-34
SLIDE 34

Perform Great Even After Memory Runs Out

TPC-C on VoltDB

37.00 27.74 19.33 1.5 5 10 15 20 25 30 35 40 100% 75% 50% 25% TPS (Thousands) In-Memory Working Set

In Infiniswa wap

37 36.3 35.6 15.6 5 10 15 20 25 30 35 40 100% 75% 50% 25% TPS (Thousands) In-Memory Working Set

TPC-C on VoltDB

In Infiniswa wap + Le Leap ap

34

38.61 6.61 1.01 5 10 15 20 25 30 35 40 100% 75% 50% TPS (Thousands) In-Memory Working Set

Di Disk

TPC-C on VoltDB

slide-35
SLIDE 35

Perform Great Even After Memory Runs Out

TPC-C on VoltDB

37.00 27.74 19.33 1.5 5 10 15 20 25 30 35 40 100% 75% 50% 25% TPS (Thousands) In-Memory Working Set

In Infiniswa wap

37 36.3 35.6 15.6 5 10 15 20 25 30 35 40 100% 75% 50% 25% TPS (Thousands) In-Memory Working Set

TPC-C on VoltDB

In Infiniswa wap + Le Leap ap

35

38.61 6.62 1.01 Fails 5 10 15 20 25 30 35 40 100% 75% 50% 25% TPS (Thousands) In-Memory Working Set

Di Disk

TPC-C on VoltDB

slide-36
SLIDE 36

Benefit Breakdown of Leap’s Components

Data path optimizations: single-μs latency till 95th percentile Prefetcher: sub-μs latency till 85th percentile Eager cache eviction: improves the 99th percentile latency by 22%

36

slide-37
SLIDE 37

Future Work

  • 1. Thread-specific prefetching for multiple concurrent streams
  • memory is managed at the process level
  • this requires significant changes in virtual memory subsystem
  • 2. Optimized remote I/O interface
  • load balancing,
  • fault-tolerance,
  • data locality, and
  • application-specific isolation in remote memory

37

slide-38
SLIDE 38

Leap

38

Lightweight and efficient data path for remote memory

source code available at https://github.com/SymbioticLab/leap

Online prefetcher with a leaner data path and eager cache eviction policy to improve

  • cache hit,
  • remote I/O latency, and
  • application-level performance

without modifying any

  • application, or
  • hardware
slide-39
SLIDE 39

Thank You!

source code available at https://github.com/SymbioticLab/leap

39