Rocksteady: Fast Migration for Low-Latency In-memory Storage Chinmay - - PowerPoint PPT Presentation

rocksteady fast migration for low latency in memory
SMART_READER_LITE
LIVE PREVIEW

Rocksteady: Fast Migration for Low-Latency In-memory Storage Chinmay - - PowerPoint PPT Presentation

Rocksteady: Fast Migration for Low-Latency In-memory Storage Chinmay Kulkarni , Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman 1 Introduction Distributed low-latency in-memory key-value stores are emerging Predictable response


slide-1
SLIDE 1

Rocksteady: Fast Migration for Low-Latency In-memory Storage

Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

1

slide-2
SLIDE 2

Introduction

  • Distributed low-latency in-memory key-value stores are emerging
  • Predictable response times ~10 µs median, ~60 µs 99.9th-tile
  • Problem: Must migrate data between servers
  • Minimize performance impact of migration → go slow?
  • Quickly respond to hot spots, skew shifts, load spikes → go fast?
  • Solution: Fast data migration with low impact
  • Early ownership transfer of data, leverage workload skew
  • Low priority, parallel and adaptive migration
  • Result: Migration protocol for RAMCloud in-memory key-value store
  • Migrates 256 GB in 6 minutes, 99.9th-tile latency less than 250 µs
  • Median latency recovers from 40 µs to 20 µs in 14 s

2

slide-3
SLIDE 3

multiGet( ) multiGet( )

Why Migrate Data?

Poor spatial locality → High multiGet() fan-out → More RPCs

3

Server 1 Server 2

No Locality Fanout=7

A C B D A B Client 1 C D Client 2

6 Million

slide-4
SLIDE 4

multiGet( ) multiGet( )

Migrate To Improve Spatial Locality

4

Server 1 Server 2 A C B D A B C D C B

No Locality Fanout=7

Client 1 Client 2

6 Million

slide-5
SLIDE 5

multiGet( ) multiGet( )

Spatial Locality Improves Throughput

Better spatial locality → Fewer RPCs → Higher throughput Benefits multiGet(), range scans

5

Server 1 Server 2 A B C D A B C D

Full Locality Fanout=1 No Locality Fanout=7

Client 1 Client 2

6 Million 25 Million

slide-6
SLIDE 6

The RAMCloud Key-Value Store

6

Data Center Fabric Master Backup Master Backup Master Backup Master Backup Client Client Client Client Coordinator All Data in RAM Kernel Bypass/ DPDK 10 µs reads

slide-7
SLIDE 7

The RAMCloud Key-Value Store

7

Data Center Fabric Master Backup Master Backup Master Backup Master Backup Client Client Client Client Coordinator Write RPC

slide-8
SLIDE 8

The RAMCloud Key-Value Store

8

Data Center Fabric Master Backup Master Backup Master Backup Master Backup Client Client Client Client Coordinator 1x in DRAM 3x on Disk Write RPC

slide-9
SLIDE 9

Fault-toler eranc nce & e & Rec ecover ery I In n RAM AMCl Cloud ud

9

Data Center Fabric Master Backup Master Backup Master Backup Master Backup Client Client Client Client Coordinator

slide-10
SLIDE 10

Fault-toler eranc nce & e & Rec ecover ery I In n RAM AMCl Cloud ud

10

Data Center Fabric Master Backup Master Backup Master Backup Master Backup Client Client Client Client Coordinator 2 seconds to recover

slide-11
SLIDE 11

Performance Goals For Migration

  • Maintain low access latency
  • 10 µsec median latency → System extremely sensitive
  • Tail latency matters at scale → Even more sensitive
  • Migrate data fast
  • Workloads dynamic → Respond quickly
  • Growing DRAM storage: 512 GB per server
  • Slow data migration → Entire day to scale cluster

11

slide-12
SLIDE 12

Rocksteady Overview: Early Ownership Transfer

Problem: Loaded source can bottleneck migration Solution: Instantly shift ownership and all load to target

12

Source Server Target Server Client 1 Client 2 Reads and Writes Client 3 Client 4

slide-13
SLIDE 13

Rocksteady Overview: Early Ownership Transfer

Problem: Loaded source can bottleneck migration Solution: Instantly shift ownership and all load to target

13

Source Server Target Server Client 1 Client 2 Client 3 Client 4 Instantly Redirected All future operations serviced at Target Creates “headroom” to speed migration

slide-14
SLIDE 14

Rocksteady Overview: Leverage Skew

14

Source Server Target Server On-demand Pull Client 1 Client 2 Client 3 Client 4 Read

Problem: Data has not arrived at source yet Solution: On demand migration of unavailable data

slide-15
SLIDE 15

Rocksteady Overview: Leverage Skew

15

Source Server Target Server Client 1 Client 2 Client 3 Client 4 Read

Problem: Data has not arrived at source yet Solution: On demand migration of unavailable data

Hot keys move early Median Latency recovers to 20 µs in 14 s

slide-16
SLIDE 16

Rocksteady Overview: Adaptive and Parallel

16

Source Server Target Server On-demand Pull Client 1 Client 2 Client 3 Client 4

Problem: Old single-threaded protocol limited to 130 MB/s Solution: Pipelined and parallel at source and target

Parallel Pulls

slide-17
SLIDE 17

Rocksteady Overview: Adaptive and Parallel

17

Source Server Target Server On-demand Pull Client 1 Client 2 Client 3 Client 4

Problem: Old single-threaded protocol limited to 130 MB/s Solution: Pipelined and parallel at source and target

Target Driven Yields to On-demand Pulls Moves 758 MB/s Parallel Pulls

slide-18
SLIDE 18

18

Rocksteady Overview: Eliminate Sync Replication

Source Server Target Server Parallel Pulls On-demand Pull Client 1 Client 2 Client 3 Client 4 Replication

Problem: Synchronous replication bottleneck at target Solution: Safely defer replication until after migration

slide-19
SLIDE 19

19

Rocksteady Overview: Eliminate Sync Replication

Source Server Target Server Client 1 Client 2 Client 3 Client 4 Replication

Problem: Synchronous replication bottleneck at target Solution: Safely defer replication until after migration

slide-20
SLIDE 20

Rocksteady: Putting it all together

  • Instantaneous ownership transfer
  • Immediate load reduction at overloaded source
  • Creates “headroom” for migration work
  • Leverage skew to rapidly migrate hot data
  • Target comes up to speed with little data movement
  • Adaptive parallel, pipelined at source and target
  • All cores avoid stalls, but yield to client-facing operations
  • Safely defer replication at target
  • Eliminates replication bottleneck and contention

20

slide-21
SLIDE 21

Rocksteady

  • Instantaneous ownership transfer
  • Leverage skew to rapidly migrate hot data
  • Adaptive parallel, pipelined at source and target
  • Safely defer synchronous replication at target

21

slide-22
SLIDE 22

Source Server

Evaluation Setup

22

300 Million Records 45 GB Target Server Client YCSB-B (95/5) Skew=0.99 Client YCSB-B (95/5) Skew=0.99 Client YCSB-B (95/5) Skew=0.99 Client YCSB-B (95/5) Skew=0.99

slide-23
SLIDE 23

Evaluation Setup

23

150 Million Records 22.5 GB Target Server Client YCSB-B (95/5) Skew=0.99 Client YCSB-B (95/5) Skew=0.99 Client YCSB-B (95/5) Skew=0.99 Client YCSB-B (95/5) Skew=0.99 150 Million Records 22.5 GB 150 Million Records 22.5 GB

slide-24
SLIDE 24

Instantaneous Ownership Transfer

Before migration: Source over-loaded, Target under-loaded Ownership transfer creates Source headroom for migration

24

80% 25% Before Ownership Transfer Immediately After Transfer Source CPU Utilization Created 55% Source CPU Headroom

slide-25
SLIDE 25

Rocksteady

  • Instantaneous ownership transfer
  • Leverage skew to rapidly migrate hot data
  • Adaptive parallel, pipelined at source and target
  • Safely defer synchronous replication at target

25

slide-26
SLIDE 26

Leverage Skew To Move Hot Data

After ownership transfer, hot keys pulled on-demand More skew → Median restored faster (migrate fewer hot keys)

26

240µs 245µs 155µs 75µs 28µs 17µs Uniform (Low) Skew=0.99 Skew=1.5 (High) 99.9th Latency Median Latency

Before Migration: Median=10 µs 99.9th = 60 µs

slide-27
SLIDE 27

Rocksteady

  • Instantaneous ownership transfer
  • Leverage skew to rapidly migrate hot data
  • Adaptive parallel, pipelined at source and target
  • Safely defer synchronous replication at target

27

slide-28
SLIDE 28

Parallel, Pipelined, & Adaptive Pulls

28

Target

Hash Table 8 16 24 Worker Cores

NIC

Dispatch Core

Polling

replay replay read(B) Migration Manager Pull Buffers pulling Per-Core Buffers

  • Target driven, migration manager
  • Co-partitioned hash tables, pull from partitions in parallel
  • Replay pulled data into per-core buffers
slide-29
SLIDE 29

Parallel, Pipelined, & Adaptive Pulls

29

Source

NIC

Dispatch Core

Polling

read(A)

Hash Table 8 16 24

pull(11) pull(17)

Gather List Gather List Copy Addresses

  • Stateless passive Source
  • Granular 20 KB pulls
slide-30
SLIDE 30

Parallel, Pipelined, & Adaptive Pulls

30

  • Redirect any idle CPU for migration
  • Migration yields to regular requests, on-demand pulls
slide-31
SLIDE 31

Rocksteady

  • Instantaneous ownership transfer
  • Leverage skew to rapidly migrate hot data
  • Adaptive parallel, pipelined at source and target
  • Safely defer synchronous replication at target

31

slide-32
SLIDE 32

Each server has a recovery log distributed across the cluster

Backup

Naïve Fault Tolerance During Migration

32

Source A B C Target A A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log

slide-33
SLIDE 33

Naïve Fault Tolerance During Migration

Migrated data needs to be triplicated to target’s recovery log

33

Source A B C Target A Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log

slide-34
SLIDE 34

Naïve Fault Tolerance During Migration

Migrated data needs to be triplicated to target’s recovery log

34

Source A B C Target A Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log A A A

slide-35
SLIDE 35

Synchronous Replication Bottlenecks Migration

Synchronous replication hits migration speed by 34%

35

Source A B C Target A B Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log B A B A B A

slide-36
SLIDE 36

Replicate at Target only after all data has been moved over

36

Source A B C Target A B C Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log

Rocksteady: Safely Defer Replication At The Target

slide-37
SLIDE 37

Writes/Mutations Served By Target

Mutations have to be replicated by the target

37

Source A B C Target A B C’ Write Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log C’ C’ C’

slide-38
SLIDE 38

Crash Safety During Migration

  • Need both Source and Target recovery log for data

recovery

  • Initial table state on Source recovery log
  • Writes/Mutations on Target recovery log
  • Transfer ownership back to Source in case of crash
  • Migration cancelled
  • Recovery involves both recovery logs
  • Source takes a dependency on Target recovery log at

migration start

  • Stored reliably at the cluster coordinator
  • Identifies position after which mutations present

38

slide-39
SLIDE 39

If The Source Crashes During Migration

Recover Source, recover from Target recovery log

39

Source A B C Target A B C’ Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log C’ C’ C’

slide-40
SLIDE 40

If The Target Crashes During Migration

Recover from Source and Target recovery log, recover Target

40

Source A B C Target A B C’ Backup A C Backup C Backup A B Backup A B Backup B C Source Recovery Log Target Recovery Log C’ C’ C’

slide-41
SLIDE 41

Crash Safety During Migration

  • Need both Source and Target recovery log for data

recovery

  • Initial table state on Source recovery log
  • Writes/Mutations on Target recovery log
  • Transfer ownership back to Source in case of crash
  • Migration cancelled
  • Recovery involves both recovery logs
  • Source takes a dependency on Target recovery log at

migration start

  • Stored reliably at the cluster coordinator
  • Identifies position after which mutations present

41

Safely Transfer Ownership At Migration Start Safely Delay Replication Till All Data Has Been Moved

slide-42
SLIDE 42

Performance of Rocksteady

42

Time Since Experiment Start (Seconds) Median Access Latency (µs)

Rocksteady Source Keeps Ownership + Sync Replication

YCSB-B, 300 Million objects (30 B key, 100 B value), migrate half

slide-43
SLIDE 43

Performance of Rocksteady

43

Time Since Experiment Start (Seconds) Median Access Latency (µs)

Rocksteady Source Keeps Ownership + Sync Replication Median latency better after ~14 seconds

YCSB-B, 300 Million objects (30 B key, 100 B value), migrate half

slide-44
SLIDE 44

Performance of Rocksteady

44

Time Since Experiment Start (Seconds) Median Access Latency (µs)

Rocksteady Source Keeps Ownership + Sync Replication Median latency better after ~14 seconds 28% faster migration

YCSB-B, 300 Million objects (30 B key, 100 B value), migrate half

slide-45
SLIDE 45

Related Work

  • Dynamo: Pre-partition hash keys
  • Spanner: Applications given control over locality

(Directories)

  • FaRM and DrTM: Re-use in-memory redundancy for

migration

  • Squall: Reconfiguration protocol for H-Store
  • Early ownership transfer
  • Paced background migration
  • Fully partitioned, serial execution, no synchronization
  • Each migration pull stalls execution
  • Synchronous replication at the target

45

slide-46
SLIDE 46

Conclusion

  • Distributed low-latency in-memory key-value stores are emerging
  • Predictable response times ~10 µs median, ~60 µs 99.9th-tile
  • Problem: Must migrate data between servers
  • Minimize performance impact of migration → go slow?
  • Quickly respond to hot spots, skew shifts, load spikes → go fast?
  • Solution: Fast data migration with low impact
  • Leverage skew: Transfer ownership before data, move hot data first
  • Low priority, parallel and adaptive migration
  • Result: Migration protocol for RAMCloud in-memory key-value store
  • Migrates at 758 MBps with 99.9th-tile latency < 250 µs

Source Code: https://github.com/utah-scs/RAMCloud/tree/rocksteady-sosp2017

46

slide-47
SLIDE 47

Backup Slides

47

slide-48
SLIDE 48

Rocksteady Tail Latency Breakdown

50 100 150 200 250 300 Rocksteady

48

slide-49
SLIDE 49

Rocksteady Tail Latency Breakdown

50 100 150 200 250 300 Rocksteady

49

  • Disabling parallel pulls brings tail latency down to 160 µsec
slide-50
SLIDE 50

Rocksteady Tail Latency Breakdown

50 100 150 200 250 300 Rocksteady

50

  • Disabling parallel pulls brings tail latency down to 160 µsec
  • Synchronous on-demand pulls further brings tail latency down to 135 µsec