RIPQ: Advanced Photo Caching on Flash for Facebook Linpeng Tang - - PowerPoint PPT Presentation

ripq advanced photo caching
SMART_READER_LITE
LIVE PREVIEW

RIPQ: Advanced Photo Caching on Flash for Facebook Linpeng Tang - - PowerPoint PPT Presentation

RIPQ: Advanced Photo Caching on Flash for Facebook Linpeng Tang (Princeton) Qi Huang (Cornell & Facebook) Wyatt Lloyd (USC & Facebook) Sanjeev Kumar (Facebook) Kai Li (Princeton) 1 2 Billion * Photos Photo Serving Stack Shared Daily


slide-1
SLIDE 1

RIPQ: Advanced Photo Caching

  • n Flash for Facebook

Linpeng Tang (Princeton)

Qi Huang (Cornell & Facebook) Wyatt Lloyd (USC & Facebook) Sanjeev Kumar (Facebook) Kai Li (Princeton)

1

slide-2
SLIDE 2

2

2

* Facebook 2014 Q4 Report

Photo Serving Stack 2 Billion* Photos Shared Daily

Storage Backend

slide-3
SLIDE 3

3

Photo Caches

Close to users Reduce backbone traffic Co-located with backend Reduce backend IO

Flash

Storage Backend Edge Cache Origin Cache

Photo Serving Stack

slide-4
SLIDE 4

4

Flash

Storage Backend Edge Cache Origin Cache

Photo Serving Stack

An Analysis of Facebook Photo Caching [Huang et al. SOSP’13] Segmented LRU-3: 10% less backbone traffic Greedy-Dual-Size-Frequency-3: 23% fewer backend IOs

Advanced caching algorithms help!

slide-5
SLIDE 5

5

Flash

FIFO was still used No known way to implement advanced algorithms efficiently

Storage Backend Edge Cache Origin Cache

In Practice Photo Serving Stack

slide-6
SLIDE 6

6

Advanced caching helps:

  • 23% fewer backend IOs
  • 10% less backbone traffic

Theory Practice

Difficult to implement on flash:

  • FIFO still used

Restricted Insertion Priority Queue: efficiently implement advanced caching algorithms on flash

slide-7
SLIDE 7

Outline

  • Why are advanced caching algorithms

difficult to implement on flash efficiently?

  • How RIPQ solves this problem?

– Why use priority queue? – How to efficiently implement one on flash?

  • Evaluation

– 10% less backbone traffic – 23% fewer backend IOs

7

slide-8
SLIDE 8

Outline

  • Why are advanced caching algorithms

difficult to implement on flash efficiently?

– Write pattern of FIFO and LRU

  • How RIPQ solves this problem?

– Why use priority queue? – How to efficiently implement one on flash?

  • Evaluation

– 10% less backbone traffic – 23% fewer backend IOs

8

slide-9
SLIDE 9

FIFO Does Sequential Writes

9

Cache space of FIFO Head Tail

slide-10
SLIDE 10

FIFO Does Sequential Writes

10

Cache space of FIFO Head Tail

Miss

slide-11
SLIDE 11

FIFO Does Sequential Writes

11

Cache space of FIFO Head Tail

Hit

slide-12
SLIDE 12

FIFO Does Sequential Writes

12

Cache space of FIFO Head Tail

Evicted

No random writes needed for FIFO

slide-13
SLIDE 13

LRU Needs Random Writes

13

Cache space of LRU Head Tail

Locations on flash ≠ Locations in LRU queue Hit

slide-14
SLIDE 14

LRU Needs Random Writes

14

Head Tail Non-contiguous

  • n flash

Random writes needed to reuse space

Cache space of LRU

slide-15
SLIDE 15

Why Care About Random Writes?

  • Write-heavy workload

– Long tail access pattern, moderate hit ratio – Each miss triggers a write to cache

  • Small random writes are harmful for flash

– e.g. Min et al. FAST’12 – High write amplification

15

Low write throughput Short device lifetime

slide-16
SLIDE 16

What write size do we need?

  • Large writes

– High write throughput at high utilization – 16~32MiB in Min et al. FAST’2012

  • What’s the trend since then?

– Random writes tested for 3 modern devices – 128~512MiB needed now

16

100MiB+ writes needed for efficiency

slide-17
SLIDE 17

Outline

  • Why are advanced caching algorithms

difficult to implement on flash efficiently?

  • How RIPQ solves this problem?
  • Evaluation

17

slide-18
SLIDE 18

RIPQ Architecture

(Restricted Insertion Priority Queue)

18

Advanced Caching Policy (SLRU, GDSF …) RIPQ Priority Queue API RAM Flash Flash-friendly Workloads Approximate Priority Queue

Efficient caching

  • n flash

Caching algorithms approximated as well

slide-19
SLIDE 19

RIPQ Architecture

(Restricted Insertion Priority Queue)

19

Advanced Caching Policy (SLRU, GDSF …) RIPQ Priority Queue API RAM Flash Flash-friendly Workloads Approximate Priority Queue

Restricted insertion Section merge/split Large writes Lazy updates

slide-20
SLIDE 20

Priority Queue API

  • No single best caching policy
  • Segmented LRU [Karedla’94]

– Reduce both backend IO and backbone traffic – SLRU-3: best algorithm for Edge so far

  • Greedy-Dual-Size-Frequency [Cherkasova’98]

– Favor small objects – Further reduces backend IO – GDSF-3: best algorithm for Origin so far

20

slide-21
SLIDE 21

Segmented LRU

  • Concatenation of K LRU caches

21

Cache space of SLRU-3 Head L2 L1 Tail L3

Miss

slide-22
SLIDE 22

Segmented LRU

  • Concatenation of K LRU caches

22

Head L2 L1 Tail L3

Miss

Cache space of SLRU-3

slide-23
SLIDE 23

Segmented LRU

  • Concatenation of K LRU caches

23

Cache space of SLRU-3 Head L2 L1 Tail L3

Hit

slide-24
SLIDE 24

Segmented LRU

  • Concatenation of K LRU caches

24

Cache space of SLRU-3 Head L2 L1 Tail L3

Hit again

slide-25
SLIDE 25

Greedy-Dual-Size-Frequency

  • Favoring small objects

25

Cache space of GDSF-3 Head Tail

slide-26
SLIDE 26

Greedy-Dual-Size-Frequency

  • Favoring small objects

26

Cache space of GDSF-3 Head Tail

Miss

slide-27
SLIDE 27

Greedy-Dual-Size-Frequency

  • Favoring small objects

27

Cache space of GDSF-3 Head Tail

Miss

slide-28
SLIDE 28

Greedy-Dual-Size-Frequency

  • Favoring small objects

28

Cache space of GDSF-3 Head

  • Write workload more random than LRU
  • Operations similar to priority queue

Tail

slide-29
SLIDE 29

Relative Priority Queue for Advanced Caching Algorithms

29

Cache space Head Tail 1.0 0.0

Miss object: insert(x, p)

p

slide-30
SLIDE 30

Relative Priority Queue for Advanced Caching Algorithms

30

Cache space Head Tail 1.0 0.0

Hit object: increase(x, p’)

p’

slide-31
SLIDE 31

Relative Priority Queue for Advanced Caching Algorithms

31

Cache space Head Tail 1.0 0.0

Implicit demotion on insert/increase:

  • Object with lower priorities

moves towards the tail

slide-32
SLIDE 32

Relative Priority Queue for Advanced Caching Algorithms

32

Cache space Head Tail 1.0 0.0

Evict from queue tail

Evicted

Relative priority queue captures the dynamics of many caching algorithms!

slide-33
SLIDE 33

RIPQ Design: Large Writes

33

  • Need to buffer object writes (10s KiB) into block writes
  • Once written, blocks are immutable!
  • 256MiB block size, 90% utilization
  • Large caching capacity
  • High write throughput
slide-34
SLIDE 34

RIPQ Design: Restricted Insertion Points

34

  • Exact priority queue
  • Insert to any block in the queue
  • Each block needs a separate buffer
  • Whole flash space buffered in RAM!
slide-35
SLIDE 35

RIPQ Design: Restricted Insertion Points

35

Solution: restricted insertion points

slide-36
SLIDE 36

Section is Unit for Insertion

36

1 .. 0.6 0.6 .. 0.35 0.35 .. 0

Active block with RAM buffer Sealed block

  • n flash

Head Tail

Each section has one insertion point

Section Section Section

slide-37
SLIDE 37

Section is Unit for Insertion

37

Head Tail +1 insert(x, 0.55)

1 .. 0.6 0.6 .. 0.35 0.35 .. 0 1 .. 0.62 0.62 .. 0.33 0.33 .. 0

Insert procedure

  • Find corresponding section
  • Copy data into active block
  • Updating section priority range

Section Section Section

slide-38
SLIDE 38

1 .. 0.62 0.62 .. 0.33 0.33 .. 0

Section is Unit for Insertion

38

Active block with RAM buffer Sealed block

  • n flash

Head Tail

Relative orders within one section not guaranteed!

Section Section Section

slide-39
SLIDE 39

Trade-off in Section Size

39

Section size controls approximation error

  • Sections , approximation error
  • Sections , RAM buffer

Head Tail

1 .. 0.62 0.62 .. 0.33 0.33 .. 0 Section Section Section

slide-40
SLIDE 40

RIPQ Design: Lazy Update

40

Head Tail increase(x, 0.9)

Problem with naïve approach

  • Data copying/duplication on flash

x +1 Naïve approach: copy to the corresponding active block

Section Section Section

slide-41
SLIDE 41

RIPQ Design: Lazy Update

41

Head Tail

Solution: use virtual block to track the updated location!

Section Section Section

slide-42
SLIDE 42

RIPQ Design: Lazy Update

42

Head Tail Virtual Blocks

Solution: use virtual block to track the updated location!

Section Section Section

slide-43
SLIDE 43

Virtual Block Remembers Update Location

43

Head Tail

No data written during virtual update

increase(x, 0.9) x +1

Section Section Section

slide-44
SLIDE 44

Actual Update During Eviction

44

Head Tail x

Section Section Section

x now at tail block.

slide-45
SLIDE 45

Actual Update During Eviction

45

Head Tail

  • 1

+1 x Copy data to the active block

Always one copy of data on flash

Section Section Section

slide-46
SLIDE 46

RIPQ Design

  • Relative priority queue API
  • RIPQ design points

– Large writes – Restricted insertion points – Lazy update – Section merge/split

  • Balance section sizes and RAM buffer usage
  • Static caching

– Photos are static

46

slide-47
SLIDE 47

Outline

  • Why are advanced caching algorithms

difficult to implement on flash efficiently?

  • How RIPQ solves this problem?
  • Evaluation

47

slide-48
SLIDE 48

Evaluation Questions

  • How much RAM buffer needed?
  • How good is RIPQ’s approximation?
  • What’s the throughput of RIPQ?

48

slide-49
SLIDE 49

Evaluation Approach

  • Real-world Facebook workloads

– Origin – Edge

  • 670 GiB flash card

– 256MiB block size – 90% utilization

  • Baselines

– FIFO – SIPQ: Single Insertion Priority Queue

49

slide-50
SLIDE 50

RIPQ Needs Small Number of Insertion Points

Insertion points

50

Object-wise hit-ratio (%)

25 30 35 40 45 2 4 8 16 32 Exact GDSF-3 GDSF-3 Exact SLRU-3 SLRU-3 FIFO

+6% +16%

slide-51
SLIDE 51

RIPQ Needs Small Number of Insertion Points

51

Object-wise hit-ratio (%)

25 30 35 40 45 2 4 8 16 32 Exact GDSF-3 GDSF-3 Exact SLRU-3 SLRU-3 FIFO

Insertion points

slide-52
SLIDE 52

RIPQ Needs Small Number of Insertion Points

52

Object-wise hit-ratio (%)

25 30 35 40 45 2 4 8 16 32 Exact GDSF-3 GDSF-3 Exact SLRU-3 SLRU-3 FIFO

You don’t need much RAM buffer (2GiB)!

Insertion points

slide-53
SLIDE 53

RIPQ Has High Fidelity

53

Object-wise hit-ratio (%)

20 25 30 35 40 45 SLRU-1 SLRU-2 SLRU-3 GDSF-1 GDSF-2 GDSF-3 FIFO Exact RIPQ FIFO

slide-54
SLIDE 54

RIPQ Has High Fidelity

54

Object-wise hit-ratio (%)

20 25 30 35 40 45 SLRU-1 SLRU-2 SLRU-3 GDSF-1 GDSF-2 GDSF-3 FIFO Exact RIPQ FIFO

slide-55
SLIDE 55

RIPQ Has High Fidelity

55

Object-wise hit-ratio (%)

20 25 30 35 40 45 SLRU-1 SLRU-2 SLRU-3 GDSF-1 GDSF-2 GDSF-3 FIFO Exact RIPQ FIFO

RIPQ achieves ≤0.5% difference for all algorithms

slide-56
SLIDE 56

RIPQ Has High Fidelity

56

Object-wise hit-ratio (%)

20 25 30 35 40 45 SLRU-1 SLRU-2 SLRU-3 GDSF-1 GDSF-2 GDSF-3 FIFO Exact RIPQ FIFO

+16% hit-ratio ➔ 23% fewer backend IOs +16%

slide-57
SLIDE 57

RIPQ Has High Throughput

57

Throughput (req./sec)

RIPQ throughput comparable to FIFO (≤10% diff.)

5000 10000 15000 20000 25000 30000 SLRU-1 SLRU-2 SLRU-3 GDSF-1 GDSF-2 GDSF-3 RIPQ FIFO

slide-58
SLIDE 58

Related Works

RAM-based advanced caching SLRU(Karedla’94), GDSF(Young’94, Cao’97, Cherkasova’01), SIZE(Abrams’96), LFU(Maffeis’93), LIRS (Jiang’02), … Flash-based caching solutions Facebook FlashCache, Janus(Albrecht ’13), Nitro(Li’13), OP-FCL(Oh’12), FlashTier(Saxena’12), Hec(Yang’13), … Flash performance Stoica’09, Chen’09, Bouganim’09, Min’12, …

58

RIPQ enables their use on flash RIPQ supports advanced algorithms Trend continues for modern flash cards

slide-59
SLIDE 59

RIPQ

  • First framework for advanced caching on flash

– Relative priority queue interface – Large writes – Restricted insertion points – Lazy update – Section merge/split

  • Enables SLRU-3 & GDSF-3 for Facebook photos

– 10% less backbone traffic – 23% fewer backend IOs

59