Ad Adap%v %ve P e Placem emen ent f for or In In-memory y - - PowerPoint PPT Presentation

ad adap v ve p e placem emen ent f for or in in memory y
SMART_READER_LITE
LIVE PREVIEW

Ad Adap%v %ve P e Placem emen ent f for or In In-memory y - - PowerPoint PPT Presentation

Ad Adap%v %ve P e Placem emen ent f for or In In-memory y Storage Func%ons Ankit Bhardwaj , Chinmay Kulkarni, and Ryan Stutsman University of Utah Utah Scalable Computer Systems Lab Introduction Kernel-bypass key-value stores offer


slide-1
SLIDE 1

Ad Adap%v %ve P e Placem emen ent f for

  • r In

In-memory y Storage Func%ons

Ankit Bhardwaj, Chinmay Kulkarni, and Ryan Stutsman University of Utah

Utah Scalable Computer Systems Lab

slide-2
SLIDE 2

Introduction

  • Kernel-bypass key-value stores offer < 10µs latency, Mops throughput
  • Fast because they are just dumb
  • Inefficient – Data movement, client stalls
  • Run application logic on the server?
  • Storage server can become bottleneck, effects propagates back to clients
  • Key-ideas: Put application logic in decoupled functions
  • Profile invocations & adaptively place to avoid bottlenecks
  • Challenge: efficiently shifting compute at microsecond-timescales
slide-3
SLIDE 3

Disaggrega%on Improves U%liza%on and Scaling

Decouple Compute & Storage using Network Provision at idle Capacity Scale Independently

Compute Storage

slide-4
SLIDE 4

Disaggrega%on Improves U%liza%on and Scaling

FaRM <10µs latency RAMCloud MOPS Throughput Decouple Compute & Storage using Network Provision at idle Capacity Scale Independently

Compute Storage

slide-5
SLIDE 5

But, Data Movement Has a Cost

Compute Storage

Data Movement

Massive Data Movement Destroys Efficiency So, push code to storage?

Data Movement

slide-6
SLIDE 6

Storage Function Requirements

  • Microsecond-scale -> low invocaNon cost
  • High-throughput, in-memory -> naNve code performance
  • Amenable to mulN-core processing
  • SoluNon: Splinter allows loadable compiled extensions of storage

funcNons

Splinter: Bare-Metal Extensions for Mul5-Tenant Low-Latency Storage

slide-7
SLIDE 7

Server-side Placement Can Improve Throughput

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 2 3 4 5 6 7 8 Throughput (millions of tree traversals/sec) Traversal Depth (operations/invocation) Client-side

Hash Table Client Server

get()/put()

  • ver

Network +RTT +RTT RTT

slide-8
SLIDE 8

Server-side Placement Can Improve Throughput

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 2 3 4 5 6 7 8 Throughput (millions of tree traversals/sec) Traversal Depth (operations/invocation) Client-side Server-side

Hash Table Client Server

invoke()

  • ver

Network 50% Reduces (N-1) RPCs and RTTs

slide-9
SLIDE 9

Server-side Placement Can Improve Throughput

Hash Table Client Server

invoke()

  • ver

Network 200,000

  • ps/s/core

400,000

  • ps/s/core

FaRM Server-side Facebook TAO graph operaNons perform 2x beTer as compared to state-of-the-art system FaRM

slide-10
SLIDE 10

Server-side Placement Can BoGleneck the Server

  • Server-side placement is good for data-intensive funcNons
  • Compute-intensive funcNons make the server CPU boTleneck
  • Overloaded server stops responding to even get()/put() requests
  • Overall system throughput drops
slide-11
SLIDE 11

Server-side Placement Can Bottleneck the Server

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side 55% Lower than Client-side 22% Higher than Client-side Tree Depth 2

slide-12
SLIDE 12

What about Rebalancing and Load-Balancing?

  • Workload change can happen in two ways
  • Workload shiFs in funcGon call distribuGon over Gme
  • ShiFs in per-invocaGon costs
  • Migrate data only when the workload is stable
  • Moving load to client and use the server CPU for migraNon

Time Invoca/on Computa/on Frequency Load

slide-13
SLIDE 13

Key Insight: Decoupled Func%ons Can Run Anywhere

  • Tenants write logically decoupled funcNons using standard get/put

interface

  • Clients physically push and run funcNons server-side
  • Or the clients could run the funcNons locally
slide-14
SLIDE 14

Goal: The Best of Both Worlds

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Ideal Data Intensive Compute Intensive

slide-15
SLIDE 15

Adap%ve Storage Func%on Placement (ASFP)

Client Server

Get Get Validate Compute

Server-side Storage FuncNon ExecuNon

slide-16
SLIDE 16

Adap%ve Storage Func%on Placement (ASFP)

Running heavy compute at client creates room for remaining work

Client Server Client Server

Get Get Validate Compute Get Compute Compute Get Validate Get Compute Validate Get

Server-side Storage FuncNon ExecuNon Pushed-back Storage FuncNon ExecuNon

slide-17
SLIDE 17

Adap%ve Storage Func%on Placement (ASFP)

  • Mechanisms
  • Server-side: Run Storage FuncGons, suspend, pushback to client
  • Client-side: RunGme, transparent remote data access
  • Consistency and concurrency control
  • Policies
  • InvocaGon Profiling & Cost Modeling
  • Overload detecGon
slide-18
SLIDE 18

Server-side Storage Func%on Execu%on

Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response

slide-19
SLIDE 19

Server-side Storage Func%on Execu%on

Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response

slide-20
SLIDE 20

Server-side Storage Function Execution

Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response

slide-21
SLIDE 21

Server-side Storage Function Execution

Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response

slide-22
SLIDE 22

Consistency and Concurrency Control

  • Problem: Invoke() tasks run concurrently on server on each core

and pushed-back invocaNons run in parallel to the server tasks

  • Solu9on: Run invocaNons in strict serializable transacNons
  • Use opGmisGc concurrency control (OCC)
  • Read/Write set tracking is also used in pushback
  • Pushback invocaGon never generate work for Server
  • Server don’t need to maintain any state for pushed-back invocaGons
slide-23
SLIDE 23

Client-side Execu%on for Pushed-back Invoca%ons

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-24
SLIDE 24

Client-side Execu%on for Pushed-back Invoca%ons

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-25
SLIDE 25

Client-side Execu%on for Pushed-back Invoca%ons

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-26
SLIDE 26

Client-side Execu%on for Pushed-back Invoca%ons

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-27
SLIDE 27

Client-side Execution for Pushed-back Invocations

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-28
SLIDE 28

Client-side Execution for Pushed-back Invocations

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-29
SLIDE 29

Client-side Execu%on for Pushed-back Invoca%ons

Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted

slide-30
SLIDE 30

Adap%ve Storage Func%on Placement (ASFP)

  • Mechanism
  • Server-side: Storage FuncGons, suspend, move back to client
  • Client-side: RunGme, transparent remote data access
  • Consistency and Concurrency Control
  • Policy
  • Server Overload DetecGon
  • InvocaGon Profiling and ClassificaGon
slide-31
SLIDE 31

Server Overload Detec%on

  • Always run the invocaNons on

server, if underloaded

  • Guarantees
  • Start pushback only when there

are some old tasks and server receives even more tasks

  • Keep at least 𝑢 tasks even aFer

pushback, to avoid server idleness

  • Consider only invoke() tasks for
  • verload detecGon

PollRecvQueue PacketToTask

#OldTasks > t #NewTasks > t

Classify&Pushback AddTasksToQueue ExecuteTasks-RR Yes Yes No No

Shenango: Achieving High CPU Efficiency for Latency-sensi5ve Datacenter Workloads

Pushback

slide-32
SLIDE 32

Invoca%on Profiling and Classifica%on

  • Profile each invocaNon for Nme spent in compute and data access
  • Classify an invocaNon compute-bound if
  • Spent more Gme in compute than data access
  • Crossed a threshold 𝑑 > 𝑜𝐸
  • 𝑑 𝑗𝑡 𝑏𝑛𝑝𝑣𝑜𝑢 𝑝𝑔 𝑑𝑝𝑛𝑞𝑣𝑢𝑓 𝑒𝑝𝑜𝑓 𝑐𝑧 𝑢ℎ𝑓 𝑗𝑜𝑤𝑝𝑑𝑏𝑢𝑗𝑝𝑜
  • 𝑜 𝑗𝑡 𝑢ℎ𝑓 𝑢𝑝𝑢𝑏𝑚 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑒𝑏𝑢𝑏 𝑏𝑑𝑑𝑓𝑡𝑡 𝑢𝑗𝑚𝑚 𝑜𝑝𝑥
  • 𝐸 𝑗𝑡 𝐷𝑄𝑉 𝑑𝑝𝑡𝑢 𝑢𝑝 𝑞𝑠𝑝𝑑𝑓𝑡𝑡 𝑝𝑜𝑓 𝑠𝑓𝑟𝑣𝑓𝑡𝑢
slide-33
SLIDE 33

Evalua%on

GAINS AND COSTS RW-SET EFFECT APPLICATION MIX

slide-34
SLIDE 34

Experimental Setup

  • One Server and Four Client
  • CPU - Ten-core Intel E5-2640v4 at 2.4 GHz
  • RAM - 64GB Memory (4x 16 GB DDR4-2400 DIMMs)
  • NIC - Mellanox CX-4, 25 Gbps Ethernet
  • 15GB Read-write set as 120M Records, 30B key and 100B value
slide-35
SLIDE 35

Does ASFP improve server throughput?

3 data-accesses per invocaNon

33% 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback

slide-36
SLIDE 36

What is the cost of using ASFP?

2 data-accesses per invocation

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback 15% lower than Client-side

slide-37
SLIDE 37

What is the cost of using ASFP?

Aggressive overload detecNon

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback 3% lower than Client-side

slide-38
SLIDE 38

How do ASFP and OCC interact?

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback Pushback-wo-rwset 33% lower than Pushback

slide-39
SLIDE 39

Solid: Run Server-side, Hashed: Run Client-side

Data Bound Compute Bound Compute Bound

Does ASFP improve throughput for an Applica%on Mix?

slide-40
SLIDE 40

Does ASFP improve throughput for an Applica%on Mix?

More room on server to respond to more get/puts

Data Bound Compute Bound Compute Bound

slide-41
SLIDE 41

Does ASFP improve throughput for an Application Mix?

More room on server to respond to more get/puts

Data Bound Compute Bound Compute Bound

160% 33% 77% 65%

slide-42
SLIDE 42

TAO ↑ by avoiding data movement; Pushback makes room for TAO

Data Bound Compute Bound Compute Bound

36% 4% 10%

  • 80% higher than Server-side
  • 10% higher than Client-side

Does ASFP improve throughput for an Applica%on Mix?

0.5%

slide-43
SLIDE 43

Related Work

  • Storage Procedures, UDFs
  • SQL - Poor fit for specialized computaGon
  • Redis – Extension provided at server start Gme
  • Splinter- build on top of it
  • Offloading and code migraNon in mobile and edge compuNng
  • MAUI – different Gmescales and use-cases
  • Thread and Process MigraNon
  • Sprite, Condor – slow and unsuitable for µs scale
slide-44
SLIDE 44

Conclusion

  • Kernel-bypass key-value stores offer < 10µs latency, Mops throughput
  • Fast because they are just dumb
  • Inefficient – Data movement, client stalls
  • Run applicaNon logic on the server?
  • Storage server can become bofleneck, effects propagates back to clients
  • AdapNvely place the invocaNons to avoid boTlenecks
  • Up to 42% gain for low-compute invocaGons (vs client-side)
  • Comparable performance for high-compute invocaGon(vs client-side)