SLIDE 1 Ad Adap%v %ve P e Placem emen ent f for
In-memory y Storage Func%ons
Ankit Bhardwaj, Chinmay Kulkarni, and Ryan Stutsman University of Utah
Utah Scalable Computer Systems Lab
SLIDE 2 Introduction
- Kernel-bypass key-value stores offer < 10µs latency, Mops throughput
- Fast because they are just dumb
- Inefficient – Data movement, client stalls
- Run application logic on the server?
- Storage server can become bottleneck, effects propagates back to clients
- Key-ideas: Put application logic in decoupled functions
- Profile invocations & adaptively place to avoid bottlenecks
- Challenge: efficiently shifting compute at microsecond-timescales
SLIDE 3
Disaggrega%on Improves U%liza%on and Scaling
Decouple Compute & Storage using Network Provision at idle Capacity Scale Independently
Compute Storage
SLIDE 4
Disaggrega%on Improves U%liza%on and Scaling
FaRM <10µs latency RAMCloud MOPS Throughput Decouple Compute & Storage using Network Provision at idle Capacity Scale Independently
Compute Storage
SLIDE 5 But, Data Movement Has a Cost
Compute Storage
Data Movement
Massive Data Movement Destroys Efficiency So, push code to storage?
Data Movement
SLIDE 6 Storage Function Requirements
- Microsecond-scale -> low invocaNon cost
- High-throughput, in-memory -> naNve code performance
- Amenable to mulN-core processing
- SoluNon: Splinter allows loadable compiled extensions of storage
funcNons
Splinter: Bare-Metal Extensions for Mul5-Tenant Low-Latency Storage
SLIDE 7 Server-side Placement Can Improve Throughput
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 2 3 4 5 6 7 8 Throughput (millions of tree traversals/sec) Traversal Depth (operations/invocation) Client-side
Hash Table Client Server
get()/put()
Network +RTT +RTT RTT
SLIDE 8 Server-side Placement Can Improve Throughput
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 2 3 4 5 6 7 8 Throughput (millions of tree traversals/sec) Traversal Depth (operations/invocation) Client-side Server-side
Hash Table Client Server
invoke()
Network 50% Reduces (N-1) RPCs and RTTs
SLIDE 9 Server-side Placement Can Improve Throughput
Hash Table Client Server
invoke()
Network 200,000
400,000
FaRM Server-side Facebook TAO graph operaNons perform 2x beTer as compared to state-of-the-art system FaRM
SLIDE 10 Server-side Placement Can BoGleneck the Server
- Server-side placement is good for data-intensive funcNons
- Compute-intensive funcNons make the server CPU boTleneck
- Overloaded server stops responding to even get()/put() requests
- Overall system throughput drops
SLIDE 11
Server-side Placement Can Bottleneck the Server
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side 55% Lower than Client-side 22% Higher than Client-side Tree Depth 2
SLIDE 12 What about Rebalancing and Load-Balancing?
- Workload change can happen in two ways
- Workload shiFs in funcGon call distribuGon over Gme
- ShiFs in per-invocaGon costs
- Migrate data only when the workload is stable
- Moving load to client and use the server CPU for migraNon
Time Invoca/on Computa/on Frequency Load
SLIDE 13 Key Insight: Decoupled Func%ons Can Run Anywhere
- Tenants write logically decoupled funcNons using standard get/put
interface
- Clients physically push and run funcNons server-side
- Or the clients could run the funcNons locally
SLIDE 14
Goal: The Best of Both Worlds
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Ideal Data Intensive Compute Intensive
SLIDE 15 Adap%ve Storage Func%on Placement (ASFP)
Client Server
Get Get Validate Compute
Server-side Storage FuncNon ExecuNon
SLIDE 16 Adap%ve Storage Func%on Placement (ASFP)
Running heavy compute at client creates room for remaining work
Client Server Client Server
Get Get Validate Compute Get Compute Compute Get Validate Get Compute Validate Get
Server-side Storage FuncNon ExecuNon Pushed-back Storage FuncNon ExecuNon
SLIDE 17 Adap%ve Storage Func%on Placement (ASFP)
- Mechanisms
- Server-side: Run Storage FuncGons, suspend, pushback to client
- Client-side: RunGme, transparent remote data access
- Consistency and concurrency control
- Policies
- InvocaGon Profiling & Cost Modeling
- Overload detecGon
SLIDE 18 Server-side Storage Func%on Execu%on
Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response
SLIDE 19 Server-side Storage Func%on Execu%on
Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response
SLIDE 20 Server-side Storage Function Execution
Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response
SLIDE 21 Server-side Storage Function Execution
Pushback Running Ready Committed/ Aborted Offload Result Invoke Get (Local) Yield Schedule Validation Server Overload State Change Request Response
SLIDE 22 Consistency and Concurrency Control
- Problem: Invoke() tasks run concurrently on server on each core
and pushed-back invocaNons run in parallel to the server tasks
- Solu9on: Run invocaNons in strict serializable transacNons
- Use opGmisGc concurrency control (OCC)
- Read/Write set tracking is also used in pushback
- Pushback invocaGon never generate work for Server
- Server don’t need to maintain any state for pushed-back invocaGons
SLIDE 23 Client-side Execu%on for Pushed-back Invoca%ons
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 24 Client-side Execu%on for Pushed-back Invoca%ons
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 25 Client-side Execu%on for Pushed-back Invoca%ons
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 26 Client-side Execu%on for Pushed-back Invoca%ons
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 27 Client-side Execution for Pushed-back Invocations
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 28 Client-side Execution for Pushed-back Invocations
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 29 Client-side Execu%on for Pushed-back Invoca%ons
Ready Create Awaiting Validation Awaiting Data Running Get (in local Read Set) Yield Schedule Get (Remote) Install RW Set Get Get Validation Result Pushback State Change Request Response Completed Validate Committed/ Aborted
SLIDE 30 Adap%ve Storage Func%on Placement (ASFP)
- Mechanism
- Server-side: Storage FuncGons, suspend, move back to client
- Client-side: RunGme, transparent remote data access
- Consistency and Concurrency Control
- Policy
- Server Overload DetecGon
- InvocaGon Profiling and ClassificaGon
SLIDE 31 Server Overload Detec%on
- Always run the invocaNons on
server, if underloaded
- Guarantees
- Start pushback only when there
are some old tasks and server receives even more tasks
- Keep at least 𝑢 tasks even aFer
pushback, to avoid server idleness
- Consider only invoke() tasks for
- verload detecGon
PollRecvQueue PacketToTask
#OldTasks > t #NewTasks > t
Classify&Pushback AddTasksToQueue ExecuteTasks-RR Yes Yes No No
Shenango: Achieving High CPU Efficiency for Latency-sensi5ve Datacenter Workloads
Pushback
SLIDE 32 Invoca%on Profiling and Classifica%on
- Profile each invocaNon for Nme spent in compute and data access
- Classify an invocaNon compute-bound if
- Spent more Gme in compute than data access
- Crossed a threshold 𝑑 > 𝑜𝐸
- 𝑑 𝑗𝑡 𝑏𝑛𝑝𝑣𝑜𝑢 𝑝𝑔 𝑑𝑝𝑛𝑞𝑣𝑢𝑓 𝑒𝑝𝑜𝑓 𝑐𝑧 𝑢ℎ𝑓 𝑗𝑜𝑤𝑝𝑑𝑏𝑢𝑗𝑝𝑜
- 𝑜 𝑗𝑡 𝑢ℎ𝑓 𝑢𝑝𝑢𝑏𝑚 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑒𝑏𝑢𝑏 𝑏𝑑𝑑𝑓𝑡𝑡 𝑢𝑗𝑚𝑚 𝑜𝑝𝑥
- 𝐸 𝑗𝑡 𝐷𝑄𝑉 𝑑𝑝𝑡𝑢 𝑢𝑝 𝑞𝑠𝑝𝑑𝑓𝑡𝑡 𝑝𝑜𝑓 𝑠𝑓𝑟𝑣𝑓𝑡𝑢
SLIDE 33
Evalua%on
GAINS AND COSTS RW-SET EFFECT APPLICATION MIX
SLIDE 34 Experimental Setup
- One Server and Four Client
- CPU - Ten-core Intel E5-2640v4 at 2.4 GHz
- RAM - 64GB Memory (4x 16 GB DDR4-2400 DIMMs)
- NIC - Mellanox CX-4, 25 Gbps Ethernet
- 15GB Read-write set as 120M Records, 30B key and 100B value
SLIDE 35
Does ASFP improve server throughput?
3 data-accesses per invocaNon
33% 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback
SLIDE 36
What is the cost of using ASFP?
2 data-accesses per invocation
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback 15% lower than Client-side
SLIDE 37
What is the cost of using ASFP?
Aggressive overload detecNon
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback 3% lower than Client-side
SLIDE 38
How do ASFP and OCC interact?
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Throughput (millions of invocations/s) Invocation Computation (cycles/invocation) Client-side Server-side Pushback Pushback-wo-rwset 33% lower than Pushback
SLIDE 39
Solid: Run Server-side, Hashed: Run Client-side
Data Bound Compute Bound Compute Bound
Does ASFP improve throughput for an Applica%on Mix?
SLIDE 40
Does ASFP improve throughput for an Applica%on Mix?
More room on server to respond to more get/puts
Data Bound Compute Bound Compute Bound
SLIDE 41 Does ASFP improve throughput for an Application Mix?
More room on server to respond to more get/puts
Data Bound Compute Bound Compute Bound
160% 33% 77% 65%
SLIDE 42 TAO ↑ by avoiding data movement; Pushback makes room for TAO
Data Bound Compute Bound Compute Bound
36% 4% 10%
- 80% higher than Server-side
- 10% higher than Client-side
Does ASFP improve throughput for an Applica%on Mix?
0.5%
SLIDE 43 Related Work
- Storage Procedures, UDFs
- SQL - Poor fit for specialized computaGon
- Redis – Extension provided at server start Gme
- Splinter- build on top of it
- Offloading and code migraNon in mobile and edge compuNng
- MAUI – different Gmescales and use-cases
- Thread and Process MigraNon
- Sprite, Condor – slow and unsuitable for µs scale
SLIDE 44 Conclusion
- Kernel-bypass key-value stores offer < 10µs latency, Mops throughput
- Fast because they are just dumb
- Inefficient – Data movement, client stalls
- Run applicaNon logic on the server?
- Storage server can become bofleneck, effects propagates back to clients
- AdapNvely place the invocaNons to avoid boTlenecks
- Up to 42% gain for low-compute invocaGons (vs client-side)
- Comparable performance for high-compute invocaGon(vs client-side)