user level threading have your cake and eat it too
play

User-level Threading: Have Your Cake and Eat It Too Martin Karsten - PowerPoint PPT Presentation

Problem Statement Fred Runtime Evaluation Wrap Up User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R. Cheriton School of Computer Science University of Waterloo June 2020 SIGMETRICS 2020 1/27


  1. Problem Statement Fred Runtime Evaluation Wrap Up User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R. Cheriton School of Computer Science University of Waterloo June 2020 SIGMETRICS 2020 1/27

  2. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions SIGMETRICS 2020 2/27

  3. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions event-based programming • explicit state management • asynchronous control flow → callback hell SIGMETRICS 2020 2/27

  4. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions event-based programming • explicit state management • asynchronous control flow → callback hell thread-per-session programming • automatic state management • synchronous control flow SIGMETRICS 2020 2/27

  5. Problem Statement Fred Runtime Evaluation Wrap Up Motivation application programming paradigms • network service handling concurrent sessions event-based programming • explicit state management • asynchronous control flow → callback hell thread-per-session programming • automatic state management • synchronous control flow ⇒ performance ? SIGMETRICS 2020 2/27

  6. Problem Statement Fred Runtime Evaluation Wrap Up Background parallel hardware → threads & synchronization SIGMETRICS 2020 3/27

  7. Problem Statement Fred Runtime Evaluation Wrap Up Background parallel hardware → threads & synchronization kernel thread caveats • limit: typically 10Ks • (some) execution overhead • complex scheduling for fairness & control SIGMETRICS 2020 3/27

  8. Problem Statement Fred Runtime Evaluation Wrap Up Background parallel hardware → threads & synchronization kernel thread caveats • limit: typically 10Ks • (some) execution overhead • complex scheduling for fairness & control ⇒ user-level threads! • key aspect: scheduling • requirement: user-level I/O blocking SIGMETRICS 2020 3/27

  9. Problem Statement Fred Runtime Evaluation Wrap Up Take Away user-level threads • similar throughput to event-based programming • load balancing can sometimes reduce tail latency SIGMETRICS 2020 4/27

  10. Problem Statement Fred Runtime Evaluation Wrap Up Take Away user-level threads • similar throughput to event-based programming • load balancing can sometimes reduce tail latency kernel threads not that bad either • up to a limit SIGMETRICS 2020 4/27

  11. Problem Statement Fred Runtime Evaluation Wrap Up Take Away user-level threads • similar throughput to event-based programming • load balancing can sometimes reduce tail latency kernel threads not that bad either • up to a limit Fred Runtime rules! SIGMETRICS 2020 4/27

  12. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 5/27

  13. Problem Statement Fred Runtime Evaluation Wrap Up Problem Statement minimum overhead of user-level threading? SIGMETRICS 2020 6/27

  14. Problem Statement Fred Runtime Evaluation Wrap Up Problem Statement minimum overhead of user-level threading? roadmap • build minimum viable user-level threading runtime • compare to state of the art threading runtimes • evaluate production-grade application SIGMETRICS 2020 6/27

  15. Problem Statement Fred Runtime Evaluation Wrap Up Approach Application Application vs Event Handling Thread Runtime SIGMETRICS 2020 7/27

  16. Problem Statement Fred Runtime Evaluation Wrap Up Approach Application Application vs Event Handling Thread Runtime Memcached - in-memory key/value store • minimum port to thread-per-session • fully preserved state machine • no structural benefits SIGMETRICS 2020 7/27

  17. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 8/27

  18. Problem Statement Fred Runtime Evaluation Wrap Up Scheduler performance: simple and lightweight scalability: local queueing effectiveness: load sharing efficiency: idle-sleep SIGMETRICS 2020 9/27

  19. Problem Statement Fred Runtime Evaluation Wrap Up Inverse Shared Ready Stack Ready−Queue 1 benaphore processor ring (for stealing) Processor 1 V() fred Ready−Queue 2 counter P() Processor 2 Ready−Queue 3 Processor 3 Staging−Queue waiting processors "processor ready−stack" SIGMETRICS 2020 10/27

  20. Problem Statement Fred Runtime Evaluation Wrap Up I/O Blocking automatically suspend thread during I/O wait essential for synchronous control flow suspend/resume user-level thread • user-level synchronization primitives • OS-level notifications SIGMETRICS 2020 11/27

  21. Problem Statement Fred Runtime Evaluation Wrap Up I/O Notifications poller input OS query event loop epoll/kqueue interest set output freds I/O Synchronization Vector (indexed by FD) SIGMETRICS 2020 12/27

  22. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 13/27

  23. Problem Statement Fred Runtime Evaluation Wrap Up Threading Benchmarks comparison of 9 different threading runtimes performance & scalability problems • Arachne, Mordor, µ C++ efficiency problems • Arachne, Boost, Qthreads • busy-looping scheduler solid results • Fred, Libfiber, Pthreads • Go: higher constant scheduling overhead SIGMETRICS 2020 14/27

  24. Problem Statement Fred Runtime Evaluation Wrap Up Performance 10 Libfiber Qthreads Fred Throughput x10 7 (32 Cores) 8 Pthread Go Boost 6 Arachne Mordor uC++ 4 2 0 0 5 10 15 20 25 30 35 40 Duration of Each Work Unit (us) SIGMETRICS 2020 15/27

  25. Problem Statement Fred Runtime Evaluation Wrap Up Efficiency 300 Libfiber Pthread Arachne Qthreads Go Mordor 250 Fred Boost uC++ Cost of Iteration (us) 200 150 100 50 0 0 5 10 15 20 25 30 Core Count SIGMETRICS 2020 16/27

  26. Problem Statement Fred Runtime Evaluation Wrap Up I/O Benchmarks I/O stress test for Fred, Go, Libfiber, Pthread compared to best-in-class event-based server • Libfiber breaks • Go and Pthread limited • only Fred competitive SIGMETRICS 2020 17/27

  27. Problem Statement Fred Runtime Evaluation Wrap Up I/O Scalability 1600 ULib Fred (8 poller freds) Request Throughput (x1000/sec) 1400 Pthread Go 1200 uC++ 1000 800 600 400 200 0 0 5 10 15 20 25 30 Cores SIGMETRICS 2020 18/27

  28. Problem Statement Fred Runtime Evaluation Wrap Up Application Benchmarks SIGMETRICS 2020 19/27

  29. Problem Statement Fred Runtime Evaluation Wrap Up Application Benchmarks only Fred competitive with original Memcached tail latency results from Arachne paper • only apply to special case: #RX queues < #cores • performance of Pthread for low connection count! SIGMETRICS 2020 19/27

  30. Problem Statement Fred Runtime Evaluation Wrap Up Throughput 800 Fred Vanilla 700 Query Throughput (x1000/sec) Pthread Arachne 600 Fred (shared RQ) 500 400 300 200 100 0 0 2 4 6 8 10 12 14 16 Cores SIGMETRICS 2020 20/27

  31. Problem Statement Fred Runtime Evaluation Wrap Up Throughput - more connections 700 Fred Vanilla 600 Query Throughput (x1000/sec) Pthread Fred (shared RQ) 500 Arachne 400 300 200 100 0 0 2 4 6 8 10 12 14 16 Cores SIGMETRICS 2020 21/27

  32. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Arachne Results 10000 Vanilla (pin/rfs) Read Latency (us), 99th Percentile Fred (pin) Arachne Pthread (rfs) 1000 100 10 0 200 400 600 800 1000 Query Throughput (x1000) SIGMETRICS 2020 22/27

  33. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Explanation original experiment: 8 RX queues for 12 cores head-of-line blocking? modified setup: 16 RX queues for 12 cores tail latency discrepancies largely gone... SIGMETRICS 2020 23/27

  34. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Regular 10000 Vanilla (pin) Read Latency (us), 99th Percentile Fred (pin) Arachne Pthread 1000 100 10 0 200 400 600 800 1000 Query Throughput (x1000) SIGMETRICS 2020 24/27

  35. Problem Statement Fred Runtime Evaluation Wrap Up Tail Latency: Higher Connection Count 1,536 → 7,680 connections 100000 Vanilla (pin) Read Latency (us), 99th Percentile Fred (pin) Arachne Pthread 10000 1000 100 10 0 100 200 300 400 500 600 700 800 900 Query Throughput (x1000) SIGMETRICS 2020 25/27

  36. Problem Statement Fred Runtime Evaluation Wrap Up Table of Contents 1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up SIGMETRICS 2020 26/27

  37. Problem Statement Fred Runtime Evaluation Wrap Up Wrap Up Fred: nimble user-level threading runtime comprehensive performance evaluation user-level threading possible at low overhead scenarios with improved performance? Fred currently the best reference platform SIGMETRICS 2020 27/27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend