User-level Threading: Have Your Cake and Eat It Too Martin Karsten - - PowerPoint PPT Presentation

user level threading have your cake and eat it too
SMART_READER_LITE
LIVE PREVIEW

User-level Threading: Have Your Cake and Eat It Too Martin Karsten - - PowerPoint PPT Presentation

Problem Statement Fred Runtime Evaluation Wrap Up User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R. Cheriton School of Computer Science University of Waterloo June 2020 SIGMETRICS 2020 1/27


slide-1
SLIDE 1

Problem Statement Fred Runtime Evaluation Wrap Up

User-level Threading: Have Your Cake and Eat It Too

Martin Karsten and Saman Barghi

David R. Cheriton School of Computer Science University of Waterloo

June 2020

SIGMETRICS 2020 1/27

slide-2
SLIDE 2

Problem Statement Fred Runtime Evaluation Wrap Up

Motivation

application programming paradigms

  • network service handling concurrent sessions

SIGMETRICS 2020 2/27

slide-3
SLIDE 3

Problem Statement Fred Runtime Evaluation Wrap Up

Motivation

application programming paradigms

  • network service handling concurrent sessions

event-based programming

  • explicit state management
  • asynchronous control flow → callback hell

SIGMETRICS 2020 2/27

slide-4
SLIDE 4

Problem Statement Fred Runtime Evaluation Wrap Up

Motivation

application programming paradigms

  • network service handling concurrent sessions

event-based programming

  • explicit state management
  • asynchronous control flow → callback hell

thread-per-session programming

  • automatic state management
  • synchronous control flow

SIGMETRICS 2020 2/27

slide-5
SLIDE 5

Problem Statement Fred Runtime Evaluation Wrap Up

Motivation

application programming paradigms

  • network service handling concurrent sessions

event-based programming

  • explicit state management
  • asynchronous control flow → callback hell

thread-per-session programming

  • automatic state management
  • synchronous control flow

⇒ performance?

SIGMETRICS 2020 2/27

slide-6
SLIDE 6

Problem Statement Fred Runtime Evaluation Wrap Up

Background

parallel hardware → threads & synchronization

SIGMETRICS 2020 3/27

slide-7
SLIDE 7

Problem Statement Fred Runtime Evaluation Wrap Up

Background

parallel hardware → threads & synchronization kernel thread caveats

  • limit: typically 10Ks
  • (some) execution overhead
  • complex scheduling for fairness & control

SIGMETRICS 2020 3/27

slide-8
SLIDE 8

Problem Statement Fred Runtime Evaluation Wrap Up

Background

parallel hardware → threads & synchronization kernel thread caveats

  • limit: typically 10Ks
  • (some) execution overhead
  • complex scheduling for fairness & control

⇒ user-level threads!

  • key aspect: scheduling
  • requirement: user-level I/O blocking

SIGMETRICS 2020 3/27

slide-9
SLIDE 9

Problem Statement Fred Runtime Evaluation Wrap Up

Take Away

user-level threads

  • similar throughput to event-based programming
  • load balancing can sometimes reduce tail latency

SIGMETRICS 2020 4/27

slide-10
SLIDE 10

Problem Statement Fred Runtime Evaluation Wrap Up

Take Away

user-level threads

  • similar throughput to event-based programming
  • load balancing can sometimes reduce tail latency

kernel threads not that bad either

  • up to a limit

SIGMETRICS 2020 4/27

slide-11
SLIDE 11

Problem Statement Fred Runtime Evaluation Wrap Up

Take Away

user-level threads

  • similar throughput to event-based programming
  • load balancing can sometimes reduce tail latency

kernel threads not that bad either

  • up to a limit

Fred Runtime rules!

SIGMETRICS 2020 4/27

slide-12
SLIDE 12

Problem Statement Fred Runtime Evaluation Wrap Up

Table of Contents

1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up

SIGMETRICS 2020 5/27

slide-13
SLIDE 13

Problem Statement Fred Runtime Evaluation Wrap Up

Problem Statement

minimum overhead of user-level threading?

SIGMETRICS 2020 6/27

slide-14
SLIDE 14

Problem Statement Fred Runtime Evaluation Wrap Up

Problem Statement

minimum overhead of user-level threading? roadmap

  • build minimum viable user-level threading runtime
  • compare to state of the art threading runtimes
  • evaluate production-grade application

SIGMETRICS 2020 6/27

slide-15
SLIDE 15

Problem Statement Fred Runtime Evaluation Wrap Up

Approach

vs Application Application Event Handling Thread Runtime

SIGMETRICS 2020 7/27

slide-16
SLIDE 16

Problem Statement Fred Runtime Evaluation Wrap Up

Approach

vs Application Application Event Handling Thread Runtime

Memcached - in-memory key/value store

  • minimum port to thread-per-session
  • fully preserved state machine
  • no structural benefits

SIGMETRICS 2020 7/27

slide-17
SLIDE 17

Problem Statement Fred Runtime Evaluation Wrap Up

Table of Contents

1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up

SIGMETRICS 2020 8/27

slide-18
SLIDE 18

Problem Statement Fred Runtime Evaluation Wrap Up

Scheduler

performance: simple and lightweight scalability: local queueing effectiveness: load sharing efficiency: idle-sleep

SIGMETRICS 2020 9/27

slide-19
SLIDE 19

Problem Statement Fred Runtime Evaluation Wrap Up

Inverse Shared Ready Stack

Processor 3 Processor 1 Processor 2 Ready−Queue 2 Ready−Queue 3 Staging−Queue Ready−Queue 1 waiting processors "processor ready−stack" fred counter benaphore P() V() processor ring (for stealing)

SIGMETRICS 2020 10/27

slide-20
SLIDE 20

Problem Statement Fred Runtime Evaluation Wrap Up

I/O Blocking

automatically suspend thread during I/O wait essential for synchronous control flow suspend/resume user-level thread

  • user-level synchronization primitives
  • OS-level notifications

SIGMETRICS 2020 11/27

slide-21
SLIDE 21

Problem Statement Fred Runtime Evaluation Wrap Up

I/O Notifications

freds

  • utput

input poller OS query

epoll/kqueue

interest set

loop event I/O Synchronization Vector (indexed by FD)

SIGMETRICS 2020 12/27

slide-22
SLIDE 22

Problem Statement Fred Runtime Evaluation Wrap Up

Table of Contents

1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up

SIGMETRICS 2020 13/27

slide-23
SLIDE 23

Problem Statement Fred Runtime Evaluation Wrap Up

Threading Benchmarks

comparison of 9 different threading runtimes performance & scalability problems

  • Arachne, Mordor, µC++

efficiency problems

  • Arachne, Boost, Qthreads
  • busy-looping scheduler

solid results

  • Fred, Libfiber, Pthreads
  • Go: higher constant scheduling overhead

SIGMETRICS 2020 14/27

slide-24
SLIDE 24

Problem Statement Fred Runtime Evaluation Wrap Up

Performance

2 4 6 8 10 5 10 15 20 25 30 35 40 Throughput x107 (32 Cores) Duration of Each Work Unit (us) Libfiber Qthreads Fred Pthread Go Boost Arachne Mordor uC++

SIGMETRICS 2020 15/27

slide-25
SLIDE 25

Problem Statement Fred Runtime Evaluation Wrap Up

Efficiency

50 100 150 200 250 300 5 10 15 20 25 30 Cost of Iteration (us) Core Count Libfiber Qthreads Fred Pthread Go Boost Arachne Mordor uC++

SIGMETRICS 2020 16/27

slide-26
SLIDE 26

Problem Statement Fred Runtime Evaluation Wrap Up

I/O Benchmarks

I/O stress test for Fred, Go, Libfiber, Pthread compared to best-in-class event-based server

  • Libfiber breaks
  • Go and Pthread limited
  • only Fred competitive

SIGMETRICS 2020 17/27

slide-27
SLIDE 27

Problem Statement Fred Runtime Evaluation Wrap Up

I/O Scalability

200 400 600 800 1000 1200 1400 1600 5 10 15 20 25 30 Request Throughput (x1000/sec) Cores ULib Fred (8 poller freds) Pthread Go uC++

SIGMETRICS 2020 18/27

slide-28
SLIDE 28

Problem Statement Fred Runtime Evaluation Wrap Up

Application Benchmarks

SIGMETRICS 2020 19/27

slide-29
SLIDE 29

Problem Statement Fred Runtime Evaluation Wrap Up

Application Benchmarks

  • nly Fred competitive with original Memcached

tail latency results from Arachne paper

  • only apply to special case: #RX queues < #cores
  • performance of Pthread for low connection count!

SIGMETRICS 2020 19/27

slide-30
SLIDE 30

Problem Statement Fred Runtime Evaluation Wrap Up

Throughput

100 200 300 400 500 600 700 800 2 4 6 8 10 12 14 16 Query Throughput (x1000/sec) Cores Fred Vanilla Pthread Arachne Fred (shared RQ)

SIGMETRICS 2020 20/27

slide-31
SLIDE 31

Problem Statement Fred Runtime Evaluation Wrap Up

Throughput - more connections

100 200 300 400 500 600 700 2 4 6 8 10 12 14 16 Query Throughput (x1000/sec) Cores Fred Vanilla Pthread Fred (shared RQ) Arachne

SIGMETRICS 2020 21/27

slide-32
SLIDE 32

Problem Statement Fred Runtime Evaluation Wrap Up

Tail Latency: Arachne Results

10 100 1000 10000 200 400 600 800 1000 Read Latency (us), 99th Percentile Query Throughput (x1000) Vanilla (pin/rfs) Fred (pin) Arachne Pthread (rfs)

SIGMETRICS 2020 22/27

slide-33
SLIDE 33

Problem Statement Fred Runtime Evaluation Wrap Up

Tail Latency: Explanation

  • riginal experiment: 8 RX queues for 12 cores

head-of-line blocking? modified setup: 16 RX queues for 12 cores tail latency discrepancies largely gone...

SIGMETRICS 2020 23/27

slide-34
SLIDE 34

Problem Statement Fred Runtime Evaluation Wrap Up

Tail Latency: Regular

10 100 1000 10000 200 400 600 800 1000 Read Latency (us), 99th Percentile Query Throughput (x1000) Vanilla (pin) Fred (pin) Arachne Pthread

SIGMETRICS 2020 24/27

slide-35
SLIDE 35

Problem Statement Fred Runtime Evaluation Wrap Up

Tail Latency: Higher Connection Count

1,536 → 7,680 connections

10 100 1000 10000 100000 100 200 300 400 500 600 700 800 900 Read Latency (us), 99th Percentile Query Throughput (x1000) Vanilla (pin) Fred (pin) Arachne Pthread

SIGMETRICS 2020 25/27

slide-36
SLIDE 36

Problem Statement Fred Runtime Evaluation Wrap Up

Table of Contents

1 Problem Statement 2 Fred Runtime 3 Evaluation 4 Wrap Up

SIGMETRICS 2020 26/27

slide-37
SLIDE 37

Problem Statement Fred Runtime Evaluation Wrap Up

Wrap Up

Fred: nimble user-level threading runtime comprehensive performance evaluation user-level threading possible at low overhead scenarios with improved performance? Fred currently the best reference platform

SIGMETRICS 2020 27/27