Scaling Userspace @ Facebook Ben Maurer bmaurer@fb.com About Me - - PowerPoint PPT Presentation

scaling userspace facebook
SMART_READER_LITE
LIVE PREVIEW

Scaling Userspace @ Facebook Ben Maurer bmaurer@fb.com About Me - - PowerPoint PPT Presentation

Scaling Userspace @ Facebook Ben Maurer bmaurer@fb.com About Me At Facebook since 2010 Co-founded reCAPTCHA Tech-lead of Web Foundation team Responsible for the overall performance & reliability of Facebooks user-


slide-1
SLIDE 1

Scaling Userspace @ Facebook

Ben Maurer bmaurer@fb.com

slide-2
SLIDE 2

About Me

▪ At Facebook since 2010 ▪ Co-founded reCAPTCHA ▪ Tech-lead of Web Foundation team ▪ Responsible for the overall performance & reliability of Facebook’s user-

facing products

▪ Proactive — Design ▪ Reactive — Outages

slide-3
SLIDE 3

Facebook in 30 Seconds

Web Tier (HHVM) Graph Cache (TAO) Newsfeed Ads Messages Ranking Spam Search Database (MySQL) Payments Trending Timeline Load Balancer

slide-4
SLIDE 4

Rapid Change

▪ Code released twice a day ▪ Rapid feature development — e.g. Lookback videos ▪ 450 Gbps of egress ▪ 720 million videos rendered (9 million / hour) ▪ 11 PB of storage ▪ Inception to Production: 25 days

slide-5
SLIDE 5

A Stable Environment

▪ All Facebook projects in a single source control repo ▪ Common infrastructure for all projects ▪ folly: base C++ library ▪ thrift: RPC ▪ Goals: ▪ Maximum performance ▪ “Bazooka proof”

slide-6
SLIDE 6

Typical Server Application

new connection accept() epoll_wait(); read() acceptor network worker 1x 1/core many

slide-7
SLIDE 7

Acceptor Threads

Simple right?
 


while(true) ¡{
 ¡ ¡epoll_wait();
 ¡ ¡accept();
 }

slide-8
SLIDE 8

accept() can be O(N)

▪ Problem: finding lowest available FD is O(open FDs)



 __alloc_fd(...) ¡{
 ¡fd ¡= ¡find_next_zero_bit(fdt-­‑>open_fds, ¡fdt-­‑>max_fds, ¡fd);
 }

▪ Userspace solution: avoid connection churn ▪ Kernel solution: could use multi-level bitmap

slide-9
SLIDE 9

EMFILE

▪ Problem: when accept() returns EMFILE it does not discard the pending

request, but still readable in epoll_wait

▪ Userspace solution: sleep for 10+ ms after seeing an EMFILE return code ▪ Kernel solution: don’t wake up epoll

slide-10
SLIDE 10

Listen Queue Overflow

SYN Backlog Listen Queue App SYN ACK accept() SYN Cookie DROP!?

slide-11
SLIDE 11

Listen Queue Overflow

▪ Userspace solution: tcp_abort_on_overflow sysctl ▪ Kernel solution: tuned overflow check

slide-12
SLIDE 12

Network Monitoring with Retransmits

▪ Problem: you need to track down issues on your network ▪ Userspace solution: ▪ netstat -s | grep retransmited ▪ Distribution of request times (eg 200 ms = minimum RTO) ▪ Kernel solution: ▪ Tracepoint for retransmissions: IP/port info. Aggregated centrally ▪ Could use better tuning for intra-datacenter TCP (200 ms = forever)

slide-13
SLIDE 13

Networking Threads

while(true) ¡{
 ¡ ¡epoll_wait();
 ¡ ¡read();
 ¡ ¡send_to_worker();
 }

slide-14
SLIDE 14

Suboptimal Scheduling

▪ Problem: Allocating a connection to a specific thread causes delays if other

work is happening on that thread.
 
 
 
 
 
 


▪ Userspace solution: minimal work on networking threads ▪ Kernel solution: M:N epoll api? Thread 1 Thread 2 Data on socket allocated to thread 1 Could have run faster


  • n thread 2
slide-15
SLIDE 15

Causes of Delay

▪ Holding locks ▪ Compression ▪ Deserialization

slide-16
SLIDE 16

Disk IO

▪ Problem: writing even a single byte to a file can take 100+ ms ▪ Stable pages (fixed) ▪ Journal writes to update mtime ▪ Debug: perf record -afg -e cs ▪ Userspace solution: avoid write() calls in critical threads ▪ Kernel solution: ▪ O_NONBLOCK write() call ▪ Guaranteed async writes given buffer space

slide-17
SLIDE 17

Worker Threads

Avoid the dangers of doing work on the networking thread
 


while(true) ¡{
 ¡ ¡wait_for_work();
 ¡ ¡do_work();
 ¡ ¡send_result_to_network();
 }

slide-18
SLIDE 18

How to Get Work to Workers?

▪ Obvious solution: pthread_cond_t.

1.25 2.5 3.75 5 0 μs 4 μs 7 μs 11 μs 14 μs

Number of worker threads

50 100 200 400 800 1600 3200 6400 time context switches

3 context switches per item?!

slide-19
SLIDE 19

Multiple Wakeups / Deque

pthread_cond_signal() ¡{ ¡ ¡ ¡lock();
 ¡ ¡++futex; ¡ ¡ ¡futex_wake(&futex, ¡1); ¡ ¡ ¡unlock(); ¡ }

  • pthread_cond_wait() ¡{ ¡

¡ ¡do ¡{ ¡ ¡ ¡ ¡ ¡int ¡futex_val ¡= ¡cond-­‑>futex ¡ ¡ ¡ ¡ ¡unlock(); ¡ ¡ ¡ ¡ ¡futex_wait ¡(&futex, ¡futex_val); ¡ ¡ ¡ ¡ ¡lock(); ¡ ¡ ¡} ¡while ¡(!my_turn_to_wake_up()) ¡ }

Potential context switches

slide-20
SLIDE 20

LIFO vs FIFO

▪ pthread_cond_t is first in first out ▪ New work is schedule on the thread that has been idle longest ▪ Bad for the CPU cache ▪ Bad for the scheduler ▪ Bad for memory usage

slide-21
SLIDE 21

LifoSem

▪ 13x faster. 12x fewer context switches

1.25 2.5 3.75 5 0 μs 4 μs 7 μs 11 μs 14 μs

Number of worker threads

50 100 200 400 800 1600 3200 6400 pthread time pthread context switches time context switches

slide-22
SLIDE 22

Synchronization Performance

▪ pthread_cond_t not the only slow synchronization method ▪ pthread_mutex_t: can cause contention in futex spinlock 


http://lwn.net/Articles/606051/

▪ pthread_rwlock_t: Uses a mutex. Consider RWSpinLock in folly

slide-23
SLIDE 23

Over-scheduling

▪ Problem: servers bad at regulating work under load ▪ Example: ranking feed stories ▪ Too few threads: extra latency ▪ Too many threads: ranking causes delay in other critical tasks

slide-24
SLIDE 24

Over-scheduling

▪ Userspace solution: ▪ More work != better. Use discipline ▪ TASKSTATS_CMD_GET measures CPU delay (getdelays.c) ▪ Kernel solution: only dequeue work runqueue not overloaded

slide-25
SLIDE 25

NUMA Locality

▪ Problem: cross-node memory access is slow ▪ Userspace solution: ▪ 1 thread pool per node ▪ Teach malloc about NUMA ▪ Need care to balance memory. Hack: numactl --interleave=all cat <binary> ▪ Substantial win: 3% HHVM performance improvement ▪ Kernel solution: better integration of scheduling + malloc

slide-26
SLIDE 26

Huge Pages

▪ Problem: TLB misses are expensive ▪ Userspace solution: ▪ mmap your executable with huge pages ▪ PGO using perf + a linker script ▪ Combination of huge pages + PGO: over a 10% win for HHVM

slide-27
SLIDE 27

malloc()

▪ GLIBC malloc: does not perform well for large server applications ▪ Slow rate of development: 50 commits in the last year

slide-28
SLIDE 28

Keeping Malloc Up with the Times

▪ Huge pages ▪ NUMA ▪ Increasing # of threads, CPUs

slide-29
SLIDE 29

jemalloc

▪ Areas where we have been tuning: ▪ Releasing per-thread caches for idle threads ▪ Incorporating a sense of wall clock time ▪ MADV_FREE usage ▪ Better tuning of per-thread caches ▪ 5%+ wins seen from malloc improvements

slide-30
SLIDE 30

Take Aways

▪ Details matter: Understand the inner workings of your systems ▪ Common libraries are critical: People get caught by the same traps ▪ Some problems are best solved in the kernel

slide-31
SLIDE 31

Questions

slide-32
SLIDE 32