Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To - - PowerPoint PPT Presentation

scaling slack
SMART_READER_LITE
LIVE PREVIEW

Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To - - PowerPoint PPT Presentation

Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To make peoples working lives simpler, more pleasant, and more productive. 4 From supporting small teams To serving gigantic organizations of hundreds of thousands of users 5


slide-1
SLIDE 1

Scaling Slack

Bing Wei Infrastructure@Slack

slide-2
SLIDE 2

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

Our Mission: To make people’s working lives simpler, more pleasant, and more productive.

4

slide-5
SLIDE 5

From supporting small teams To serving gigantic organizations of

hundreds of thousands of users

5

slide-6
SLIDE 6

Slack Scale

◈ 6M+ DAU, 9M+ WAU 5M+ peak simultaneously connected ◈ Avg 10+ hrs/weekday connected Avg 2+ hrs/weekday in active use ◈ 55% of DAU outside of US

6

slide-7
SLIDE 7

Cartoon Architecture

WebApp

PHP/Hack

Sharded MySql Messaging Server

Java

Job Queue

Redis/Kafka HTTP WebSocket

7

slide-8
SLIDE 8

Outline

◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server

8

slide-9
SLIDE 9

Challenge:

Slowness Connecting to Slack

9

slide-10
SLIDE 10

Login Flow in 2015

User

  • 1. HTTP POST

with user’s token

  • 2. HTTP Response:

a snapshot of the team & websocket URL

WebApp

MySql

10

slide-11
SLIDE 11

Some examples number of users/channels response size 30 / 10 200K 500 / 200 2.5M 3K / 7K 20M 30K / 1K 60M

11

slide-12
SLIDE 12

Login flow in 2015

User

  • 1. HTTP POST

with user’s token

  • 2. HTTP Response:

a snapshot of the team & websocket url

WebApp Messaging Server

  • 3. Websocket:

real-time events MySql

12

slide-13
SLIDE 13

Real-time Events on WebSocket

User

Messaging Server

WebSocket: 100+ types of events

e.g. chat messages, typing indicator, files uploads, files comments, threads replies, user presence changes, user profile changes, reactions, pins, stars, channel creations, app installations, etc.

13

slide-14
SLIDE 14

Login Flow in 2015

◈ Clients Architecture

○ Download a snapshot of entire team ○ Updates trickle in through the WebSocket ○ Eventually consistent snapshot of whole team

14

slide-15
SLIDE 15

Problems

Initial team snapshot takes time

15

slide-16
SLIDE 16

Problems

Initial team snapshot takes time Large client memory footprint

16

slide-17
SLIDE 17

Problems

Initial team snapshot takes time Large client memory footprint Expensive reconnections

17

slide-18
SLIDE 18

Problems

Initial team snapshot takes time Large client memory footprint Expensive reconnections Reconnect storm

18

slide-19
SLIDE 19

Outline

◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server

19

slide-20
SLIDE 20

Improvements

◈ Smaller team snapshot

○ Client local storage + delta ○ Remove objects out + in parallel loading ○ Simplified objects: e.g. canonical Avatars

20

slide-21
SLIDE 21

Improvements

◈ Incremental boot ○ Load one channel first

21

slide-22
SLIDE 22

Improvements

◈ Rate Limit ◈ POPs ◈ Load Testing Framework

22

slide-23
SLIDE 23

Support New Product Features

Product Launch

23

slide-24
SLIDE 24

Cope with New Product Features

Product Launch

24

slide-25
SLIDE 25

Still...

Limitations ◈ What if team sizes keep growing ◈ Outages when clients dump their local storage

25

slide-26
SLIDE 26

Outline

◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server

26

slide-27
SLIDE 27

Client Lazy Loading

Download less data upfront ◈ Fetch more on demand

27

slide-28
SLIDE 28

Flannel: Edge Cache Service

A query engine backed by cache

  • n edge locations

28

slide-29
SLIDE 29

What are in Flannel’s cache

◈ Support big objects first

○ Users ○ Channels Membership ○ Channels

29

slide-30
SLIDE 30

Login and Message Flow with Flannel

User

Messaging Server WebApp MySQL Flannel

  • 3. WebSocket:

Stream Json events

  • 1. WebSocket

connection

  • 2. HTTP Post:

download a snapshot

  • f the team

30

slide-31
SLIDE 31

A Man in the Middle

User

Messaging Server Flannel

Use real-time events to update its cache

E.g. user creation, user profile change, channel creation, user joins a channel, channel convert to private WebSocket WebSocket

31

slide-32
SLIDE 32

Edge Locations Mix of AWS & Google Cloud

main region us-east-1

32

slide-33
SLIDE 33

Examples Powered by Flannel

Quick Switcher

33

slide-34
SLIDE 34

Examples Powered by Flannel

Mention Suggestion

34

slide-35
SLIDE 35

Examples Powered by Flannel

Channel Header

35

slide-36
SLIDE 36

Examples Powered by Flannel

Channel Sidebar

36

slide-37
SLIDE 37

Examples Powered by Flannel

Team Directory

37

slide-38
SLIDE 38

Flannel Results

◈ Launched Jan 2017

○ Load 200K user team

◈ 5M+ simultaneous connections at peak ◈ 1M+ clients queries/sec

38

slide-39
SLIDE 39

Flannel Results

39

slide-40
SLIDE 40

This is not the end of the story

40

slide-41
SLIDE 41

Evolution of Flannel

41

slide-42
SLIDE 42

Web Client Iterations

Flannel Just-In-Time Annotation

Right before Web clients are about to access an object, Flannel pushes that object to clients.

42

slide-43
SLIDE 43

A Closer Look

Why does Flannel sit on WebSocket?

43

slide-44
SLIDE 44

Old Way of Cache Updates

Users

Messaging Server Flannel

LOTS of duplicated Json events

44

slide-45
SLIDE 45

Publish/Subscribe (Pub/Sub) to Update Cache

Users

Messaging Server Flannel

Pub/Sub Thrift events

45

slide-46
SLIDE 46

Pub/Sub Benefits

Less Flannel CPU Simpler Flannel code Schema data Flexibility for cache management

46

slide-47
SLIDE 47

Flexibility for Cache Management

Previously ◈ Load when the first user connects ◈ Unload when the last user disconnects

47

slide-48
SLIDE 48

Flexibility for Cache Management

With Pub/Sub ◈ Isolate received events from user connections

48

slide-49
SLIDE 49

Another Closer Look

◈ With Pub/Sub, does Flannel need to be on WebSocket path?

49

slide-50
SLIDE 50

Next Step

Move Flannel out of WebSocket path

50

slide-51
SLIDE 51

Next Step

Move Flannel out of WebSocket path Why? Separation & Flexibility

51

slide-52
SLIDE 52

Evolution with Product Requirements Grid for Big Enterprise

52

slide-53
SLIDE 53

Team Affinity for Cache Efficiency

Before Grid

53

slide-54
SLIDE 54

Team Affinity Grid Aware

Now

54

slide-55
SLIDE 55

Grid Awareness Improvements

Flannel Memory

Saves 22G of per host, 1.1TB total

55

slide-56
SLIDE 56

Grid Awareness Improvements

DB Shard CPU Idle 25% -> 90% P99 User Connect Latency 40s -> 4s For our biggest customer

56

slide-57
SLIDE 57

Team Affinity Grid Aware Scatter & Gather

Future

57

slide-58
SLIDE 58

Outline

◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server

58

slide-59
SLIDE 59

Expand Pub/Sub to Client Side

Client Side Pub/Sub reduces events Clients have to handle

59

slide-60
SLIDE 60

Presence Events

60% of all events O(N2) 1000 user team ⇒ 1000 * 1000 = 1M events

60

slide-61
SLIDE 61

Presence Pub/Sub Clients ◈ Track who are in the current view ◈ Subscribe/Unsubscribe to Messaging server when view changes

61

slide-62
SLIDE 62

Outline

◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server

62

slide-63
SLIDE 63

What is Messaging Server

Messaging Server

63

slide-64
SLIDE 64

A Message Router

Messaging Server

64

slide-65
SLIDE 65

Events Routing and Fanout

Messaging Server

WebApp/ DB

1.Events happen

  • n team

2.Events Fanout

65

slide-66
SLIDE 66

Limitations

◈ Sharded by Team Single point of failure

66

slide-67
SLIDE 67

Limitations

◈ Sharded by Team Single point of failure ◈ Shared Channels Shared states among teams

67

slide-68
SLIDE 68

68

slide-69
SLIDE 69

Topic Sharding

Everything is a Topic

public/private channel, DM, group DM, user, team, grid

69

slide-70
SLIDE 70

Topic Sharding

Natural fit for shared channels

70

slide-71
SLIDE 71

Topic Sharding

Natural fit for shared channels Reduce user perceived failures

71

slide-72
SLIDE 72

Other Improvements

Auto failure recovery

72

slide-73
SLIDE 73

Other Improvements

Auto failure recovery Publish/Subscribe

73

slide-74
SLIDE 74

Other Improvements

Auto failure recovery Publish/Subscribe Fanout at the edge

74

slide-75
SLIDE 75

Our Journey Problem Incremental Change Architectural Change Ongoing Evolution

75

slide-76
SLIDE 76

More To Come Journey Ahead Get in touch: https://slack.com/jobs

76

slide-77
SLIDE 77

Thanks!

Any questions?

@bingw11

77