How Slack Works Keith Adams kma@slack-corp.com @keithmadams - - PowerPoint PPT Presentation

how slack works
SMART_READER_LITE
LIVE PREVIEW

How Slack Works Keith Adams kma@slack-corp.com @keithmadams - - PowerPoint PPT Presentation

How Slack Works Keith Adams kma@slack-corp.com @keithmadams facebook.com/kma What is Slack? What is Slack? Voice Calls! Platform! Something about Bots!! But first it was a Persistent Group Messaging Service In this talk How Slack


slide-1
SLIDE 1

How Slack Works

Keith Adams

kma@slack-corp.com @keithmadams facebook.com/kma

slide-2
SLIDE 2

What is

Slack?

slide-3
SLIDE 3

What is

Slack?

Voice Calls! Platform! Something about Bots!!

slide-4
SLIDE 4

Persistent Group Messaging

But first it was a Service

slide-5
SLIDE 5

In this talk

  • How Slack works today

➞ Application logic ➞ Persistence ➞ Real-time messaging ➞ Deferring work for later

  • Problems
  • What we’re doing about them
slide-6
SLIDE 6

Also in this talk

  • Flaws
  • Challenges
  • Mistakes
  • Dead-ends
  • Future directions
slide-7
SLIDE 7

Slack Scale

  • 4M DAU, 5.8M WAU

Peak simultaneous connected: 2.5M

  • > 2H / weekday for each active user

> 10H / weekday connected

  • Half of DAU outside US
slide-8
SLIDE 8

Slack House Style

  • Conservative technical taste

Most supporting technologies are >10 years old

  • Willing to write a little code

Choose low coupling, fitness-to-purpose over DRY

  • Minimalism

Choose something we already operate over something new and tailor-made Shallow, transparent stack of abstractions

slide-9
SLIDE 9

Cartoon Architecture of Slack

MySQL Job Queue Message Server WebApp

slide-10
SLIDE 10

Case Study: Login and Receive Messages

slack.com POST /api/rtm.start?token=xoxo--&...

slide-11
SLIDE 11

Slack’s webapp codebase

  • PHP monolith of app logic

<1MLoC

  • Scaled-out LAMP stack app

Memcache wrapped around sharded MySQL

  • Recently migrated to HHVM

Performance, hacklang

slide-12
SLIDE 12

World’s shortest PHP-at-Slack FAQ

  • Q: I hear/believe/have experienced PHP to be terrible.

A: It sort of is, but it also works well.

  • Q: I’m skeptical.

A: You’re in good company! Check out this blog post. But we should probably get on with the talk at hand ...

  • Q: Sounds good.

A: Right-o.

slide-13
SLIDE 13

Login and Receive Messages: the “mains”

slack.com main0 main1 SELECT db_shard FROM teams WHERE domain = %domain

slide-14
SLIDE 14

Login and Receive Messages: the shards

slack.com main0 main1 main0 main1 main0 main1 main0 main1 Shard123 a Shard123 b SELECT * FROM channels WHERE team_id = 711 ...

slide-15
SLIDE 15

MySQL Shards

  • Source of truth for most customer data

Teams, users, channels, messages, comments, emoji, ...

  • Replication across two DCs

Available for 1-DC failure

  • Sharded by team

For performance, fault isolation, and scalability

slide-16
SLIDE 16

Why MySQL?

  • Many, many thousands of server-years of working
  • The relational model is a good discipline
  • Experience
  • Tooling

Not because of ACID, though

slide-17
SLIDE 17

Master-Master Replication

www1 Shard123 a Shard123 b www17

slide-18
SLIDE 18

MMR Complications

  • Choosing A in CAP terms
  • Conflicts are possible

➞ Most resolved automatically ➞ Some manually, by operator action(!)

  • INSERT ON DUPLICATE KEY UPDATE …
  • Partitioning by team saves us

➞ Team writes cannot overlap ➞ Even teams use “left” head, odd teams use “right” head

slide-19
SLIDE 19

Case Study: Login and Receive Messages

slack.com { “ok”: true, “url”: “wss:\/\/ms9.slack-msgs.com\/websocket \/7I5yBpcvk”, … }

slide-20
SLIDE 20

Rtm.start payload

  • Rtm.start returns an image of the whole team
  • Architecture of clients

➞ Eventually consistent snapshot of whole team ➞ Updates trickle in through the web socket

  • Guarantees responsive clients
  • ...once connection is established
slide-21
SLIDE 21

Cartoon Architecture of Slack

MySQL Job Queue Message Server WebApp

slide-22
SLIDE 22

Persist, broadcast messages

Message Delivery

Message Server WebApp

slide-23
SLIDE 23

Wrinkles in Message Server

  • Race between rtm.start and connection to MS

➞ Event log mechanism

  • Glitches, delays, net partitions while persisting

➞ In-memory queue of pending sends ➞ Queue depth sensitive barometer of system health

  • Most messages are presence
slide-24
SLIDE 24

Link unfurling

Deferring Work

Search indexing Exports/Imports

Job Queue (Redis) WebApp Job Workers

slide-25
SLIDE 25

Putting it all together

mains shards Message Server WebApp

slide-26
SLIDE 26

Things missing from the cartoon

  • Memcache wrapped around many DB accesses

➞ Case-by-case ➞ Manual

  • Computed data service (CDS)

➞ Provides ML models via Thrift interface

  • Rate-limiting around critical services
  • Search!

➞ Solr ➞ Team-partitioned ➞ fed from job queue workers

slide-27
SLIDE 27

Slack Today: The Good Parts

  • Team-partitioning

➞ Easy scaling to lots of teams ➞ Isolates failures and perf problems ➞ Makes customer complaints easy to field ➞ Natural fit for a paid product

  • Per-team Message Server

➞ Low-latency broadcasts

slide-28
SLIDE 28

Some Hard Cases

slide-29
SLIDE 29

Hard scenarios

  • Mains failures
  • Rtm.start on large teams
  • Mass reconnects
slide-30
SLIDE 30

Mains failure

  • 1 master fails, partner takes over
  • If both fail?

➞ Many users can proceed via memcache ➞ For the rest Slack is down ➞ Quite possible if failure was load-induced

slide-31
SLIDE 31

Rtm.start for large teams

  • Returns image of entire team
  • Channel membership is O(n2) for n users
slide-32
SLIDE 32

Mass reconnects

  • A large team loses, then regains, office Internet connectivity
  • n users perform O(n2) rtm.start operations
  • Can ‘melt’ the team shard
slide-33
SLIDE 33

What are we going to

Do

about it?

slide-34
SLIDE 34

Scale-out mains

  • Replace mains spof
  • With what? We’re not sure yet
  • Kicking the tires carefully on a scary change
slide-35
SLIDE 35

Rtm.start for large teams

  • Incremental work

➞ Current p95,p99: 221ms, 660ms

  • Core problem: channel membership is O(n2)
  • Change APIs so clients can load channel members lazily
  • Much harder than it sounds!
slide-36
SLIDE 36

Mass reconnects

  • Introducing flannel
  • Application-level edge cache
slide-37
SLIDE 37

Pre-Flannel

Message Delivery

Message Server WebApp

slide-38
SLIDE 38

Message Server

slide-39
SLIDE 39

Flannel status

  • On for a few teams
  • Rolling out to you soon with any luck
slide-40
SLIDE 40

Phew

slide-41
SLIDE 41

Stuff I had to leave out

  • Lots of client tech!
  • Voice
  • Backups
  • Data warehouse
  • Search
  • Deploying code
  • Monitoring and alerting
slide-42
SLIDE 42

Wrapping up

  • Sketch of how Slack works

➞ Application Logic ➞ Persistence ➞ Real-time messaging ➞ Asynchronous Work

  • Problems
  • What we’re doing about them
slide-43
SLIDE 43

There is a lot left to do

slack.com/jobs

slide-44
SLIDE 44

...

slide-45
SLIDE 45

Deployable Message Server

  • Channel-sharded message bus
  • Flannel discovers Channel servers via Consul

➞ Scatters user writes ➞ Gathers channel reads

  • Failures do not need reconnects