Scaling Data Infrastructure @ Spotify matti@spotify.com - - PowerPoint PPT Presentation

scaling data infrastructure spotify
SMART_READER_LITE
LIVE PREVIEW

Scaling Data Infrastructure @ Spotify matti@spotify.com - - PowerPoint PPT Presentation

Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com Mrti Kalvns Matti Pehrs kalvans@spotify.com matti@spotify.com Agenda 1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory Datamon


slide-1
SLIDE 1

Scaling Data Infrastructure @ Spotify

matti@spotify.com kalvans@spotify.com

slide-2
SLIDE 2

Mārtiņš Kalvāns

kalvans@spotify.com

Matti Pehrs

matti@spotify.com

slide-3
SLIDE 3

Agenda

1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory

○ Datamon ○ Styx ○ GABO

slide-4
SLIDE 4

Spotify big-data context

  • Over 100 million monthly active users
  • Over 30 million song
  • Over 2 billion playlists
  • Active in 60 markets
slide-5
SLIDE 5

Data is at the heart of Spotify

In 2007

  • Monthly Royalty Report

In 2016

  • Monthly Royalty Report
  • Weekly Billboard
  • Daily reports to partners
  • ...
  • AB-Testing
  • Discover weekly
  • Daily Mix
  • ...
slide-6
SLIDE 6

Our growth in Data

Users

+50 TB/day +100M Users

Developers

+60 TB/day +10k M/R jobs

slide-7
SLIDE 7

Hadoop

Autonomy & Dependencies

Team A Team B Team C

slide-8
SLIDE 8

Autonomy & Dependencies

slide-9
SLIDE 9

Autonomy & Dependencies

slide-10
SLIDE 10

Autonomy & Dependencies

slide-11
SLIDE 11

Summer of Incidents

slide-12
SLIDE 12
  • A strain of incidents

Summer of Incidents

slide-13
SLIDE 13
  • A strain of incidents
  • War-room

Summer of Incidents

slide-14
SLIDE 14
  • A strain of incidents
  • War-room
  • Hadoop on it’s knees

Summer of Incidents

slide-15
SLIDE 15
  • A strain of incidents
  • War-room
  • Hadoop on it’s knees
  • Event Delivery Catch up

Summer of Incidents

slide-16
SLIDE 16
  • A strain of incidents
  • War-room
  • Hadoop on it’s knees
  • Event Delivery Catch up
  • Reprocessing of data

Summer of Incidents

slide-17
SLIDE 17
  • A strain of incidents
  • War-room
  • Hadoop on it’s knees
  • Event Delivery Catch up
  • Reprocessing of data
  • Hard to debug data issues

Summer of Incidents

slide-18
SLIDE 18

Challenges and the path to victory...

slide-19
SLIDE 19

1. Early Warning Datamon - Data monitoring

Challenges and the path to victory...

slide-20
SLIDE 20

1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control

Challenges and the path to victory...

slide-21
SLIDE 21

1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

Challenges and the path to victory...

slide-22
SLIDE 22

1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

Challenges and the path to victory...

slide-23
SLIDE 23

Early Warning - Datamon

slide-24
SLIDE 24
  • Unified view

○ Alignment between teams

  • Ownership

○ Clear ownership of data

  • SLA

○ Alert on late data

Early Warning - Datamon

slide-25
SLIDE 25
  • Define terminology
  • Provide metadata language
  • Implement a Datamon service

Early Warning - Datamon

slide-26
SLIDE 26

1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

Challenges and the path to victory...

slide-27
SLIDE 27
  • Execution control
  • Self service for data users
  • Execution information
  • Expose debug information
  • Execution isolation
  • Docker for data jobs

Debuggability & Control - Styx

The river Styx

slide-28
SLIDE 28
  • Execution control

○ Centralized execution API

Debuggability & Control - Styx

slide-29
SLIDE 29

Debuggability & Control - Styx

  • Execution control

○ Centralized execution API ○ Backfilling and reprocessing

slide-30
SLIDE 30
  • Execution control
  • Execution information

○ Timeline

Debuggability & Control - Styx

slide-31
SLIDE 31

Debuggability & Control - Styx

  • Execution control
  • Execution information

○ Timeline ○ Google Cloud Logging

slide-32
SLIDE 32

Debuggability & Control - Styx

  • Execution control
  • Execution information
  • Execution isolation

○ Docker

slide-33
SLIDE 33

1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

Challenges and the path to victory...

slide-34
SLIDE 34
  • Complex and manual config

Automate Capacity - GABO/Event Delivery

slide-35
SLIDE 35
  • Complex and manual config
  • Pubsub & Dataflow streaming

Automate Capacity - GABO/Event Delivery

slide-36
SLIDE 36
  • Complex and manual config
  • Pubsub & Dataflow streaming
  • Pubsubs at scale

Automate Capacity - GABO/Event Delivery

slide-37
SLIDE 37
  • Complex and manual config
  • Pubsub & Dataflow streaming
  • Pubsubs at scale
  • Dataflow streaming

Automate Capacity - GABO/Event Delivery

slide-38
SLIDE 38
  • Complex and manual config
  • Pubsub & Dataflow streaming
  • Pubsubs at scale
  • Dataflow streaming :-(
  • 2 micro services + 1 Map/Reduce job

Automate Capacity - GABO/Event Delivery

slide-39
SLIDE 39
  • Complex and manual config
  • Pubsub & Dataflow streaming
  • Pubsubs at scale
  • Dataflow streaming :-(
  • 2 micro services + 1 Map/Reduce job
  • Autoscaling & The Stuffer

Automate Capacity - GABO/Event Delivery

slide-40
SLIDE 40
  • Handles at least 10x our load
  • Darkloading
  • Autoscale everything
  • Self service

GABO - WIP

slide-41
SLIDE 41
  • Make sure you have the right

tools to deal with data incidents

○ Make sure you have time to implement the tools you need

  • Remember that your capacity

model can fail at larger scale

○ Keep track of your scale and Automate, automate, automate...

Summary

slide-42
SLIDE 42

Thank you!

kalvans@spotify.com matti@spotify.com

Want to join the band? http://spoti.fi/jobs