scaling data infrastructure spotify
play

Scaling Data Infrastructure @ Spotify matti@spotify.com - PowerPoint PPT Presentation

Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com Mrti Kalvns Matti Pehrs kalvans@spotify.com matti@spotify.com Agenda 1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory Datamon


  1. Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com

  2. Mārtiņš Kalvāns Matti Pehrs kalvans@spotify.com matti@spotify.com

  3. Agenda 1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory ○ Datamon ○ Styx ○ GABO

  4. Spotify big-data context ● Over 100 million monthly active users ● Over 30 million song ● Over 2 billion playlists ● Active in 60 markets

  5. Data is at the heart of Spotify In 2007 In 2016 - Monthly Royalty Report - Monthly Royalty Report - Weekly Billboard - Daily reports to partners - ... - AB-Testing - Discover weekly - Daily Mix - ...

  6. Our growth in Data Users Developers +50 TB/day +60 TB/day +100M Users +10k M/R jobs

  7. Autonomy & Dependencies Team B Team A Team C Hadoop

  8. Autonomy & Dependencies

  9. Autonomy & Dependencies

  10. Autonomy & Dependencies

  11. Summer of Incidents

  12. Summer of Incidents ● A strain of incidents

  13. Summer of Incidents ● A strain of incidents War-room ●

  14. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees

  15. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees Event Delivery Catch up ●

  16. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees Event Delivery Catch up ● ● Reprocessing of data

  17. Summer of Incidents ● A strain of incidents War-room ● ● Hadoop on it’s knees Event Delivery Catch up ● ● Reprocessing of data Hard to debug data issues ●

  18. Challenges and the path to victory...

  19. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring

  20. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control

  21. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  22. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  23. Early Warning - Datamon

  24. Early Warning - Datamon ● Unified view Alignment between teams ○ ● Ownership ○ Clear ownership of data SLA ● ○ Alert on late data

  25. Early Warning - Datamon ● Define terminology ● Provide metadata language ● Implement a Datamon service

  26. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  27. Debuggability & Control - Styx - Execution control - Self service for data users - Execution information - Expose debug information - Execution isolation - Docker for data jobs The river Styx

  28. Debuggability & Control - Styx ● Execution control ○ Centralized execution API

  29. Debuggability & Control - Styx ● Execution control ○ Centralized execution API Backfilling and reprocessing ○

  30. Debuggability & Control - Styx ● Execution control Execution information ● ○ Timeline

  31. Debuggability & Control - Styx ● Execution control Execution information ● ○ Timeline ○ Google Cloud Logging

  32. Debuggability & Control - Styx ● Execution control Execution information ● ● Execution isolation Docker ○

  33. Challenges and the path to victory... 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery

  34. Automate Capacity - GABO/Event Delivery ● Complex and manual config

  35. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming

  36. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale

  37. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming

  38. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming :-( ● 2 micro services + 1 Map/Reduce job

  39. Automate Capacity - GABO/Event Delivery ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming :-( ● 2 micro services + 1 Map/Reduce job ● Autoscaling & The Stuffer

  40. GABO - WIP ● Handles at least 10x our load ● Darkloading ● Autoscale everything ● Self service

  41. Summary ● Make sure you have the right tools to deal with data incidents ○ Make sure you have time to implement the tools you need ● Remember that your capacity model can fail at larger scale ○ Keep track of your scale and Automate, automate, automate...

  42. Thank you! kalvans@spotify.com matti@spotify.com Want to join the band? http://spoti.fi/jobs

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend