Building highly reliable data pipelines @ Datadog Quentin FRANCOIS - PowerPoint PPT Presentation

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona ‘18 1

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona ‘18 6

Reliability is the probability that a system will produce correct outputs up to some given time t . Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press. 7

Highly reliable data pipelines 1. Architecture 8

Highly reliable data pipelines 1. Architecture 2. Monitoring 9

Highly reliable data pipelines 1. Architecture 2. Monitoring 3. Failures handling 10

Historical metric queries Time series data metric system.load.1 timestamp 1526382440 value 0.92 tags host:i-xyz,env:dev,... 11

Historical metric queries 1 point/second 12

Historical metric queries 1 point/day 13

Historical metric queries AWS S3 High resolution 1pt /sec data • Runs once a day. • Dozens of TBs of input data. Rollups • Trillions of points processed. pipeline Low 1pt /min resolution 1pt /hour data 1pt /day 14

Our big data platform architecture USERS Datadog Web CLI Scheduler monitoring Luigi Spark WORKERS EMR EMR EMR EMR CLUSTERS S3 DATA 16

Our big data platform architecture USERS Datadog Web CLI Scheduler monitoring Luigi Spark WORKERS EMR EMR EMR EMR CLUSTERS S3 DATA 17

Many ephemeral clusters • New cluster for every pipeline. • Dozens of clusters at a time. • Median lifetime of ~3 hours. 18

Total isolation We know what is happening and why. 19

Pick the best hardware for each job c3 r3 For CPU-bound jobs For memory-bound jobs 20

Scale up/down clusters • If we are behind. • Scale as we grow. • No more waiting on loaded clusters. 21

Safer upgrades of EMR/Hadoop/Spark 5.13 5.12 5.12 5.12 5.12 22

Spot-instance clusters Ridiculous savings + (up to 80% off the on-demand price) - Nodes can die at any time 23

Spot-instance clusters Ridiculous savings + (up to 80% off the on-demand price) - Nodes can die at any time 24

? How can we build highly reliable data pipelines with instances killed randomly all the time? 25

No long running jobs • The longer the job, the more work you lose on average. • The longer the job, the longer it takes to recover. 26

No long running jobs Pipeline A Pipeline B Time (hours) 0 9 27

No long running jobs Job failure Pipeline A Pipeline B Time (hours) 0 7 9 10 16 28

Break down jobs into smaller pieces Vertically - persist intermediate data between transformations. Horizontally - partition the input data. 29

Example Rollups pipeline Input Raw time series data data Aggregated time Output series data data (custom file format) 30

Example Rollups pipeline Input Raw time series data data 1. Aggregate high 1 resolution data. 2. Store the 2 aggregated data Aggregated time Output in our custom file series data data (custom file format) format. 31

Example Input Raw time series data data Vertical split 1 1. Aggregate high 1 resolution data. Aggregated time Checkpoint series data data (Parquet format) 2. Store the 2 aggregated data in our custom file 2 format. Aggregated time Output series data data (custom file format) 32

A B Example Raw time series data Horizontal split C D 1 1. Aggregate high 1 resolution data. Aggregated time Checkpoint series data data (Parquet format) 2. Store the 2 aggregated data in our custom file 2 format. Aggregated time Output series data data (custom file format) 33

Example Raw time series A B C D data Horizontal split 1 1. Aggregate high 1 resolution data. Aggregated time series data (Parquet format) 2. Store the 2 aggregated data 2 in our custom file format. Aggregated time A B C D series data (custom file format) 34

Break down jobs into smaller pieces Performance Fault tolerance 35

Lessons • Many clusters for better isolation. • Break down jobs into smaller pieces. • Trade-off between performance and fault tolerance. 36

Cluster tagging #anomaly -detection #rollups 38

Monitor cluster metrics 39

Monitor work metrics More details: datadoghq.com/blog/monitoring-spark/ 43

Monitor work metrics 44

Monitor data lag 1 point/sec data lag 1 point/hour data lag 47

Lessons • Measure, measure and measure! • Alert on meaningful and actionable metrics. • High level dashboards. 48

Data pipelines will break Hardware Increasing Upstream Bad code failures volume of delays changes data 51

Data pipelines will break 1. Recover fast. 2. Degrade gracefully. 52

Recover fast • No long running job. • Switch from spot to on-demand clusters. • Increase cluster size. 53

Recover fast: easy way to rerun jobs • Needed when jobs run but produce some bad data. • Not always trivial. 54

Example: rerun the rollups pipeline s3://bucket/ 2018-01 2018-02 2018-03 2018-04 2018-05 55

Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 56

Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location 57

Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 Active location as-of_2018-05-22 58

Example: rerun the rollups pipeline s3://bucket/2018-05/ as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 Active location as-of_2018-05-22_run-2 61

Degrade A B C D gracefully • Isolate issues to a limited number of customers. • Keep the functionalities operational at the cost of performance/accuracy. A B C D 62

Degrade gracefully: skip corrupted files • Job failure caused by limited corrupted input data. • Don’t ignore real widespread issues. 63

Lessons • Think about potential issues ahead of time. • Have knobs ready to recover fast. • Have knobs ready to limit the customer facing impact. 64

Conclusion Building highly reliable data pipelines 65

Conclusion Building highly reliable data pipelines • Know your time constraints. 66

Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. 67

Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. • Monitor cluster metrics, job metrics and data lags. 68

Conclusion Building highly reliable data pipelines • Know your time constraints. • Break down jobs into small survivable pieces. • Monitor cluster metrics, job metrics and data lags. • Think about failures ahead of time and get prepared. 69

Thanks! We’re hiring! qf@datadoghq.com https://jobs.datadoghq.com 70

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS - PowerPoint PPT Presentation

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona 18 1 2 3 4 5 Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH $ finger ilan@datadog

Using Luigi to build data pipelines that wont wake you at 3am matt williams evangelist

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Joel BARCIAUSKAS

Rmi Hakim Datadog remi@datadoghq.com @remhak github.com/remh Follow @honest_update on

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Tailor-S: Look What You Made Me Do! Vadim Semenov Software Engineer @ Datadog

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure.

Test automation Building automatically repeatable test suites Test automation n Test automation

Local-Optimality Guarantees for Optimal Decoding Based on Paths Nissim Halabi Guy Even School

Minimalism of Software Implementation Extensive Performance Analysis of Symmetric Primitives

Private information retrieval schemes based on codes Julien Lavauzelle IRMAR, Universit de

CS 188: Artificial Intelligence Constraint Satisfaction Problems II Instructors: Dan Klein and

Contextualising analyses through data and software preservation Robin Dasler WSSSPE5.1 6

Continuous delivery for native apps Niels Frydenholm, ebay Classifieds Continuous delivery 3

React native Installation and overview Installation On Mac (see tutorialsPoint tutorial)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS - PowerPoint PPT Presentation

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona 18 1 2 3 4 5 Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH $ finger ilan@datadog

Using Luigi to build data pipelines that wont wake you at 3am matt williams evangelist

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&amp;D

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Joel BARCIAUSKAS

Rmi Hakim Datadog remi@datadoghq.com @remhak github.com/remh Follow @honest_update on

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Tailor-S: Look What You Made Me Do! Vadim Semenov Software Engineer @ Datadog

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure.

Test automation Building automatically repeatable test suites Test automation n Test automation

Local-Optimality Guarantees for Optimal Decoding Based on Paths Nissim Halabi Guy Even School

Minimalism of Software Implementation Extensive Performance Analysis of Symmetric Primitives

Private information retrieval schemes based on codes Julien Lavauzelle IRMAR, Universit de

CS 188: Artificial Intelligence Constraint Satisfaction Problems II Instructors: Dan Klein and

Contextualising analyses through data and software preservation Robin Dasler WSSSPE5.1 6

Continuous delivery for native apps Niels Frydenholm, ebay Classifieds Continuous delivery 3

React native Installation and overview Installation On Mac (see tutorialsPoint tutorial)

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D