Building highly reliable data pipelines @ Datadog Quentin FRANCOIS - - PowerPoint PPT Presentation

building highly reliable data pipelines datadog
SMART_READER_LITE
LIVE PREVIEW

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS - - PowerPoint PPT Presentation

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng Barcelona 18 1 2 3 4 5 Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering DataEng


slide-1
SLIDE 1

Building highly reliable data pipelines @ Datadog

Quentin FRANCOIS

Team Lead, Data Engineering

1

DataEng Barcelona ‘18

slide-2
SLIDE 2

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

slide-6
SLIDE 6

Building highly reliable data pipelines @ Datadog

6

Quentin FRANCOIS

Team Lead, Data Engineering DataEng Barcelona ‘18

slide-7
SLIDE 7

Reliability is the probability that a system will produce correct outputs up to some given time t.

Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.

7

slide-8
SLIDE 8
  • 1. Architecture

Highly reliable data pipelines

8

slide-9
SLIDE 9
  • 1. Architecture
  • 2. Monitoring

Highly reliable data pipelines

9

slide-10
SLIDE 10
  • 1. Architecture
  • 2. Monitoring
  • 3. Failures handling

Highly reliable data pipelines

10

slide-11
SLIDE 11

Historical metric queries

Time series data

metric system.load.1 timestamp 1526382440 value 0.92 tags host:i-xyz,env:dev,...

11

slide-12
SLIDE 12

Historical metric queries

12

1 point/second

slide-13
SLIDE 13

Historical metric queries

13

1 point/day

slide-14
SLIDE 14

Historical metric queries

14

High resolution data Low resolution data 1pt /min 1pt /hour 1pt /day 1pt /sec AWS S3

  • Runs once a day.
  • Dozens of TBs of input data.
  • Trillions of points processed.

Rollups pipeline

slide-15
SLIDE 15
  • 1. Architecture
  • 2. Monitoring
  • 3. Failures handling

Highly reliable data pipelines

15

slide-16
SLIDE 16

Our big data platform architecture

CLUSTERS DATA WORKERS USERS Luigi Spark Datadog monitoring S3 EMR Web Scheduler CLI

16

EMR EMR EMR

slide-17
SLIDE 17

Our big data platform architecture

CLUSTERS DATA WORKERS USERS Luigi Spark Datadog monitoring S3 EMR Web Scheduler CLI

17

EMR EMR EMR

slide-18
SLIDE 18

Many ephemeral clusters

  • New cluster for every pipeline.
  • Dozens of clusters at a time.
  • Median lifetime of ~3 hours.

18

slide-19
SLIDE 19

Total isolation

We know what is happening and why.

19

slide-20
SLIDE 20

Pick the best hardware for each job

For CPU-bound jobs

c3

For memory-bound jobs

r3

20

slide-21
SLIDE 21

Scale up/down clusters

  • If we are behind.
  • Scale as we grow.
  • No more waiting on loaded clusters.

21

slide-22
SLIDE 22

Safer upgrades of EMR/Hadoop/Spark

5.13 5.12 5.12 5.12 5.12

22

slide-23
SLIDE 23

Spot-instance clusters

Ridiculous savings (up to 80% off the on-demand price) Nodes can die at any time

23

  • +
slide-24
SLIDE 24

Spot-instance clusters

Ridiculous savings (up to 80% off the on-demand price) Nodes can die at any time

24

  • +
slide-25
SLIDE 25

How can we build highly reliable data pipelines with instances killed randomly all the time?

25

?

slide-26
SLIDE 26

No long running jobs

  • The longer the job, the more work you lose on average.
  • The longer the job, the longer it takes to recover.

26

slide-27
SLIDE 27

No long running jobs

9

27

Pipeline A Pipeline B Time (hours)

slide-28
SLIDE 28

No long running jobs

Pipeline A Pipeline B Time (hours) 9 7 10 16 Job failure

28

slide-29
SLIDE 29

Break down jobs into smaller pieces

Vertically - persist intermediate data between transformations. Horizontally - partition the input data.

29

slide-30
SLIDE 30

Example

30

Input data Output data

Rollups pipeline

Aggregated time series data (custom file format) Raw time series data

slide-31
SLIDE 31

Example

31

Input data Output data

Rollups pipeline

Raw time series data Aggregated time series data (custom file format)

  • 1. Aggregate high

resolution data.

  • 2. Store the

aggregated data in our custom file format.

1 2

slide-32
SLIDE 32

Example

32

Output data

Checkpoint data

Input data

Vertical split

Aggregated time series data (Parquet format) Raw time series data Aggregated time series data (custom file format)

1 2

  • 1. Aggregate high

resolution data.

  • 2. Store the

aggregated data in our custom file format.

1 2

slide-33
SLIDE 33

Example

33

Output data

Checkpoint data

Horizontal split

Aggregated time series data (Parquet format) Raw time series data Aggregated time series data (custom file format)

1 2

A B C D

  • 1. Aggregate high

resolution data.

  • 2. Store the

aggregated data in our custom file format.

1 2

slide-34
SLIDE 34

Example

34

A A

Horizontal split

B B C C D D

Aggregated time series data (Parquet format) Raw time series data Aggregated time series data (custom file format)

1 2

  • 1. Aggregate high

resolution data.

  • 2. Store the

aggregated data in our custom file format.

1 2

slide-35
SLIDE 35

Break down jobs into smaller pieces

35

Fault tolerance Performance

slide-36
SLIDE 36

Lessons

  • Many clusters for better isolation.
  • Break down jobs into smaller pieces.
  • Trade-off between performance and fault tolerance.

36

slide-37
SLIDE 37
  • 1. Architecture
  • 2. Monitoring
  • 3. Failures handling

Highly reliable data pipelines

37

slide-38
SLIDE 38

Cluster tagging

#rollups #anomaly

  • detection

38

slide-39
SLIDE 39

Monitor cluster metrics

39

slide-40
SLIDE 40

Monitor cluster metrics

40

slide-41
SLIDE 41

Monitor cluster metrics

41

slide-42
SLIDE 42

Monitor cluster metrics

42

slide-43
SLIDE 43

Monitor work metrics

More details: datadoghq.com/blog/monitoring-spark/

43

slide-44
SLIDE 44

Monitor work metrics

44

slide-45
SLIDE 45

Monitor work metrics

45

slide-46
SLIDE 46

Monitor work metrics

46

slide-47
SLIDE 47

Monitor data lag

47

1 point/sec data lag 1 point/hour data lag

slide-48
SLIDE 48

Lessons

  • Measure, measure and measure!
  • Alert on meaningful and actionable metrics.
  • High level dashboards.

48

slide-49
SLIDE 49
  • 1. Architecture
  • 2. Monitoring
  • 3. Failures handling

Highly reliable data pipelines

49

slide-50
SLIDE 50

50

slide-51
SLIDE 51

Data pipelines will break

Hardware failures Bad code changes

51

Upstream delays Increasing volume of data

slide-52
SLIDE 52

Data pipelines will break

  • 1. Recover fast.
  • 2. Degrade gracefully.

52

slide-53
SLIDE 53

Recover fast

  • No long running job.
  • Switch from spot to on-demand clusters.
  • Increase cluster size.

53

slide-54
SLIDE 54

Recover fast: easy way to rerun jobs

  • Needed when jobs run but produce some bad data.
  • Not always trivial.

54

slide-55
SLIDE 55

Example: rerun the rollups pipeline

2018-01 2018-02 2018-03 2018-04 2018-05 s3://bucket/

55

slide-56
SLIDE 56

Example: rerun the rollups pipeline

as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 s3://bucket/2018-05/

56

slide-57
SLIDE 57

Example: rerun the rollups pipeline

as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 s3://bucket/2018-05/ Active location

57

slide-58
SLIDE 58

Example: rerun the rollups pipeline

as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 s3://bucket/2018-05/ Active location

58

slide-59
SLIDE 59

Example: rerun the rollups pipeline

as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 s3://bucket/2018-05/ Active location

59

slide-60
SLIDE 60

Example: rerun the rollups pipeline

as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 s3://bucket/2018-05/ Active location

60

slide-61
SLIDE 61

Example: rerun the rollups pipeline

as-of_2018-05-01 as-of_2018-05-02 ... as-of_2018-05-21 as-of_2018-05-22 as-of_2018-05-22_run-2 s3://bucket/2018-05/ Active location

61

slide-62
SLIDE 62

Degrade

62

A A B B C C D D

gracefully

  • Isolate issues to a limited

number of customers.

  • Keep the functionalities
  • perational at the cost of

performance/accuracy.

slide-63
SLIDE 63

Degrade gracefully: skip corrupted files

  • Job failure caused by limited corrupted input data.
  • Don’t ignore real widespread issues.

63

slide-64
SLIDE 64

Lessons

  • Think about potential issues ahead of time.
  • Have knobs ready to recover fast.
  • Have knobs ready to limit the customer facing impact.

64

slide-65
SLIDE 65

Conclusion

65

Building highly reliable data pipelines

slide-66
SLIDE 66

Conclusion

  • Know your time constraints.

66

Building highly reliable data pipelines

slide-67
SLIDE 67

Conclusion

  • Know your time constraints.
  • Break down jobs into small survivable pieces.

67

Building highly reliable data pipelines

slide-68
SLIDE 68

Conclusion

  • Know your time constraints.
  • Break down jobs into small survivable pieces.
  • Monitor cluster metrics, job metrics and data lags.

68

Building highly reliable data pipelines

slide-69
SLIDE 69

Conclusion

  • Know your time constraints.
  • Break down jobs into small survivable pieces.
  • Monitor cluster metrics, job metrics and data lags.
  • Think about failures ahead of time and get prepared.

69

Building highly reliable data pipelines

slide-70
SLIDE 70

Thanks!

We’re hiring! qf@datadoghq.com https://jobs.datadoghq.com

70