Crisis to Calm: A Story of Data Validation @ Netflix Lavanya - - PowerPoint PPT Presentation

crisis to calm a story of data validation netflix
SMART_READER_LITE
LIVE PREVIEW

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya - - PowerPoint PPT Presentation

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data Increase capacity Netflix starts to work Root Cause Duplicate objects Apps failed Cascading failures Netflix goes down! Data


slide-1
SLIDE 1

Crisis to Calm: A Story of Data Validation @ Netflix

Lavanya Kanchanapalli

slide-2
SLIDE 2
slide-3
SLIDE 3
  • Rollback data
  • Increase capacity
  • Netflix starts to work
slide-4
SLIDE 4
  • Duplicate objects

Root Cause

  • Apps failed
  • Cascading failures
  • Netflix goes down!
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Data Change Change in Behavior Netflix Microservices

App1

Cloud

App2 Appn

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Metadata Architecture

Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Traffic

slide-13
SLIDE 13
  • Single publisher
  • Multiple consumers
  • Hollow (hollow.how)
  • Versioned data
  • Fast propagation
slide-14
SLIDE 14

Bad data happens!

slide-15
SLIDE 15

Leaked Content

slide-16
SLIDE 16

Disables Features

slide-17
SLIDE 17

Deletes data

slide-18
SLIDE 18
slide-19
SLIDE 19

Data change = Code Push

slide-20
SLIDE 20

1 Detection Staggering 2 Rollback 3 Staggering 2 Rollback 3

slide-21
SLIDE 21

1 Detection Staggering 2 Rollback 3 Rollback 3

slide-22
SLIDE 22

1 Detection Staggering 2 Rollback 3

slide-23
SLIDE 23

Many Unknowns

slide-24
SLIDE 24

Would it be too slow?

slide-25
SLIDE 25

Would there be too many failures?

slide-26
SLIDE 26

Would it cost too much?

slide-27
SLIDE 27

1 Detection Staggering 2 Rollback 3

slide-28
SLIDE 28

1 Detection Staggering 2 Rollback 3 Staggering 2 Rollback 3

slide-29
SLIDE 29

Chapter 1: Circuit Breakers

slide-30
SLIDE 30

Metadata Architecture

Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Traffic

slide-31
SLIDE 31
  • Integrity checks
  • Duplicate detection
  • Object counts
  • Semantic checks

Circuit Breakers

slide-32
SLIDE 32

Know your data change

slide-33
SLIDE 33

Knobs are key to sanity

  • On/ off
  • Threshold
  • Exclusions
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Business value is the key

slide-37
SLIDE 37
slide-38
SLIDE 38

Efficiency

  • Change Isolation
  • Sampling
slide-39
SLIDE 39

Chapter 2: Canaries

slide-40
SLIDE 40

Traditional Canaries

Canary (New Code) Baseline (Old Code) Shadow Traffic Shadow Traffic Video Metadata Service Amazon S3 Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Source System Source System Traffic

slide-41
SLIDE 41

Data Canaries

Netflix Services Netflix Services Netflix Services Netflix Services

Video Metadata Service Amazon S3 Source System Source System

Netflix Service Netflix Data Canary Service Netflix Service Traffic

slide-42
SLIDE 42

Netflix Data Canary Service

  • Pick key use case(s)
  • Pick data to test
  • Test with latest data

Netflix Service

Netflix Data Canary Service

slide-43
SLIDE 43

1 Detection Staggering 2 Rollback Rollback 3

slide-44
SLIDE 44

Amazon WS Global Infrastructure

STAGGERED ROLLOUT

slide-45
SLIDE 45

1 Detection Staggering 2 Rollback 3

slide-46
SLIDE 46

Keep calm & rollback

  • Pin back
  • Root cause
  • Unpin
slide-47
SLIDE 47

Rollback

  • Visibility
  • Traversing
slide-48
SLIDE 48

Data Diff UI

slide-49
SLIDE 49

Circuit Breaker UI

slide-50
SLIDE 50

Pinning UI

slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53

Data validation is key to high availability

  • Data change = Code push
  • Circuit breakers & canaries
  • Staggering and rollback
slide-54
SLIDE 54

Thank You