SLIDE 1 Crisis to Calm: A Story of Data Validation @ Netflix
Lavanya Kanchanapalli
SLIDE 2
SLIDE 3
- Rollback data
- Increase capacity
- Netflix starts to work
SLIDE 4
Root Cause
- Apps failed
- Cascading failures
- Netflix goes down!
SLIDE 5
SLIDE 6
SLIDE 7
SLIDE 8 Data Change Change in Behavior Netflix Microservices
App1
Cloud
App2 Appn
SLIDE 9
SLIDE 10
SLIDE 11
SLIDE 12 Metadata Architecture
Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Traffic
SLIDE 13
- Single publisher
- Multiple consumers
- Hollow (hollow.how)
- Versioned data
- Fast propagation
SLIDE 14
Bad data happens!
SLIDE 15
Leaked Content
SLIDE 16
Disables Features
SLIDE 17
Deletes data
SLIDE 18
SLIDE 19
Data change = Code Push
SLIDE 20
1 Detection Staggering 2 Rollback 3 Staggering 2 Rollback 3
SLIDE 21
1 Detection Staggering 2 Rollback 3 Rollback 3
SLIDE 22
1 Detection Staggering 2 Rollback 3
SLIDE 23
Many Unknowns
SLIDE 24
Would it be too slow?
SLIDE 25
Would there be too many failures?
SLIDE 26
Would it cost too much?
SLIDE 27
1 Detection Staggering 2 Rollback 3
SLIDE 28
1 Detection Staggering 2 Rollback 3 Staggering 2 Rollback 3
SLIDE 29
Chapter 1: Circuit Breakers
SLIDE 30 Metadata Architecture
Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Traffic
SLIDE 31
- Integrity checks
- Duplicate detection
- Object counts
- Semantic checks
Circuit Breakers
SLIDE 32
Know your data change
SLIDE 33 Knobs are key to sanity
- On/ off
- Threshold
- Exclusions
SLIDE 34
SLIDE 35
SLIDE 36
Business value is the key
SLIDE 37
SLIDE 38 Efficiency
- Change Isolation
- Sampling
SLIDE 39
Chapter 2: Canaries
SLIDE 40 Traditional Canaries
Canary (New Code) Baseline (Old Code) Shadow Traffic Shadow Traffic Video Metadata Service Amazon S3 Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Source System Source System Traffic
SLIDE 41 Data Canaries
Netflix Services Netflix Services Netflix Services Netflix Services
Video Metadata Service Amazon S3 Source System Source System
Netflix Service Netflix Data Canary Service Netflix Service Traffic
SLIDE 42 Netflix Data Canary Service
- Pick key use case(s)
- Pick data to test
- Test with latest data
Netflix Service
Netflix Data Canary Service
SLIDE 43
1 Detection Staggering 2 Rollback Rollback 3
SLIDE 44 Amazon WS Global Infrastructure
STAGGERED ROLLOUT
SLIDE 45
1 Detection Staggering 2 Rollback 3
SLIDE 46 Keep calm & rollback
- Pin back
- Root cause
- Unpin
SLIDE 48
Data Diff UI
SLIDE 49
Circuit Breaker UI
SLIDE 50
Pinning UI
SLIDE 51
SLIDE 52
SLIDE 53 Data validation is key to high availability
- Data change = Code push
- Circuit breakers & canaries
- Staggering and rollback
SLIDE 54
Thank You