Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu - - PowerPoint PPT Presentation
Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu - - PowerPoint PPT Presentation
Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu SRE Diversified products Promote Experienced team Too complicated to recover rapidly Network & Infrastructure Operation Mistake Software Bug Basic model of reduce
SLIDE 1
SLIDE 2
Baidu SRE
Diversified products Promote Experienced team
SLIDE 3
Too complicated to recover rapidly
Network & Infrastructure Operation Mistake Software Bug
SLIDE 4
30 60 90 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 requests failed root cause recovery
π‘ππ π€πππ_πππ‘π’ = , πππ‘π’ π’ ππ’
/
incident happened recovery
- peration
Basic model of reduce service lost in incident
SLIDE 5
30 60 90 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 root cause recovery partial recovery
π‘ππ π€πππ_πππ‘π’ = , πππ‘π’ π’ ππ’
/
detected partial recovery root cause identified completely recovery
Root cause recovery VS partial recovery
requests failed
SLIDE 6
Basic principles
- Deployment isolation
β Limit failures in one cell, shift user queries rapidly
- Module isolation
β Make the non-essential modules detachable
- User traffic isolation
β Drop some of the queries to save the important ones
SLIDE 7
Deployment isolation
SLIDE 8
Deployment isolation
SLIDE 9
Deploy isolation β Global Single Point
ZooKeeper
Third Party Service
SLIDE 10
Deploy isolation β Service Across Cells
SLIDE 11
Deploy isolation β Capacity Redundancy
Realtime capacity measure Periodic stress test
SLIDE 12
Deploy isolation β Reduce Change Risks
- Not only deploy, but also operation
- Do not change all cells at the same time, especially in
automation!!!
- Check system status after every stage of change,
manually if necessary
- Pay attention to different operation entry, set global
βlocksβ
SLIDE 13
Module isolation
- No service will never
crash
- Detail loss is much
better than totally
- utage
- Make every non-
essential module detachable, even automatically
SLIDE 14
Module isolation -- External Dependencies
CDN DNS HttpDNS
SLIDE 15
User traffic isolation
- When no sufficient capacity,
sacrifice part of requests to save the more important part
- Which part?
β Real user > Crawler β Paid user > Free user β Popular request > Long tail request
SLIDE 16
User traffic isolation β Distinguish Real-time
- Prepare for dropping at any time
- Crawlers may disguise the requests as human
- Machine learning attempt
SLIDE 17
Conclusion
SLIDE 18