Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu - - PowerPoint PPT Presentation

β–Ά
accept partial failures minimize service loss
SMART_READER_LITE
LIVE PREVIEW

Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu - - PowerPoint PPT Presentation

Accept Partial Failures, Minimize Service Loss Daxin Wang Baidu SRE Diversified products Promote Experienced team Too complicated to recover rapidly Network & Infrastructure Operation Mistake Software Bug Basic model of reduce


slide-1
SLIDE 1

Accept Partial Failures, Minimize Service Loss

Daxin Wang

slide-2
SLIDE 2

Baidu SRE

Diversified products Promote Experienced team

slide-3
SLIDE 3

Too complicated to recover rapidly

Network & Infrastructure Operation Mistake Software Bug

slide-4
SLIDE 4

30 60 90 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 requests failed root cause recovery

𝑑𝑓𝑠𝑀𝑗𝑑𝑓_π‘šπ‘π‘‘π‘’ = , 𝑀𝑝𝑑𝑒 𝑒 𝑒𝑒

/

incident happened recovery

  • peration

Basic model of reduce service lost in incident

slide-5
SLIDE 5

30 60 90 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 root cause recovery partial recovery

𝑑𝑓𝑠𝑀𝑗𝑑𝑓_π‘šπ‘π‘‘π‘’ = , 𝑀𝑝𝑑𝑒 𝑒 𝑒𝑒

/

detected partial recovery root cause identified completely recovery

Root cause recovery VS partial recovery

requests failed

slide-6
SLIDE 6

Basic principles

  • Deployment isolation

– Limit failures in one cell, shift user queries rapidly

  • Module isolation

– Make the non-essential modules detachable

  • User traffic isolation

– Drop some of the queries to save the important ones

slide-7
SLIDE 7

Deployment isolation

slide-8
SLIDE 8

Deployment isolation

slide-9
SLIDE 9

Deploy isolation – Global Single Point

ZooKeeper

Third Party Service

slide-10
SLIDE 10

Deploy isolation – Service Across Cells

slide-11
SLIDE 11

Deploy isolation – Capacity Redundancy

Realtime capacity measure Periodic stress test

slide-12
SLIDE 12

Deploy isolation – Reduce Change Risks

  • Not only deploy, but also operation
  • Do not change all cells at the same time, especially in

automation!!!

  • Check system status after every stage of change,

manually if necessary

  • Pay attention to different operation entry, set global

β€œlocks”

slide-13
SLIDE 13

Module isolation

  • No service will never

crash

  • Detail loss is much

better than totally

  • utage
  • Make every non-

essential module detachable, even automatically

slide-14
SLIDE 14

Module isolation -- External Dependencies

CDN DNS HttpDNS

slide-15
SLIDE 15

User traffic isolation

  • When no sufficient capacity,

sacrifice part of requests to save the more important part

  • Which part?

– Real user > Crawler – Paid user > Free user – Popular request > Long tail request

slide-16
SLIDE 16

User traffic isolation – Distinguish Real-time

  • Prepare for dropping at any time
  • Crawlers may disguise the requests as human
  • Machine learning attempt
slide-17
SLIDE 17

Conclusion

slide-18
SLIDE 18

Daxin Wang ηŽ‹θΎΎεΏƒ

Enjoy the fighting against incidents