The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 our - - PowerPoint PPT Presentation

the forces that disrupt netflix
SMART_READER_LITE
LIVE PREVIEW

The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 our - - PowerPoint PPT Presentation

The Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 our world ACROBAT FLEA parallel world # A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.


slide-1
SLIDE 1

The Forces That Disrupt Netflix

  • Nov. 7, 2016

Haley Tucker

slide-2
SLIDE 2

ACROBAT FLEA

  • ur world

parallel world

slide-3
SLIDE 3

# A distributed system is

  • ne in which the failure
  • f a computer you didn't

even know existed can render your own computer unusable.

  • -Leslie Lamport
slide-4
SLIDE 4

ACROBAT FLEA

  • ur world

parallel world computing

ENGINEER

slide-5
SLIDE 5
slide-6
SLIDE 6

PROLOGUE

DISTRIBUTED SYSTEMS

slide-7
SLIDE 7
slide-8
SLIDE 8

Proxy/Routing

DECOMPOSING THE MONOLITH

Devices Netflix Service Netflix Service Edge Service Traffic Netflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic

slide-9
SLIDE 9

Notes on Distributed Systems for Young Bloods

# Distributed systems are different because they fail often.

  • -Jeff Hodges
slide-10
SLIDE 10

TABLE OF CONTENTS

CHAPTER 1: THE WEIRD DATA IN THE CATALOG

  • Metadata impacts on availability

CHAPTER 2: THE VANISHING OF CRITICAL SERVICES

  • Crashing services and cascading failures

CHAPTER 3: THE THROTTLE

  • Latency spikes and the impact of fallbacks

FORCES AT WORK

slide-11
SLIDE 11
slide-12
SLIDE 12

Whoops, something went wrong…

Netflix Streaming Error

We’re having trouble playing this title right now. Please try again later or select a different title.

slide-13
SLIDE 13

CHAPTER ONE

THE WEIRD DATA IN THE CATALOG

slide-14
SLIDE 14
slide-15
SLIDE 15

45 MINUTES!!

Clock, by heyyobecky4lyfe, Tumblr

slide-16
SLIDE 16
slide-17
SLIDE 17

VIDEO METADATA

ARCHITECTURE

Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Traffic

slide-18
SLIDE 18

Amazon S3 Netflix Playback Service { String msg = “This should never happen!”; throw new IllegalStateException(msg); }

slide-19
SLIDE 19

MITIGATION

BLAST RADIUS

Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr

slide-20
SLIDE 20

Amazon WS Global Infrastructure

slide-21
SLIDE 21

Amazon WS Global Infrastructure

STAGGERED ROLLOUT

slide-22
SLIDE 22

Pager Diagnosis?

slide-23
SLIDE 23
slide-24
SLIDE 24

Canary, CC BY 2.0, Steve P2008 2014, Flikr

PREVENTION

CANARIES

slide-25
SLIDE 25

TRADITIONAL CANARY

Canary (New Code) Baseline (Old Code) Traffic Traffic Video Metadata Service Amazon S3 Netflix Services Netflix Services Netflix Services Netflix Services Netflix Service Source System Source System Traffic

slide-26
SLIDE 26

CONSISTENCY

VALID STATE TRANSITIONS

slide-27
SLIDE 27

DATA CANARY

Netflix Services Netflix Services Netflix Services Netflix Services

Video Metadata Service Amazon S3 Source System Source System

Netflix Service Netflix Data Canary Service Data Tester Netflix Service Traffic

slide-28
SLIDE 28

Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia

SEEING RETURNS

slide-29
SLIDE 29

Verify consistency prior to applying state changes.

…one tool is a data canary.

slide-30
SLIDE 30

CHAPTER TWO

THE VANISHING OF CRITICAL SERVICES

slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

# A distributed system is

  • ne in which the failure
  • f a computer you didn't

even know existed can render your own computer unusable.

  • -Leslie Lamport
slide-34
SLIDE 34

Proxy/Routing Devices

LOG DATA

Log Data Service Traffic Cassandra Playback Service Netflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic

slide-35
SLIDE 35
slide-36
SLIDE 36

Proxy/Routing Devices

Proxy

Log Data Service Traffic Cassandra Playback Service Netflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic

slide-37
SLIDE 37
slide-38
SLIDE 38

Proxy/Routing Devices

CASCADING FAILURE

Log Data Service Traffic Cassandra Playback Service Netflix Playback Service Netflix Playback Service Edge Service Edge Service Edge Service Playback Service Traffic

slide-39
SLIDE 39

{ throw new OutOfMemoryError(); }

Log Data Service Cassandra Playback Service

slide-40
SLIDE 40

PREVENTION

MANAGING RESOURCE CONSTRAINTS

Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr

slide-41
SLIDE 41

Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr

REDUCE SURFACE AREA

slide-42
SLIDE 42

1

Keep Only Dependencies which are Necessary

SO MANY JARS!!

slide-43
SLIDE 43

2

LIMIT “MAGIC”

Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr

slide-44
SLIDE 44

3

Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr

ADD KILL SWITCHES

slide-45
SLIDE 45

DEV

FAVOR IMMUTABILITY

Playback Service TEST Playback Service PROD Playback Service

4

slide-46
SLIDE 46

try { remoteService.call(); } catch( Throwable t ){ //Oops! System.exit(1); }

Log Data Service Cassandra Playback Service Proxy/Routing Traffic

slide-47
SLIDE 47

It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr

MITIGATION

CIRCUIT BREAKERS

slide-48
SLIDE 48
slide-49
SLIDE 49

Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr

FAILURE TESTING

slide-50
SLIDE 50

Proxy/Routing Devices

FAILURE TESTING

Log Data Service Traffic Cassandra Playback Service Automating Chaos Experiments in Production by Ali Basiri Applying Failure Testing Research @Netflix by Kolton Andrus and Peter Alvaro

slide-51
SLIDE 51

Manage resource constraints by reducing surface area.

Leverage circuit breakers and rigorously test failures.

slide-52
SLIDE 52

CHAPTER THREE

THE THROTTLE

slide-53
SLIDE 53
slide-54
SLIDE 54

Proxy/Routing Devices

PLAYBACK ARCHITECTURE

Edge Service Edge Service Edge Service Playback Service Traffic Traffic URL Service

slide-55
SLIDE 55

NETFLIX CLIENT JARS

Playback Service

URL Service URL Client

Circuit-breakers and Fallbacks Metrics Retries and Timeouts RPC Service Discovery

slide-56
SLIDE 56
slide-57
SLIDE 57

Playback Service

Traffic Concurrent Requests Throttled Requests (HTTP 503)

THROTTLING

slide-58
SLIDE 58
slide-59
SLIDE 59

} System.gc(); }

URL Service Playback Service Edge Service Proxy/Routing Traffic

slide-60
SLIDE 60

NETFLIX CLIENT JARS

Playback Service

URL Service URL Client

Circuit-breakers Metrics Retries and Timeouts RPC Service Discovery

Heavy Fallback

slide-61
SLIDE 61

FALLBACK TESTING

With 100% Fallback, CPU held at 90%

15 RPS

No fallback, CPU held at 90%

58 RPS

Siege: https://github.com/JoeDog/siege

slide-62
SLIDE 62

SELECTING FALLBACKS

CACHE STATIC FALLBACK SERVICE

slide-63
SLIDE 63

URL Service Playback Service Edge Service Proxy/Routing Traffic

} return Response .status(503) .build(); }

slide-64
SLIDE 64

REQUEST BUCKETING

NON-CRITICAL

Experience or Performance Impact

CRITICAL

Customer Streaming Impact

Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr

slide-65
SLIDE 65

APPLICATION SHARDING

Non-Critical Playback Service Proxy/Routing Devices Edge Service Edge Service Edge Service Traffic Traffic URL Service Critical Playback Service Non-Critical URL Service

slide-66
SLIDE 66

CRITICAL

Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr

slide-67
SLIDE 67

NON-CRITICAL

Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr

slide-68
SLIDE 68

APPLICATION SHARDING

Non-Critical Playback Service Proxy/Routing Devices Edge Service Edge Service Edge Service Traffic Traffic Critical Playback Service URL Service Non-Critical URL Service

slide-69
SLIDE 69

No heavy fallbacks!! Fallbacks should be light and fast.

Shard your application based on

  • perational characteristics.
slide-70
SLIDE 70

EPILOGUE

KEY TAKEAWAYS

slide-71
SLIDE 71

KEY TAKEAWAYS

CHAPTER 1: THE WEIRD DATA IN THE CATALOG

  • Verify consistency prior to applying state changes.
  • One tool is a data canary.

CHAPTER 2: THE VANISHING OF CRITICAL SERVICES

  • Manage resource constraints by reducing surface area.
  • Leverage circuit breakers and rigorously test failures.

CHAPTER 3: THE THROTTLE

  • No heavy fallbacks!! Fallbacks should be light and fast.
  • Shard your application based on operational characteristics.
slide-72
SLIDE 72

The unexpected will happen. Plan to fail.

slide-73
SLIDE 73

PARTING THOUGHT

DISTRIBUTED SYSTEMS SOCIAL

slide-74
SLIDE 74

Questions?

Haley Tucker