Microservices at Netflix Scale First Principles, Tradeoffs, Lessons - - PowerPoint PPT Presentation

microservices at netflix scale
SMART_READER_LITE
LIVE PREVIEW

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons - - PowerPoint PPT Presentation

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg @rusmeshenberg Microservices: all benefits, no costs? Netflix is the worlds leading Internet television network with over 81 million members in


slide-1
SLIDE 1

Microservices at Netflix Scale

First Principles, Tradeoffs, Lessons Learned

Ruslan Meshenberg @rusmeshenberg

slide-2
SLIDE 2

Microservices: all benefits, no costs?

slide-3
SLIDE 3

Netflix is the world’s leading Internet television network with over 81 million members in over 190 countries enjoying more than 125 million hours of TV shows and movies per day, including original series, documentaries and feature films.

slide-4
SLIDE 4

Ruslan Meshenberg Director, Platform Engineering

  • Runtime Systems
  • Container Runtime
  • Persistence and Databases
  • Real Time Data Infrastructure
slide-5
SLIDE 5

Netflix runs on microservices

slide-6
SLIDE 6

Netflix journey to microservices

slide-7
SLIDE 7

Our journey took 7 years

https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration

slide-8
SLIDE 8

Data Center - Monolith

RDBMS

slide-9
SLIDE 9

August 2008

slide-10
SLIDE 10

First Principles

slide-11
SLIDE 11

Buy vs. Build

  • Use or contribute to OSS technologies first
  • Only build what you have to
slide-12
SLIDE 12

Services should be stateless*

  • Must not rely on sticky sessions
  • Prove by Chaos testing

*Except the Persistence / Caching layers

slide-13
SLIDE 13

Scale out vs. scale up

  • If you keep scaling up, you’ll hit a limit
  • Horizontal scaling gives you a longer runway
slide-14
SLIDE 14

Redundancy and Isolation For Resiliency

  • Make more than one of anything
  • Isolate the blast radius for any given failure
slide-15
SLIDE 15

Automate destructive testing

  • Simian Army
  • Started with Chaos Monkey
slide-16
SLIDE 16

First Principles In Action

slide-17
SLIDE 17

Stateless services

Service A Service B Service B Service B Service B Service B

slide-18
SLIDE 18

Verify stateless

slide-19
SLIDE 19

Data – from RDBMS to Cassandra

  • NoSQL at scale
  • Open Source
  • Multi-Regional
  • Multi-directional
  • Available
  • Partition Tolerance
  • Tunable Consistency*
slide-20
SLIDE 20

Multi-Regional Replication

Zone A Zone B Zone C Zone B Zone C Zone A Zone A Zone B Zone C Zone C

Client Client

Zone A Zone B

500ms

Bi-directional Nightly compare & repair Local Quorum

(Typical)

Region A Region B

slide-21
SLIDE 21

Last, but not least - Billing

slide-22
SLIDE 22

Microservices – Benefits

slide-23
SLIDE 23

Our Priorities

  • 1. Innovation
  • 3. Efficiency
  • 2. Reliability
slide-24
SLIDE 24

Innovation: tight coupling doesn’t work

Develop

  • Team A
  • Team B
  • Team C

Test

Release

slide-25
SLIDE 25

Innovation: Loose coupling

Team A

Develop, Test, Deploy, Support

Team B

Develop, Test, Deploy, Support

Team C

Develop, Test, Deploy, Support

slide-26
SLIDE 26

Architect

Design Develop Review Test Deploy Run Support

End-end

  • wnership
slide-27
SLIDE 27

End-end ownership + velocity

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

Architect

Design Develop Review Test Deploy Run Support

slide-28
SLIDE 28

Separation of concerns

UI

Feature A Feature B Feature C

Personalization

Feature D A/B Test E

Mid-tier

A/B Test F Feature H

Infrastructure

Availability Scalability Security Leverage

slide-29
SLIDE 29

Microservices – Costs

slide-30
SLIDE 30

Microservices Is an org change!

Org changes are hard!

slide-31
SLIDE 31

Evolving the organization

slide-32
SLIDE 32

Central infrastructure investment

slide-33
SLIDE 33

Migration doesn’t happen

  • vernight
  • Living in the hybrid world
  • Supporting 2 tech stacks
  • Double the maintenance
  • Multi-master data replication
slide-34
SLIDE 34

Microservices - Lessons Learned

slide-35
SLIDE 35

IPC is crucial for loose coupling

  • Common language between the services
  • Establishes the contract of interaction
slide-36
SLIDE 36

Caching to protect DBs

  • 1. Read from Cache
  • 2. On cache miss call service
  • 3. Service calls DB and responds
  • 4. Service updates the cache

Client Application Client Library EVCache Client Service Client S S S S

. . . DB DB DB DB . . . . . .

Request Cache

slide-37
SLIDE 37

Operational visibility matters

If you can’t see it, you can’t improve it

slide-38
SLIDE 38

Will your Telemetry scale?

Orient Decide Act Observe

slide-39
SLIDE 39

Edge

ELB Zuul Playback

API Middle Tier & Platform

EVCache Cassandra

slide-40
SLIDE 40

Reliability Matters

  • We strive for 4 9’s of availability
  • That leaves only 52 minutes of downtime per YEAR
  • Netflix outages lead to…
slide-41
SLIDE 41

Disappointment

slide-42
SLIDE 42

Outrage

slide-43
SLIDE 43

Withdrawal

slide-44
SLIDE 44

Humor

slide-45
SLIDE 45

Cascading failures

99% availability 99% availability 99% availability

99%

500 = 0.0657%

slide-46
SLIDE 46

FIT

Fault-Injection Test Framework

Microservice failure

slide-47
SLIDE 47

x x

Regional fail-over

slide-48
SLIDE 48

Regional fail-over

slide-49
SLIDE 49

A word on containers

  • Containers change the level of encapsulation

from VM to process

  • Containers can help deliver great developer

experience

  • To run containers in production at scale…
slide-50
SLIDE 50

Requires something like this:

Titus UI Titus UI Docker Registry Docker Registry Rhea container container container docker Titus Agent metrics agent Titus executor logging agent zfs mesos agent docker Rhea Titus API Cassandra Titus Master Job Management & Scheduler S3 Zookeeper Docker Registry 50 EC2 Autocaling API Mesos Master Titus UI Fenzo VPC networking driver container container container AWS container metadata proxy

Integration

CI/CD Amazon VM’s

slide-51
SLIDE 51

Microservices - Resources

slide-52
SLIDE 52

http://netflix.github.com

slide-53
SLIDE 53

http://netflix.github.com

slide-54
SLIDE 54

http://netflix.github.com

slide-55
SLIDE 55

http://netflix.github.com

slide-56
SLIDE 56

http://netflix.github.com

slide-57
SLIDE 57

http://netflix.github.com

slide-58
SLIDE 58

Wrap up

slide-59
SLIDE 59

Microservices bring great value to development velocity, availability and other dimensions

slide-60
SLIDE 60

Microservices at scale require organizational change and centralized infrastructure investment

slide-61
SLIDE 61

Be aware of your situation and what works for you

slide-62
SLIDE 62

Questions?

Ruslan Meshenberg @rusmeshenberg