Headline Architecture Suudhan Rangarajan (@suudhan) Senior - - PowerPoint PPT Presentation

headline
SMART_READER_LITE
LIVE PREVIEW

Headline Architecture Suudhan Rangarajan (@suudhan) Senior - - PowerPoint PPT Presentation

Netflix Play API Why we built an Evolutionary Headline Architecture Suudhan Rangarajan (@suudhan) Senior Software Engineer Netflix Play API Why we built an Evolutionary Headline Architecture Suudhan Rangarajan (@suudhan) Senior Software


slide-1
SLIDE 1

Headline

Suudhan Rangarajan (@suudhan) Senior Software Engineer

Netflix Play API

Why we built an Evolutionary Architecture

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Headline

Suudhan Rangarajan (@suudhan) Senior Software Engineer

Netflix Play API

Why we built an Evolutionary Architecture

slide-6
SLIDE 6

Previous Architecture Workflow

Sign-up Content Discovery Playback API Service

← Services hosted in AWS → Devices

Domain specific Microservices API Proxy Service

slide-7
SLIDE 7

Signup Workflow

← Services hosted in AWS → Devices

Signup API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service

slide-8
SLIDE 8

Content Discovery Workflow

← Services hosted in AWS → Devices

Discovery API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service

slide-9
SLIDE 9

Playback Workflow

← Services hosted in AWS → Devices

Play API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service

slide-10
SLIDE 10

Previous Architecture

← Services hosted in AWS → Devices

Signup API Discovery API Play API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service

slide-11
SLIDE 11

Identity Type 1/2 Decisions Evolvability

slide-12
SLIDE 12

Identity Type 1/2 Decisions Evolvability

slide-13
SLIDE 13

Start with WHY: Ask why your service exists

slide-14
SLIDE 14

Lead the Internet TV revolution to entertain billions of people across the world P Maximize customer engagement from signup to streaming P Enable acquisition, discovery, playback functionality 24/7

slide-15
SLIDE 15

API Identity: Deliver Acquisition, Discovery and Playback functions with high availability

slide-16
SLIDE 16

Single Responsibility Principle: Be wary

  • f multiple-identities rolled up into a

single service

slide-17
SLIDE 17

One API Service

Signup API Discovery API Play API Signup API Discovery API Play API

API Service Per function Previous Architecture Current Architecture

slide-18
SLIDE 18

Lead the Internet TV revolution to entertain billions of people across the world P Maximize user engagement of Netflix customer from signup to streaming P Enable non-member, discovery, playback functionality 24/7 P Deliver Playback Lifecycle 24/7

slide-19
SLIDE 19

Decide best playback experience Track events to measure playback experience Authorize playback experience

Play API

Devices

API Proxy Service

slide-20
SLIDE 20

Decide best playback experience Track events to measure playback experience Authorize playback experience

Devices

API Proxy Service

High Coupling, Low Evolvability

slide-21
SLIDE 21
slide-22
SLIDE 22

Play API Identity: Orchestrate Playback Lifecycle with stable abstractions

slide-23
SLIDE 23

Guiding Principle: We believe in a simple singular identity for our services. The identity relates to and complements the identities of the company, organization, team and its peer services

slide-24
SLIDE 24

Identity Type 1/2 Decisions Evolvability

slide-25
SLIDE 25

“Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation [...] We can call these Type 1 decisions…”

Quote from Jeff Bezos

slide-26
SLIDE 26

“...But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long [...] Type 2 decisions can and should be made quickly by high judgment individuals or small groups.”

Quote from Jeff Bezos

slide-27
SLIDE 27

Three Type 1 Decisions to Consider

Synchronous & Asynchronous Data Architecture Appropriate Coupling

slide-28
SLIDE 28

Two types of Shared Libraries

Play API Service Utilities cache Metrics Shared Libraries with common functions Client Libraries used for inter-service communications Client 1 Client 2 Client 3

slide-29
SLIDE 29

“Thick” shared libraries with 100s of dependent libraries (e.g. utilities jar)

Previous Architecture

1) Binary Coupling

slide-30
SLIDE 30

Hundreds of shared libraries spanning services across network boundaries

Previous Architecture

Binary coupling => Distributed Monolith

Utilities Utilities Utilities Service1 Service2 Service3

slide-31
SLIDE 31

“The evils of too much coupling between services are far worse than the problems caused by code duplication”

  • Sam Newman (Building

Microservices)

slide-32
SLIDE 32

Play API Service Playback Decision Service

Playback Decision Client Previous Architecture

slide-33
SLIDE 33

Requests Per Second of API Service Increase in Latencies from the API Service Execution of Fallback via Play Decision Client

Clients with heavy Fallbacks

slide-34
SLIDE 34

Play API Service Playback Decision Service

Playback Decision Client Previous Architecture

2) Operational Coupling

slide-35
SLIDE 35

“Operational Coupling” might be an

  • k choice, if some services/teams are

not yet ready to own and operate a highly available service.

slide-36
SLIDE 36

Many of the client libraries had the potential to bring down the API Service

Previous Architecture

Operational Coupling impacts Availability

Play API Service

slide-37
SLIDE 37

Play API Service Playback Decisions Service

client

Java Java

Previous Architecture

3) Language Coupling

slide-38
SLIDE 38

Play API Service

client

REST over HTTP 1.1

  • Unidirectional

(Request/ Response type APIs) Previous Architecture

Playback Decisions Service Jersey Framework

Communication Protocol

slide-39
SLIDE 39

Requirements

Operationally “thin” Clients No or limited shared libraries Auto-generated clients for Polyglot support Bi-Directional Communication

slide-40
SLIDE 40
  • At Netflix, most use-cases were modelled as Request/Response

○ REST was a simple and easy way of communicating between services; so choice of REST was more incidental rather than intentional

  • Most of the services were not following RESTful principles.

○ The URL didn’t represent a unique resource, instead the parameters passed in the call determined the response - effectively made them a RPC call

  • So we were agnostic to REST vs RPC as long as it meets our requirements

REST vs RPC

slide-41
SLIDE 41
slide-42
SLIDE 42

Previous Architecture Current Architecture

Play API Service

Playback Decisions Playback Authorize Playback Events Playback Decisions Playback Authorize Playback Events

1) Operationally Coupled Clients 2) High Binary Coupling 3) Only Java 4) Unidirectional communication

Play API Service

1) Minimal Operational Coupling 2) Limited Binary Coupling 3) Beyond Java 4) Beyond Request/ Response

gRPC/ HTTP2 REST/ HTTP1

slide-43
SLIDE 43

Consider “thin” auto-generated clients with bi-directional communication and minimize code reuse across service boundaries Type 1 Decision: Appropriate Coupling

slide-44
SLIDE 44

Three Type 1 Decisions to Consider

Synchronous vs Asynchronous Data Architecture Appropriate Coupling

slide-45
SLIDE 45

PlayData getPlayData(string customerId, string titleId, string deviceId){ CustomerInfo custInfo = getCustomerInfo(customerId); DeviceInfo deviceInfo = getDeviceInfo(deviceId); PlayData playdata = decidePlayData(custInfo, deviceInfo, titleId); return playdata; }

slide-46
SLIDE 46

Request Handler Thread pool Client Thread pool

Typical Synchronous Architecture

slide-47
SLIDE 47

Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request

Typical Synchronous Architecture

getDeviceInfo Customer Service Device Service Play Data Decision Service

slide-48
SLIDE 48

Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request

Typical Synchronous Architecture

getDeviceInfo Customer Service Device Service Play Data Decision Service

Blocking Request Handler Blocking Client I/O

slide-49
SLIDE 49

Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request

Typical Synchronous Architecture

getDeviceInfo

Blocking Request Handler Blocking Client I/O

Works for Simple Request/Response Works for Limited Clients

slide-50
SLIDE 50

Beyond Request/Response

One Request - One Response Request Play-data for Title X Receive Play-data for Title X One Request - Stream Response Request Play-data for Titles X,Y,Z Receive Play-data for Title X Receive Play-data for Title Y Receive Play-data for Title Z Stream Request - One Response Request Play-data for Title X Request Play-data for Title Y Request Play-data for Title Z Receive Play-data for Titles X,Y,Z Stream Request - Stream Response

Request Play-data for Title X Request Play-data for Title Y Receive Play-data for Title X Get Play-data for Title Z Receive Play-data for Title Y Receive Play-data for Title Z

slide-51
SLIDE 51

Request/Response Event Loop Outgoing Event Loop per client Worker Threads

Asynchronous Architecture

slide-52
SLIDE 52

PlayData getPlayData(string customerId, string titleId, string deviceId){ Zip(getCustomerInfo(customerId), getDeviceInfo(deviceId), (custInfo, deviceInfo) -> return decidePlayData(custInfo, deviceInfo, titleId) ); }

slide-53
SLIDE 53

Request/Response Event Loop Outgoing Event Loop per client Workflow spans many worker threads

Asynchronous Architecture

Customer Service Device Service PlayData Service

setup

slide-54
SLIDE 54

Request/Response Event Loop Outgoing Event Loop per client Workflow spans many worker threads

Asynchronous Architecture

Customer Service Device Service PlayData Service

getCustomerInfo

slide-55
SLIDE 55

Request/Response Event Loop Outgoing Event Loop per client Workflow spans many worker threads

Asynchronous Architecture

Customer Service Device Service PlayData Service

getDeviceInfo

slide-56
SLIDE 56

Request/Response Event Loop Outgoing Event Loop per client Workflow spans many worker threads

Asynchronous Architecture

Customer Service Device Service PlayData Service

zip

slide-57
SLIDE 57

Request/Response Event Loop Outgoing Event Loop per client Workflow spans many worker threads

Asynchronous Architecture

Customer Service Device Service PlayData Service

decidePlayData

slide-58
SLIDE 58
  • All context is passed as messages from one processing unit to

another.

  • If we need to follow and reason about a request, we need to build

tools to capture and reassemble the order of execution units

  • None of the calls can block

Workflow spans multiple threads

slide-59
SLIDE 59

Request/Response Event Loop Outgoing Event Loop per client Worker Threads

Asynchronous Architecture

Asynchronous Request Handler Non-Blocking I/O

slide-60
SLIDE 60

Synchrony

Ask: Do you really have a need beyond Request/Response?

slide-61
SLIDE 61

Network Event Loop Outgoing Event Loop per client Dedicated thread

Synchronous Execution + Asynchronous I/O Blocking Request Handler Non-Blocking I/O

Current Architecture

getPlayData getCustomerInfo decidePlayData Return getDeviceInfo

slide-62
SLIDE 62

If most of your APIs fit the Request/Response pattern, consider a synchronous request handler, with nonblocking I/O Type 1 Decision: Synchronous vs Asynchronous

slide-63
SLIDE 63

Three Type 1 Decisions to Consider

Synchronous vs Asynchronous Data Architecture Appropriate Coupling

slide-64
SLIDE 64

Without an intentional Data Architecture, Data becomes its

  • wn monolith
slide-65
SLIDE 65

Previous Architecture

What a Data Monolith looks like

Data Source Data Source Data Source

Service 1 Service 2 Service 3 Service 4

slide-66
SLIDE 66

4 GB 1 GB 2 GB 400 MB 600 MB

API Service ← Multiple Data sources loaded in memory → ← Memory Load →

Previous Architecture

What a Data Monolith looks like

slide-67
SLIDE 67

4 GB 1 GB 2 GB 400 MB 600 MB

API Service Very small percentage of data actually accessed

Previous Architecture

What a Data Monolith looks like

slide-68
SLIDE 68

API Service Each Data Source models gets coupled across classes and libraries

Previous Architecture

What a Data Monolith looks like

slide-69
SLIDE 69

API Service Unpredictable Performance Characteristics Data Update CPU Utilization

Previous Architecture

What a Data Monolith looks like

slide-70
SLIDE 70

What a Data Monolith looks like

API Service

Potential to bring down the service

Data Update Netflix was down

Previous Architecture

slide-71
SLIDE 71

"All problems in computer science can be solved by another level of indirection." David Wheeler

(World’s first Comp Sci PhD)

slide-72
SLIDE 72

Current Architecture

Data Source Data Source Data Source Data Source Data Source Data Loader Data Service

Play API Service

Data Store

Materialized View

slide-73
SLIDE 73

Current Architecture

Data Source Data Source Data Source Data Source Data Source Data Loader Data Service

Uses only the data it needs Predictable Operational Characteristics Reduced Dependency chain

Data Store

Play API Service Materialized View

slide-74
SLIDE 74

Isolate Data from the Service. At the very least, ensure that data sources are accessed via a layer of abstraction, so that it leaves room for extension later Type 1 Decision: Data Architecture

slide-75
SLIDE 75

Three Type 1 Decisions to Consider

Synchrony Data Architecture Appropriate Coupling

slide-76
SLIDE 76

For Type 2 decisions, choose a path, experiment and iterate

slide-77
SLIDE 77

Guiding Principle: Identify your Type 1 and Type 2 decisions; Spend 80% of your time debating and aligning on Type 1 Decisions

slide-78
SLIDE 78

Identity Type 1/2 Decisions Evolvability

slide-79
SLIDE 79

An Evolutionary Architecture supports guided and incremental change as first principle among multiple dimensions

  • ThoughtWorks
slide-80
SLIDE 80

Choosing a microservices architecture with appropriate coupling allows us to evolve across multiple dimensions

slide-81
SLIDE 81

How evolvable are the Type 1 decisions

Change Play API Current Architecture Previous Architecture

Asynchronous? Polyglot services? Bidirectional APIs? Additional Data Sources?

Known Unknowns

slide-82
SLIDE 82

Potential Type 1 decisions in the future?

Change Play API Current Architecture Previous Architecture

Containers? Serverless?

? ? And we fully expect that there will be Unknown Unknowns

slide-83
SLIDE 83

As we evolve, how to ensure we are not breaking our original goals?

slide-84
SLIDE 84

Use Fitness Functions to guide change

slide-85
SLIDE 85

High Availability Low Latency Simplicity Reliability High Throughput Observability Developer Productivity Continuous Integration Scalable Evolvability

slide-86
SLIDE 86

High Availability Low Latency Simplicity Reliability High Throughput Observability Developer Productivity Continuous Integration Scalable Evolvability

1 2 3 4

slide-87
SLIDE 87

Why Simplicity over Reliability?

Increase in Operational Complexity Reliable Fallback when service is down

slide-88
SLIDE 88

Why Scalability over Throughput?

New instances were added Increase in Errors due to cache warming

slide-89
SLIDE 89

Why Observability over Latency?

Decrease in latency by using a fully async executor Cost of Async: Loss in Observability

slide-90
SLIDE 90

Four 9s of availability Thin Clients P99 latency Resilience to failures Merge to Deploy Time

1 2 3

slide-91
SLIDE 91

Guiding Principle: Define Fitness functions to act as your guide for architectural evolution

slide-92
SLIDE 92

Previous Architecture Current Architecture

Operational Coupling Binary Coupling Only Java Synchronous communication Data Monolith Operational Isolation No Binary Coupling Beyond Java Asynchronous communication Explicit Data Architecture Guided Fitness Functions Multiple Identities Singular Identities

slide-93
SLIDE 93
  • No incidents in a

year

  • 4.5 deployments

per week

  • Just two rollbacks!
slide-94
SLIDE 94

Identity Type 1/2 Decisions Evolvability Build a Evolutionary Architecture