Lyft's Envoy: Embracing a Service Mesh Matt Klein / @mattklein123, - - PowerPoint PPT Presentation

lyft s envoy embracing a service mesh
SMART_READER_LITE
LIVE PREVIEW

Lyft's Envoy: Embracing a Service Mesh Matt Klein / @mattklein123, - - PowerPoint PPT Presentation

Lyft's Envoy: Embracing a Service Mesh Matt Klein / @mattklein123, Software Engineer @Lyft @mattklein123 Lyft ~5 years ago PHP / Apache Internet Clients AWS ELB monolith MongoDB Simple! No microservices! ( but still not that simple )


slide-1
SLIDE 1

@mattklein123

Lyft's Envoy: Embracing a Service Mesh

Matt Klein / @mattklein123, Software Engineer @Lyft

slide-2
SLIDE 2

@mattklein123 @mattklein123

Lyft ~5 years ago

PHP / Apache monolith MongoDB Internet Clients AWS ELB

Simple! No microservices! (but still not that simple)

slide-3
SLIDE 3

@mattklein123 @mattklein123

Lyft ~3 years ago

PHP / Apache monolith (+haproxy/nsq) MongoDB Internet Clients AWS external ELB DynamoDB AWS internal ELBs Python services

Not simple! Microservices! With monolith! (and some haproxy/nsq)

slide-4
SLIDE 4

@mattklein123 @mattklein123

Lyft’s microservice architecture problems 3 years ago

  • Multiple Languages and frameworks.
  • Many Protocols (HTTP/1, HTTP/2, gRPC, databases, caching, etc.).
  • Black box load balancers (AWS ELB).
  • Lack of consistent Observability (stats, tracing, and logging).
  • Partial or no implementations of retry, circuit breaking, rate limiting,

timeouts, and other distributed systems best practices.

  • Minimal Authentication and Authorization.
  • Per language libraries for service calls.
  • Extremely difficult to debug latency and failures.
  • Developers did not trust the microservice architecture.
slide-5
SLIDE 5

@mattklein123 @mattklein123

Lyft’s architecture problems 3 years ago

A really big and confusing mess...

slide-6
SLIDE 6

@mattklein123 @mattklein123

What is Envoy and the service mesh?

The network should be transparent to applications. When network and application problems do occur it should be easy to determine the source of the problem.

slide-7
SLIDE 7

@mattklein123 @mattklein123

Service mesh refresher

Service A Sidecar proxy Service B Sidecar proxy Service A Sidecar proxy Service A Sidecar proxy Service A Sidecar proxy Service C Sidecar proxy Service A Sidecar proxy Service D Sidecar proxy

slide-8
SLIDE 8

@mattklein123 @mattklein123

Envoy

  • Out of process architecture
  • High performance / low latency code base
  • L3/L4 filter architecture
  • HTTP L7 filter architecture
  • HTTP/2 first
  • Service discovery and active/passive health checking
  • Advanced load balancing
  • Best in class observability (stats, logging, and tracing)
  • Authentication and authorization
  • Edge proxy
slide-9
SLIDE 9

@mattklein123 @mattklein123

Observability

  • Observability is by far the most important thing that Envoy and the service

mesh provides.

  • Having all traffic transit through Envoy provides a single place to:

○ Produce consistent statistics for every hop. ○ Create and propagate a stable request ID / tracing context. ○ Consistent logging. ○ Distributed tracing.

slide-10
SLIDE 10

@mattklein123 @mattklein123

Lyft today

Legacy monolith Internet Clients Front / edge Python services

Obs, obs, obs, obs, obs, obs...

Go services MongoDB DynamoDB Stats / tracing / logging Envoy manager (xDS server) Redis External partners

slide-11
SLIDE 11

@mattklein123 @mattklein123

Per service auto-generated panel

Links to interesting data Clickable traces from top-level panel Per-caller information

slide-12
SLIDE 12

@mattklein123 @mattklein123

Distributed tracing

slide-13
SLIDE 13

@mattklein123 @mattklein123

Logging

slide-14
SLIDE 14

@mattklein123 @mattklein123

Service to service template dashboard

Template with drop down for every service

slide-15
SLIDE 15

@mattklein123 @mattklein123

Edge proxy

Per-upstream cluster RPS Per-upstream cluster 5xx Per-upstream cluster timings

slide-16
SLIDE 16

@mattklein123 @mattklein123

Global health dashboard

slide-17
SLIDE 17

@mattklein123 @mattklein123

Envoy thin clients @Lyft

from lyft.api_client import EnvoyClient switchboard_client = EnvoyClient( service='switchboard' ) msg = {'template': 'breaksignout'} headers = {'x-lyft-user-id': 12345647363394} switchboard_client.post("/v2/messages", data=msg, headers=headers)

  • Abstract away egress port
  • Request ID/tracing propagation
  • Guide devs into good timeout, retry, etc. policies
  • Similar thin clients for Go and PHP
slide-18
SLIDE 18

@mattklein123 @mattklein123

Envoy config management via xDS APIs

  • Envoy is a universal data plane
  • xDS == * Discovery Service (various configuration APIs). E.g.,:

○ LDS == Listener Discovery Service ○ CDS == Cluster Discovery Service

  • Both gRPC streaming and JSON/YAML REST via proto3!
  • Central management system can control a fleet of Envoys avoiding per-proxy

config file hell

  • Global bootstrap config for every Envoy, rest taken careof by the

management server

  • Envoys + xDS + management system == fleet wide traffic management

distributed system

slide-19
SLIDE 19

@mattklein123 @mattklein123

Envoy config management via xDS APIs @lyft

Cluster manager Listener manager Route manager Legacy discovery service Envoy manager service Envoy static config repo Service manifests S3 Registration cron jobs SDS CDS RDS LDS Only need a very tiny bootstrap config for each envoy...

slide-20
SLIDE 20

@mattklein123 @mattklein123

Lyft’s Envoy deployment

  • 100s of services
  • 10Ks of hosts
  • 5-10M mesh RPS
  • Majority h2
  • All edge, StS, and vast majority of external partners
  • MongoDB, DynamoDB, Spanner, Redis
  • Evolving configuration management system as we move to K8s
slide-21
SLIDE 21

@mattklein123 @mattklein123

Envoy adoption

And lots more not listed...

slide-22
SLIDE 22

@mattklein123 @mattklein123

Why Envoy + Q&A

  • Quality + velocity
  • Extensibility
  • Eventually consistent configuration API
  • No “open core” / paid premium version. It’s all there
  • Community, community, community

Critical mass has nearly been achieved. Becoming too costly to not use?