Microservices: State of the Union Adrian Cockcroft @adrianco - - PowerPoint PPT Presentation

microservices state of the union
SMART_READER_LITE
LIVE PREVIEW

Microservices: State of the Union Adrian Cockcroft @adrianco - - PowerPoint PPT Presentation

Microservices: State of the Union Adrian Cockcroft @adrianco Technology Fellow - Battery Ventures June 2016 What does @adrianco do? Maintain Relationship with Presentations at Technology Due Cloud Vendors Conferences Diligence on Deals


slide-1
SLIDE 1

Microservices: State of the Union

Adrian Cockcroft @adrianco Technology Fellow - Battery Ventures June 2016

slide-2
SLIDE 2

What does @adrianco do?

@adrianco

Technology Due Diligence on Deals Presentations at Conferences Presentations at Companies Technical Advice for Portfolio Companies Program Committee for Conferences Networking with Interesting People Tinkering with Technologies Maintain Relationship with Cloud Vendors

Previously: Netflix, eBay, Sun Microsystems, CCL, TCU London BSc Applied Physics

slide-3
SLIDE 3

Developer responsibilities: Faster, cheaper, safer

slide-4
SLIDE 4

What Happened?

Rate of change increased Cost and size and risk of change reduced

slide-5
SLIDE 5

Disruptor: Continuous Delivery with Containerized Microservices

slide-6
SLIDE 6

Microservices

slide-7
SLIDE 7

A Microservice Definition Loosely coupled service oriented architecture with bounded contexts

slide-8
SLIDE 8

A Microservice Definition Loosely coupled service oriented architecture with bounded contexts

If every service has to be updated at the same time it’s not loosely coupled

slide-9
SLIDE 9

A Microservice Definition Loosely coupled service oriented architecture with bounded contexts

If every service has to be updated at the same time it’s not loosely coupled If you have to know too much about surrounding services you don’t have a bounded context. See the Domain Driven Design book by Eric Evans.

slide-10
SLIDE 10

Speeding Up The Platform

Datacenter Snowflakes

  • Deploy in months
  • Live for years
slide-11
SLIDE 11

Speeding Up The Platform

Datacenter Snowflakes

  • Deploy in months
  • Live for years

Virtualized and Cloud

  • Deploy in minutes
  • Live for weeks
slide-12
SLIDE 12

Speeding Up The Platform

Datacenter Snowflakes

  • Deploy in months
  • Live for years

Virtualized and Cloud

  • Deploy in minutes
  • Live for weeks

Container Deployments

  • Deploy in seconds
  • Live for minutes/hours
slide-13
SLIDE 13

Speeding Up The Platform

Datacenter Snowflakes

  • Deploy in months
  • Live for years

Virtualized and Cloud

  • Deploy in minutes
  • Live for weeks

Container Deployments

  • Deploy in seconds
  • Live for minutes/hours

Lambda Deployments

  • Deploy in milliseconds
  • Live for seconds
slide-14
SLIDE 14

Speeding Up The Platform

AWS Lambda is leading exploration of serverless architectures in 2016

Datacenter Snowflakes

  • Deploy in months
  • Live for years

Virtualized and Cloud

  • Deploy in minutes
  • Live for weeks

Container Deployments

  • Deploy in seconds
  • Live for minutes/hours

Lambda Deployments

  • Deploy in milliseconds
  • Live for seconds
slide-15
SLIDE 15

http://www.infoq.com/presentations/Twitter-Timeline-Scalability http://www.infoq.com/presentations/twitter-soa http://www.infoq.com/presentations/Zipkin http://www.infoq.com/presentations/scale-gilt Go-Kit https://www.youtube.com/watch?v=aL6sd4d4hxk http://www.infoq.com/presentations/circuit-breaking-distributed-systems https://speakerdeck.com/mattheath/scaling-micro-services-in-go-highload-plus-plus-2014

State of the Art in Web Scale Microservice Architectures

AWS Re:Invent : Asgard to Zuul https://www.youtube.com/watch?v=p7ysHhs5hl0 Resiliency at Massive Scale https://www.youtube.com/watch?v=ZfYJHtVL1_w Microservice Architecture https://www.youtube.com/watch?v=CriDUYtfrjs New projects for 2015 and Docker Packaging https://www.youtube.com/watch?v=hi7BDAtjfKY Spinnaker deployment pipeline https://www.youtube.com/watch?v=dwdVwE52KkU http://www.infoq.com/presentations/spring-cloud-2015

slide-16
SLIDE 16

Microservice Architectures

Configuration Tooling Discovery Routing Observability

Development: Languages and Container Operational: Orchestration and Deployment Infrastructure Datastores Policy: Architectural and Security Compliance

slide-17
SLIDE 17

Next Generation Applications

Fill in the gaps, rapidly evolving ecosystem choices

Archaius LaunchDarkly Habitat Configuration Lambda Docker Spinnaker Tooling Etcd Eureka Consul Discovery Compose Linkerd Weave Routing Zipkin Prometheus Hystrix Observability

Development: components interfaces languages e.g. Docker Hub, Artifactory, Datawire Quark, Go, Rust Operational: Mesos, Kubernetes, Swarm, Nomad for private clouds. ECS, Mesos, GKS for public Datastores: Orchestrated, Distributed Ephemeral e.g. Cassandra, or DBaaS e.g. DynamoDB Policy: Security compliance e.g. Docker Content Trust. Architecture compliance e.g. Cloud Foundry

slide-18
SLIDE 18

@adrianco

In Search of Segmentation Ops Dev

Datacenters AD/LDAP Roles VLAN Networks Hypervisor IPtables Docker Links AWS Accounts IAM Roles VPC Security Groups Calico Policy Docker Net/Weave

slide-19
SLIDE 19

@adrianco

Hierarchical Segmentation

B C A

B C

E F D

E F

Homepage Team Security Group Reports Team Security Group

VPC Z - Manage a small number of network spaces

D

An AWS oriented example…

AWS Account - Manage across multiple accounts

containers and links

slide-20
SLIDE 20

@adrianco

What’s Often Missing?

Failure injection testing Versioning, routing Binary protocols and interfaces Timeouts and retries Denormalized data models Monitoring, tracing Simplicity through symmetry

slide-21
SLIDE 21

@adrianco

Failure Injection Testing

Netflix Chaos Monkey, Simian Army, FIT and Gremlin

http://techblog.netflix.com/2011/07/netflix-simian-army.html http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html http://techblog.netflix.com/2016/01/automated-failure-testing.html https://www.infoq.com/presentations/failure-test-research-netflix

slide-22
SLIDE 22
  • Chaos Monkey - enforcing stateless business logic
  • Chaos Gorilla - enforcing zone isolation/replication
  • Chaos Kong - enforcing region isolation/replication
  • Security Monkey - watching for insecure configuration settings
  • FIT & Gremlin - inject errors to enforce robust dependencies
  • See over 100 NetflixOSS projects at netflix.github.com
  • Get “Technical Indigestion” reading techblog.netflix.com

Trust with Verification

slide-23
SLIDE 23

@adrianco

Benefits of version aware routing

Immediately and safely introduce a new version Canary test in production Use DIY feature flags or . Route clients to a version so they can’t get disrupted Change client or dependencies but not both at once Eventually remove old versions Incremental or infrequent “break the build” garbage collection

slide-24
SLIDE 24

@adrianco

Versioning, Routing

Version numbering: Interface.Feature.Bugfix V1.2.3 to V1.2.4 - Canary test then remove old version V1.2.x to V1.3.x - Canary test then remove or keep both Route V1.3.x clients to new version to get new feature Remove V1.2.x only after V1.3.x is found to work for V1.2.x clients V1.x.x to V2.x.x - Route clients to specific versions Remove old server version when all old clients are gone

slide-25
SLIDE 25

@adrianco

Protocols

Measure serialization, transmission, deserialization costs Sending a megabyte of XML between microservices will make you sad, but not as sad as 10yrs ago with SOAP Use Thrift, Protobuf/gRPC, Avro, SBE internally Use JSON for external/public interfaces

https://github.com/real-logic/simple-binary-encoding

slide-26
SLIDE 26

@adrianco

Interfaces

When you build a service, build a “driver” client for it Reference implementation error handling and serialization Release automation stress test using client Validate that service interface is usable! Minimize additional dependencies Swagger - OpenAPI Specification Datawire Quark adds behaviors to API spec

slide-27
SLIDE 27

@adrianco

Interface Version Pinning

Change one thing at a time! Pin the version of everything else Incremental build/test/deploy pipeline Deploy existing app code with new platform Deploy existing app code with new dependencies Deploy new app code with pinned platform/dependencies

slide-28
SLIDE 28

@adrianco

Interfaces between teams

slide-29
SLIDE 29

@adrianco

Interfaces between teams

Client Code

Minimal Object Model

slide-30
SLIDE 30

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model Full Object Model

slide-31
SLIDE 31

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model Full Object Model

Cache Code

Common Object Model Decoupled

  • bject

models

slide-32
SLIDE 32

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model

Service Driver Service Handler

Full Object Model

Cache Code

Common Object Model Decoupled

  • bject

models

slide-33
SLIDE 33

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model

Cache Driver Service Driver Service Handler

Full Object Model

Cache Code Cache Handler

Common Object Model Decoupled

  • bject

models

slide-34
SLIDE 34

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model

Cache Driver Service Driver Platform

Platform

Service Handler

Full Object Model

Cache Code

Platform

Cache Handler

Common Object Model Decoupled

  • bject

models

slide-35
SLIDE 35

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model

Cache Driver Service Driver Platform

Platform

Service Handler

Full Object Model

Cache Code

Platform

Cache Handler

Common Object Model Versioned dependency interfaces Decoupled

  • bject

models

slide-36
SLIDE 36

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model

Cache Driver Service Driver Platform

Platform

Service Handler

Full Object Model

Cache Code

Platform

Cache Handler

Common Object Model Versioned dependency interfaces Versioned platform interface Decoupled

  • bject

models

slide-37
SLIDE 37

@adrianco

Interfaces between teams

Service Code Client Code

Minimal Object Model

Cache Driver Service Driver Platform

Platform

Service Handler

Full Object Model

Cache Code

Platform

Cache Handler

Common Object Model Versioned dependency interfaces Versioned platform interface Decoupled

  • bject

models Versioned routing

slide-38
SLIDE 38

@adrianco

Timeouts and Retries

Connection timeout vs. request timeout confusion Usually setup incorrectly, global defaults Systems collapse with “retry storms” Timeouts too long, too many retries Services doing work that can never be used

slide-39
SLIDE 39

@adrianco

Connections and Requests

TCP makes a connection, HTTP makes a request HTTP hopefully reuses connections for several requests Both have different timeout and retry needs! TCP timeout is purely a property of one network latency hop HTTP timeout depends on the service and its dependencies

connection path request path

slide-40
SLIDE 40

@adrianco

Timeouts and Retries

Edge Service Good Service Good Service

Bad config: Every service defaults to 2 second timeout, two retries

Edge Service not responding

Overloaded service not responding

Failed Service

If anything breaks, everything upstream stops responding Retries add unproductive work

slide-41
SLIDE 41

@adrianco

Timeouts and Retries

Edge Service Good Service Good Service

Bad config: Every service defaults to 2 second timeout, two retries

Edge Service not responding

Overloaded service not responding

Failed Service

If anything breaks, everything upstream stops responding Retries add unproductive work

slide-42
SLIDE 42

@adrianco

Timeouts and Retries

Edge Service Good Service Good Service

Bad config: Every service defaults to 2 second timeout, two retries

Edge Service not responding

Overloaded service not responding

Failed Service

If anything breaks, everything upstream stops responding Retries add unproductive work

slide-43
SLIDE 43

@adrianco

Timeouts and Retries

Bad config: Every service defaults to 2 second timeout, two retries

Edge service responds slowly Overloaded service

Partially failed service

slide-44
SLIDE 44

@adrianco

Timeouts and Retries

Bad config: Every service defaults to 2 second timeout, two retries

Edge service responds slowly Overloaded service

Partially failed service

First request from Edge timed out so it ignores the successful response and keeps retrying. Middle service load increases as it’s doing work that isn’t being consumed

slide-45
SLIDE 45

@adrianco

Timeouts and Retries

Bad config: Every service defaults to 2 second timeout, two retries

Edge service responds slowly Overloaded service

Partially failed service

First request from Edge timed out so it ignores the successful response and keeps retrying. Middle service load increases as it’s doing work that isn’t being consumed

slide-46
SLIDE 46

@adrianco

Timeout and Retry Fixes

Cascading timeout budget Static settings that decrease from the edge

  • r dynamic budget passed with request

How often do retries actually succeed? Don’t ask the same instance the same thing Only retry on a different connection

slide-47
SLIDE 47

@adrianco

Timeouts and Retries

Edge Service Good Service

Budgeted timeout, one retry

Failed Service

slide-48
SLIDE 48

@adrianco

Timeouts and Retries

Edge Service Good Service

Budgeted timeout, one retry

Failed Service 3s 1s 1s

Fast fail response after 2s

Upstream timeout must always be longer than total downstream timeout * retries delay No unproductive work while fast failing

slide-49
SLIDE 49

@adrianco

Timeouts and Retries

Edge Service Good Service

Budgeted timeout, failover retry

Failed Service

For replicated services with multiple instances never retry against a failed instance No extra retries or unproductive work

Good Service

slide-50
SLIDE 50

@adrianco

Timeouts and Retries

Edge Service Good Service

Budgeted timeout, failover retry

Failed Service 3s 1s

For replicated services with multiple instances never retry against a failed instance No extra retries or unproductive work

Good Service

Successful response delayed 1s

slide-51
SLIDE 51

@adrianco

Manage Inconsistency

ACM Paper: "The Network is Reliable" Distributed systems are inconsistent by nature Clients are inconsistent with servers Most caches are inconsistent Versions are inconsistent Get over it and Deal with it

See http://queue.acm.org/detail.cfm?id=2655736

slide-52
SLIDE 52

@adrianco

Denormalized Data Models

Any non-trivial organization has many databases Cross references exist, inconsistencies exist Microservices work best with individual simple stores Scale, operate, mutate, fail them independently NoSQL allows flexible schema/object versions

slide-53
SLIDE 53

@adrianco

Denormalized Data Models

Build custom cross-datasource check/repair processes Ensure all cross references are up to date Read these Pat Helland papers

Immutability Changes Everything

http://highscalability.com/blog/2015/1/26/paper-immutability-changes-everything-by-pat-helland.html

Memories, Guesses and Apologies

https://blogs.msdn.microsoft.com/pathelland/2007/05/15/memories-guesses-and-apologies/

Standing on the Distributed Shoulders of Giants

http://queue.acm.org/detail.cfm?id=2953944

slide-54
SLIDE 54

Cloud Native Monitoring and Microservices

slide-55
SLIDE 55

Low Latency SaaS Based Monitors

https://www.datadoghq.com/ http://www.instana.com/ www.bigpanda.io www.vividcortex.com signalfx.com wavefront.com sysdig.com See www.battery.com for a list of portfolio investments

slide-56
SLIDE 56

A Tragic Quadrant

Ability to scale Ability to handle rapidly changing microservices

In-house tools at web scale companies Most current monitoring & APM tools Next generation APM Next generation Monitoring

Datacenter Cloud Containers 100s 1,000s 10,000s 100,000s Lambda

slide-57
SLIDE 57

A Tragic Quadrant

Ability to scale Ability to handle rapidly changing microservices

In-house tools at web scale companies Most current monitoring & APM tools Next generation APM Next generation Monitoring

Datacenter Cloud Containers 100s 1,000s 10,000s 100,000s Lambda YMMV: Opinionated approximate positioning only

slide-58
SLIDE 58

Interesting architectures have a lot of microservices! Flow visualization is a big challenge.

See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
slide-59
SLIDE 59

Simulated Microservices

Model and visualize microservices Simulate interesting architectures Generate large scale configurations Eventually stress test real tools Code: github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3 See for yourself: http://simianviz.surge.sh Follow @simianviz for updates

ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Three Availability Zones Denominator DNS Endpoint

slide-60
SLIDE 60

@adrianco

Simplicity through symmetry

Symmetry Invariants Stable assertions No special cases Single purpose components

slide-61
SLIDE 61

Serverless

slide-62
SLIDE 62

Serverless Architectures

AWS Lambda getting some early wins Google Cloud Functions, Azure Functions alpha launched IBM OpenWhisk - open sourced Startup activity: iron.io , serverless.com, apex.run toolkit

slide-63
SLIDE 63

@adrianco

Serverless Architecture

API Gateway Kinesis S3 DynamoDB

slide-64
SLIDE 64

@adrianco

Serverless Architecture

API Gateway Kinesis S3 DynamoDB

slide-65
SLIDE 65

@adrianco

Serverless Architecture

API Gateway Kinesis S3 DynamoDB

slide-66
SLIDE 66

AWS Lambda Reference Arch

http://www.allthingsdistributed.com/2016/05/aws-lambda-serverless-reference-architectures.html

slide-67
SLIDE 67

Serverless Programming Model Event driven functions Role based permissions Whitelisted API based security Good for simple single threaded code

slide-68
SLIDE 68

Serverless Cost Efficiencies

100% useful work, no agents, overheads 100% utilization, no charge between requests No need to size capacity for peak traffic Anecdotal costs ~1% of conventional system Ideal for low traffic, Corp IT, spiky workloads

slide-69
SLIDE 69

Serverless Work in Progress

Tooling for ease of use Multi-region HA/DR patterns Debugging and testing frameworks Monitoring, end to end tracing

slide-70
SLIDE 70

DIY Serverless Operating Challenges Startup latency Execution overhead Charging model Capacity planning

slide-71
SLIDE 71

Learn More…

slide-72
SLIDE 72

@adrianco

“We see the world as increasingly more complex and chaotic because we use inadequate concepts to explain it. When we understand something, we no longer see it as chaotic or complex.”

Jamshid Gharajedaghi - 2011 Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture

slide-73
SLIDE 73

Q&A

Adrian Cockcroft @adrianco http://slideshare.com/adriancockcroft Technology Fellow - Battery Ventures

See www.battery.com for a list of portfolio investments

slide-74
SLIDE 74

Security

Visit http://www.battery.com/our-companies/ for a full list of all portfolio companies in which all Battery Funds have invested. Palo Alto Networks

Enterprise IT

Operations & Management Big Data Compute Networking Storage