Controlled Chaos: Taming Organic, Federated Growth of Microservices - - PowerPoint PPT Presentation

controlled chaos taming organic federated growth of
SMART_READER_LITE
LIVE PREVIEW

Controlled Chaos: Taming Organic, Federated Growth of Microservices - - PowerPoint PPT Presentation

Controlled Chaos: Taming Organic, Federated Growth of Microservices Southwest meltdown July 20, 2016 1 5 - One of 2,000 routers fails Router Days - All monitors green - Billions of packets in, 0 out - 30 min to discovery $80M -$3.4B -


slide-1
SLIDE 1

Controlled Chaos: Taming Organic, Federated Growth of Microservices

slide-2
SLIDE 2

Southwest meltdown

July 20, 2016

  • One of 2,000 routers fails
  • All monitors green
  • Billions of packets in, 0 out
  • 30 min to discovery
  • 12 hours rebooting
  • 5 days to full recovery

Image: goodfreephotos.com

1 Router 5 Days $80M Losses

  • $3.4B

Mrkt Cap

Source: https://www.washingtonpost.com/lifestyle/travel/airline-computer-outages-like-deltas-are-bound-to-repeat-themselves-heres-what-to-know/2016/08/11/578a83cc-5d8d-11e6-9d2f-b1a3564181a1_story.html

slide-3
SLIDE 3

Mission control Service landscape Behaviors Operational patterns

Tobias Kunze Co-Founder & CEO tobias@glasnostic.com @tkunze

Image: NASA

slide-4
SLIDE 4

The agile operating model

Containers PaaS SaaS VMs Serverless

Cloud Ecosystem

Gateways Service Mesh APIs Integration

Traffjc Control

Monolith Microservices Organic Federated Growth

Service Landscape

Shared Services IT Ops SRE DevOps

Mission Control

Autonomous Teams Rapid Learning & Decision Cycles Hierarchy

Agile Organization

slide-5
SLIDE 5
slide-6
SLIDE 6

Loss of perimeter Loss of blueprint Changing identities Volatile behaviors Fundamentally new challenge

Security: evolving topologies, ephemeral actors

Image: FIXME

slide-7
SLIDE 7

Large-scale Complex Non-linear Unpredictable Fundamentally new challenge

Stability: complex emergent behaviors

slide-8
SLIDE 8

replicas: 3 spec: containers: resources: limits: nvidia.com/gpu: 1 memory: 400Mi requests: cpu: 200m memory: 100Mi livenessProbe: initialDelaySeconds: 30 periodSeconds: 3 http: timeout: 10s retries: attempts: 3 perTryTimeout: 2s nginx: resources: limits: memory: 1Gi

Resource limits Scaling behaviors Request behaviors Pool sizing

Can’t engineer away

Image: teachersource.com

slide-9
SLIDE 9
slide-10
SLIDE 10

Environment over code

slide-11
SLIDE 11

Source: nats.aero

slide-12
SLIDE 12

Successful
 missions
 are run.

Image: NASA

slide-13
SLIDE 13

Coping strategies

Do nothing Monitor nodes Trace requests

slide-14
SLIDE 14

Service mesh

Source: gagliardiphotography.com

Natural landscape evolves faster than baroque YAML

slide-15
SLIDE 15

Golden signals

Requests Latency Concurrency Bandwidth

Image source: bbc.co.uk

slide-16
SLIDE 16

Operational patterns

Control Systemic Failures Bulkhead Backpressure Segmentation Assure Performance Circuit Breaker Quality of Service Deploy with Confidence Quarantine Canary Build Resilience Fault Injection Brownout Blast Radius

slide-17
SLIDE 17
  • Ex. 1: cache thrashing

E m e r g e n t B e h a v i

  • r

O r g a n i c G r

  • w

t h S e r v i c e L a n d s c a p e 1 2 3 4

Behavior

  • 1. New organic growth
  • 2. Upstream fan-out changed
  • 3. Shared cache thrashes
  • 4. Wide, unspecific slowness

Remediation

  • 1. See widespread slowness
  • 2. Identify bottleneck
  • 3. Correlate with deployment
  • 4. Quarantine deployment

Quarantine

slide-18
SLIDE 18
  • Ex. 2: cascading failure at Target

Source: https://medium.com/@daniel.p.woods/on-infrastructure-at-scale-a-cascading-failure-of-distributed-systems-7cff2a3cd2df

OpenStack K8s Cluster Service Kafla Sidecar Maintenance 1 Network outage 2 Intermittently available 3 CPU spike 4 Starvation 5 Unhealthy 6 7 Migrate 8 CPU spike 9 Starvation Unhealthy 10 11 Migrate

Behavior

  • 1. K8s flip-flopping

Logging spikes

Remediation

  • 1. See logging spikes
  • 2. Backpressure, circuit-break

Docker Node

Circuit Breaker Backpressure

slide-19
SLIDE 19

Ex 3: security breach

Org 1 Room 1 Room n Security, Governance Participant Gateway Relay Org m

Behavior

  • 1. DoS
  • 2. Segmentation violation

DoS 1 Segmentation violation 2

Remediation

  • 1. Identify sources
  • 2. Segment

Sources

Segmentation

slide-20
SLIDE 20

Runtime control examples

Deploy to Production

slide-21
SLIDE 21

Runtime control examples

Deploy to Production Architect in Real Time

slide-22
SLIDE 22

Architect in real time

Managed Access Data Access Auth Car Access IoT Access Privacy Sanitization Streaming Pipelines Query Planning Diagnostics Stream Processing Analytics Data Synthesis Car Data IoT 100’s of Applications 1,000,000’s of Cars Service Layer

slide-23
SLIDE 23

Runtime control examples

Deploy to Production Architect in Real Time Define Structure

slide-24
SLIDE 24

Summary

New reality Agile

  • perating

model Service landscape Environment

  • ver code

Rapid MTTR Operational patterns Golden signals

slide-25
SLIDE 25

Developers

  • 1. Avoid distributed systems
  • 2. Resilient federations
  • 3. Compensate
  • 4. Redundancy, plausibility
  • 5. Debug at unit level
  • 6. Defer design to runtime

Takeaways

Operators

  • 1. Environment over code
  • 2. Rapid detect–react loops
  • 3. Signals, patterns
  • 4. No root causes:

remediate

  • 5. No process debugging
  • 6. Architect at runtime

Applies to all decentralized architectures, not just microservices.

slide-26
SLIDE 26

Mission Control for Agile Architectures

glasnostic.com
 Tobias Kunze Co-Founder & CEO tobias@glasnostic.com
 @tkunze

Image: NASA