Istio SRE Hybrid Specialist: Shawn Ho shawnho@google.com 1 - - PowerPoint PPT Presentation

istio sre
SMART_READER_LITE
LIVE PREVIEW

Istio SRE Hybrid Specialist: Shawn Ho shawnho@google.com 1 - - PowerPoint PPT Presentation

Istio SRE Hybrid Specialist: Shawn Ho shawnho@google.com 1 What is SRE? Product Lifecycle Concept Business Development Operations Market Agile DevOps solves this solves this Dev & Ops KPIs aren't


slide-1
SLIDE 1

透過 Istio 打造企業內的 SRE

Hybrid Specialist: Shawn Ho shawnho@google.com

slide-2
SLIDE 2

1

What is SRE?

slide-3
SLIDE 3

Product Lifecycle

Concept Business Development Operations Market

Agile solves this DevOps solves this

slide-4
SLIDE 4

Developers

Agility

Operators

Stability

Dev & Ops’ KPIs aren't Aligned

slide-5
SLIDE 5

What is relationship between Devops and SRE ?

  • Devops is more like abstract

concept,guide line and disciplines to break silos in developments,

  • peration
  • SRE is Google version of realized

practice of Devops.

“Class SRE implements Devops”

slide-6
SLIDE 6

Self-Service Platform Monitoring Automation CI/CD

SRE

Developers

Class SRE = REAL PERSON

slide-7
SLIDE 7

#1. Decision based on data

所有的決定是以資料為基礎

slide-8
SLIDE 8

#2. Be user centric

即使所有的監控數據都是正常的, 但客戶只要覺得系統不穩定,那系統就是不穩定

slide-9
SLIDE 9

#3. Blameless culture & Share responsibility

降低部門隔閡要由跨部門的責任分享開始 (Developers, Operators, Leader) 系統 系統失效不僅是維運者的責任,程式碼品質,技術債等都是可能的原因

slide-10
SLIDE 10

2

How to Implement SRE by Istio/Anthos?

slide-11
SLIDE 11

Istio in 2 minutes

Gallery Service A Service B proxy proxy Control Plane API on K8S API Server Citadel Logging plugin Monitoring plugin

HTTP , gRPC, TCP Routing + Secure Naming

Cert Authority plugin Ingress Gateway Egress Gateway

mTLS mTLS mTLS JWT + TLS Cert issuance Perimeter security policies Perimeter security policies

Istio Control Plane Pilot

Policy Enforcement + Reporting Data flow Control + metrics flow

Local Authz

JWT + TLS

Internal App 1 External App 1

slide-12
SLIDE 12

What does SRE implement on Platform?

Metrics & monitoring Capacity planning Emergency response Change management Culture

  • SLO
  • Dashboard
  • Analytics
  • Forecasting
  • Demand-driven
  • Pergormance
  • Release process
  • Consulting design
  • Automations
  • Oncall
  • Incident analysis
  • Postmoruems
  • Toil management
  • Blamelessness
  • Share responsibility
slide-13
SLIDE 13

What does SRE implement on Platform?

Metrics & monitoring Capacity planning Emergency response Change management Culture

  • SLO
  • Dashboard
  • Analytics
  • Forecasting
  • Demand-driven
  • Pergormance
  • Release process
  • Consulting design
  • Automations
  • Oncall
  • Incident analysis
  • Postmoruems
  • Toil management
  • Blamelessness
  • Share responsibility
slide-14
SLIDE 14

Monitoring and Incident Management

Understand system architecture Understand system architecture and deployed topology System monitoring Monitoring system by gathering blackbox & whitebox metrics SLI & SLO are extracted from the matrix and logs. The informations are visualized thru dashboard Log handling Managing planned event (release, maintenance) Incident handling Create incident ticket Rollback change to resolve incident Investigate root cause with logging,monitoring matrix and debugging. Postmoruem Retrospect incident and prepare plan to prevent reoccurence

slide-15
SLIDE 15

What to Monitor?

SLO = SLI + Target

“99% of REST API call will complete in less than 100ms every week” SLI Target

SLI

service level indicator: a well-defined measure of 'good enough'

  • used to specify

SLO/SLA

SLO

service level

  • bjective: a top-line

target for fraction

  • f good

interactions

  • specifies goals

(SLI + Target)

SLA

service level agreement: consequences

  • SLA = (SLO + margin)

+ consequences = SLI + Target + consequences

Error Budget

Product management & SRE define an availability target.

  • 100% - availability target

is a “budget of unreliability” (or the error budget).

slide-16
SLIDE 16

Availability SLO

Allowed unavailability window Error Budget

per year per quarter per 30 days Error rate 1% 90% 36.5 days 9 days 3 days 90 95% 18.25 days 4.5 days 1.5 days 80 99% 3.65 days 21.6 hours 7.2 hours 99.5% 1.83 days 10.8 hours 3.6 hours

  • 100

99.9% 8.76 hours 2.16 hours 43.2 minutes

  • 900

99.95% 4.38 hours 1.08 hours 21.6 minutes

  • 1900

99.99% 52.6 minutes 12.96 minutes 4.32 minutes

  • 9900

99.999% 5.26 minutes 1.30 minutes 25.9 seconds

  • 99900

Error Budget (Availability)

slide-17
SLIDE 17

Demo with Anthos: Monitoring+Incident Mgmt

  • Topology
  • SLO/SLI Metrics
  • Blackbox/Whitebox
  • Log Viewer
  • Tracing/Tracing Report
slide-18
SLIDE 18

Demo with Anthos:

Monitoring+Incident Mgmt

Topology Blackbox Whitebox

slide-19
SLIDE 19

Demo with Anthos:

Monitoring+Incident Mgmt

Logging Tracing

slide-20
SLIDE 20

Error Budget Burn Down Rate

slide-21
SLIDE 21

Demo with Anthos: Proactive Reduce Error Budget

  • Alert Setting
  • Canary Deployment
  • Cross-Region Deployment

Clients Kubernetes Cluster Kubernetes Engine Taiwan-1 Kubernetes Cluster Kubernetes Engine Singapore Cloud Load Balancing

10 90

slide-22
SLIDE 22
  • Alert Setting
  • Canary Deployment
  • Cross-Region Deployment

Clients Kubernetes Cluster Kubernetes Engine Taiwan-1 Kubernetes Cluster Kubernetes Engine Singapore Cloud Load Balancing

50 50

Demo with Anthos: Proactive Reduce Error Budget

slide-23
SLIDE 23

What does SRE implement on Platform?

Metrics & monitoring Capacity planning Emergency response Change management Culture

  • SLO
  • Dashboard
  • Analytics
  • Forecasting
  • Demand-driven
  • Pergormance
  • Release process
  • Consulting design
  • Automations
  • Oncall
  • Incident analysis
  • Postmoruems
  • Toil management
  • Blamelessness
  • Share responsibility
slide-24
SLIDE 24

Capacity planning

Plan for organic growth Increased product adoption and usage by customers. Determine inorganic growth Sudden jumps in demand due to feature launches, marketing campaigns, etc.

slide-25
SLIDE 25

Change Management

Roughly 70%1 of outages are due to changes in a live system

Kubernetes Configuration Service Continuous Deployment

Clients Kubernetes Cluster Kubernetes Engine Multiple Instances Cloud Source Repositories

OnPremise Kubernetes Cluster

Kubernetes Engine GCP Kubernetes Cluster Kubernetes Engine On-Prem1 Anthos Hub Service

NAT

slide-26
SLIDE 26

Demo with Anthos: The Power of GitOps

slide-27
SLIDE 27

Summary + Call for Action

  • SRE has 3 key principles:

○ Decision Based on Data (有意義的監控) ○ Be User Centric(黑箱測試) ○ Blameless Culture & Share Responsibility (分擔責任,共同努力)

  • Kubernetes is a perfect platform to implement SRE

○ SLI + SLO + Error Budget ○ Watch for the Budget Burn Rate ○ Establish CI+CD with GitOps

  • Pick a System and Build your SRE Practices
slide-28
SLIDE 28

Cover images used with permission. These books can be found on shop.oreilly.com.