透過 Istio 打造企業內的 SRE
Hybrid Specialist: Shawn Ho shawnho@google.com
Istio SRE Hybrid Specialist: Shawn Ho shawnho@google.com 1 - - PowerPoint PPT Presentation
Istio SRE Hybrid Specialist: Shawn Ho shawnho@google.com 1 What is SRE? Product Lifecycle Concept Business Development Operations Market Agile DevOps solves this solves this Dev & Ops KPIs aren't
Hybrid Specialist: Shawn Ho shawnho@google.com
Concept Business Development Operations Market
Agile solves this DevOps solves this
Agility
Stability
concept,guide line and disciplines to break silos in developments,
practice of Devops.
Self-Service Platform Monitoring Automation CI/CD
SRE
Developers
所有的決定是以資料為基礎
即使所有的監控數據都是正常的, 但客戶只要覺得系統不穩定,那系統就是不穩定
降低部門隔閡要由跨部門的責任分享開始 (Developers, Operators, Leader) 系統 系統失效不僅是維運者的責任,程式碼品質,技術債等都是可能的原因
Gallery Service A Service B proxy proxy Control Plane API on K8S API Server Citadel Logging plugin Monitoring plugin
HTTP , gRPC, TCP Routing + Secure Naming
Cert Authority plugin Ingress Gateway Egress Gateway
mTLS mTLS mTLS JWT + TLS Cert issuance Perimeter security policies Perimeter security policies
Istio Control Plane Pilot
Policy Enforcement + Reporting Data flow Control + metrics flow
Local Authz
JWT + TLS
Internal App 1 External App 1
Metrics & monitoring Capacity planning Emergency response Change management Culture
Metrics & monitoring Capacity planning Emergency response Change management Culture
Understand system architecture Understand system architecture and deployed topology System monitoring Monitoring system by gathering blackbox & whitebox metrics SLI & SLO are extracted from the matrix and logs. The informations are visualized thru dashboard Log handling Managing planned event (release, maintenance) Incident handling Create incident ticket Rollback change to resolve incident Investigate root cause with logging,monitoring matrix and debugging. Postmoruem Retrospect incident and prepare plan to prevent reoccurence
“99% of REST API call will complete in less than 100ms every week” SLI Target
service level indicator: a well-defined measure of 'good enough'
SLO/SLA
service level
target for fraction
interactions
(SLI + Target)
service level agreement: consequences
+ consequences = SLI + Target + consequences
Product management & SRE define an availability target.
is a “budget of unreliability” (or the error budget).
Availability SLO
Allowed unavailability window Error Budget
per year per quarter per 30 days Error rate 1% 90% 36.5 days 9 days 3 days 90 95% 18.25 days 4.5 days 1.5 days 80 99% 3.65 days 21.6 hours 7.2 hours 99.5% 1.83 days 10.8 hours 3.6 hours
99.9% 8.76 hours 2.16 hours 43.2 minutes
99.95% 4.38 hours 1.08 hours 21.6 minutes
99.99% 52.6 minutes 12.96 minutes 4.32 minutes
99.999% 5.26 minutes 1.30 minutes 25.9 seconds
Topology Blackbox Whitebox
Logging Tracing
Clients Kubernetes Cluster Kubernetes Engine Taiwan-1 Kubernetes Cluster Kubernetes Engine Singapore Cloud Load Balancing
10 90
Clients Kubernetes Cluster Kubernetes Engine Taiwan-1 Kubernetes Cluster Kubernetes Engine Singapore Cloud Load Balancing
50 50
Metrics & monitoring Capacity planning Emergency response Change management Culture
Plan for organic growth Increased product adoption and usage by customers. Determine inorganic growth Sudden jumps in demand due to feature launches, marketing campaigns, etc.
Roughly 70%1 of outages are due to changes in a live system
Kubernetes Configuration Service Continuous Deployment
Clients Kubernetes Cluster Kubernetes Engine Multiple Instances Cloud Source Repositories
OnPremise Kubernetes Cluster
Kubernetes Engine GCP Kubernetes Cluster Kubernetes Engine On-Prem1 Anthos Hub Service
NAT
○ Decision Based on Data (有意義的監控) ○ Be User Centric(黑箱測試) ○ Blameless Culture & Share Responsibility (分擔責任,共同努力)
○ SLI + SLO + Error Budget ○ Watch for the Budget Burn Rate ○ Establish CI+CD with GitOps
Cover images used with permission. These books can be found on shop.oreilly.com.