[PPT] - Path to Resilient and Observable Microservices Slides: PowerPoint Presentation

SLIDE 1

Path to Resilient and Observable Microservices

Slides: https://slides.peterj.dev

@pjausovec

1 / 56

SLIDE 2

Safe Harbor

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, coe, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and princing of any features

r functionality described for Oracle's products may change and remains at the sole

discretion of Oracle Corporation. Statements in this presentation relating to Oracle's future plans, expectations, beliefts, intentions and prospects are "forward-looking statements" and are subject to material risks and uncertainties. A detailed discussion of these factors and other risks that aect our business is contained in Oracle's Securities and Exchange Commission (SEC) lings, including

ur most recent reports on Form 10-K and Form 10-Q under the heading "Risk Factors."

These lings are available on the SEC's website or on Oracle's website at http://www.oracle.com/investor. All information in this presentation is current as of September 2019 and Oracle undertakes no duty to update any statement in light of new information or future events.

SLIDE 3

Introduction

I am Peter (@pjausovec) Software Engineer at Oracle Working on "cloud-native" stu Books: Cloud Native: Using Containers, Functions, and Data to Build Next-Gen Apps SharePoint Development VSTO For Dummies Courses: Kubernetes Course (https://startkubernetes.com) Istio Service Mesh Course (https://learnistio.com)

3 / 56

SLIDE 4

4 / 56

SLIDE 5

@pjausovec

5 / 56

SLIDE 6

@pjausovec

6 / 56

SLIDE 7

Microservices

Independently deployable Independently deployable Independently deployable Independently deployable Independently deployable services, owning their dataand having well-dened interfaces

@pjausovec

7 / 56

SLIDE 8

@pjausovec

8 / 56

SLIDE 9

@pjausovec

9 / 56

SLIDE 10

@pjausovec

10 / 56

SLIDE 11

Service Resiliency

@pjausovec

11 / 56

SLIDE 12

Resiliency

Ability to recover from failures recover from failures recover from failures recover from failures recover from failures and continue to function continue to function continue to function continue to function continue to function

@pjausovec

12 / 56

SLIDE 13

Goal

Return the service to a fully functioning state fully functioning state fully functioning state fully functioning state fully functioning state after failure

@pjausovec

13 / 56

SLIDE 14

Resiliency - High availability (1/2)

Healthy 🏦 No signicant downtime ⏱ Responsive 🏏 Meeting SLAs/SLOs 💱

@pjausovec

14 / 56

SLIDE 15

Resiliency - Disaster Recovery (2/2)

How to recover from incidents Data backup and archiving DR starts when HA design can't handle the impact of failures

@pjausovec

15 / 56

SLIDE 16

Path to resiliency

Understand the requirements Dene service availability Design for resiliency Strategies for detection and recovery Testing Monitoring

@pjausovec

16 / 56

SLIDE 17

Path to resiliency

Understand the requirements How much downtime can you handle/is acceptable? More downtime → broken SLAs/SLOs → 😣 Dene service availability What does it mean for service to be available? Design for resiliency Identify failures points - what and how can things wrong?

@pjausovec

17 / 56

SLIDE 18

Path to resiliency

Strategies for detection and recovery How are you detecting and recovering from failures? Testing and monitoring Test for failure conditions so you can detect and recover from them Monitor your services, so you know what's happening

@pjausovec

18 / 56

SLIDE 19

Resiliency Strategies

Load Balancing Timeouts and retries Circuit breakers and bulkhead pattern Data replication Graceful degradation Rate limiting

@pjausovec

19 / 56

SLIDE 20

Load Balancing

Scale services by adding more instances adding more instances adding more instances adding more instances adding more instances

@pjausovec

20 / 56

SLIDE 21

Timeouts and Retries

Timeouts Network latency: how long do you wait for responses? Waiting indenitely == bad Waiting indenitely == bad Waiting indenitely == bad Waiting indenitely == bad Waiting indenitely == bad Always dene timeouts! Retries Helps handle transient network failures Only retry calls that make sense Idempotent operations Consider appropriate retry counts and intervals between retries

@pjausovec

21 / 56

SLIDE 22

Circuit breaker

Prevents doing an operation that is likely to fail

@pjausovec

22 / 56

SLIDE 23

@pjausovec

23 / 56

SLIDE 24

@pjausovec

24 / 56

SLIDE 25

@pjausovec

25 / 56

SLIDE 26

@pjausovec

26 / 56

SLIDE 27

Bulkhead pattern

Isolate resources in such a way that if one fails, it's not aecting

thers

@pjausovec

27 / 56

SLIDE 28

Data replication

Handle non-transient failures in the data store

@pjausovec

28 / 56

SLIDE 29

Graceful degradation

Ability to maintain limited functionality maintain limited functionality maintain limited functionality maintain limited functionality maintain limited functionality in face of failures

@pjausovec

29 / 56

SLIDE 30

Rate limiting

Restrict the number of requests made in a period of time

@pjausovec

30 / 56

SLIDE 31

How?

@pjausovec

31 / 56

SLIDE 32

Service Mesh

@pjausovec

32 / 56

SLIDE 33

Dedicated infrastructure layer to connect connect connect connect connect, manage manage manage manage manage, and secure secure secure secure secure workloads by managing the communication between them

@pjausovec

33 / 56

SLIDE 34

Istio service mesh

Open source service mesh Google, IBM, Lyft Well-dened API Can be deployed on-premise, in the cloud Kubernetes Mesos

@pjausovec

34 / 56

SLIDE 35

@pjausovec

35 / 56

SLIDE 36

@pjausovec

36 / 56

SLIDE 37

@pjausovec

37 / 56

SLIDE 38

@pjausovec

38 / 56

SLIDE 39

Source: https://barkpost.com/cute/sidecar-dogs/ 39 / 56

SLIDE 40

Service Mesh & Resiliency

@pjausovec

40 / 56

SLIDE 41

Resiliency Strategies - Service Mesh

Load Balancing Load Balancing Load Balancing Load Balancing Load Balancing Timeouts and retries Timeouts and retries Timeouts and retries Timeouts and retries Timeouts and retries Circuit breakers Circuit breakers Circuit breakers Circuit breakers Circuit breakers and bulkhead pattern Data replication Graceful degradation Rate limiting Rate limiting Rate limiting Rate limiting Rate limiting

@pjausovec

41 / 56

SLIDE 42

Testing for resiliency

Test Measure Analyze (x the issues)

@pjausovec

42 / 56

SLIDE 43

Testing for Resiliency - Service Mesh

Inject failures

Delays Delays Delays Delays Delays Example: "For 30% of the requests, wait 5 seconds before responding" Faults Faults Faults Faults Faults Example: "For 50% of the requests, return HTTP 404"

43 / 56

SLIDE 44

Failure injection

1 apiVersion: networking.istio.io/v1alpha3 2 kind: VirtualService 3 metadata: 4 name: service-b 5 spec: 6 hosts: 7 - service-b.default.svc.cluster.local 8 http: 9 - route: 10 - destination: 11 host: service-b.default.svc.cluster.local 12 subset: v1 13 fault: 14 abort: 15 percent: 60 16 httpStatus: 404

@pjausovec

44 / 56

SLIDE 45

Delay injection

1 apiVersion: networking.istio.io/v1alpha3 2 kind: VirtualService 3 metadata: 4 name: service-b 5 spec: 6 hosts: 7 - service-b.default.svc.cluster.local 8 http: 9 - route: 10 - destination: 11 host: service-b.default.svc.cluster.local 12 subset: v1 13 fault: 14 delay: 15 percent: 20 16 fixedDelay: 3s

@pjausovec

45 / 56

SLIDE 46

Timeouts & Retries

1 apiVersion: networking.istio.io/v1alpha3 2 kind: VirtualService 3 metadata: 4 name: service-b 5 spec: 6 hosts: 7 - service-b.default.svc.cluster.local 8 http: 9 - route: 10 - destination: 11 host: service-b.default.svc.cluster.local 12 subset: v1 13 timeout: 5s 14 retries: 15 attempts: 6 16 perTryTimeout: 2s 17 retryOn: gateway-error,connect-failure

@pjausovec

46 / 56

SLIDE 47

Circuit Breakers

1 apiVersion: networking.istio.io/v1alpha3 2 kind: DestinationRule 3 metadata: 4 name: service-b 5 spec: 6 host: service-b.default.svc.cluster.local 7 trafficPolicy: 8 tcp: 9 maxConnections: 1 10 http: 11 http1MaxPendingRequests: 1 12 maxRequestsPerConnection: 1 13 outlierDetection: 14 consecutiveErrors: 1 15 interval: 1s 16 baseEjectionTime: 3m 17 maxEjectionPercent: 100

@pjausovec

47 / 56

SLIDE 48

Observability

@pjausovec

48 / 56

SLIDE 49

What is observability?

Act of measuring measuring measuring measuring measuring, collecting collecting collecting collecting collecting and analyzing analyzing analyzing analyzing analyzing metrics, traces, logs, events, … from services

@pjausovec

49 / 56

SLIDE 50

Logging

Log granularity: verbose, debug, warning, info, error, ... Storage is cheap - rather log more than less Store logs in a central place Use correlation IDs Don't log private information Use common format

50 / 56

SLIDE 51

Monitoring & Alerting

Monitoring Know the state of your system Collect system/infrastructure metrics (e.g. CPU, memory, ...) and business metrics (instrument services) Alerting Queries that look for failures/failure conditions PagerDuty -> alert on-call engineers Only re alerts when something is really wrong

51 / 56

SLIDE 52

Tools

Grafana Jaeger Kiali ELK (Elastic Search + Fluentd + Kibana) PagerDuty

52 / 56

SLIDE 53

@pjausovec

53 / 56

SLIDE 54

Resources

Kubernetes on Oracle Cloud (OKE) - (https://cloud.oracle.com) Kubernetes - (https://kubernetes.io) Istio - (https://istio.io) Oracle Microservices Example MuShop - https://github.com/oracle-quickstart/oci-cloudnative

@pjausovec

54 / 56

SLIDE 55

Thank you

Slides: https://slides.peterj.dev Contact @pjausovec https://peterj.dev

@pjausovec

55 / 56

SLIDE 56

Path to Resilient and Observable Microservices

Safe Harbor

Introduction

Microservices

Independently deployable Independently deployable Independently deployable Independently deployable Independently deployable services, owning their dataand having well-dened interfaces

Service Resiliency

Resiliency

Ability to recover from failures recover from failures recover from failures recover from failures recover from failures and continue to function continue to function continue to function continue to function continue to function

Goal

Return the service to a fully functioning state fully functioning state fully functioning state fully functioning state fully functioning state after failure

Resiliency - High availability (1/2)

Resiliency - Disaster Recovery (2/2)

Path to resiliency

Path to resiliency

Path to resiliency

Resiliency Strategies

Load Balancing

Scale services by adding more instances adding more instances adding more instances adding more instances adding more instances

Timeouts and Retries

Circuit breaker

Prevents doing an operation that is likely to fail

Bulkhead pattern

Isolate resources in such a way that if one fails, it's not aecting

Data replication

Handle non-transient failures in the data store

Graceful degradation

Ability to maintain limited functionality maintain limited functionality maintain limited functionality maintain limited functionality maintain limited functionality in face of failures

Rate limiting

Restrict the number of requests made in a period of time

How?

Service Mesh

Dedicated infrastructure layer to connect connect connect connect connect, manage manage manage manage manage, and secure secure secure secure secure workloads by managing the communication between them

Istio service mesh

Service Mesh & Resiliency

Resiliency Strategies - Service Mesh

Testing for resiliency

Testing for Resiliency - Service Mesh

Inject failures

Failure injection

Delay injection

Timeouts & Retries

Circuit Breakers

Observability

What is observability?

Act of measuring measuring measuring measuring measuring, collecting collecting collecting collecting collecting and analyzing analyzing analyzing analyzing analyzing metrics, traces, logs, events, … from services

Logging

Monitoring & Alerting

Tools

Resources

Thank you

Table of Contents