Pivotals best practices for achieving sustainable Cloud Operations - - PowerPoint PPT Presentation

pivotal s best practices for achieving sustainable cloud
SMART_READER_LITE
LIVE PREVIEW

Pivotals best practices for achieving sustainable Cloud Operations - - PowerPoint PPT Presentation

Pivotals best practices for achieving sustainable Cloud Operations Konstantin Semenov - Principal Software Engineer About myself Started career in software engineering 22 years ago Involved in a diverse range of projects from database


slide-1
SLIDE 1

Pivotal’s best practices for achieving sustainable Cloud Operations

Konstantin Semenov - Principal Software Engineer

slide-2
SLIDE 2

About myself

Started career in software engineering 22 years ago Involved in a diverse range of projects – from database management to 3D modelling Enjoy playing music in my spare time

slide-3
SLIDE 3

Agenda

Who are Cloud Ops? Pivotal Values Google SRE Extracted Practices Questions

slide-4
SLIDE 4

Who are Cloud Ops

The reverse of Dev Ops 50% time spent on development Provide feedback to the product teams Develop best practices for operators

slide-5
SLIDE 5

Pivotal Core Values

eXtreme Programming Test-driven development Small releases ➡ Small updates Pair programming ➡ Pair operations Continuous integration ➡ Continuous upgrade Collective ownership ➡ No superheroes, please Sustainable pace ➡ Sane working hours

slide-6
SLIDE 6

HumanOps

Humans are part of the system Humans impact systems Humans impact business Human issues count as system issues Escalate to humans as a last resort

slide-7
SLIDE 7

HumanOps

Human metrics System metrics

slide-8
SLIDE 8

Cloud Ops EU

Our first experiment – a distributed operations team ! Regularly sharing context within distributed teams is hard

slide-9
SLIDE 9

Comic Relief

A large-scale tele-marathon in the UK Collected over £82 million in donations in one night Deployed across AWS, vSphere and GCP

slide-10
SLIDE 10

Google Partnership

We were invited to the CRE trial run Well-aligned with Pivotal principles

slide-11
SLIDE 11

Service Levels

What does it mean to have 99.9%? What is the SLI/SLO/SLA relationship? How would you choose them?

slide-12
SLIDE 12

Service Levels

What does it mean to have 99.9%?

Level Outage per month 99% 7 hours 99.9% 43.2 minutes 99.95% 21.6 minutes 99.99% 4.32 minutes

slide-13
SLIDE 13

Service Levels

It’s all about risk assessment Set clear expectations of performance

Issue MTTD MTTR MTBF Impact Loss min/yr Containers run without dependent services 3 min 90 days 10% 1 VM exposed to Internet traffic 120 min 60 min 365 days 10% 18 Applications can cause collateral damage to log availability 60 min 30 min 90 days 0% Traffic spike prevents mitigation 10 min 60 min 180 days 100% 142

slide-14
SLIDE 14

Service Levels

It’s all about risk assessment Set clear expectations of performance

slide-15
SLIDE 15

Error budget

Is usually defined within a 30-day rolling window Helps to prioritise innovation over stability It’s a budget - it is meant to be spent

slide-16
SLIDE 16

Pivotal Tracker

Web-based project management system Over 100 000 active users Runs on commercial distribution of Cloud Foundry Migrated from AWS to GCP with no downtime

slide-17
SLIDE 17

Platform updates

Security patches General support timeframe Scheduled nighttime/weekend maintenance windows More error-prone due to human factors No-one to ask for help when something fails

slide-18
SLIDE 18

Deployment train

Inspired by agile release engineering All pending updates are applied every morning

slide-19
SLIDE 19

Train driver

Controls what changes board the train, and whether the train is allowed to leave Holds the pager On duty for a week Writes deployment reports

slide-20
SLIDE 20

Fire drills

Keep the teams in check with certain tooling Can be performed with development teams Are essential for becoming a train driver

slide-21
SLIDE 21

Dungeons & Dragons

Help develop troubleshooting skills Gets the team more intimately familiar with various parts of the system Are fun!

slide-22
SLIDE 22

Toil snake

Toil reduction prioritisation tool Clearly indicates the biggest pain

Backup Backup Backup Tunnel Tunnel Tunnel Tunnel Backup Backup Upgrades Upgrades

slide-23
SLIDE 23

End of General Support

Reviewed weekly Feedback to product teams

slide-24
SLIDE 24

Bit Rot

Indicates how long a component hasn’t been updated Surfaces update issues

slide-25
SLIDE 25

Questions?

"

slide-26
SLIDE 26

References

Cloud Foundry Foundation
 https://www.cloudfoundry.org/ HumanOps - https://www.humanops.com/ Google SRE - https://landing.google.com/sre/

slide-27
SLIDE 27

Thank you!

#