OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, - - PowerPoint PPT Presentation

openstack operations quick ramp up and survival guide
SMART_READER_LITE
LIVE PREVIEW

OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, - - PowerPoint PPT Presentation

OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwan Fan He, Architect, IBM Bluemix Private Cloud, @fancyhe Joshua Guan, Operations Lead Fan He, Cloud Architect IBM


slide-1
SLIDE 1

OpenStack Operations Quick Ramp-up and Survival Guide

Joshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwan Fan He, Architect, IBM Bluemix Private Cloud, @fancyhe

slide-2
SLIDE 2

Joshua Guan, Operations Lead IBM Bluemix Private Cloud Fan He, Cloud Architect IBM Bluemix Private Cloud

slide-3
SLIDE 3

A Little Bit Background …

  • Bluemix Private Cloud is

IBM’s private cloud as service based on OpenStack

  • Bluemix Private Cloud landed

in China to support IBM’s Cloud business there.

  • We were building an

OpenStack Operations Team from scratch

slide-4
SLIDE 4

Agenda

  • Define an OpenStack Operations Team
  • Operating Model
  • Processes
  • Tooling
  • Teaming
  • Tooling Integration
  • Cliché: OpenStack upgrade, HA, Live Migration
slide-5
SLIDE 5

Operating OpenStack is like …

You thought you would work like this And, Welcome to the real world

slide-6
SLIDE 6

Define an OpenStack Operations Team

Operating Model

  • How the cloud services are
  • ffered
  • What is the SLA
  • Collaboration with Business

Partners, Data Centers and backend teams, etc.

Processes

  • Operation Tiers
  • Escalation Levels
  • Incident Management
  • Change Management
  • Shifts
  • Onboard & Offboard

Tooling

  • Monitoring
  • Collaboration
  • Cloud Management
  • Knowledge Base
  • Security
  • Customer Support

Teaming

  • Roles and Responsibilities
  • Shift Model
slide-7
SLIDE 7

Operating Model

Data Center Service Level Agreement Business Partner Development Team OpenStack Service Offering Customers OpenStack Operations Support Entry Points

use consume complies

  • perates

collaborate/escalate route collaborate/escalate

slide-8
SLIDE 8

Processes

Operation Tiers Escalation Flows Incident Management Change Management Shifts Security

  • Roles
  • Responsibilities

Tier Role Responsibilities 1 Support First line of defense 2 Operations Deploy, upgrade, admin 3 OpenStack Engineering Build the product 3 Network Engineering Undercloud networks

slide-9
SLIDE 9

Processes

Operation Tiers Escalation Flows Incident Management Change Management Shifts Security

  • How tickets/alerts/incidents

go between different tiers

customer Tier 1 Tier 2 Tier 3 Tier 3 Tier 3

slide-10
SLIDE 10

Processes

Operation Tiers Escalation Flows Incident Management Change Management Shifts Security Definition Example Priority Level P0, P1, P2 Incident Definition OpenStack node failure, Data center network interruption Management Activities RFO, Outage Track Response time Immediate, 15min, 1hr Update interval Every 30min Communicatio n method Customer ticket, email, statuspage.io Escalcation to leadership 1hr

slide-11
SLIDE 11

Processes

Operation Tiers Escalation Flows Incident Management Change Management Shifts Security

  • Different types of changes
  • How the change will be rolled
  • ut
  • When the change will be

rolled out

  • Review and approval
  • Customer communication
slide-12
SLIDE 12

Processes

Operation Tiers Escalation Flows Incident Management Change Management Shifts Security

at-work

  • n-call primary
  • n-call secondary

Time

at-work

  • n-call primary
  • n-call secondary

at-work

  • n-call primary
  • n-call secondary

at-work at-work at-work

slide-13
SLIDE 13

Processes

Operation Tiers Escalation Flows Incident Management Change Management Shifts Security

  • Security Compliance Activities
  • Health Check
  • Patch Reporting
  • Vulnerability Scanning
  • Continuous Business Need
slide-14
SLIDE 14

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base Security Customer Support

  • Monitoring
  • Alerting
  • Log Aggregation
  • Dashboard
slide-15
SLIDE 15

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base Security Customer Support

  • Chat
  • File Sharing
  • Project Kanban
  • Shift Management
slide-16
SLIDE 16

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base Security Customer Support

  • CMDB
  • Asset Management
  • Change Management
  • Incident Management
slide-17
SLIDE 17

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base Security Customer Support

  • Internal Wiki/Runbooks
  • Product Documents for

Customers

slide-18
SLIDE 18

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base Security Customer Support

  • Access Management
  • Security Compliance

Management

  • Health Checking
  • Patching Reporting
  • Vulnerability Scanning
slide-19
SLIDE 19

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base Security Customer Support

  • Ticketing System
  • Customer Chat
  • Customer Satisfaction
  • Cloud Level Maintenance

Communication

  • Site Level Maintenance

Communication

slide-20
SLIDE 20

Teaming

Service Level Agreement Service Availability Shift Model

slide-21
SLIDE 21

Teaming

  • 24x7 Availability
  • Spread the pain
  • Eliminate interruptions as

possible

at-work

  • n-call primary
  • n-call secondary

Time

at-work

  • n-call primary
  • n-call secondary

at-work

  • n-call primary
  • n-call secondary

at-work at-work at-work

Operators on shift SME On-call 1

Triage at-work at-work at-work at-work primary secondary

SME On-call 2

primary secondary

SME On-call 3

primary secondary

slide-22
SLIDE 22

Tooling Integration

  • A lot of screens to watch
  • A lot of systems to work on
  • A lot of interruptions
  • Use your tools to “kill” them
slide-23
SLIDE 23

Tooling Integration

As a good start: Kill ”context switch” – work on a single platform

slide-24
SLIDE 24

Tooling Integration

As a good start: Kill ”context switch” – work on a single platform

slide-25
SLIDE 25

Tooling Integration

What’s next: Kill ”all interruptions” – workflow automation across platforms

slide-26
SLIDE 26

Cliché – Where BOOOOOM Happens

  • Implementations & Operations: Change management
  • The Practices of Upgrade
  • The Story of HA
  • The Myth of Live Migration
slide-27
SLIDE 27

Change management

  • “Infrastructure as Code”
  • Incoming change requests
  • Customer initiated requirements
  • Internal enhancements roll out
  • Compliance
  • Change planning for Consistency
  • Priorities
  • Dependencies
slide-28
SLIDE 28

OpenStack Upgrade

  • Prerequisites: deployment automation
  • Consistency – cloud configurations in CMDB
  • Idempotency – code to run OpenStack upgrade
  • Upgrade process design
  • Upgrade orchestration
  • Repeatable success &

minimum disruption

Reference: Upgrading OpenStack: A Best Practices Guide

slide-29
SLIDE 29

Let’s talk about High Availability….

  • Architecture decisions for HA
  • Eliminate SPOF; Non-disruptive upgrade; Load Balancing; …
  • Inherent availability = MTTF / (MTTF + MTTR)
  • HA’s “dark side” for cloud operations
  • Recovery with HA resetting
  • Complexity’s impact on recovery time
  • Mitigation plan
  • Built-in monitoring for HA mechanism
  • Recovery automation
slide-30
SLIDE 30

Live Migration?

  • Does ”nova live-migrate” work?
  • Manage customer expectations
  • Abuse prevention
  • Limited appropriate scenarios
  • Automation with caution
  • Integration with pre & post-

verification routine

Reference: Live Migration is a Perk, not a Panacea

@kiwik http://kiwik.github.io/openstack/2015/05/23/Nova-Live-Migration-Workflow/

slide-31
SLIDE 31

11:25 Kickoff with Todd Moore

IBM Vice President, Open Technology

11:30 OpenStack for Beginners

Shamail Tahir • Tyler Britten

12:15 The Open Cloud: A Platform of Possibilities

Jesse Proudman • Azmir Mohamed

2:15 Don’t Just Take Our Word for It: Use Cases from Materna & AT&T

Armin von Dolenga (Materna) • Jacob Caspi (AT&T)

3:05 Part 1 - Designing Effective Microservices

Manuel Silveyra

3:55 Part 2 - Deploying Infrastructure Foundations

Shaun Murikami • Andrew Bodine

5:05 Part 3 - Delivering Application Microservices

Daniel Krook

5:55 Part 4 – Directing Deployments with DevOps

Megan Kostick • Michael Brewer • Manuel Silveyra

Microservices on the Open Cloud En Enterprise Pe Perspectives

4:30 Join Brad Topol and the Interop Challenge Vendors for refreshments

The Open Cloud: Delivering Solutions with Choice

October 26th CCIB Room 116

slide-32
SLIDE 32

Thank You