Beyond DevOps: How Netflix Bridges the Gap Josh Evans - Director of - - PowerPoint PPT Presentation

beyond devops how netflix bridges the gap
SMART_READER_LITE
LIVE PREVIEW

Beyond DevOps: How Netflix Bridges the Gap Josh Evans - Director of - - PowerPoint PPT Presentation

Beyond DevOps: How Netflix Bridges the Gap Josh Evans - Director of Operations Engineering November 16, 2015 Fall 2013 Technical Debt Java 6 Perforce Single Master Jenkins Ant CentOS Asgard/Mimir How do we drive


slide-1
SLIDE 1

Josh Evans - Director of Operations Engineering November 16, 2015

Beyond DevOps: How Netflix Bridges the Gap

slide-2
SLIDE 2

Technical Debt

  • Java 6
  • Perforce
  • Single Master Jenkins
  • Ant
  • CentOS
  • Asgard/Mimir

Fall 2013

slide-3
SLIDE 3

How do we drive broad-based change?

slide-4
SLIDE 4

The Paved Road

  • Java 7
  • Stash
  • Jenkins Shards
  • Gradle
  • Ubuntu
slide-5
SLIDE 5

Some said

  • You’re overloading us
  • Too many projects
  • Poor targeting

Others said

  • What took you so long?
  • We’ve moved on
  • Now we need to migrate

That’s great but… We’re paying a high tax

slide-6
SLIDE 6
  • Expectations gap

– Division of labor – Timing of solutions – Leadership

  • Affects

– Reputation – Relationships – Lost opportunities

Organizational Debt

slide-7
SLIDE 7

How do we bridge the gap?

slide-8
SLIDE 8

“Remember that TIME is money…”

slide-9
SLIDE 9

Time is a form of currency

slide-10
SLIDE 10
  • Product Engineering
  • Operations Engineering
  • Challenges & Strategies

Our time today…

slide-11
SLIDE 11
  • Product Engineering
  • Operations Engineering
  • Challenges & Strategies

Our time today…

slide-12
SLIDE 12

Product Innovation

winning moments of truth

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
  • Every facet of the product
  • 1400 AB tests in the last year & accelerating

Continuous Innovation

slide-16
SLIDE 16

But wait, there’s more…

slide-17
SLIDE 17

Build It

  • design
  • code
  • build
  • bake
  • test
  • deploy

Run It

  • configure
  • monitor
  • triage
  • fix

…at scale, globally You build it, you run it

slide-18
SLIDE 18

Internet

  • 1000s of starts per second
  • 100,000s of requests per second
  • 100,000,000 hours of content / day
  • 3 AWS Regions, 3 AZs per region
slide-19
SLIDE 19

Relentless product innovation Building & running micro- services at scale, globally

slide-20
SLIDE 20
  • Product Engineering
  • Operations Engineering
  • Challenges & Strategies

Our time today…

slide-21
SLIDE 21

DevOps is a software development method that emphasizes the roles of both software developers and

  • ther information-technology (IT) professionals with an

emphasis on IT Operations.

  • Wikipedia

The Gap

slide-22
SLIDE 22

Why? How?

slide-23
SLIDE 23

Quality Velocity

Operational Excellence

slide-24
SLIDE 24

Operational Excellence is the continuous improvement of

the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage.

slide-25
SLIDE 25
  • Engineering Tools
  • Insight & Real-time Analytics
  • Performance & Reliability

Operations Engineering is the application of software

engineering practices to achieve and sustain operational excellence.

slide-26
SLIDE 26

Operations Engineering

  • Service provider
  • Operational excellence driver
  • Cross-cutting solutions
  • Undifferentiated heavy lifting
slide-27
SLIDE 27
  • Product Engineering
  • Operations Engineering
  • Challenges & Strategies

Our time today…

slide-28
SLIDE 28
  • You’re overloading us
  • What took you so long?

Remember that feedback?

  • We made assumptions

– Requirements – what & when – Time for non-product work

slide-29
SLIDE 29
  • Move from assumptions to knowledge
  • Affect change without imposing a tax?
  • Achieve and sustain operational excellence?

How do we…

slide-30
SLIDE 30

Time is a form of currency

slide-31
SLIDE 31

5 strategies for success in time-based economies

software & organizational engineering

slide-32
SLIDE 32
  • 1. Reach out
slide-33
SLIDE 33
  • What are your biggest operational pain points?
  • How can we help?
  • How well are we meeting your needs today?
  • What would you like to see from us in the future?

Listen Shower, rinse, repeat

Talk to your engineering customers

slide-34
SLIDE 34

Grease the Squeaky Wheels

  • low tolerance for tax
  • more vocal than most
slide-35
SLIDE 35
  • High impact solutions
  • Clarity on deliverables
  • Lower operational tax
  • Leadership, innovation, and partnership

What they wanted

slide-36
SLIDE 36
  • Deliver on solutions
  • Better road map definition & communication
  • A more aggressive stance on automation
  • Deeper investment into leadership, innovation, planning

Our commitments

slide-37
SLIDE 37
  • 2. Make an impact
  • Apply what you’ve learned
  • Deliver what matters
slide-38
SLIDE 38
  • global cloud console
  • end to end delivery
  • automation platform
  • velocity with confidence
slide-39
SLIDE 39
slide-40
SLIDE 40

Pipelines - Automated Global Delivery

slide-41
SLIDE 41
slide-42
SLIDE 42
  • 3. Make it easy to do the right thing
slide-43
SLIDE 43
  • Engineering time is scarce
  • We must do more heavy lifting

Supply & Demand

slide-44
SLIDE 44
  • Spinnaker manual step
  • Automated migrations – Mimir

Provide on-ramps

slide-45
SLIDE 45

Automate proven practices

slide-46
SLIDE 46
  • Alerting and Monitoring
  • Apache & Tomcat Hardening
  • Automated Canary Analysis
  • Autoscaling
  • Chaos Participation
  • Consistent Naming
  • ELB Configuration
  • Healthcheck Configured
  • Red-Black Pipeline
  • Squeeze Testing
  • Timeout & Fallback Tuning
  • Workload Reliability

Production Ready?

slide-47
SLIDE 47
  • Alerting and Monitoring
  • Apache & Tomcat Hardening
  • Automated Canary Analysis
  • Autoscaling
  • Chaos Participation
  • Consistent Naming
  • ELB Configuration
  • Healthcheck Configured
  • Red-Black Pipeline
  • Squeeze Testing
  • Timeout & Fallback Tuning
  • Workload Reliability

Production Ready?

slide-48
SLIDE 48

Old Version (v1.0) New Version (v1.1) Load Balancer Customers

100 Servers 5 Servers

95% 5% Metrics

Canaries

slide-49
SLIDE 49

Old Version (v1.0) New Version (v1.1) Load Balancer Customers

0 Servers 100 Servers

100% Metrics

Canaries

slide-50
SLIDE 50

Define

  • Metrics
  • A threshold

Every n minutes

  • Classify metrics
  • Compute score
  • Make a decision

Automated Canary Analysis

slide-51
SLIDE 51

Canary Analysis Performance Integration Tests Chaos Conformity Static Unit Tests

Make it easy to do the right thing

Static & Functional Testing

slide-52
SLIDE 52
  • 4. Reduce the cost of change
slide-53
SLIDE 53
  • Ongoing migrations
  • Library propagation
  • 100s of micro-services
  • Complex dependencies

Continuous, Broad-based Change

slide-54
SLIDE 54

Change Engineering

  • Locate
  • Communicate
  • Facilitate
slide-55
SLIDE 55
  • Automated forensics

– Who last touched x? – What team? – Who was their manager?

Who owns this artifact, repository, service?

slide-56
SLIDE 56

Whitepages

  • Workday wrapper
  • App & REST API
  • Organization hierarchy
  • Metadata
  • Change log

(###) ###-####

slide-57
SLIDE 57

Krieger

  • REST-based service
  • Sources

– Whitepages – Stash – Edda – Jenkins – Spinnaker – Etc…

{ "content": {}, "_links": { "employees": { "href": "/api/employees/" }, "projects": { "href": "/api/projects/" }, "teams": { "href": "/api/teams/" }, "applications": { "href": "/api/applications/" }, "jobs": { "href": "/api/build/jobs" }, "masters": { "href": "/api/build/masters" }, "projectDistribution": { "href": "/api/teams/projectDistribution" } } }

slide-58
SLIDE 58

/api/employees?q=jevans

"employees": [ { "id": "241", "firstName": "Josh", "lastName": "Evans", "username": "jevans", "email": "jevans@netflix.com", "jobTitle": "Director of Operations Engineering", "isManager": true, "isCurrent": true, "title": "Josh Evans (jevans) - Operations Engineering", "_links": { "self": { "href": "/api/employees/241" }, "manager": { "href": "/api/employees/117890" }, "team": { "href": "/api/teams/f9134a81" }, "projects": { "href": "/api/teams/f9134a81/projects" } } } ] }

slide-59
SLIDE 59
  • Security vulnerabilities

– Who owns this service?

  • Platform updates

– Who is using this version of this library?

Today – Targeted Coordination

slide-60
SLIDE 60

Automated, efficient technical project management

  • Communication
  • Guidance
  • Tracking

Low tax for TPMs & engineers

Security Fix Guava

Future – Change Campaigns

slide-61
SLIDE 61
  • 5. Develop Partnerships

Beyond supply & demand

slide-62
SLIDE 62
  • Nearing completion
  • Aggressive schedule
  • Unexpected delays
  • Commitment to June delivery

Spinnaker 1.0 – 1H 2015

slide-63
SLIDE 63
  • Built their own continuous delivery solution
  • Not positioned for engineering-wide support
  • Believes common solutions

Edge Engineering

slide-64
SLIDE 64

Partnership in Action

  • Strong relationship
  • Open discussions about concerns
  • Decision - leaned forward
  • +2 engineers on Spinnaker
  • Successful 1.0 launch
slide-65
SLIDE 65

Moving Forward Together

  • Containers?
  • Achieving alignment
  • Collaborative exploration

– Edge, Platform, Operations – A new paved road?

slide-66
SLIDE 66
  • Paved Road adopted

– Adding new ones

  • Production Ready ongoing
  • Migrations easier
  • Reputation improving
  • Improved

– Service uptime – Rate of change

Payoffs

slide-67
SLIDE 67

Putting it to the test in 2016

  • Streaming production & test - EC2 Classic to VPC
  • Highly cross-functional
  • Complex dependencies
  • Zero downtime

Stay tuned…

slide-68
SLIDE 68

Five Strategies

1. Reach out 2. Make an impact 3. Make it easy to do the right thing 4. Reduce the cost of change 5. Develop partnerships

slide-69
SLIDE 69

Open Sourced!

https://netflix.github.io/

slide-70
SLIDE 70

Josh Evans

jevans@netflix.com @ops_engineering

Questions?