Cluster management at Google with Borg - coping with scale 2016-11 - - PowerPoint PPT Presentation

cluster management at google with borg coping with scale
SMART_READER_LITE
LIVE PREVIEW

Cluster management at Google with Borg - coping with scale 2016-11 - - PowerPoint PPT Presentation

Cluster management at Google with Borg - coping with scale 2016-11 john wilkes / johnwilkes@google.com Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo) CC-BY-NC-ND Creative Commons license Cluster management


slide-1
SLIDE 1
slide-2
SLIDE 2

Cluster management at Google with Borg - coping with scale

2016-11

john wilkes / johnwilkes@google.com Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo) CC-BY-NC-ND Creative Commons license

slide-3
SLIDE 3

Cluster management at Google with Borg - coping with scale

2016-11

john wilkes / johnwilkes@google.com Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo) CC-BY-NC-ND Creative Commons license

the system we internally call

slide-4
SLIDE 4

Borg contributors

Core: Abhishek Rai, Abhishek Verma, Andy Zheng, Ashwin Kumar, Ben Smith, Beng-Hong Lim, Bin Zhang, Bolu Szewczyk, Brad Strand, Brian Budge, Brian Grant, Brian Wickman, Chengdu Huang, Chris Colohan, Cliff Stein, Cynthia Wong, Daniel Smith, Dave Bort, David Oppenheimer, David Wall, Divyesh Shah, Dawn Chen, Eric Haugen, Eric Tune, Eric Wilcox, Ethan Solomita, Gaurav Dhiman, Geeta Chaudhry, Greg Roelofs, Grzegorz Czajkowski, James Eady, Jarek Kusmierek, Jaroslaw Przybylowicz, Jason Hickey, Javier Kohen, Jeff Dean, Jeremy Dion, Jeremy Lau, Jerzy Szczepkowski, Joe Hellerstein, John Wilkes, Jonathan Wilson, Joso Eterovic, Jutta Degener, Kai Backman, Kamil Yurtsever, Ken Ashcraft, Kenji Kaneda, Kevan Miller, Kurt Steinkraus, Leo Landa, Liza Fireman, Madhukar Korupolu, Maricia Scott, Mark Logan, Mark Vandevoorde, Markus Gutschke, Matt Sparks, Maya Haridasan, Michael Abd-El-Malek, Michael Kenniston, Ming-Yee Iu, Monika Henzinger, Mukesh Kumar, Nate Calvin, Onufry Wojtaszczyk, Olcan Sercinoglu, Paul Menage, Patrick Johnson, Pavanish Nirula, Pedro Valenzuela, Percy Liang, Piotr Witusowski, Praveen Kallakuri, Rafal Sokolowski, Rajmohan Rajaraman, Richard Gooch, Rishi Gosalia, Rob Radez, Robert Hagmann, Robert Jardine, Robert Kennedy, Rohit Jnagal, Roy Bryant, Rune Dahl, Scott Garriss, Scott Johnson, Sean Howarth, Sheena Madan, Smeeta Jalan, Stan Chesnutt, Temo Arobelidze, Tim Hockin, Todd Wang, Tomasz Blaszczyk, Tomasz Wozniak, Tomek Zielonka, Victor Marmol, Vish Kannan, Vrigo Gokhale, Walfredo Cirne, Walt Drummond, Weiran Liu, Xiaopan Zhang, Xiao Zhang, Ye Zhao, and Zohaib Maya. SRE: Adam Rogoyski, Alex Milivojevic, Anil Das, Cody Smith, Cooper Bethea, Folke Behrens, Matt Liggett, James Sanford, John Millikin, Matt Brown, Miki Habryn, Peter Dahl, Robert van Gent, Seppi Wilhelmi, Seth Hettich, Torsten Marek, and Viraj Alankar. BCL and borgcfg: Marcel van Lohuizen and Robert Griesemer. Reviewers: Christos Kozyrakis, Eric Brewer, Malte Schwarzkopf, and Tom Rodeheffer.

slide-5
SLIDE 5

http://www.google.com/about/datacenters/inside/locations/index.html

slide-6
SLIDE 6

http://googleasiapacific.blogspot.se/2015/06/growing-our-data-center-in-singapore.html

slide-7
SLIDE 7

Image by Connie Zhou

slide-8
SLIDE 8

job hello_world = { runtime = { cell = 'ic' } // Cell (cluster) to run in binary = '.../hello_world_webserver' // Program to run args = { port = '%port%' } // Command line parameters requirements = { // Resource requirements (optional) ram = 100M disk = 100M cpu = 0.1 } replicas = 5 // Number of tasks }

10000

User view

slide-9
SLIDE 9

User view

slide-10
SLIDE 10

What just happened?

web browsers BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard Cell Scheduler borgcfg web browsers scheduler Borglet Borglet Borglet Borglet BorgMaster link shard read/UI shard Config file

persistent store (Paxos)

Binary

User view

slide-11
SLIDE 11

Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world!

Image by Connie Zhou

User view

Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world!

slide-12
SLIDE 12

User view

slide-13
SLIDE 13

task-eviction rates and causes

13

Failures

slide-14
SLIDE 14

Images by Connie Zhou

A 2000-machine service will have >10 task exits per day

This is not a problem: it's normal

Failures

slide-15
SLIDE 15
slide-16
SLIDE 16

Advanced bin-packing algorithms

Experimental placement

  • f production VM

workload, July 2014

Efficiency

stranded resources available resources

  • ne

machine

slide-17
SLIDE 17

tasks per machine

Multiple applications per machine

CPI^2 paper, EuroSys 2013

Efficiency

slide-18
SLIDE 18

18

Sharing clusters between prod/batch helps

Segregating them would need more machines

Efficiency

shared cell (original) shared cell (compacted) non-prod load (compacted) prod-only load (compacted)

# machines

slide-19
SLIDE 19

# machines

19

Sharing clusters between prod/batch helps

Segregating them would need more machines

Efficiency

shared cell (original) shared cell (compacted) non-prod load (compacted) prod-only load (compacted)

  • verhead
slide-20
SLIDE 20

Waste

Sharing clusters between prod/batch helps

Segregating them would need more machines 15 production cells from a larger pool, omitting small

  • nes (<5000 machines)

20

Efficiency

slide-21
SLIDE 21

21

Efficiency

Smaller cells would need more machines

slide-22
SLIDE 22

Bucketing to next-largest power of 2 would need more machines

prod only, starting from 0.5 cores, 0.5GiB ⇒ GCE Custom machine types

22

Efficiency

slide-23
SLIDE 23

There are no “obvious” resource-bucket sizes

  • cf. cloud VMs

23

nice round numbers gaming the system

Efficiency

slide-24
SLIDE 24

potentially reusable resources

Resource reclamation

24

Efficiency

time

limit: amount of resource requested usage: actual resource consumption reservation: estimate of future usage

slide-25
SLIDE 25

Resource reclamation could be more aggressive

Nov/Dec 2013

25

Efficiency

slide-26
SLIDE 26

Resource reclamation could be more aggressive

Nov/Dec 2013

26

Efficiency

slide-27
SLIDE 27

web browsers BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard Cell Scheduler borgcfg web browsers scheduler Borglet Borglet Borglet Borglet BorgMaster link shard read/UI shard Config file

persistent store (Paxos)

A few other moving parts

slide-28
SLIDE 28

app

agent

master

job config

A few other moving parts

slide-29
SLIDE 29

app

agent

master

system config monitoring security accounting/planning binaries + data distribution job config storage

Diagram from an original by Cody Smith.

A few other moving parts

slide-30
SLIDE 30

app

agent master

system config monitoring security accounting/billing binaries + data distribution job config storage

A few other moving parts

Diagram from an original by Cody Smith.

slide-31
SLIDE 31

κυβερνήτης: pilot or

helmsman of a ship

http://kubernetes.io

  • Top 0.01% of all Github projects
  • 800+ unique contributors
  • 15000+ people signed up for k8s meetups

Kubernetes

slide-32
SLIDE 32

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes

Web server Log roller

slide-33
SLIDE 33

Log roller Web server

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

Pods

slide-34
SLIDE 34

FE FE FE FE FE BE BE BE BE BE BE BE BE BE Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

Labels

slide-35
SLIDE 35

FE FE FE FE FE BE BE BE BE BE BE BE BE BE Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

Label selectors

labels: role: frontend

slide-36
SLIDE 36

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

FE FE FE FE FE BE BE BE BE BE BE BE BE BE

Label selectors

labels: role: frontend stage: production

slide-37
SLIDE 37

FE FE FE

replicas: 3 template: ... labels: role: frontend

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes - Master/Scheduler

Replica controller

slide-38
SLIDE 38

FE FE FE FE

replicas: 4 template: ... labels: role: frontend

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes - Master/Scheduler

Replica controller

slide-39
SLIDE 39

id: frontend-service port: 9000 labels: role: frontend

frontend-service FE FE FE FE Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes - Master/Scheduler

Service

slide-40
SLIDE 40

Kubernetes

Direct Borg analogues:

  • containers
  • pods
  • Kubelet
  • persistent, declarative specs
  • reconciliation loops
slide-41
SLIDE 41

New / improved:

  • labels
  • services
  • composable microservices

○ replication controller ○ horizontal autoscaler

  • IP per pod

Kubernetes

slide-42
SLIDE 42

Kubernetes & GCP

Kubernetes:

  • Open source container
  • rchestration
  • Supports multiple cloud and

bare-metal environments Google Container Engine:

  • Kubernetes as a service

○ runs on GCE, part of GCP

  • Auto-upgrades, scaling,

healing, monitoring, backup, ...

slide-43
SLIDE 43

Kubernetes & GCP

App Engine

  • Platform as a

service

  • Auto-everything
  • Deploy from code

Container Engine

  • Containers as a

service

  • Automation

doesn’t limit control

  • Run any app

Compute Engine

  • Infrastructure as a

service

  • Roll-your-own

automation

  • Use VMs, disks,

networks

slide-44
SLIDE 44

johnwilkes@google.com http://kubernetes.io

http://goo.gl/1C4nuo (Borg paper)

Images by Connie Zhou

Observations:

  • 1. Resiliency is achieved only

by ruthless attention to detail

a. ubiquitous software fault tolerance

b.

persistent, declarative specs

  • 2. We get efficiency by:

a. sharing resources b. reclaiming unused allocations

  • 3. Containers make users more

productive

slide-45
SLIDE 45