[PPT] - Cluster management at Google with Borg - coping with scale 2016-11 PowerPoint Presentation

SLIDE 1

SLIDE 2

Cluster management at Google with Borg - coping with scale

2016-11

john wilkes / johnwilkes@google.com Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo) CC-BY-NC-ND Creative Commons license

SLIDE 3

Cluster management at Google with Borg - coping with scale

2016-11

john wilkes / johnwilkes@google.com Principal Software Engineer Derived from EuroSys'15 paper (http://goo.gl/1C4nuo) CC-BY-NC-ND Creative Commons license

the system we internally call

SLIDE 4

Borg contributors

Core: Abhishek Rai, Abhishek Verma, Andy Zheng, Ashwin Kumar, Ben Smith, Beng-Hong Lim, Bin Zhang, Bolu Szewczyk, Brad Strand, Brian Budge, Brian Grant, Brian Wickman, Chengdu Huang, Chris Colohan, Cliff Stein, Cynthia Wong, Daniel Smith, Dave Bort, David Oppenheimer, David Wall, Divyesh Shah, Dawn Chen, Eric Haugen, Eric Tune, Eric Wilcox, Ethan Solomita, Gaurav Dhiman, Geeta Chaudhry, Greg Roelofs, Grzegorz Czajkowski, James Eady, Jarek Kusmierek, Jaroslaw Przybylowicz, Jason Hickey, Javier Kohen, Jeff Dean, Jeremy Dion, Jeremy Lau, Jerzy Szczepkowski, Joe Hellerstein, John Wilkes, Jonathan Wilson, Joso Eterovic, Jutta Degener, Kai Backman, Kamil Yurtsever, Ken Ashcraft, Kenji Kaneda, Kevan Miller, Kurt Steinkraus, Leo Landa, Liza Fireman, Madhukar Korupolu, Maricia Scott, Mark Logan, Mark Vandevoorde, Markus Gutschke, Matt Sparks, Maya Haridasan, Michael Abd-El-Malek, Michael Kenniston, Ming-Yee Iu, Monika Henzinger, Mukesh Kumar, Nate Calvin, Onufry Wojtaszczyk, Olcan Sercinoglu, Paul Menage, Patrick Johnson, Pavanish Nirula, Pedro Valenzuela, Percy Liang, Piotr Witusowski, Praveen Kallakuri, Rafal Sokolowski, Rajmohan Rajaraman, Richard Gooch, Rishi Gosalia, Rob Radez, Robert Hagmann, Robert Jardine, Robert Kennedy, Rohit Jnagal, Roy Bryant, Rune Dahl, Scott Garriss, Scott Johnson, Sean Howarth, Sheena Madan, Smeeta Jalan, Stan Chesnutt, Temo Arobelidze, Tim Hockin, Todd Wang, Tomasz Blaszczyk, Tomasz Wozniak, Tomek Zielonka, Victor Marmol, Vish Kannan, Vrigo Gokhale, Walfredo Cirne, Walt Drummond, Weiran Liu, Xiaopan Zhang, Xiao Zhang, Ye Zhao, and Zohaib Maya. SRE: Adam Rogoyski, Alex Milivojevic, Anil Das, Cody Smith, Cooper Bethea, Folke Behrens, Matt Liggett, James Sanford, John Millikin, Matt Brown, Miki Habryn, Peter Dahl, Robert van Gent, Seppi Wilhelmi, Seth Hettich, Torsten Marek, and Viraj Alankar. BCL and borgcfg: Marcel van Lohuizen and Robert Griesemer. Reviewers: Christos Kozyrakis, Eric Brewer, Malte Schwarzkopf, and Tom Rodeheffer.

SLIDE 5

http://www.google.com/about/datacenters/inside/locations/index.html

SLIDE 6

http://googleasiapacific.blogspot.se/2015/06/growing-our-data-center-in-singapore.html

SLIDE 7

Image by Connie Zhou

SLIDE 8

job hello_world = { runtime = { cell = 'ic' } // Cell (cluster) to run in binary = '.../hello_world_webserver' // Program to run args = { port = '%port%' } // Command line parameters requirements = { // Resource requirements (optional) ram = 100M disk = 100M cpu = 0.1 } replicas = 5 // Number of tasks }

10000

User view

SLIDE 9

User view

SLIDE 10

What just happened?

web browsers BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard Cell Scheduler borgcfg web browsers scheduler Borglet Borglet Borglet Borglet BorgMaster link shard read/UI shard Config file

persistent store (Paxos)

Binary

User view

SLIDE 11

Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world!

Image by Connie Zhou

User view

Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world! Hello world!

SLIDE 12

User view

SLIDE 13

task-eviction rates and causes

13

Failures

SLIDE 14

Images by Connie Zhou

A 2000-machine service will have >10 task exits per day

This is not a problem: it's normal

Failures

SLIDE 15

SLIDE 16

Advanced bin-packing algorithms

Experimental placement

f production VM

workload, July 2014

Efficiency

stranded resources available resources

ne

machine

SLIDE 17

tasks per machine

Multiple applications per machine

CPI^2 paper, EuroSys 2013

Efficiency

SLIDE 18

18

Sharing clusters between prod/batch helps

Segregating them would need more machines

Efficiency

shared cell (original) shared cell (compacted) non-prod load (compacted) prod-only load (compacted)

# machines

SLIDE 19

# machines

19

Sharing clusters between prod/batch helps

Segregating them would need more machines

Efficiency

shared cell (original) shared cell (compacted) non-prod load (compacted) prod-only load (compacted)

verhead

SLIDE 20

Waste

Sharing clusters between prod/batch helps

Segregating them would need more machines 15 production cells from a larger pool, omitting small

nes (<5000 machines)

20

Efficiency

SLIDE 21

21

Efficiency

Smaller cells would need more machines

SLIDE 22

Bucketing to next-largest power of 2 would need more machines

prod only, starting from 0.5 cores, 0.5GiB ⇒ GCE Custom machine types

22

Efficiency

SLIDE 23

There are no “obvious” resource-bucket sizes

cf. cloud VMs

23

nice round numbers gaming the system

Efficiency

SLIDE 24

potentially reusable resources

Resource reclamation

24

Efficiency

time

limit: amount of resource requested usage: actual resource consumption reservation: estimate of future usage

SLIDE 25

Resource reclamation could be more aggressive

Nov/Dec 2013

25

Efficiency

SLIDE 26

Resource reclamation could be more aggressive

Nov/Dec 2013

26

Efficiency

SLIDE 27

web browsers BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard BorgMaster link shard UI shard Cell Scheduler borgcfg web browsers scheduler Borglet Borglet Borglet Borglet BorgMaster link shard read/UI shard Config file

persistent store (Paxos)

A few other moving parts

SLIDE 28

app

agent

master

job config

A few other moving parts

SLIDE 29

app

agent

master

system config monitoring security accounting/planning binaries + data distribution job config storage

Diagram from an original by Cody Smith.

A few other moving parts

SLIDE 30

app

agent master

system config monitoring security accounting/billing binaries + data distribution job config storage

A few other moving parts

Diagram from an original by Cody Smith.

SLIDE 31

κυβερνήτης: pilot or

helmsman of a ship

http://kubernetes.io

Top 0.01% of all Github projects
800+ unique contributors
15000+ people signed up for k8s meetups

Kubernetes

SLIDE 32

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes

Web server Log roller

SLIDE 33

Log roller Web server

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

Pods

SLIDE 34

FE FE FE FE FE BE BE BE BE BE BE BE BE BE Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

Labels

SLIDE 35

FE FE FE FE FE BE BE BE BE BE BE BE BE BE Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

Label selectors

labels: role: frontend

SLIDE 36

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes master/scheduler

FE FE FE FE FE BE BE BE BE BE BE BE BE BE

Label selectors

labels: role: frontend stage: production

SLIDE 37

FE FE FE

replicas: 3 template: ... labels: role: frontend

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes - Master/Scheduler

Replica controller

SLIDE 38

FE FE FE FE

replicas: 4 template: ... labels: role: frontend

Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes - Master/Scheduler

Replica controller

SLIDE 39

id: frontend-service port: 9000 labels: role: frontend

frontend-service FE FE FE FE Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Machine Host Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent Container Agent

Kubernetes - Master/Scheduler

Service

SLIDE 40

Kubernetes

Direct Borg analogues:

containers
pods
Kubelet
persistent, declarative specs
reconciliation loops

SLIDE 41

New / improved:

labels
services
composable microservices

○ replication controller ○ horizontal autoscaler

IP per pod

Kubernetes

SLIDE 42

Kubernetes & GCP

Kubernetes:

Open source container
rchestration
Supports multiple cloud and

bare-metal environments Google Container Engine:

Kubernetes as a service

○ runs on GCE, part of GCP

Auto-upgrades, scaling,

healing, monitoring, backup, ...

SLIDE 43

Kubernetes & GCP

App Engine

Platform as a

service

Auto-everything
Deploy from code

Container Engine

Containers as a

service

Automation

doesn’t limit control

Run any app

Compute Engine

Infrastructure as a

service

Roll-your-own

automation

Use VMs, disks,

networks

SLIDE 44

johnwilkes@google.com http://kubernetes.io

http://goo.gl/1C4nuo (Borg paper)

Images by Connie Zhou

Observations:

1. Resiliency is achieved only

by ruthless attention to detail

a. ubiquitous software fault tolerance

b.

persistent, declarative specs

2. We get efficiency by:

a. sharing resources b. reclaiming unused allocations

3. Containers make users more

productive

SLIDE 45