Cluster management at Google 2015-02 john wilkes / - PowerPoint PPT Presentation

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software Engineer

For the past 15 years , Google has been building out the world’s fastest, most powerful, highest quality cloud infrastructure on the planet. Images by Connie Zhou

Hello World job hello_world = { runtime = { cell = 'ic' } // What cluster should we run in? binary = '.../hello_world_webserver' // What program are we to run? args = { port = '%port%' } // Command line parameters requirements = { // Resource requirements ram = 100M disk = 100M cpu = 0.1 } 10000 replicas = 5 // Number of tasks }

Hello World > borgcfg .../hello_world_webserver.borg up ... About to affect 10000 tasks and 1 packages on cell IC. Do you wish to continue (yes/no) [no]? yes ==== Staging package hello_world_webserver.63ce1b965155c75e/johnwilkes on ic... SUCCESS ==== Making package hello_world_webserver.63ce1b965155c75e/johnwilkes on ic... SUCCESS ==== Starting job hello_world on ic... SUCCESS

Hello World

Binary Hello World Config file web browsers borgcfg web browsers What just happened? Cell BorgMaster BorgMaster UI shard BorgMaster UI shard BorgMaster UI shard read/UI BorgMaster UI shard shard persistent store Scheduler scheduler (Paxos) link shard link shard link shard link shard link shard Borglet Borglet Borglet Borglet

Hello World Images by Connie Zhou

Hello World

Failures task-eviction rates and causes 9

A 2000-machine service will DRAM errors (1% AFR) Disk failures (2-10% AFR) have >10 machine crashes per Machine crashes (~2/year) day OS upgrades (2-6/year) Images by Connie Zhou

A 2000-machine service will DRAM errors (1% AFR) Disk failures (2-10% AFR) have >10 machine crashes per Machine crashes (~2/year) day OS upgrades (2-6/year) This is normal; not a problem Images by Connie Zhou

Efficiency Advanced bin- packing algorithms Experimental placement of production VM workload, July 2014

Efficiency Advanced bin- packing algorithms nice round numbers There are no obvious bucket sizes (cf. cloud VMs) gaming the system 13

Efficiency Batch jobs CDF Advanced bin- Service jobs packing algorithms Heterogeneous workloads, May 2011 Omega paper, EuroSys 2013 Job runtime [log]

Efficiency Utilization : sharing clusters between prod/batch helps 15

Efficiency Utilization : sharing clusters between prod/batch helps 16

Efficiency Advanced bin- packing algorithms Data from a cluster with 12k machines, May 2011 Trace is publicly available Heterogeneity and dynamicity of clouds at scale: Google trace analysis . SoCC’12

Efficiency Resource reclamation could be more aggressive Nov/Dec 2013 18

Efficiency Multiple tasks /machine applications per machine CPI^2 paper, EuroSys 2013 threads /machine

Efficiency Multiple applications ← μ per machine CPI^2 paper, EuroSys 2013 ← μ + σ 1. Gather CPI for all the ← μ + 2σ tasks in a job ← μ + 3σ 2. Find outliers 3. Take action outliers => victims task CPI

Achieving desired behavior Exposing mechanisms is fragile Better: declarative intents

Achieving desired behavior an SLO Service level objective (SLO) Examples: • availability • obtainability • reliability • velocity • freshness? • accuracy? • security?

A few other moving parts Config file web browsers borgcfg web browsers Cell UI BorgMaster UI BorgMaster UI BorgMaster UI shard BorgMaster read/UI shard BorgMaster shard shard shard persistent Scheduler scheduler store (Paxos) link shard link shard link shard link shard link shard Borglet Borglet Borglet Borglet

A few other moving parts master job config agent app

A few other moving parts storage master job config agent app

A few other moving parts system config storage master job config agent app

A few other moving parts system config storage master job config agent app monitoring

A few other moving parts system config storage master job config agent app monitoring binaries + data distribution

A few other moving parts system config security storage master job config agent app monitoring binaries + data distribution

A few other moving parts system config security accounting/planning storage master job config agent app monitoring binaries + data distribution Diagram from an original by Cody Smith.

A few other moving parts system config security accounting/billing storage master job config agent app monitoring binaries + data distribution Diagram from an original by Cody Smith.

Containers Everything at Google runs in a container -- including our VMs Containers give us: • resource isolation • execution isolation • CPU QoS We start over 2 billion containers per week. Image: "Container" glynlowe CC-BY-2.0 https://www.flickr.com/photos/glynlowe/10921733615

Kubernetes Machine Machine Machine κυβερνήτης : Machine Greek for “pilot” or “helmsman of a ship” The open source cluster manager from Google.

Kubernetes Web server Log roller Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Pods Web server Kubernetes master/scheduler Log roller Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Labels BE BE BE BE FE BE FE FE FE BE BE BE BE FE Kubernetes master/scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Label selectors labels: role: frontend BE BE BE BE FE BE FE FE FE BE BE BE BE FE Kubernetes master/scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Label selectors labels: role: frontend stage: production BE BE BE BE FE BE FE FE FE BE BE BE BE FE Kubernetes master/scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Replica controller replicas: 3 template: ... labels: FE FE FE role: frontend Kubernetes - Master/Scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Replica controller replicas: 4 template: ... labels: FE FE FE FE role: frontend Kubernetes - Master/Scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Service id: frontend-service frontend - service port: 9000 labels: role: frontend FE FE FE FE Kubernetes - Master/Scheduler Container Container Container Container Container Container Container Agent Agent Agent Agent Agent Agent Agent Machine Machine Machine Machine Machine Machine Machine Host Host Host Host Host Host Host

Kubernetes The open source cluster manager from Google. ● Pods: groups of containers ● Labels ● Replica controller ● Services http://kubernetes.io

Pulling it all together Do it yourself? Sure. resources offered load

Pulling it all together We choose to go to the roof not because it is glamorous, but because it is right there! ... the bulk of our success is the result of the methodical, relentless, persistent pursuit of 1.3- 2x opportunities -- what I have come to call " roofshots ". -- Luiz Barroso

Pulling it all together Data: Volkswagen, 2014-07-31 Image: john wilkes Porsche doesn't make cars: it designs and assembles them 1H2014: ○ 1.7% (89k) of VW group's vehicles ○ 23% (€1.4b) of its profits

Pulling it all together Cloud system providers are getting better at everything ... • capacity management • monitoring • storage + networking • reliability • software development tooling • ... Wouldn't you like to stand on others' shoulders?

Three rules of thumb: 1. Resiliency is more important than performance. 2. Relax. Let go. Build on what others have done. 3. Do more monitoring . johnwilkes@google.com http://kubernetes.io Images by Connie Zhou

Cluster management at Google 2015-02 john wilkes / - PowerPoint PPT Presentation

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software Engineer For the past 15 years , Google has been building out the worlds fastest, most powerful, highest quality cloud infrastructure on the

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google

Google Slides Opening a New Slide To open a new Google Slide, navigate to your Google Drive and

Cluster management at Google with Borg - coping with scale 2015-11 john wilkes /

Cluster management at Google with Borg - coping with scale 2016-11 john wilkes /

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Goals for Today Learning Objective: Understanding how operating systems support containers

Container Patterns Matthias Lbken plus give feedback GiantSwarm.io Simple Microservice

and more... The standard library is the collection of functions and types that is supplied with

RESPONSIVE DESIGN mobil e ma tu e rs CSS GRID TERMINOLOGY Lines Cell Area Vertical and

Meeting 100 // Docker and Vulnerability Scanning // If Youre New! Join our Slack:

Automating the Build Pipeline for Docker Container Nikolai Reed , Jrgen Walter , and Samuel

Distributed Broker Network System in Cloud Team 2: Sharad, Pratikshya, Lakshmipriyadarshini

Kotlin Coroutines in Practice Roman Eliza zarov !"#$%&( )

Cluster management at Google 2015-02 john wilkes / - PowerPoint PPT Presentation

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software Engineer For the past 15 years , Google has been building out the worlds fastest, most powerful, highest quality cloud infrastructure on the

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (&amp; 6 TIPS!) BRAINJAR HOW GOOGLE

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

The most important free tools for any website owner Google Webmaster Tools &amp; Google Analytics

Guide to Make Google Docs &amp; Google Slides ADA Compliant Google Docs Headings Google

Google Slides Opening a New Slide To open a new Google Slide, navigate to your Google Drive and

Cluster management at Google with Borg - coping with scale 2015-11 john wilkes /

Cluster management at Google with Borg - coping with scale 2016-11 john wilkes /

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Goals for Today Learning Objective: Understanding how operating systems support containers

Container Patterns Matthias Lbken plus give feedback GiantSwarm.io Simple Microservice

and more... The standard library is the collection of functions and types that is supplied with

RESPONSIVE DESIGN mobil e ma tu e rs CSS GRID TERMINOLOGY Lines Cell Area Vertical and

Meeting 100 // Docker and Vulnerability Scanning // If Youre New! Join our Slack:

Automating the Build Pipeline for Docker Container Nikolai Reed , Jrgen Walter , and Samuel

Distributed Broker Network System in Cloud Team 2: Sharad, Pratikshya, Lakshmipriyadarshini

Kotlin Coroutines in Practice Roman Eliza zarov !&quot;#$%&amp;( )

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google

Kotlin Coroutines in Practice Roman Eliza zarov !"#$%&( )