Building Large-Scale Internet Services Jeff Dean jeff@google.com

Plan for Today • Google’s computational environment – hardware – system software stack & its evolution • Techniques for building large-scale systems – decomposition into services – common design patterns • Challenging areas for current and future work

Computing shifting to really small and really big devices UI-centric devices Large consolidated computing farms

Implications • Users have many devices – expect to be able to access their data on any of them – devices have wide range of capabilities/capacities • Disconnected operation – want to provide at least some functionality when disconnected – bigger problem in short to medium term • long term we’ll be able to assume network connection (almost) always available • Interactive apps require moving at least some computation to client – Javascript, Java, native binaries, ... • Opportunity to build interesting services: – can use much larger bursts of computational power than strictly client- side apps

Google’s data center at The Dalles, OR

The Machinery Servers • CPUs • DRAM • Disks Clusters Racks • 40-80 servers • Ethernet switch

The Joys of Real Hardware Typical first year for a new cluster: ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc. Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.

The Joys of Real Hardware Typical first year for a new cluster: ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc. Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc. • Reliability/availability must come from software!

Google Cluster Software Environment • Cluster is 1000s of machines, typically one or handful of configurations • File system (GFS or Colossus) + cluster scheduling system are core services • Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s) • mix of batch and low-latency, user-facing production jobs scheduling master ... job 1 job 3 job 12 job 7 job 3 job 5 task task task task task task GFS ... chunk scheduling chunk scheduling master server slave server slave Linux Linux Chubby Commodity HW Commodity HW lock service Machine 1 Machine N

Some Commonly Used Systems Infrastructure at Google • GFS & Colossus (next gen GFS) – cluster-level file system (distributed across thousands of nodes) • Cluster scheduling system – assigns resources to jobs made up of tasks • MapReduce – programming model and implementation for large-scale computation • Bigtable – distributed semi-structured storage system – adaptively spreads data across thousands of nodes

MapReduce • A simple programming model that applies to many large-scale computing problems • Hide messy details in MapReduce runtime library: – automatic parallelization – load balancing – network and disk transfer optimizations – handling of machine failures – robustness – improvements to core library benefit all users of library!

Typical problem solved by MapReduce • Read a lot of data • Map: extract something you care about from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results Outline stays the same, map and reduce change to fit the problem

Example: Rendering Map Tiles Input Map Shuffle Reduce Output Emit each to all Render tile using Geographic Sort by key overlapping latitude- data for all enclosed Rendered tiles feature list (key= Rect. Id) longitude rectangles features I-5 (0, I-5) (0, I-5) 0 Lake Washington (1, I-5) (0, Lake Wash.) WA-520 (0, Lake Wash.) (0, WA-520) I-90 (1, Lake Wash.) … … (0, WA-520) (1, I-90) … (1, I-5) 1 (1, Lake Wash.) (1, I-90) …

Parallel MapReduce Input data Map Map Map Map Master Shuffle Shuffle Shuffle Reduce Reduce Reduce Partitioned output

Parallel MapReduce Input data Map Map Map Map Master Shuffle Shuffle Shuffle Reduce Reduce Reduce Partitioned output For large enough problems, it’s more about disk and network performance than CPU & DRAM

MapReduce Usage Statistics Over Time Aug, ‘04 Mar, ‘06 Sep, '07 May, ’10 Number of jobs 29K 171K 2,217K 4,474K Average completion time (secs) 634 874 395 748 Machine years used 217 2,002 11,081 39,121 Input data read (TB) 3,288 52,254 403,152 946,460 Intermediate data (TB) 758 6,743 34,774 132,960 Output data written (TB) 193 2,970 14,018 45,720 Average worker machines 157 268 394 368

MapReduce in Practice • Abstract input and output interfaces – lots of MR operations don’t just read/write simple files • B-tree files • memory-mapped key-value stores • complex inverted index file formats • BigTable tables • SQL databases, etc. • ... • Low-level MR interfaces are in terms of byte arrays – Hardly ever use textual formats, though: slow, hard to parse – Most input & output is in encoded Protocol Buffer format • See “ MapReduce: A Flexible Data Processing Tool ” (CACM, 2010)

BigTable: Motivation • Lots of (semi-)structured data at Google – URLs: • Contents, crawl metadata, links, anchors, pagerank, … – Per-user data: • User preference settings, recent queries/search results, … – Geographic locations: • Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … • Scale is large – billions of URLs, many versions/page (~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data

Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents Columns Rows • Rows are ordered lexicographically • Good match for most of our applications

Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows “www.cnn.com” “<html>…” • Rows are ordered lexicographically • Good match for most of our applications

Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows “www.cnn.com” t 17 “<html>…” Timestamps • Rows are ordered lexicographically • Good match for most of our applications

Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows t 11 “www.cnn.com” t 17 “<html>…” Timestamps • Rows are ordered lexicographically • Good match for most of our applications

Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) → cell contents “contents:” Columns Rows t 3 t 11 “www.cnn.com” t 17 “<html>…” Timestamps • Rows are ordered lexicographically • Good match for most of our applications

Tablets & Splitting “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” … “website.com” … “zuppa.com/menu.html”

Tablets & Splitting “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets … “website.com” … “zuppa.com/menu.html”

Tablets & Splitting “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets … “website.com” … “yahoo.com/kids.html” … “yahoo.com/kids.html\0” … “zuppa.com/menu.html”

BigTable System Structure Bigtable Cell Bigtable master Bigtable tablet server Bigtable tablet server … Bigtable tablet server

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server … Bigtable tablet server

BigTable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server Bigtable tablet server … Bigtable tablet server serves data serves data serves data

Building Large-Scale Internet Services Jeff Dean jeff@google.com - PowerPoint PPT Presentation

Building Large-Scale Internet Services Jeff Dean jeff@google.com Plan for Today Googles computational environment hardware system software stack & its evolution Techniques for building large-scale systems decomposition

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

INTERNET SERVICES 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Internet Topology Generation for Large Scale BGP Simulation Jean-Michel Fourneau - Houssame

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

CS 457 Networking and the Internet Fall 2016 The Global Internet (Then) The tree structure of

Internet-scale Experimentation The challenges of large-scale networked system experimentation and

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Large-Scale Electronic Voting Protocols Mike Carpenter Introduction What is meant by large-scale

Internet Resource Certification (RPKI) Building a More Secure Internet Sint-Maarten Internet Week

INTERNET FOR A MOBILE INTERNET FOR A MOBILE GENERATION GENERATION www.itu.int/mobileinternet

History of the Internet Pat Morin COMP 2405 Outline Origins of the Internet Internet

IOC: Internet of Composites IOC: Internet of Composites IOC: Internet of Composites IOC: Internet

MongoDB large scale data-centric architectures QConSF 2012 Kenny Gorman Founder, ObjectRocket

INCORPORATING LARGE-SCALE CITIZEN INCORPORATING LARGE-SCALE CITIZEN DELIBERATION INTO

Mark Batty University of Kent It is time for mechanised industrial standards

Disclaimer Nursing Student Attitudes toward Mental Illness: This presenter has no conflicts of

Green Growth and Behavioral Economics Green Growth and Behavioral Economics Elke U Weber &

Edge Bundling for Visualizing Communication Behavior Ronny Brendel, Michael Heyde, Holger Brunst,

Graphical Perception Nam Wook Kim Mini-Courses January @ GSAS 2018 What is graphical

rts t t rsss

ECON 626: Applied Microeconomics Lecture 5: Regression Discontinuity Professors: Pamela Jakiela

Childrens Storybooks and Early Literacy in Rural Kenya: Evidence from 1.5 Randomized