Cluster-Level Storage @ Google How we use Colossus to improve storage - PowerPoint PPT Presentation

Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi Senior Staff Software Engineer dserenyi@google.com November 13, 2017 Keynote at the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Intensive Computing Systems

What do you call a few PB of free space?

What do you call a few PB of free space? An emergency low disk space condition

Typical Cluster: 10s of thousands of machines PB of distributed HDD Optional multi-TB local SSD 10 GB/s bisection bandwidth

Part 1: Transition From GFS to Colossus

GFS architectural problems GFS master One machine not large enough for large FS ● ● Single bottleneck for metadata operations ● Fault tolerant, not HA Predictable performance ● No guarantees of latency

Some obvious GFSv2 goals Bigger! Faster! More predictable tail latency GFS master replaced by Colossus GFS chunkserver replaced by D

Solve an easier problem A “file system” for Bigtable Append-only ● ● Single-writer (multi-reader) ● No snapshot / rename Directories unnecessary ● Where to put metadata?

Storage options back then GFS Sharded MySQL with local disk & replication ○ Ads databases Local key-value store with Paxos replication ○ Chubby Bigtable (sorted key-value store on GFS)

Storage options back then GFS ← lacks useful database features Sharded MySQL ← poor load balancing, complicated Local key-value store ← doesn’t scale Bigtable ← hmmmmmm….

Why Bigtable? Bigtable solves many of the hard problems: Automatically shards data across tablets ● ● Locates tablets via metadata lookups ● Easy to use semantics Efficient point lookups and scans ● File system metadata kept in an in-memory locality group

Metadata in Bigtable (!?!?) Application Bigtable (XX,XXX tabletservers) METADATA user1 tablets user2 tablets ... metadata data CFS Bigtable (XXX tabletservers) XX,XXX D chunkservers METADATA FS META GFS metadata GFS data GFS XXX GFS chunkservers master Note: GFS still present, storing file system metadata

GFS master -> CFS /cfs/ex-d/home/denis/myfile CFS “curators” run in Bigtable tablet is-finalized? mtime, ctime, ... servers encoding r=3.2 stripe 0, checksum, length Bigtable row corresponds to a single chunk0 chunk1 chunk2 file stripe 1, checksum, length Stripes are replication groups: open, chunk0 chunk1 chunk2 closed, finalized stripe 2, OPEN chunk0 chunk1 chunk2

Colossus for metadata? Metadata is ~1/10000 the size of data So if we host a Colossus on Colossus… 100 PB data → 10 TB metadata 10TB metadata → 1GB metametadata 1GB metametadata → 100KB meta... And now we can put it into Chubby!

Part 2: Colossus and Efficient Storage

Themes Colossus enables scale, declustering Complementary applications → cheaper storage Placement of data, IO balance is hard

What’s a cluster look like? Machine 1 Machine 2 Machine XX000 YouTube Ads YouTube Serving MapReduce MapReduce GMail YouTube CFS Bigtable Serving Bigtable D Server D Server D Server

Let’s talk about money T otal C ost of O wnership TCO encompasses much more than the retail price of a disk A denser disk might sell at a premium $/GB but still cheaper to deploy (power, connection overhead, repairs)

The ingredients of storage TCO Most importantly, we care about storage TCO, not disk TCO. Storage TCO is the cost of data durability and its availability , and the cost of serving it We minimize total storage TCO if we keep the disk full and busy

What disk should I buy? Which disk s should I buy? We’ll have a mix because we’re growing We have an overall goal for IOPS and capacity We select disks to bring the cluster and fleet closer to our overall mix

What we want cold data cold data hot data hot data small disk big disk Equal amounts of hot data (spindle is busy) Rest of disk filled with cold data (disks are full)

How we get it Colossus rebalances old, cold data ...and distributes newly written data evenly across disks

When stuff works well Each box is a D server Sized by disk capacity Colored by spindle utilization

Rough scheme Buy flash for caching to bring IOPS/GB into disk range Buy disks for capacity and fill them up Hope that the disks are busy otherwise we bought too much flash… ○ ○ but not too busy… If we buy disks for IOps, byte improvements don’t help If cold bytes grow infinitely, we have lots of IO capacity

Filling up disks is hard Filesystem doesn’t work well when 100% full Can’t remove capacity for upgrades and repairs without empty space Individual groups don’t want to run near 100% of quota Administrators are uncomfortable with statistical overcommit Supply chain uncertainty

Applications must change Unlike almost anything else in our datacenters, disk I/O cost is going up Applications that want more accesses than HDDs offer probably need to think about making their hot data hotter (so flash works well) and cold data colder An application written X years ago might cause us to buy smaller disks, increasing storage costs

Conclusion Colossus has been extremely useful for optimizing our storage efficiency ● Metadata scaling enables declustering of resources ● Ability to combine disks of various sizes and workloads of varying types is very powerful Looking forward, I/O cost trends will require both applications and storage systems to evolve

Thank you! Denis Serenyi dserenyi@google.com

Cluster-Level Storage @ Google How we use Colossus to improve storage - PowerPoint PPT Presentation

Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi Senior Staff Software Engineer dserenyi@google.com November 13, 2017 Keynote at the 2nd Joint International Workshop on Parallel Data Storage &

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google

Google Slides Opening a New Slide To open a new Google Slide, navigate to your Google Drive and

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

scheduling 3: MLFQ / proportional share 1 last time CPU burst concept scheduling metrics

CPU Scheduling Nima Honarmand (Based on slides by Don Porter and Mike Ferdman) Fall 2014 :: CSE

Scheduling, part 2 scheduling RCU File System Networking Sync Don Porter CSE 506 Memory

Ken Birman i Cornell University. CS5410 Fall 2008. Cooperative Storage Early uses of P2P

Effective Communication Skills The People Side of Food Safety Stan Cherkasky Comprehensive

-Audit Update- Dawn B. Simpson, Director Government Accountability Office May 7, 2019 2019

Flooding trends In Iowa and across the Midwest Larry Weber Executive Associate Dean College of

Systems 01/27 /2014 Heechul Yun 1 Administrative Next summary assignment due Efficient

Cluster-Level Storage @ Google How we use Colossus to improve storage - PowerPoint PPT Presentation

Cluster-Level Storage @ Google How we use Colossus to improve storage efficiency Denis Serenyi Senior Staff Software Engineer dserenyi@google.com November 13, 2017 Keynote at the 2nd Joint International Workshop on Parallel Data Storage &

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Hybrid SAN &amp; Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (&amp; 6 TIPS!) BRAINJAR HOW GOOGLE

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

arXiv:1706.03762v5 [cs.CL] 6 Dec 2017 Llion Jones Aidan N. Gomez ukasz Kaiser

The most important free tools for any website owner Google Webmaster Tools &amp; Google Analytics

Guide to Make Google Docs &amp; Google Slides ADA Compliant Google Docs Headings Google

Google Slides Opening a New Slide To open a new Google Slide, navigate to your Google Drive and

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Cluster management at Google 2015-02 john wilkes / johnwilkes@google.com Principal Software

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

scheduling 3: MLFQ / proportional share 1 last time CPU burst concept scheduling metrics

CPU Scheduling Nima Honarmand (Based on slides by Don Porter and Mike Ferdman) Fall 2014 :: CSE

Scheduling, part 2 scheduling RCU File System Networking Sync Don Porter CSE 506 Memory

Ken Birman i Cornell University. CS5410 Fall 2008. Cooperative Storage Early uses of P2P

Effective Communication Skills The People Side of Food Safety Stan Cherkasky Comprehensive

-Audit Update- Dawn B. Simpson, Director Government Accountability Office May 7, 2019 2019

Flooding trends In Iowa and across the Midwest Larry Weber Executive Associate Dean College of

Systems 01/27 /2014 Heechul Yun 1 Administrative Next summary assignment due Efficient

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE