FAQs Quiz #2 2/21 ~ 2/23 Spark and Storm 10 questions 30 minutes - PDF document

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Quiz #2 • 2/21 ~ 2/23 • Spark and Storm • 10 questions • 30 minutes • Answers will be available at 9PM 2/24 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session I. Peta Scale Storage Systems • Lecture 2. • GFS I and II • Cassandra CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. Peta-scale Storage Systems http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. peta-scale storage systems Lecture 2. Google File System and Hadoop Distributed File System 3. Relaxed Consistency CS535 Big Data | Computer Science | Colorado State University Two breaks in the communication lines London Rome Boston Chicago LA Paris Miami A single machine can’t partition So it does not have to worry about partition tolerance There is only one node. Sydney If it’s up, it’s available http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Eventually consistent • At any time nodes may have replication inconsistencies • If there are no more updates (or updates can be ordered), eventually all nodes will be updated to the same value CS535 Big Data | Computer Science | Colorado State University GFS has a relaxed consistency model • Consistent : See the same data • On all replicas • Defined : If it is consistent AND • Clients see mutation writes in its entirety http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Inconsistent and undefined Operation A Operation B CS535 Big Data | Computer Science | Colorado State University Consistent but undefined Operation A Operation B http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Defined Operation A Operation B CS535 Big Data | Computer Science | Colorado State University File state region after a mutation Write Record Append Serial success Defined defined interspersed with Consistent inconsistent Concurrent but undefined success Failure Inconsistent http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. peta-scale storage systems Lecture 2. Google File System and Hadoop Distributed File System 4. Handling write and append to a file CS535 Big Data | Computer Science | Colorado State University GFS uses leases to maintain consistent mutation order across replicas • Master grants lease to one of the replicas • Primary • Primary picks serial-order • For all mutations to the chunk • Other replicas follow this order • When applying mutations http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Lease mechanism designed to minimize communications with the master • Lease has initial timeout of 60 seconds • As long as chunk is being mutated • Primary can request and receive extensions • Extension requests/grants piggybacked over heart-beat messages CS535 Big Data | Computer Science | Colorado State University Revocation and transfer of leases • Master may revoke a lease before it expires • If communications lost with primary • Master can safely give lease to another replica • Only After the lease period for old primary elapses http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University How a write is actually performed 1. Chunkserver holding the current lease for the chunk and the location of the other replica MASTER 4. Write request Client 2. Identity of the primary 3*. and the locations of other replicas Secondary Replica A Primary 5. Write request/ 6. Acknowledgement Replica 7. Final Reply Secondary Replica B 3. Client pushes the data to all the replicas CS535 Big Data | Computer Science | Colorado State University Client pushes data to all the replicas [1/2] • Each chunk server stores data in an LRU buffer until • Data is used • Aged out http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Client pushes data to all the replicas [2/2] • When chunk servers acknowledge receipt of data • Client sends a write request to primary • Primary assigns consecutive serial numbers to mutations • Forwards to replicas CS535 Big Data | Computer Science | Colorado State University Data flow is decoupled from the control flow to utilize network efficiently • Utilize each machine’s network bandwidth • Avoid network bottlenecks • Avoid high-latency links • Leverage network topology • Estimate distances from IP addresses • Pipeline the data transfer • Once a chunkserver receives some data, it starts forwarding immediately. • For transferring B bytes to R replicas • Ideal elapsed time will be ≈ B/T+RL where: • T is the network throughput • L is latency to transfer bytes between two machines http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Append: Record sizes and fragmentation • Size is restricted to ¼ the chunk size • Maximum size • Minimizes worst-case fragmentation • Internal fragmentation in each chunk … CS535 Big Data | Computer Science | Colorado State University Inconsistent Regions Data 1 Data 1 Data 1 Data 2 Data 2 Data 2 Data 3 Data 3 Failed User will re-try to store Data 3 Data 1 Data 1 Data 1 Empty Data 2 Data 2 Data 2 Data 3 Data 3 Data 3 Data 3 Data 3 Data 3 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University What if record append fails at one of the replicas • Client must retry the operation • Replicas of same chunk may contain • Different data • Duplicates of the same record • In whole or in part • Replicas of chunks are not bit-wise identical ! • In most systems, replicas are identical CS535 Big Data | Computer Science | Colorado State University GFS only guarantees that the data will be written at least once as an atomic unit • For an operation to return success • Data must be written at the same offset on all the replicas • After the write, all replicas are as long as the end of the record • Any future record will be assigned a higher offset or a different chunk http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. peta-scale storage systems Lecture 2. Google File System and Hadoop Distributed File System Google File System II Colossus CS535 Big Data | Computer Science | Colorado State University Storage Software: Colossus (GFS2) • Next-generation cluster-level file system • Automatically sharded metadata layer • Distributed Masters (64MB block size à 1MB) • Data typically written using Reed-Solomon (1.5x) • Client-driven replication, encoding and replication • Metadata space has enabled availability • Why Reed-Solomon? • Cost • Especially with cross cluster replication • More flexible cost vs. availability choices • Google File System II: Dawn of the Multiplying Master Nodes, http://www.theregister.co.uk/2009/08/12/google_file_system_part_deux/?page=1 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

FAQs Quiz #2 2/21 ~ 2/23 Spark and Storm 10 questions 30 minutes - PDF document

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit Saxena + , Mario Blaum + , and

Scio A Scala API for Google Cloud Dataflow & Apache Beam Robert Gruener @MrRobbie_G About

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Weak AI

Scaling Graphite at Criteo FOSDEM 2018 - Not yet another talk about Prometheus Me Corentin

Mechanical Turing Machine in Wood R. Ridel LEGO Turing Machine Built by J. van den Bos & D.

The Critical Role Of Supercomputing in Weather and Climate Science Prof Dale Barker Director,

Informatics 2A: Language Complexity and the Chomsky Hierarchy Slides by Bonnie Webber (modified

The Big Draw on LibreLogo.org Lszl Nmeth Andrs Tmr (presenter) Hint to readers: it

FAQs Quiz #2 2/21 ~ 2/23 Spark and Storm 10 questions 30 minutes - PDF document

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit Saxena + , Mario Blaum + , and

Scio A Scala API for Google Cloud Dataflow &amp; Apache Beam Robert Gruener @MrRobbie_G About

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Weak AI

Scaling Graphite at Criteo FOSDEM 2018 - Not yet another talk about Prometheus Me Corentin

Mechanical Turing Machine in Wood R. Ridel LEGO Turing Machine Built by J. van den Bos &amp; D.

The Critical Role Of Supercomputing in Weather and Climate Science Prof Dale Barker Director,

Informatics 2A: Language Complexity and the Chomsky Hierarchy Slides by Bonnie Webber (modified

The Big Draw on LibreLogo.org Lszl Nmeth Andrs Tmr (presenter) Hint to readers: it

Scio A Scala API for Google Cloud Dataflow & Apache Beam Robert Gruener @MrRobbie_G About

Mechanical Turing Machine in Wood R. Ridel LEGO Turing Machine Built by J. van den Bos & D.