faqs
play

FAQs Quiz #2 2/21 ~ 2/23 Spark and Storm 10 questions 30 minutes - PDF document

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University


  1. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Quiz #2 • 2/21 ~ 2/23 • Spark and Storm • 10 questions • 30 minutes • Answers will be available at 9PM 2/24 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session I. Peta Scale Storage Systems • Lecture 2. • GFS I and II • Cassandra CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. Peta-scale Storage Systems http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. peta-scale storage systems Lecture 2. Google File System and Hadoop Distributed File System 3. Relaxed Consistency CS535 Big Data | Computer Science | Colorado State University Two breaks in the communication lines London Rome Boston Chicago LA Paris Miami A single machine can’t partition So it does not have to worry about partition tolerance There is only one node. Sydney If it’s up, it’s available http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  4. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Eventually consistent • At any time nodes may have replication inconsistencies • If there are no more updates (or updates can be ordered), eventually all nodes will be updated to the same value CS535 Big Data | Computer Science | Colorado State University GFS has a relaxed consistency model • Consistent : See the same data • On all replicas • Defined : If it is consistent AND • Clients see mutation writes in its entirety http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

  5. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Inconsistent and undefined Operation A Operation B CS535 Big Data | Computer Science | Colorado State University Consistent but undefined Operation A Operation B http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

  6. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Defined Operation A Operation B CS535 Big Data | Computer Science | Colorado State University File state region after a mutation Write Record Append Serial success Defined defined interspersed with Consistent inconsistent Concurrent but undefined success Failure Inconsistent http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

  7. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. peta-scale storage systems Lecture 2. Google File System and Hadoop Distributed File System 4. Handling write and append to a file CS535 Big Data | Computer Science | Colorado State University GFS uses leases to maintain consistent mutation order across replicas • Master grants lease to one of the replicas • Primary • Primary picks serial-order • For all mutations to the chunk • Other replicas follow this order • When applying mutations http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

  8. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Lease mechanism designed to minimize communications with the master • Lease has initial timeout of 60 seconds • As long as chunk is being mutated • Primary can request and receive extensions • Extension requests/grants piggybacked over heart-beat messages CS535 Big Data | Computer Science | Colorado State University Revocation and transfer of leases • Master may revoke a lease before it expires • If communications lost with primary • Master can safely give lease to another replica • Only After the lease period for old primary elapses http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

  9. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University How a write is actually performed 1. Chunkserver holding the current lease for the chunk and the location of the other replica MASTER 4. Write request Client 2. Identity of the primary 3*. and the locations of other replicas Secondary Replica A Primary 5. Write request/ 6. Acknowledgement Replica 7. Final Reply Secondary Replica B 3. Client pushes the data to all the replicas CS535 Big Data | Computer Science | Colorado State University Client pushes data to all the replicas [1/2] • Each chunk server stores data in an LRU buffer until • Data is used • Aged out http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

  10. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Client pushes data to all the replicas [2/2] • When chunk servers acknowledge receipt of data • Client sends a write request to primary • Primary assigns consecutive serial numbers to mutations • Forwards to replicas CS535 Big Data | Computer Science | Colorado State University Data flow is decoupled from the control flow to utilize network efficiently • Utilize each machine’s network bandwidth • Avoid network bottlenecks • Avoid high-latency links • Leverage network topology • Estimate distances from IP addresses • Pipeline the data transfer • Once a chunkserver receives some data, it starts forwarding immediately. • For transferring B bytes to R replicas • Ideal elapsed time will be ≈ B/T+RL where: • T is the network throughput • L is latency to transfer bytes between two machines http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

  11. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Append: Record sizes and fragmentation • Size is restricted to ¼ the chunk size • Maximum size • Minimizes worst-case fragmentation • Internal fragmentation in each chunk … CS535 Big Data | Computer Science | Colorado State University Inconsistent Regions Data 1 Data 1 Data 1 Data 2 Data 2 Data 2 Data 3 Data 3 Failed User will re-try to store Data 3 Data 1 Data 1 Data 1 Empty Data 2 Data 2 Data 2 Data 3 Data 3 Data 3 Data 3 Data 3 Data 3 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

  12. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University What if record append fails at one of the replicas • Client must retry the operation • Replicas of same chunk may contain • Different data • Duplicates of the same record • In whole or in part • Replicas of chunks are not bit-wise identical ! • In most systems, replicas are identical CS535 Big Data | Computer Science | Colorado State University GFS only guarantees that the data will be written at least once as an atomic unit • For an operation to return success • Data must be written at the same offset on all the replicas • After the write, all replicas are as long as the end of the record • Any future record will be assigned a higher offset or a different chunk http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

  13. CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. peta-scale storage systems Lecture 2. Google File System and Hadoop Distributed File System Google File System II Colossus CS535 Big Data | Computer Science | Colorado State University Storage Software: Colossus (GFS2) • Next-generation cluster-level file system • Automatically sharded metadata layer • Distributed Masters (64MB block size à 1MB) • Data typically written using Reed-Solomon (1.5x) • Client-driven replication, encoding and replication • Metadata space has enabled availability • Why Reed-Solomon? • Cost • Especially with cross cluster replication • More flexible cost vs. availability choices • Google File System II: Dawn of the Multiplying Master Nodes, http://www.theregister.co.uk/2009/08/12/google_file_system_part_deux/?page=1 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend