FAQs Quiz 1 Pseudocode should be interpretable as a MapReduce Your - PDF document

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University FAQs • Quiz 1 • Pseudocode should be interpretable as a MapReduce • Your code should be interpretable as a actual MR code • E.g. • Step 1. Read lines • Step 2. Tokenize it • Step 3. group records based on the branch • Step 4. Sort all of the record of a branch • Step 5. Find the top 10 per branch • Can this code an effective mapreduce implementation? • <Key, Value> is the core data structure of communication in MR without any exception • Next quiz: 2/21 ~ 2/23 • Spark and Storm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University FAQs • How to lead the discussion as a presenter • GOAL: You should involve your audience to the discussion • Please remember that you have at least 10 other students (3 other teams!) who already read the same paper and submitted reviews!! • Initiate questions • “What do you think about this? Do you think that the approach XYZ is suitable for ABC?” • Provide discussion topics • “OK. We will discuss the performance aspect of this project. This project has proposed approach X, Y, and Z…” • Pose questions • “We came up with the following questions…” CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • Apache Storm vs. Heron • GEAR Session I. Peta Scale Storage Systems http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron Apache Storm Apache Heron CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Limitation of the Storm worker architecture • Multi-level scheduling and complex interaction • Tasks are scheduled using JVM’s preemptive and priority- Executor 1 Executor 2 Executor 3 based scheduling algorithm • Each thread runs several tasks Task Task Task JVM process • Executor implements another scheduling algorithm 1 4 6 Task Task Task • Hard to isolate its resource usage 2 5 7 • Tasks with different characteristics are scheduled in the same executor (e.g. Kafka spout, a bold writing output to a key- Task Task value store, and a bolt joining data can be in a single 3 8 executor) • Logs from multiple tasks are written into a single file • Hard to debug and track the topology http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Limitation of the Storm worker architecture • Limitation of the Storm Nimbus • Scheduling, monitoring, and distributing JARs • Topologies are untraceable • Nimbus does not support resource reservation and isolation • Storm workers that belong to different topologies running on the same machine • Interfere with each other • Zookeeper manages heartbeats from workers and the supervisors • Becomes a bottleneck • The Nimbus component is a single point of failure CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Limitation of the Storm worker architecture • If the receiver component is unable to handle incoming data/tuples • the sender simply drops tuples • In extreme scenarios, this design causes the topology to not make any progress • While consuming all its resources http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Apache Heron • Maintains compatibility with the Storm API • Data processing semantics • At most once – No tuple is processed more than once, although some tuples may be dropped, and thus may miss being analyzed by the topology • At least once – Each tuple is guaranteed to be processed at least once, although some tuples may be processed more than once, and may contribute to the result of the topology multiple times CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Aurora Scheduler • Aurora Topology 1 • Generic service scheduler runs on Mesos Topology 2 Topology 3 Aurora Scheduler Topology N http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Aurora Scheduler • Each topology runs as an Aurora job Container Stream Heron • Consisting several containers manager Instance Manager Metrics • Topology master Topology • Stream manager Master(TM) Heron • Heron Instances Instance Messaging System • Generic service scheduler runs on Mesos Zoo Keeper Container Heron Stream Instance manager Manager Metrics Heron Topology Instance Master(TM) (standby) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topology Backpressure • Dynamically adjust the rate at which data flows through the topology • Skewed data flows • Strategy 1: TCP Backpressure • Using TCP windowing • TCP connection between HI and SM • E.g. for the slow HI, SM will notice that its send buffer is filling up • SM will propagate it to other SMs http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topology Backpressure • Strategy 2: Spout Backpressure • SMs clamp down their local spouts to reduce the new data that is injected into the topology • Step 1: Identifies local spouts reading data to the straggler HIs • Step 2: Sends special message ( start backpressure ) to other SMs • Step 3: Other SMs clamp down their local spouts • Step 4: Once the straggler HI catches up à send a stop backpressure message to other SMs • Step 5: Other SMs start consuming data • Strategy 3: Stage-by-stage backpressure • Gradually propagates the backpressure stage-by-stage until it reaches the spouts • which represent the 1st stage in any topology CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. Peta-scale Storage Systems http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. Peta-scale Storage Systems • Objectives • Understanding large scale storage systems and their applications • Lecture 1. 3/17/2020 • Distributed File Systems: Google File System I, II and HDFS • Lecture 2. 3/19/2020 • Distributed File Systems: Google File System I, II and Apache HDFS • Distributed NoSQL DB: Apache Cassandra DB • Lecture 3. 3/24/2020 • Distributed NoSQL DB: Apache Cassandra DB • Workshop 3/26/2020 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University GEAR Session 1. Peta-scale Storage Systems • Workshop 3/26/2020 • [GS-1-A] • Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J.J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P . and Hsieh, W., 2013. Spanner: Google’s globally distributed database . ACM Transactions on Computer Systems (TOCS) , 31 (3), pp.1-22. • Presenters: Team 12 (Miller Ridgeway, William Pickard, and Timothy Garton) • [GS-1-B] • Xie, D., Li, F., Yao, B., Li, G., Zhou, L. and Guo, M., 2016, June. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data (pp. 1071-1085). • Presenters: Team 2 (Approv Pandey, Poornima Gunhalkar, Prinila Irene Ponnayya, and Saptashi Chatterjee ) • [GS-1-C] • Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M. and Vassilakis, T., 2010. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment , 3 (1-2), pp.330-339. • Presenters: Team 9 (Brandt Reutimann, Anthony Feudale, Austen Weaver, and Saloni Choudhary) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

FAQs Quiz 1 Pseudocode should be interpretable as a MapReduce Your - PDF document

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 1: PETA-SCALE STORAGE SYSTEMS Google had 2.5 million servers in 2016 Sangmi Lee Pallickara Computer Science, Colorado State University

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

Algorithmic Frontiers of Modern Massively Parallel Computation Introduction Ashish Goel, Sergei

Announcement Slides for Worship 9/20/20 Slide 1 Thank you for your donations for the fire

CHAPTER 9: REPORTING DEHCR BUREAU OF COMMUNITY DEVELOPMENT Required

Off Equatorial Analysis of Several Commonly Used Magnetic Field Models Student: Matthew Igel

Growing Pains Describe a community you used to enjoy CS 278 | Stanford University | Michael

Because Advocacy Never Stops: New Tools for Taking Action August 13, 2015 WebJunction and

Meeting of the Board of Visitors Finance Committee September 17, 2015 Agenda I. CONSENT AGENDA

Financial Results Third Quarter 2015 Safe Harbor Statements All statements in this report that