Apache Heron executor (e.g. Kafka spout, a bold writing output to a - PDF document

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Quiz 1 • Pseudocode should be interpretable as a MapReduce • Your code should be interpretable as a actual MR code • E.g. • Step 1. Read lines • Step 2. Tokenize it PART B. GEAR SESSIONS • Step 3. group records based on the branch SESSION 1: PETA-SCALE STORAGE SYSTEMS • Step 4. Sort all of the record of a branch • Step 5. Find the top 10 per branch • Can this code an effective mapreduce implementation? • <Key, Value> is the core data structure of communication in MR without any exception Google had 2.5 million servers in 2016 • Next quiz: 2/21 ~ 2/23 Sangmi Lee Pallickara • Spark and Storm Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University FAQs Topics of Todays Class • How to lead the discussion as a presenter • Apache Storm vs. Heron • GOAL: You should involve your audience to the discussion • GEAR Session I. Peta Scale Storage Systems • Please remember that you have at least 10 other students (3 other teams!) who already read the same paper and submitted reviews!! • Initiate questions • “What do you think about this? Do you think that the approach XYZ is suitable for ABC?” • Provide discussion topics • “OK. We will discuss the performance aspect of this project. This project has proposed approach X, Y, and Z…” • Pose questions • “We came up with the following questions…” CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Limitation of the Storm worker architecture • Multi-level scheduling and complex interaction • Tasks are scheduled using JVM’s preemptive and priority- Executor 1 Executor 2 Executor 3 based scheduling algorithm • Each thread runs several tasks Task Task Task 4. Real-time Streaming Computing Models: JVM process • Executor implements another scheduling algorithm 1 4 6 Apache Storm and Twitter Heron Task Task Task • Hard to isolate its resource usage Apache Storm 2 5 7 • Tasks with different characteristics are scheduled in the same Apache Heron executor (e.g. Kafka spout, a bold writing output to a key- Task Task value store, and a bolt joining data can be in a single 3 8 executor) • Logs from multiple tasks are written into a single file • Hard to debug and track the topology http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Limitation of the Storm worker architecture Limitation of the Storm worker architecture • Limitation of the Storm Nimbus • If the receiver component is unable to handle incoming data/tuples • Scheduling, monitoring, and distributing JARs • the sender simply drops tuples • In extreme scenarios, this design causes the topology to not make any progress • Topologies are untraceable • While consuming all its resources • Nimbus does not support resource reservation and isolation • Storm workers that belong to different topologies running on the same machine • Interfere with each other • Zookeeper manages heartbeats from workers and the supervisors • Becomes a bottleneck • The Nimbus component is a single point of failure CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Apache Heron Aurora Scheduler • Maintains compatibility with the Storm API • Aurora • Generic service scheduler runs on Mesos Topology 1 • Data processing semantics • At most once – No tuple is processed more than once, although some tuples may be dropped, Topology 2 and thus may miss being analyzed by the topology Topology 3 • At least once – Each tuple is guaranteed to be processed at least once, although some tuples Aurora may be processed more than once, and may contribute to the result of the topology multiple Scheduler times Topology N CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Aurora Scheduler Topology Backpressure • Dynamically adjust the rate at which data flows through the topology • Each topology runs as an Aurora job Container • Skewed data flows Stream Heron • Consisting several containers manager Instance Metrics Manager • Topology master Topology • Strategy 1: TCP Backpressure • Stream manager Master(TM) Heron • Using TCP windowing • Heron Instances Instance Messaging System • Generic service scheduler runs on Mesos • TCP connection between HI and SM • E.g. for the slow HI, SM will notice that its send buffer is filling up Zoo Keeper Container • SM will propagate it to other SMs Stream Heron Instance manager Manager Metrics Heron Topology Instance Master(TM) (standby) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Apache Heron executor (e.g. Kafka spout, a bold writing output to a - PDF document

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Quiz 1 Pseudocode should be

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Streaming In Practice KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN b I II (

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN b I II

Flexichain An editable sequence and its gap-buffer implementation Robert Strandh, Tim Moore,

Bayesian Spatial Analysis Dr. Jarad Niemi STAT 615 - Iowa State University November 9, 2017

JULY 2020 SERVICE CHANGES Presentation to GCTD Board of Directors JULY 1, 2020 July 2020 Service

Print version Updated: 23 February 2020 Lecture #18 Dissolved Carbon Dioxide: Introduction

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini

The Benevolent Brain Morro Bay May 18, 2012 Rick Hanson, Ph.D. The Wellspring Institute for

The Brahmahuptas theorem after Coxeter Alexander Mednykh Sobolev Institute of Mathematics

Problem Solving (Chapter 8 in Transitions) Trevor Hawkes Coventry University

Apache Heron executor (e.g. Kafka spout, a bold writing output to a - PDF document

CS535 Big Data 2/17/2020 Week 5-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Quiz 1 Pseudocode should be

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Streaming In Practice KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN b I II (

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN b I II

Flexichain An editable sequence and its gap-buffer implementation Robert Strandh, Tim Moore,

Bayesian Spatial Analysis Dr. Jarad Niemi STAT 615 - Iowa State University November 9, 2017

JULY 2020 SERVICE CHANGES Presentation to GCTD Board of Directors JULY 1, 2020 July 2020 Service

Print version Updated: 23 February 2020 Lecture #18 Dissolved Carbon Dioxide: Introduction

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2018/19 Valeria Cardellini

The Benevolent Brain Morro Bay May 18, 2012 Rick Hanson, Ph.D. The Wellspring Institute for

The Brahmahuptas theorem after Coxeter Alexander Mednykh Sobolev Institute of Mathematics

Problem Solving (Chapter 8 in Transitions) Trevor Hawkes Coventry University

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation