Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise

Outline 2 1 Overview over Stratosphere 2 Dataflow Orientation 3 Tuple-based Data Model 4 Other Differences 5 Seminar Organization Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Stratosphere Stack 3 Simple Script Simple Parser Higher-Level Language SOPREMO Compiler SOPREMO Plan Programming PACT Optimizer PACT Program Model Execution Execution Graph Nephele Scheduler Engine Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Pact (PArallelization ConTracts) 4 Parallel programming model Implementation and generalization of Map/Reduce Similar interface as Hadoop Defines the parallelization semantics of tasks Pact plan is dataflow-oriented Pact optimizes plans and compiles them to execution graphs for Nephele Alexandrov et al. 2010. MapReduce and Pact - Comparing Data Parallel Programming Models. Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Hadoop and Stratosphere Job 5 SELECT ∗ FROM Documents d JOIN Rankings r ON r . u r l = d . u r l WHERE CONTAINS( d . text , [ keywords ] ) AND r . rank > [ rank ] AND NOT EXISTS ( SELECT ∗ FROM V i s i t s v WHERE v . u r l = d . u r l AND v . v i s i t D a t e = CURDATE( ) ) ; d r d r v U N Y * + I Q E - K E U * + § MAP * :CONTAINS [keywords]: * MAP MAP + :rank > [rank]: * CONTAINS [keywords] rank > [rank] * * SAME-KEY REDUCE * * IF : * * MATCH MAP * IF : * * visitDate = CURDATE() SAME-KEY j v * * * § COGROUP MAP IF NONE : * * § * :visitDate = CURDATE(): * * : SAME-KEY * * * REDUCE IF none : * * sink * sink + § * * * ip_addr (ip_addr, url, url (url, content) rank (rank, url, url () url (rank, url, avg_duration) date, ad_revenue, ...) avg_duration) Battré et al. 2010. Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Hadoop and Stratosphere Job 5 d r d r v U Y N E I K * + Q U E - + § * MAP * :CONTAINS [keywords]: * MAP MAP + * :rank > [rank]: CONTAINS [keywords] rank > [rank] * * SAME-KEY REDUCE * * * * IF : MATCH MAP * * * IF : visitDate = CURDATE() SAME-KEY v j * * * § COGROUP MAP * IF NONE : * :visitDate = CURDATE(): § * * * : SAME-KEY * * * REDUCE sink * * IF none : * sink + § * * * ip_addr (ip_addr, url, url () url (rank, url, rank (rank, url, url (url, content) date, ad_revenue, ...) avg_duration) avg_duration) Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Nephele 6 Executes an Execution Graph Decides for each task how many instances are appropriate Assigns task instances to computation units Manages fault tolerance and adapts to changes Daniel Warneke and Odej Kao. 2009. Nephele: Efficient Parallel Data Processing in the Cloud Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Sopremo and Simple 7 High level language layer Simple = query language Sopremo = semi-structured data model (JSON) and operators Extensible operators for several use cases Text Mining, Data Cleansing, Data Mining Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Data Analysis Program 9 Hadoop Driver program + multiple jobs Driver program manually connects input and outputs d r d r v U N I E K Y * + Q U E - Fixed pipeline * + § MAP * * :CONTAINS [keywords]: MAP MAP + * :rank > [rank]: * CONTAINS [keywords] rank > [rank] * SAME-KEY 1 Job = 1 Map + 1 Reduce REDUCE * * IF : * * MATCH MAP * IF : * * visitDate = CURDATE() SAME-KEY j v * * * § COGROUP MAP IF NONE : * * :visitDate = CURDATE(): § * : * * SAME-KEY * * Stratosphere * REDUCE IF none : * * sink * sink Directed acyclic graph of arbitrary Pacts + § * * * rank (rank, url, ip_addr (ip_addr, url, url () url (rank, url, url (url, content) avg_duration) date, ad_revenue, ...) avg_duration) Explicit data sources and sinks Pact also support two inputs (for join-like operations) Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Map/Reduce in Stratosphere 10 Works same as Hadoop But the semantics are interpreted differently Each Pact defines data dependencies Map: each tuple can be treated separately Reduce: tuple with same key are grouped and guaranteed to be processed by same reducer Key Value Key Value Input Independent Input Independent Subsets Subsets Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Two Input Pacts 11 Currently three additional Pacts for two inputs Cross: make all possible pairs Match: find all matching pairs CoGroup: group all matching tuples All pairs/groups are treated independently Input B Input B Input B Input A Independent Input A Independent Input A Independent Subsets Subsets Subsets Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Comparison 12 Hadoop Often workarounds necessary in a complex M/R program Error-prone, re-occurring manual data partitioning Pattern evolved for joins etc. (Data mining book) Fixed pipeline allows fine-grain tweaks (exploits) Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Comparison 12 T Data Load Phase L T 1 PPartblock T 6 Replicate Replicate Replicate Replicate Replicate Replicate H1 Fetch Fetch Fetch H2 H3 Fetch Fetch Fetch H4 Store Store Store . . . Store Store Store . . . Scan Scan Scan PPartsplit PPartsplit PPartsplit T 1 T 5 T 2 T 4 T 3 T 6 M1 Union M2 M3 M4 RecReaditemize RecReaditemize MMapmap MMapmap PPartmem Map Phase LPartsh LPartsh LPartsh LPartsh Sortcmp Sortcmp Sortcmp Sortcmp SortGrpgrp SortGrpgrp SortGrpgrp SortGrpgrp MMapcombine MMapcombine MMapcombine . . . MMapcombine . . . Store Store Store Mergecmp SortGrpgrp MMapcombine Store Store PPartsh PPartsh Shu ffl e Phase Fetch Fetch Fetch Fetch Buffer Buffer Buffer Buffer Store Store Merge Store Mergecmp . . . Reduce Phase SortGrpgrp MMapreduce R1 Store R2 T ′ T ′ 1 2 Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Comparison 12 Hadoop Often workarounds necessary in a complex M/R program Error-prone, re-occurring manual data partitioning Pattern evolved for joins etc. (Data mining book) Fixed pipeline allows fine-grain tweaks (exploits) Stratosphere More expressiveness for complex data operations Maintains dataflow semantics to some degree Allows optimization and different shipping strategy Less hooks than Hadoop Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Optimization 13 Inspired by traditional query optimization Choose best join/shipping strategy for plan Example Task: Alternative 1 Example Task: Alternative 2 orders lineitem orders lineitem Y U E Y N K U E I Q U E - N I K Q U E - Local Forward Local Forward Local Forward Local Forward MAP MAP MAP MAP none none none none SAME-KEY SAME-KEY SAME-KEY SAME-KEY Repartition Repartition Broadcast Local Forward MATCH MATCH sort-merge sort-merge SUPER-KEY SUPER-KEY COMBINE Local Forward combining-sort REDUCE Repartition combining-sort REDUCE Local Forward combining-sort sink Local Forward sink Reorder Pacts Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Switch to Tuple-based Model 15 Bleeding Edge! Check pact-examples subproject Typical Hadoop job: Map: transforms/filters data, sets key Reduce: combines data, unsets key Preceding maps limit reordering Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Switch to Tuple-based Model 15 Bleeding Edge! Check pact-examples subproject Typical Hadoop job: Map: transforms/filters data, sets key Reduce: combines data, unsets key Preceding maps limit reordering Stratosphere uses tuples instead of k/v-pairs User should set keys as soon as possible for ALL Pacts Each Pact is annotated to define the "key" Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Reduce Example 16 public static class CountWords extends ReduceStub { 1 public void reduce(Iterator<PactRecord> records, 2 Collector out){ PactRecord element = null ; 3 int sum = 0; 4 while (records.hasNext()) { 5 element = records.next(); 6 PactInteger count = 7 element.getField(1, PactInteger. class ); 8 sum += count.getValue(); 9 10 } 11 element.setField(1, new PactInteger(sum)); 12 out.collect(element); 13 } 14 } 15 16 ... ReduceContract reducer = new ReduceContract( 17 CountWords. class , // <-- UDF 18 PactString. class , // <-- key class 19 0, // <-- key index 20 21 mapper); // <-- input Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise - PowerPoint PPT Presentation

Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise Outline 2 1 Overview over Stratosphere 2 Dataflow Orientation 3 Tuple-based Data Model 4 Other Differences 5 Seminar Organization Arvid Heise | Scalable Data Analysis

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

StratoSphere Above the Clouds The Stratosphere Project * Explore the power of Cloud

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Is There Evidence of Convectively Injected Water Vapor in the Lowermost Stratosphere Over Boulder,

StratoSphere Above the Clouds Triangle Enumeration Input Set of undirected edges

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Hadoop Scheduling A Hadoop job consists of Map tasks and Reduce tasks Only one job in

CFRP Applications in Michigan & AASHTO Innovations Initiative Program 2015 AASHTO

Corporate Presentation Corporate Presentation Safe Harbor PAGE 2 Statements in this

Efficient Analysis of Big Data and Big Models through Distributed Computation Benjamin E. Bagozzi

Technology Evolution Technology Focused Evolution Architectural Changes Impact on

C o m m o n T h r e a t s 101110001010101001010 010101010100101010101 IT'S TIME TO

ACAMS Cybersecurity Risk We All Face Jerry Craft| August 2017 About Your Speaker About Nth

Dave McCurdy Executive Director, Internet Security Alliance President, Electronic Industries

cyber defence capability: A practitioners perspective Luke Beeson, Vice President Security UK