StratoSphere Above the Clouds The Stratosphere Project * Explore the - PowerPoint PPT Presentation

Massively Parallel Analytics beyond Map/Reduce Stephan Ewen Fabian Hüske Odej Kao Volker Markl Daniel Warneke StratoSphere Above the Clouds

The Stratosphere Project * Explore the power of ■ Cloud computing for Use ‐ Cases complex information management applications Scientific Data Life Sciences Linked Data Database-inspired ■ approach Analyze, aggregate, and ■ StratoSphere Query Processor query Above the Clouds Textual and (semi-) ■ Infrastructure as a Service structured data ... Research and prototype a ■ web-scale data analytics infrastructure * publically funded joint project with HU Berlin (C. Freytag, U. Leser) and HPI (F. Naumann) Massively Parallel Analytics beyond Map/Reduce 2

Example: Climate Data Analysis PS,1,1,0,Pa, surface pressure T_2M,11,105,0,K,air_temperature Analysis Tasks on Climate Data Sets TMAX_2M,15,105,2,K,2m maximum temperature TMIN_2M,16,105,2,K,2m minimum temperature � Validate climate models U,33,110,0,ms-1,U-component of wind V,34,110,0,ms-1,V-component of wind � Locate „hot ‐ spots“ in climate models QV_2M,51,105,0,kgkg-1,2m specific humidity CLCT,71,1,0,1,total cloud cover − Monsoon … (Up to 200 parameters) − Drought − Flooding � Compare climate models − Based on different parameter settings Necessary Data Processing Operations � Filter 2km resolution � Aggregation (sliding window) 10TB � Join 1100km, � Multi ‐ dimensional sliding ‐ window operations � Geospatial/Temporal joins � Uncertainty 950km, Massively Parallel Analytics beyond Map/Reduce 3 2km resolution

Further Use-Cases ■ Text Mining in the biosciences ■ Cleansing of linked open data Massively Parallel Analytics beyond Map/Reduce 4 4

Outline ■ Motivation for Stratosphere ■ Architecture of the Stratosphere System ■ The PACT Programming Model ■ The Nephele Execution Engine ■ Parallelizing PACT Programs Massively Parallel Analytics beyond Map/Reduce 5

TPC-H Aggregation Query using MapReduce SELECT l_orderkey, o_shippriority, sum(l_extendedprice) AS revenue FROM orders O, lineitem Li WHERE l_orderkey = o_orderkey AND o_custkey IN [X] AND o_orderdate > [Y] GROUP BY l_orderkey, o_shippriority • Partial Aggregation • Final Aggregation REDUCE • Project for output extendedprice COMBINE • Set key to orderkey, For Orders: shippriority • Filter & Project O MAP • Flag with ‘O’ • Set key to orderkey For Lineitems: • Concatenate REDUCE • Project Li ‘O’ ‐ flagged tuples and • Flag with ‘L’ L’ ‐ flagged tuples • Set key to orderkey MAP • Read file from DFS • Read file from DFS Input O Input Li Massively Parallel Analytics beyond Map/Reduce 6

TPC-H Aggregation Query on Hadoop REDUCE reduce reduce reduce sort sort sort COMBINE shuffle combine combine combine map map map MAP HDFS REDUCE reduce reduce reduce sort sort sort MAP shuffle map map map Input O Input Li ■ Data is shuffled twice ■ Intermediate result is written to HDFS Massively Parallel Analytics beyond Map/Reduce 7

TPC-H Aggregation Query - Alternative Broadcast strategy using Hadoop’s Distributed Cache: REDUCE reduce reduce reduce COMBINE sort sort sort shuffle Dist. MAP combine combine combine Cache map map map Dist. Cache Input O Input Li ■ Only one MapReduce job Data is shuffled once □ No intermediate result is written to HDFS □ Efficient if Orders is comparably small □ ■ Hadoop does not know broadcast shipping strategy Massively Parallel Analytics beyond Map/Reduce 8

Motivation for Stratosphere System ■ Complex data processing must be pushed into Map/Reduce Developer must care about parallelization □ Developer has to know how the execution framework operates □ Framework does not know what is happening □ Examples: □ − Tasks with multiple input data sets (join and cross operations) − Custom partitioning (range partitioning, window operations) ■ Static execution strategy Gives fault-tolerance but not necessarily best performance □ Developer has to hard-code own strategies □ − Broadcast strategy using the distributed cache No automatic optimization can be applied □ Results of research on parallel databases are neglected □ Massively Parallel Analytics beyond Map/Reduce 9

Architecture Overview Higher ‐ Level JAQL, JAQL? Language Scope, Pig, Pig? DryadLINQ Hive Hive? Parallel Programming PACT Map/Reduce Model Programming Programming Model Model Execution Engine Hadoop Dryad Nephele Stratosphere Hadoop Stack Dryad Stack Stack Massively Parallel Analytics beyond Map/Reduce 10

Stratosphere in a Nutshell ■ PACT Programming Model Parallelization Contract (PACT) □ Declarative definition of data parallelism □ PACT Compiler Centered around second-order functions □ Generalization of map/reduce □ ■ Nephele Nephele Dryad-style execution engine □ Evaluates dataflow graphs in parallel □ Data is read from distributed filesystem □ Flexible engine for complex jobs □ ■ Stratosphere = Nephele + PACT Compiles PACT programs to Nephele dataflow graphs □ Combines parallelization abstraction and flexible execution □ Choice of execution strategies gives optimization potential □ Massively Parallel Analytics beyond Map/Reduce 11

An Intuition for Parallelization Contracts (PACTs) ■ Map and reduce are second-order functions Call first-order functions (user code) □ Provide first-order functions with subsets of the input data □ ■ Map and reduce are PACTs in our context Key Value ■ Map □ All pairs are independently processed Independent Input set subsets ■ Reduce □ Pairs with identical key are grouped □ Groups are independently processed Massively Parallel Analytics beyond Map/Reduce 12

What is a PACT? Second-order function that defines properties on the input ■ and output data of its associated first-order function Input First ‐ order function Output Data Data Contract (user code) Contract Input Contract ■ Generates independently processable subsets of data □ Generalization of map/reduce □ Enforced by the system □ Output Contract ■ Generic properties that are preserved or produced by the user code □ Use is optional but enables certain optimizations □ Guaranteed by the developer □ Key-Value data model ■ Massively Parallel Analytics beyond Map/Reduce 13

PACTs beyond Map and Reduce ■ Cross □ Multiple inputs □ Cartesian Product of inputs is built □ All combinations are processed independently ■ Match □ Multiple inputs □ All combinations of pairs with identical key over all inputs are built □ All combinations are processed independently □ Contract resembles an equi-join on the key ■ CoGroup □ Multiple inputs □ Pairs with identical key are grouped for each input □ Groups of all inputs with identical key are processed together Massively Parallel Analytics beyond Map/Reduce 14

TPC-H Aggregation Query using PACTs SELECT l_orderkey, o_shippriority, sum(l_extendedprice) AS revenue FROM orders O, lineitem Li WHERE l_orderkey = o_orderkey AND o_custkey IN [X] AND o_orderdate > [Y] GROUP BY l_orderkey, o_shippriority • Final Aggregate • Project for output • Partial Aggregate REDUCE extendedprice COMBINE • Concat O and Li • Set Key to MATCH (orderkey, shippriority) • Filter O • Project O • Project Li • Set key to orderkey • Set key to orderkey MAP MAP • Read file from DFS • Read file from DFS Input O Input Li Massively Parallel Analytics beyond Map/Reduce 15

K-Means Iteration using PACTs Output Centers (cid,cpos) • Compute new center positions from ppos REDUCE (cid,ppos) • Find nearest cluster center • Set key to cid REDUCE • Compute distance d (pid,(ppos,cid,d)) • Set key to pid CROSS (cid,cpos) (pid,ppos) • Read or generate • Read data points Input Centers Input Data Points cluster centers Massively Parallel Analytics beyond Map/Reduce 16

Nephele Execution Engine ■ Evaluates data flow graphs in parallel Out1 ■ Vertices represent tasks Tasks run user code □ T4 ■ Edges denote communication channels Network, In-Memory, and File Channels □ T3 ■ Rich set of vertex annotations provide fine-grained control over parallelization T1 T2 Number of subtasks (degree of parallelism) □ Number of subtasks per virtual machine □ Type of virtual machine (#CPU cores, RAM…) □ In1 In2 Channel types □ Sharing virtual machines among tasks □ Massively Parallel Analytics beyond Map/Reduce 17

From PACT Programs to Parallel Data Flows PACT code invoke(): while (!input2.eof) (grouping) KVPair p = input2.next(); hash-table.put(p.key, p.value); function match(Key k, Tuple val1, while (!input1.eof) Tuple val2) KVPair p = input1.next(); -> (Key, Tuple) KVPait t = hash-table.get(p.key); { User if (t != null) Tuple res = val1.concat(val2); KVPair[] result = res.project(...); Function UF.match(p.key, p.value, t.value); Key k = res.getColumn(1); output.write(result); Return (k, res); end } Nephele code (communication) V4 V4 In ‐ Memory UF1 Channel V1 V3 V3 V3 V3 (map) UF3 UF4 span V3 V4 compile (match) (reduce) V2 UF2 V1 V2 V1 V2 (map) Network Channel Nephele Schedule Spanned Data Flow PACT Program Massively Parallel Analytics beyond Map/Reduce 18

StratoSphere Above the Clouds The Stratosphere Project * Explore the - PowerPoint PPT Presentation

Massively Parallel Analytics beyond Map/Reduce Stephan Ewen Fabian Hske Odej Kao Volker Markl Daniel Warneke StratoSphere Above the Clouds The Stratosphere Project * Explore the power of Cloud computing for Use Cases complex

Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise Outline 2 1 Overview over

Is There Evidence of Convectively Injected Water Vapor in the Lowermost Stratosphere Over Boulder,

StratoSphere Above the Clouds Triangle Enumeration Input Set of undirected edges

Reproduction of stratosphere dynamics with multiscale version of SLAV atmospheric model V.V.

Rapid Desiccation of the Stratosphere in 2016: Connection to an Anomalous Change in the QBO Dale

PageRank and recommenders on very large scale A Big Data perspective through Stratosphere

Iodine Detection in the Lower Stratosphere Prof. Rainer Volkamer Theodore K. Koenig, Alfonso

Stratosphere-Troposphere Coupling and Extratropical-to-Tropical Interactions John R. Albers 1,2 1

3-Way Comparison between AIRS, ECMWF and GPS Temperatures in Upper Troposphere and Stratosphere

The Database Systems and Information Management Group at Technische Universit at Berlin 1

BALLOON-BORNE MEASUREMENTS IN THE UPPER TROPOSPHERE AND LOWER

Kh Khrist ristia ianovich vich Inst stit itute of Theore retica ical l and Ap Applie

Evaluation of WRF performance for depicting orographically-induced gravity waves in the

An overview of recent activities at the Global Modeling and Assimilation Office with focus on the

Analysis of a Mountain Wave Event Observed in AIRS and ECMWF Joan Alexander, NorthWest Research

Weather Unit Weather 101 Video from National Geographic 3:19 Weather Vocabulary 1. Atmosphere

Extracted from Slide 1: Extracted from Slide 2: Extracted from Slide 3: Extracted from Slide 4:

promoters can help to find new drugs. (Practical guide to multi-omics and multi- scale data

RUSZYMAH BT HJ IDRUS MD PhD TISSUE ENGINEERING CENTRE UKM MEDICAL CENTER KUALA LUMPUR MALAYSIA

New WHO Classification of Myeloproliferative Neoplasms Hans Michael Kvasnicka Senckenberg

Below the Knee Interventions Are they ever justified for Claudication? Shant Vartanian, MD

Program Overview T2D Risks Factors Physical Activity Benefits Increasing General Daily

Commodity, Fiat and Crypto: What does history tell us? Barry Eichengreen November 2018 1 My

Visceral aneurysm Diagnosis and Interventions M.NEDEVSKA History 1953 De Bakey and Cooley

Sambuz

Useful Links

Newsletter

Mail Us