Big Data Analytics beyond Map/Reduce 17.7. 22.7. 2011 Prof. Dr. - PowerPoint PPT Presentation

Asymmetric Fragment-and-Replicate Join ■ We can do better, if relation S is much smaller than R. ■ Idea: Reuse the existing partitioning of R and replicate the whole relation S to each node. ■ Cost: p * B(S) transport ??? local join ■  Asymmetric Fragment-and-replicate Join is a special case of the Symmetric Algorithm with m=p and n=1. ■ The Asymmetric Fragment-and-replicate join is also called Broadcast Join 7/25/2011 DIMA – TU Berlin 25

Broadcast Join Equi-Join: L(A,X) R(X,C) ■ assumption: |L| << |R| □ M M M Idea ■ broadcast L to each node completely before □ L R R R the map phase begins  by utilities, like Hadoop's distributed cache  or mappers read L from the cluster filesystem at startup Mapper ■ only over R □ step 1: read assigned input split of R into a hash-table (build phase) □ step 2: scan local copy of L and find matching R tuples (probe) □ step 3: emit each such pair □ Alternatively read L into Hash-Table, then read R and probe □ No need for partition / sort / reduce processing ■ Mapper outputs the final join result □ 7/25/2011 DIMA – TU Berlin 26

Repartition Join ■ Equi-Join: L(A,X) R(X,C) R R R assumption: |L| < |R| □ build L L R L R L R ■ Mapper L(A,X) R(X,C) h(key) % n identical processing logic for L and R □ M M M emit each tuple once □ read the intermediate key is a pair of □ L R L R L R  the value of the actual join key X  an annotation identifying to which relation the tuple belongs to (L or R) ■ Partition and sort partition by join key hash value □ input is ordered first on the join key, then on the relation name □ output: a sequence of L( i ), R( i ) blocks of tuples for ascending join key i □ ■ Reduce collect all L-tuples for the current L( i ) block in a hash map □ combine them with each R-tuple of the corresponding R( i )-tuple block □ 7/25/2011 DIMA – TU Berlin 27

Multi-Dimensional Partitioned Join ■ Equi-Join: D1(A,X) D2(B,Y) F(C,X,Y) D2 D1 D1 F star-schema with fact table F and dimensions D i □ D2 ■ Fragment D1 and D2 are partitioned independently □ the partitions for F are defined as D1 x D2 □ ■ Replicate for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y)) □ for D1-tuple d1 there is one degree of freedom (d1.y is undefined) □  D1-tuples are thus replicated for each possible y value symmetric for D2 □ ■ Reduce find and emit (f, d1, d2) pairs □ depending on the input sorting, different join strategies are possible □ 7/25/2011 DIMA – TU Berlin 29

Joins in Hadoop nodes time Asym. = Multi-Dimensional Partitioned Join selectivity 7/25/2011 DIMA – TU Berlin 31

Parallel DBMS vs. Map/Reduce Parallel DBMS Map/Reduce Schema Support   Indexing   Presenting an algorithm Stating what you want Programming Model (procedural: C/C++, (declarative: SQL) Java, …) Optimization   Scaling 1 – 500 10 - 5000 Fault Tolerance Limited Good Pipelines results Materializes results Execution between operators between phases 7/25/2011 DIMA – TU Berlin 32

Simplified Relational Data Processing on Large Clusters MAP-REDUCE-MERGE 7/25/2011 DIMA – TU Berlin 33

Map-Reduce-Merge ■ Motivation □ Map/Reduce does not directly support processing multiple related heterogeneous datasets □ difficulties and/or inefficiency when one must implement relational operators like joins ■ Map-Reduce-Merge □ adds a merge phase that  Goal: efficiently merge data already partitioned and sorted (or hashed) □ Map-Reduce-Merge workflows are comparable to RDBMS execution plans □ Can more easily implement parallel join algorithms   map: □  ( k , v ) ( k , v )   1 1 2 2     reduce: ( k , [ v ]) ( k , v ) □  2 2 2 3          merge:  □ ( k , v ) , ( k , v ) ( k v )   2 3 3 4 4 , 5 7/25/2011 DIMA – TU Berlin 34

Introducing … THE CLOUD 7/25/2011 DIMA – TU Berlin 35

In the Cloud … 7/25/2011 DIMA – TU Berlin 36

"The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women's fashion. Maybe I'm an idiot, but I have no idea what anyone is talking about. What is it? It's complete gibberish. It's insane. When is this idiocy going to stop? "We'll make cloud computing announcements. I'm not going to fight this thing. But I don't understand what we would do differently in the light of cloud." 7/25/2011 DIMA – TU Berlin 37

Steve Ballmer’s Vision of Cloud Computing 7/25/2011 DIMA – TU Berlin 38

What does Hadoop have to do with Cloud? A few months back, Hamid Pirahesh and I were doing a roundtable with a customer of ours, on cloud and data. We got into a set of standard issues -- data security being the primary but when the dialog turned to Hadoop, a person raised his hands and asked: “What has Hadoop got to do with cloud?" I responded, somewhat quickly perhaps, "Nothing specific, and I am willing to have a dialog with you on Hadoop in and out of the cloud context", but it got me thinking. Is there a relationship, or not? 7/25/2011 DIMA – TU Berlin 39

Re-inventing the wheel - or not? 7/25/2011 DIMA – TU Berlin 40

Parallel Analytics in the Cloud beyond Map/Reduce STRATOSPHERE 7/25/2011 DIMA – TU Berlin 41

The Stratosphere Project * Explore the power of Cloud ■ computing for complex Use-Cases information management applications Scientific Data Life Sciences Linked Data Database-inspired approach ■ Analyze, aggregate, and ■ StratoSphere Query Processor query Above the Clouds Textual and (semi-) ■ structured data Infrastructure as a Service Research and prototype a ... ■ web-scale data analytics infrastructure * FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam 7/25/2011 DIMA – TU Berlin 42

Example: Climate Data Analysis PS,1,1,0,Pa, surface pressure T_2M,11,105,0,K,air_temperature TMAX_2M,15,105,2,K,2m maximum temperature Analysis Tasks on Climate Data Sets TMIN_2M,16,105,2,K,2m minimum temperature  Validate climate models U,33,110,0,ms-1,U-component of wind V,34,110,0,ms-1,V-component of wind  Locate „hot - spots“ in climate models QV_2M,51,105,0,kgkg-1,2m specific humidity CLCT,71,1,0,1,total cloud cover  Monsoon … (Up to 200 parameters)  Drought  Flooding  Compare climate models  Based on different parameter settings Necessary Data Processing Operations 2km resolution  Filter 10TB 1100km,  Aggregation (sliding window)  Join  Multi-dimensional sliding-window operations  Geospatial/Temporal joins  Uncertainty 950km, 2km resolution 7/25/2011 DIMA – TU Berlin 43

Further Use-Cases ■ Text Mining in the biosciences ■ Cleansing of linked open data 7/25/2011 DIMA – TU Berlin 44

Outline ■ Architecture of the Stratosphere System ■ The PACT Programming Model ■ The Nephele Execution Engine ■ Parallelizing PACT Programs 7/25/2011 DIMA – TU Berlin 45

Architecture Overview Higher-Level JAQL, JAQL? Language Scope, Pig, Pig? DryadLINQ Hive Hive? Parallel Programming PACT Map/Reduce Model Programming Programming Model Model Execution Engine Hadoop Dryad Nephele Stratosphere Hadoop Stack Dryad Stack Stack 7/25/2011 DIMA – TU Berlin 46

Data-Centric Parallel Programming Map / Reduce Relational Databases γ Map Reduce ⋈ π Map Map σ Reduce Reduce σ ■ Schema Free ■ Schema bound (relational model) ■ Many semantics hidden inside the ■ Well defined properties and user code (tricks required to push requirements for parallelization operations into map/reduce) ■ Flexible and optimizable ■ Single default way of parallelization GOAL: Advance the m/r programming model 7/25/2011 DIMA – TU Berlin 47

Stratosphere in a Nutshell PACT Programming Model ■ □ Parallelization Contract (PACT) □ Declarative definition of data parallelism □ Centered around second-order functions PACT Compiler □ Generalization of map/reduce Nephele ■ Nephele □ Dryad-style execution engine □ Evaluates dataflow graphs in parallel □ Data is read from distributed filesystem □ Flexible engine for complex jobs Stratosphere = Nephele + PACT ■ □ Compiles PACT programs to Nephele dataflow graphs □ Combines parallelization abstraction and flexible execution □ Choice of execution strategies gives optimization potential 7/25/2011 DIMA – TU Berlin 48

Overview ■ Parallelization Contracts (PACTs) ■ The Nephele Execution Engine ■ Compiling/Optimizing Programs ■ Related Work 7/25/2011 DIMA – TU Berlin 49

Intuition for Parallelization Contracts Map and reduce are second-order functions ■ □ Call first-order functions (user code) □ Provide first-order functions with subsets of the input data Define dependencies between the ■ Key Value Independent records that must be obeyed when subsets splitting them into subsets □ Cp: Required partition properties Map ■ Input set □ All records are independently processable Reduce ■ □ Records with identical key must be processed together 7/25/2011 DIMA – TU Berlin 50

Contracts beyond Map and Reduce Cross ■ □ Two inputs □ Each combination of records from the two inputs is built and is independently processable Match ■ □ Two inputs, each combination of records with equal key from the two inputs is built □ Each pair is independently processable CoGroup ■ □ Multiple inputs □ Pairs with identical key are grouped for each input □ Groups of all inputs with identical key are processed together 7/25/2011 DIMA – TU Berlin 51

Parallelization Contracts (PACTs) Second-order function that defines properties on the input and ■ output data of its associated first-order function Input First-order function Output Data Data Contract (user code) Contract Input Contract ■ □ Specifies dependencies between records (a.k.a. "What must be processed together?") □ Generalization of map/reduce □ Logically: Abstracts a (set of) communication pattern(s)  For "reduce": repartition-by-key  For "match" : broadcast-one or repartition-by-key Output Contract ■ □ Generic properties preserved or produced by the user code  key property, sort order, partitioning, etc. □ Relevant to parallelization of succeeding functions 7/25/2011 DIMA – TU Berlin 52

Optimizing PACT Programs ■ For certain PACTs, several distribution patterns exist that fulfill the contract Choice of best one is up to the system □ ■ Created properties (like a partitioning) may be reused for later operators Need a way to find out whether they still hold after the user code □ Output contracts are a simple way to specify that □ Example output contracts: Same-Key, Super-Key, Unique-Key □ ■ Using these properties, optimization across multiple PACTs is possible Simple System-R style optimizer approach possible □ 7/25/2011 DIMA – TU Berlin 53

From PACT Programs to Data Flows PACT code invoke(): while (!input2.eof) (grouping) KVPair p = input2.next(); hash-table.put(p.key, p.value); function match(Key k, Tuple val1, while (!input1.eof) Tuple val2) KVPair p = input1.next(); -> (Key, Tuple) KVPait t = hash-table.get(p.key); { User if (t != null) Tuple res = val1.concat(val2); KVPair[] result = res.project(...); Function UF.match(p.key, p.value, t.value); Key k = res.getColumn(1); output.write(result); Return (k, res); end } Nephele code (communication) V4 V4 In-Memory UF1 Channel V1 V3 V3 V3 V3 (map) UF3 UF4 span V3 V4 compile (match) (reduce) V2 UF2 V1 V2 V1 V2 (map) Network Channel Nephele DAG Spanned Data Flow PACT Program 7/25/2011 DIMA – TU Berlin 54

NEPHELE EXECUTION ENGINE 7/25/2011 DIMA – TU Berlin 55

Nephele Execution Engine ■ Executes Nephele schedules compiled from PACT programs □ ■ Design goals PACT Compiler Exploit scalability/flexibility of clouds □ Provide predictable performance □ Efficient execution on 1000+ cores □ Flexible fault tolerance mechanisms □ Nephele ■ Inherently designed to run on top of an IaaS Cloud Heterogeneity through different types of VMs □ Knows Cloud‟s pricing model □ Infrastructure-as-a-Service  VM allocation and de-allocation Network topology inference □ 7/25/2011 DIMA – TU Berlin 56

Nephele Architecture ■ Standard master worker pattern Workload over time ■ Workers can be allocated on demand Client Public Network (Internet) Compute Cloud Cloud Controller Persistent Storage Master Private / Virtualized Network Worker Worker Worker 7/25/2011 DIMA – TU Berlin 57

Structure of a Nephele Schedule Nephele Schedule is represented as DAG ■ □ Vertices represent tasks Output 1 □ Edges denote communication channels Task: LineWriterTask.program Output: s3://user:key@storage/outp Mandatory information for each vertex ■ □ Task program □ Input/output data location (I/O vertices Task 1 only) Task: MyTask.program Optional information for each vertex ■ □ Number of subtasks (degree of parallelism) □ Number of subtasks per virtual machine Input 1 □ Type of virtual machine (#CPU cores, RAM…) Task: LineReaderTask.program □ Channel types Input: s3://user:key@storage/input □ Sharing virtual machines among tasks 7/25/2011 DIMA – TU Berlin 58

Internal Schedule Representation Nephele schedule is converted into internal ■ representation Output 1 Output 1 (1) Explicit parallelization ■ ID: 2 □ Parallelization range (mpl) derived from PACT Type: m1.large □ Wiring of subtasks derived from PACT Explicit assignment to virtual machines ■ Task 1 Task 1 (2) □ Specified by ID and type □ Type refers to hardware profile ID: 1 Type: m1.small Input 1 Input 1 (1) 7/25/2011 DIMA – TU Berlin 59

Execution Stages Issues with on-demand allocation: ■ □ When to allocate virtual machines? Stage 1 □ When to deallocate virtual machines? Output 1 (1) Output 1 □ No guarantee of resource availability! ID: 2 Type: m1.large Stage 0 Stages ensure three properties: ■ □ VMs of upcoming stage are available □ All workers are set up and ready Task 1 Task 1 (2) □ Data of previous stages is stored in persistent manner ID: 1 Type: m1.small Input 1 (1) Input 1 7/25/2011 DIMA – TU Berlin 60

Channel Types Network channels (pipeline) ■ □ Vertices must be in same stage Stage 1 Output 1 (1) Output 1 In-memory channels (pipeline) ■ ID: 2 Type: m1.large □ Vertices must run on same VM □ Vertices must be in same stage Stage 0 File channels ■ □ Vertices must run on same VM Task 1 Task 1 (2) □ Vertices must be in different stages ID: 1 Type: m1.small Input 1 (1) Input 1 7/25/2011 DIMA – TU Berlin 61

Some Evaluation (1/2) Demonstrates benefits of dynamic resource allocation ■ Challenge: Sort and Aggregate ■ □ Sort 100 GB of integer numbers (from GraySort benchmark) □ Aggregate TOP 20% of these numbers (exact result!) First execution as map/reduce jobs with Hadoop ■ □ Three map/reduce jobs on 6 VMs (each with 8 CPU cores, 24 GB RAM) □ TeraSort code used for sorting □ Custom code for aggregation Second execution as map/reduce jobs with Nephele ■ □ Map/reduce compatilibilty layer allows to run Hadoop M/R programs □ Nephele controls resource allocation □ Idea: Adapt allocated resources to required processing power 7/25/2011 DIMA – TU Berlin 62

First Evaluation (2/2) USR USR SYS SYS Automatic VM WAIT (b) WAIT Network traffic Network traffic Poor resource (a) 100 500 100 500 deallocation (c) utilization! (d) (b) 450 450 (e) (f) (g) (h) (c) (g) (d) Average network traffic among instances [MBit/s] Average network traffic among instances [MBit/s] (a) 400 400 80 80 (e) Average instance utilization [%] Average instance utilization [%] 350 350 (f) (h) 300 300 60 60 250 250 200 200 40 40 150 150 100 100 20 20 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Time [minutes] Time [minutes] ■ M/R jobs on Nephele M/R jobs on Hadoop ■ 7/25/2011 DIMA – TU Berlin 63

References ■ [WK09] Daniel Warneke, Odej Kao: Nephele: efficient parallel data processing in the cloud. SC-MTAGS 2009 ■ [BEH+10] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010: 119-130 ■ [ABE+10] A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): 1625-1628 (2010) ■ [AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hüske, et al.: MapReduce and PACT - Comparing Data Parallel Programming Models, to appear at BTW 2011 7/25/2011 DIMA – TU Berlin 64

Ongoing Work ■ Adaptive Fault-Tolerance (Odej Kao) ■ Robust Query Optimization (Volker Markl) ■ Parallelization of the PACT Programming Model (Volker Markl) ■ Continuous Re-Optimization (Johann-Christoph Freytag) ■ Validating Climate Simulations with Stratosphere (Volker Markl) ■ Text Analysis with Stratosphere (Ulf Leser) ■ Data Cleansing with Stratosphere (Felix Naumann) ■ JAQL on Stratosphere: Student Project at TUB ■ Open Source Release: Nephele + PACT (TUB, HPI, HU) 7/25/2011 DIMA – TU Berlin 65

Overview ■ Introduction ■ Big Data Analytics ■ Map/Reduce/Merge ■ Introducing … the Cloud ■ Stratosphere (PACT and Nephele) ■ Demo (Thomas Bodner, Matthias Ringwald) ■ Mahout and Scalable Data Mining (Sebastian Schelter) 7/25/2011 DIMA – TU Berlin 66

The Information Revolution http://mediatedcultures.net/ksudigg/?p=120 7/25/2011 DIMA – TU Berlin 67

Demo Screenshots WEBLOG ANALYSIS QUERY 7/25/2011 DIMA – TU Berlin 74

Weblog Query and Plan SELECT r.url, r.rank, r.avg_duration FROM Documents d JOIN Rankings r ON r.url = d.url WHERE CONTAINS(d.text, [keywords]) AND r.rank > [rank] AND NOT EXISTS (SELECT * FROM Visits v WHERE v.url = d.url AND v.date < [date]); 7/25/2011 DIMA – TU Berlin 75

Weblog Query – Job Preview 7/25/2011 DIMA – TU Berlin 76

Weblog Query – Optimized Plan 7/25/2011 DIMA – TU Berlin 77

Weblog Query – Nephele Schedule in Execution 7/25/2011 DIMA – TU Berlin 78

Demo Screenshots ENUMERATING TRIANGLES FOR SOCIAL NETWORK MINING 7/25/2011 DIMA – TU Berlin 79

Enumerating Triangles – Graph and Job 7/25/2011 DIMA – TU Berlin 80

Enumerating Triangles – Job Preview 7/25/2011 DIMA – TU Berlin 81

Enumerating Triangles – Optimized Plan 7/25/2011 DIMA – TU Berlin 82

Enumerating Triangles – Nephele Schedule in Execution 7/25/2011 DIMA – TU Berlin 83

Scalable data mining APACHE MAHOUT Sebastian Schelter 7/25/2011 DIMA – TU Berlin 85

Apache Mahout: Overview ■ What is Apache Mahout? An Apache Software Foundation project aiming to create scalable □ machine learning libraries under the Apache License focus on scalability , not a competitor for R or Weka □ in use at Adobe, Amazon, AOL, Foursquare, Mendeley, Twitter, Yahoo □ ■ Scalability time is proportional to problem size by resource size □ P t  does not imply Hadoop or parallel, although □ R the majority of implementations use Map/Reduce 7/25/2011 DIMA – TU Berlin 86

Apache Mahout: Clustering ■ Clustering Unsupervised learning: assign a set of data points into subsets (called □ clusters) so that points in the same cluster are similar in some sense ■ Algorithms K-Means □ Fuzzy K-Means □ Canopy □ Mean Shift □ Dirichlet Process □ Spectral Clustering □ 7/25/2011 DIMA – TU Berlin 87

Apache Mahout: Classification ■ Classification supervised learning: learn a decision function that predicts labels y on □ data points x given a set of training samples {(x,y)} ■ Algorithms Logistic Regression (sequential but fast) □ Naive Bayes / Complementary Naïve Bayes □ Random Forests □ 7/25/2011 DIMA – TU Berlin 88

Apache Mahout: Collaborative Filtering ■ Collaborative Filtering approach to recommendation mining: given a user's preferences for □ items, guess which other items would be highly preferred ■ Algorithms Neighborhood methods: Itembased Collaborative Filtering □ Latent factor models: matrix factorization using „Alternating Least □ Squares“ 7/25/2011 DIMA – TU Berlin 89

Apache Mahout: Singular Value Decomposition ■ Singular Value Decomposition matrix decomposition technique used to create an optimal low-rank □ approximation of a matrix used for dimensional reduction, unsupervised feature selection, “Latent □ Semantic Indexing” ■ Algorithms Lanczos Algorithm □ Stochastic SVD □ 7/25/2011 DIMA – TU Berlin 90

Comparing implementations of data mining algorithms in Hadoop/Mahout and Nephele/PACT SCALABLE DATA MINING 7/25/2011 DIMA – TU Berlin 92

Problem description Pairwise row similarity computation ■ Computes the pairwise similarities of the rows (or columns) of a sparse matrix using a predefined similarity function used for computing document □ similarities in large corpora used to precompute item-item- □ similarities for recommendations (Collaborative Filtering) similarity function can be cosine, □ Pearson-correlation, loglikelihood ratio, Jaccard coefficient, … 7/25/2011 DIMA – TU Berlin 93

Map/Reduce ■ Map/Reduce – Step 1 □ compute similarity specific row weights transpose the matrix, there by create an inverted index □ ■ Map/Reduce – Step 2 map out all pairs of cooccurring values □ collect all cooccurring values per row pair, compute similarity value □ ■ Map/Reduce – Step 3 use secondary sort to only keep the k most similar rows □ ■ PACT 7/25/2011 DIMA – TU Berlin 94

Comparison ■ Equivalent implementations in Mahout and PACT problem maps relatively well to the Map/Reduce paradigm □ insight: standard Map/Reduce code can be ported to Nephele/PACT □ with very little effort output contracts and memory forwards offer hooks for performance □ improvements (unfortunately not applicable in this particular usecase) 7/25/2011 DIMA – TU Berlin 95

Problem description K-Means ■ Simple iterative clustering algorithm uses a predefined number of clusters (k) □ start with a random selection of cluster centers □ assign points to nearest cluster □ recompute cluster centers, iterate until convergence □ 7/25/2011 DIMA – TU Berlin 96

Mahout ■ Initialization generate k random cluster centers from datapoints (optional) □ put centers to distributed cache □ ■ Map find nearest cluster for each data point □ emit (cluster id, data point) □ ■ Combine Repeat partially aggregate distances per cluster □ ■ Reduce compute new centroid for each cluster □ ■ output converged cluster centers or centers after n iterations ■ optionally output clustered data points 7/25/2011 DIMA – TU Berlin 97

Stratosphere Implementation Source: www.stratosphere.eu 7/25/2011 DIMA – TU Berlin 98

Code analysis Comparison of the implementations actual execution plan in the underlying distributed systems is nearly ■ equivalent Stratosphere implementation is more intuitive and closer to the ■ mathematical formulation of the algorithm 7/25/2011 DIMA – TU Berlin 99

Problem description Naïve Bayes ■ Simple classification algorithm based on Bayes ‟ theorem ■ General Naïve Bayes assumes feature independence □ often good results even this is □ not given ■ Mahout‟s version of Naïve Bayes Specialized approach for document □ classification based on tf-idf weight metric □ 7/25/2011 DIMA – TU Berlin 100

M/R Overview ■ Classification straight-forward approach, simply reads complete model into memory □ classification is done in the mapper, reducer only sums up statistics for □ confusion matrix ■ Trainer much higher complexity □ needs to count documents, features, features per document, features □ per corpus Mahout‟s implementation is optimized by exploiting Hadoop specific □ features like secondary sort and reading results in memory from the cluster filesystem 7/25/2011 DIMA – TU Berlin 101

M/R Trainer Overview Train Data Feature Weight Summer Tf-Idf Extractor Calculation termDocC σ k σ k σ k σ j TermDoc Counter Tf-Idf tfIdf wordFreq σ k σ j σ k σ j WordFr. Counter Theta docC Doc Normalizer Counter Vocab Theta N. Feature featureC Counter Counter vocabC thetaNorm 7/25/2011 DIMA – TU Berlin 102

Pact Trainer Overview ■ PACT implementation looks even more complex, but PACTs can be combined in a much more □ fine-grained manner as PACT offers the ability to use local memory forwards, more and □ higher level functions can be used like Cross and Match less framework specific tweaks necessary for a performant □ implementation visualized execution plan is much more similar to the algorithmic □ formulation of computing several counts and combining them to a model in the end subcalculations can be seen and unit-tested in isolation □ 7/25/2011 DIMA – TU Berlin 103

PACT Trainer Overview 7/25/2011 DIMA – TU Berlin 104

Hot Path 7,4 GB 8 kB 3,53 GB 84 kB 5,89 GB 14,8 GB 5,89 GB 5 kB 7/25/2011 DIMA – TU Berlin 105

Pact Trainer Overview ■ Future work: PACT implementation can still be tuned by sampling input data □ more variable memory management of Stratosphere □ employing context-concept of PACTs for simpler distribution of □ computed parameters 7/25/2011 DIMA – TU Berlin 106

Hindi Thai Traditional Chinese Gracias Russian Spanish Thank You Obrigado English Brazilian Portuguese Arabic Danke German Grazie Merci Italian French Simplified Chinese Tamil Japanese Korean 7/25/2011 DIMA – TU Berlin 107

Programming in a more abstract way PARALLEL DATA FLOW LANGUAGES 7/25/2011 DIMA – TU Berlin 108

Introduction ■ MapReduce paradigm is too low-level Only two declarative primitives (map + reduce) □ Extremely rigid (one input, two-stage data flow) □ Custom code for e.g.: projection and filtering □  Code is difficult to reuse and maintain □ Impedes Optimizations □ ■ Combination of high-level declarative querying and low-level programming with MapReduce ■ Dataflow Programming Languages Hive □ JAQL □ Pig □ 7/25/2011 DIMA – TU Berlin 109

Hive ■ Data warehouse infrastructure built on top of Hadoop, providing: Data Summarization □ Ad hoc querying □ ■ Simple query language: Hive QL (based on SQL) ■ Extendable via custom mappers and reducers ■ Subproject of Hadoop ■ No „Hive format“ ■ http://hadoop.apache.org/hive/ 7/25/2011 DIMA – TU Berlin 110

Big Data Analytics beyond Map/Reduce 17.7. 22.7. 2011 Prof. Dr. - PowerPoint PPT Presentation

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT UNIVERSIT DT FRANCO -ALLEMANDE FR NACHWUCHSWISSENSCHAFTLER 2011 POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : CLOUD COMPUTING : DFIS ET OPPORTUNITS HERAUSFORDERUNGEN UND MGLICHKEITEN

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Recap: Map-Reduce Map Phase Reduce Phase (per record

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen ,

FROM RETROSPECTIVE TO CONTINUOUS DEEP ANALYTICS Seif Haridi KTH SICS Why most Data Analysis

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

The Best Type Traits is_trivially_comparable<T> [3446] Motivation /

HAMMONDS PLAINS AREA GARDEN CLUB 2016 PLANT ORDER PLANTS Hostas Astilbes Companion

Alternatives Assessment Webinar: Lessons and Insights on the role of alternatives assessment in

Welcome to the Interprofessional Case Management Experience Please log in to the iCCOA Portal at

Dynamic Software Updates for C Applications Sebastian Hahn Friday 27 th June, 2014 Software

.=O&P>)(2-+&"+&& Q+,($%#"*)&P+>"#-+R'+%$&&

Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1

Market-Driven Resource Allocation in Rack-Scale Systems Muli Ben-Yehuda Lior Segev Ariel

Big Data Analytics beyond Map/Reduce 17.7. 22.7. 2011 Prof. Dr. - PowerPoint PPT Presentation

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT UNIVERSIT DT FRANCO -ALLEMANDE FR NACHWUCHSWISSENSCHAFTLER 2011 POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : CLOUD COMPUTING : DFIS ET OPPORTUNITS HERAUSFORDERUNGEN UND MGLICHKEITEN

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Recap: Map-Reduce Map Phase Reduce Phase (per record

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen ,

FROM RETROSPECTIVE TO CONTINUOUS DEEP ANALYTICS Seif Haridi KTH SICS Why most Data Analysis

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

The Best Type Traits is_trivially_comparable&lt;T&gt; [3446] Motivation /

HAMMONDS PLAINS AREA GARDEN CLUB 2016 PLANT ORDER PLANTS Hostas Astilbes Companion

Alternatives Assessment Webinar: Lessons and Insights on the role of alternatives assessment in

Welcome to the Interprofessional Case Management Experience Please log in to the iCCOA Portal at

Dynamic Software Updates for C Applications Sebastian Hahn Friday 27 th June, 2014 Software

.=O&amp;P&gt;*)(*2-+&amp;&quot;+&amp;&amp; Q+,($%#&quot;*)&amp;P+&gt;&quot;#-+R'+%$&amp;&amp;

Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1

Market-Driven Resource Allocation in Rack-Scale Systems Muli Ben-Yehuda Lior Segev Ariel

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

The Best Type Traits is_trivially_comparable<T> [3446] Motivation /

.=O&P>)(2-+&"+&& Q+,($%#"*)&P+>"#-+R'+%$&&