Big Data Analytics beyond Map/Reduce 17.7. 22.7. 2011 Prof. Dr. - - PowerPoint PPT Presentation

big data analytics beyond map reduce
SMART_READER_LITE
LIVE PREVIEW

Big Data Analytics beyond Map/Reduce 17.7. 22.7. 2011 Prof. Dr. - - PowerPoint PPT Presentation

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT UNIVERSIT DT FRANCO -ALLEMANDE FR NACHWUCHSWISSENSCHAFTLER 2011 POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : CLOUD COMPUTING : DFIS ET OPPORTUNITS HERAUSFORDERUNGEN UND MGLICHKEITEN


slide-1
SLIDE 1

DEUTSCH-FRANZÖSISCHE SOMMERUNIVERSITÄT FÜR NACHWUCHSWISSENSCHAFTLER 2011 UNIVERSITÉ D’ÉTÉ FRANCO-ALLEMANDE POUR JEUNES CHERCHEURS 2011

CLOUD COMPUTING : DÉFIS ET OPPORTUNITÉS CLOUD COMPUTING : HERAUSFORDERUNGEN UND MÖGLICHKEITEN

17.7. – 22.7. 2011

Big Data Analytics beyond Map/Reduce

  • Prof. Dr. Volker Markl

TU Berlin

slide-2
SLIDE 2

7/25/2011 DIMA – TU Berlin 2

Shift Happens! Our Digital World!

Video courtesy of Michael Brodie, Chief Scientist, Verizon Original "Shift Happens" video by K. Fisch and S. McLeod Original focuses on shift in society, aimed at teachers education Michael Brodie focuses on shift in/because of the digital world

slide-3
SLIDE 3

7/25/2011 DIMA – TU Berlin 3

Data Growth and Value

■ About data growth:

□ $600 to buy a disk drive that can store all of the world’s music □ 5 billion mobile phones in use in 2010 □ 30 billion pieces of content shared on Facebook every month □ 40% projected growth in global data per year

■ About the value of captured data:

□ 250 billion Euro potential value

 to Europe’s public sector administration

□ 60% potential increase

 in retailers’ operating margins possible with big data

□ 140,000-190,000 more deep analytical talent positions needed

Source: “Big Data: The next frontier for innovation, competition and productivity” (McKinsey)

slide-4
SLIDE 4

7/25/2011 DIMA – TU Berlin 4

Big Data

■ Data have swept into every industry and business function

□ important factor of production □ exabytes of data stored by companies every year □ much of modern economic activity could not take place without that

■ Big Data creates value in several ways

□ provides transparency □ enables experimentation □ brings about customization and tailored products □ supports human decisions □ triggers new business models

■ Use of Big Data will become a key basis of competition and growth

□ companies failing to develop their analysis capabilities will fall behind

Source: “Big Data: The next frontier for innovation, competition and productivity” (McKinsey)

slide-5
SLIDE 5

7/25/2011 DIMA – TU Berlin 5

Big Data Analytics

■ Data volume keeps growing

  • Data Warehouse sizes of about 1PB are not uncommon!
  • Some businesses produce >1TB of new data per day!
  • Scientific scenarios are even larger (e.g. LHC experiment results in ~15PB / yr)

■ Some systems are required to support extreme throughput in transaction processing

  • Especially financial institutes

■ Analysis Queries become more and more complex

  • Discovering statistical patterns is compute intensive
  • May require multiple passes over the data

■ Performance of single computing cores or single machines is not increasing substantially enough to cope with this development

slide-6
SLIDE 6

7/25/2011 DIMA – TU Berlin 6

Trends

■ Massive parallelization ■ Virtualization ■ Service-based computing ■ Web-scale data management

□ Analytics / BI □ Operational □ Multi-tenancy

Claremont Report

■ Re-architecting DBMS

□ Parallelization □ Continuous optimization □ Tight integration

■ Service-based everything

□ Programming Model □ Combining structured and unstructured data □ Media Convergence

Trends and Challenges

slide-7
SLIDE 7

7/25/2011 DIMA – TU Berlin 7

Overview

■ Introduction ■ Big Data Analytics ■ Map/Reduce/Merge ■ Introducing … the Cloud ■ Stratosphere (PACT and Nephele) ■ Demo (Thomas Bodner, Matthias Ringwald) ■ Mahout and Scalable Data Mining (Sebastian Schelter)

slide-8
SLIDE 8

7/25/2011 DIMA – TU Berlin 8

BIG DATA ANALYTICS

Map/Reduce Revisited

slide-9
SLIDE 9

7/25/2011 DIMA – TU Berlin 9

Data Partitioning (I) ■ Partitioning the data means creating a set of disjunct sub-sets

  • Example: Sales data, every year gets its own partition

■ For shared-nothing, data must be partitioned across nodes

  • If it were replicated, it would effectively become a shared-disk with the local

disks acting like a cache (must be kept coherent)

■ Partitioning with certain characteristics has more advantages

  • Some queries can be limited to operate on certain sets only, if it is provable

that all relevant data (passing the predicates) is in that partition

  • Partitions can be simply dropped as a whole (data is rolled out) when it is no

longer needed (e.g. discard old sales)

slide-10
SLIDE 10

7/25/2011 DIMA – TU Berlin 10

Data Partitioning (II)

■ How to partition the data into disjoint sets?

  • Round robin: Each set gets a tuple in a round, all sets have guaranteed

equal amount of tuples, no apparent relationship between tuples in one set.

  • Hash Partitioned: Define a set of partitioning columns. Generate a hash

value over those columns to decide the target set. All tuples with equal values in the partitioning columns are in the same set.

  • Range Partitioned: Define a set of partitioning columns and split the

domain of those columns into ranges. The range determines the target set. All tuples on one set are in the same range.

slide-11
SLIDE 11

7/25/2011 DIMA – TU Berlin 11

■ The data model

□ key/value pairs □ e.g. (int, string)

■ Functional programming model with 2nd order functions

□ map:  input key-value pairs:  output key-value pairs:

□ reduce:

 input key and a list of values  output key and a single value

■ The framework

□ accepts a list □ outputs result pairs

Map/Reduce Revisited

slide-12
SLIDE 12

7/25/2011 DIMA – TU Berlin 12

Data Flow in Map/Reduce

(K m,Vm)* (K m,Vm) (K m,Vm) (K m,Vm) MAP(K m,Vm) MAP(K m,Vm) MAP(K m,Vm) (K r ,Vr)* (K r ,Vr)* (K r ,Vr)* REDUCE(K r ,Vr*) REDUCE(K r ,Vr*) REDUCE(K r ,Vr*) (K r ,Vr*) (K r ,Vr*) (K r ,Vr*) (K r ,Vr) (K r ,Vr) (K r ,Vr) (K r ,Vr)* … … … … … …

Framework Framework Framework

slide-13
SLIDE 13

7/25/2011 DIMA – TU Berlin 13

■ Problem: Counting words in a parallel fashion

□ How many times different words appear in a set of files □ juliet.txt: Romeo, Romeo, wherefore art thou Romeo? □ benvolio.txt: What, art thou hurt? □ Expected output: Romeo (3), art (2), thou (2), art (2), hurt (1), wherefore (1), what (1)

■ Solution: Map-Reduce Job

map(filename, line) { foreach (word in line) emit(word, 1); } reduce(word, numbers) { int sum = 0; foreach (value in numbers) { sum += value; } emit(word, sum); }

Map Reduce Illustrated (1)

slide-14
SLIDE 14

7/25/2011 DIMA – TU Berlin 14

Map Reduce Illustrated (2)

slide-15
SLIDE 15

7/25/2011 DIMA – TU Berlin 15

Data Analytics: Relational Algebra

■ Base Operators

□ selection () □ projection () □ set/bag union () □ set/bag difference (\ or -) □ Cartesian product (×)

■ Derived Operators

□ join (⋈) □ set/bag intersection () □ division (/)

■ Further Operators

□ de-duplication □ generalized projection (grouping and aggregation) □

  • uter-joins und semi-joins

□ Sort

slide-16
SLIDE 16

7/25/2011 DIMA – TU Berlin 16

■ Selection / projection / aggregation

□ SQL Query: SELECT year, SUM(price) FROM sales WHERE area_code = “US” GROUP BY year □ Map/Reduce job:

map(key, tuple) { int year = YEAR(tuple.date); if (tuple.area_code = “US”) emit(year, {„year‟ => year, „price‟ => tuple.price }); } reduce(key, tuples) { double sum_price = 0; foreach (tuple in tuples) { sum_price += tuple.price; } emit(key, sum_price); }

Relational Operators as Map/Reduce jobs

slide-17
SLIDE 17

7/25/2011 DIMA – TU Berlin 17

■ Sorting

□ SQL Query: SELECT * FROM sales ORDER BY year □ Map/Reduce job:

map(key, tuple) { emit(YEAR(tuple.date) DIV 10, tuple); } reduce(key, tuples) { emit(key, sort(tuples)); }

Relational Operators as Map/Reduce jobs

slide-18
SLIDE 18

7/25/2011 DIMA – TU Berlin 18

■ UNION

□ SQL Query: SELECT phone_number FROM employees UNION SELECT phone_number FROM bosses □ Map/Reduce job needs two different mappers:

map(key, employees_phonebook_entry) { emit(employees_phonebook_entry.number, ``); } map(key, bosses_phonebook_entry) { emit(bosses_phonebook_entry.number, ``); } reduce(phone_number, tuples) { emit(phone_number, ``); }

Relational Operators as Map/Reduce jobs

slide-19
SLIDE 19

7/25/2011 DIMA – TU Berlin 19

■ INTERSECT

□ SQL Query: SELECT first_name FROM employees INTERSECT SELECT first_name FROM bosses □ Map/Reduce job needs two different mappers:

map(key, employee_listing_entry) { emit(employee_listing_entry.first_name, `E`); } map(key, boss_listing_entry) { emit(bosses_listing_entry.first_name, `B`); } reduce(first_name, markers) { if (`E` in markers and `B` in markers) { emit(first_name, ``); } }

Relational Operators as Map/Reduce jobs

slide-20
SLIDE 20

7/25/2011 DIMA – TU Berlin 20

■ Benchmark to test the performance of distributed systems ■ Goal: Sort one Petabyte of 100 byte numbers ■ Implementation in Hadoop:

□ Range-Partitioner that splits the data in equal ranges (one for each participating node)

■ Sort is basically "Range partitioning sort" as described earlier

The Petabyte Sort Benchmark

slide-21
SLIDE 21

7/25/2011 DIMA – TU Berlin 21

Petabyte sorting benchmark

■ Per node: 2 quad core Xeons @ 2.5ghz, 4 SATA disks, 8G RAM (upgraded to ■ 16GB before petabyte sort), 1 Gigabit Ethernet. ■ Per Rack: 40 nodes, 8 gigabit Ethernet uplinks.

slide-22
SLIDE 22

7/25/2011 DIMA – TU Berlin 22

Cluster Utilization during Sort

slide-23
SLIDE 23

7/25/2011 DIMA – TU Berlin 23

JOINS IN MAP/REDUCE

Map/Reduce Revisited

slide-24
SLIDE 24

7/25/2011 DIMA – TU Berlin 24

Symmetric Fragment-and-Replicate Join (II)

Nodes in the Cluster

slide-25
SLIDE 25

7/25/2011 DIMA – TU Berlin 25

■ We can do better, if relation S is much smaller than R. ■ Idea: Reuse the existing partitioning of R and replicate the whole relation S to each node. ■ Cost: p * B(S) transport ??? local join ■  Asymmetric Fragment-and-replicate Join is a special case of the Symmetric Algorithm with m=p and n=1. ■ The Asymmetric Fragment-and-replicate join is also called Broadcast Join

Asymmetric Fragment-and-Replicate Join

slide-26
SLIDE 26

7/25/2011 DIMA – TU Berlin 26

■ Equi-Join: L(A,X) R(X,C)

□ assumption: |L| << |R|

■ Idea

□ broadcast L to each node completely before the map phase begins

 by utilities, like Hadoop's distributed cache 

  • r mappers read L from the cluster filesystem at startup

■ Mapper

  • nly over R

□ step 1: read assigned input split of R into a hash-table (build phase) □ step 2: scan local copy of L and find matching R tuples (probe) □ step 3: emit each such pair □ Alternatively read L into Hash-Table, then read R and probe

■ No need for partition / sort / reduce processing

□ Mapper outputs the final join result

Broadcast Join

M M M L R R R

slide-27
SLIDE 27

7/25/2011 DIMA – TU Berlin 27

■ Equi-Join: L(A,X) R(X,C)

□ assumption: |L| < |R|

■ Mapper L(A,X) R(X,C)

□ identical processing logic for L and R □ emit each tuple once □ the intermediate key is a pair of

 the value of the actual join key X  an annotation identifying to which relation the tuple belongs to (L or R)

■ Partition and sort

□ partition by join key hash value □ input is ordered first on the join key, then on the relation name □

  • utput: a sequence of L(i), R(i) blocks of tuples for ascending join key i

■ Reduce

□ collect all L-tuples for the current L(i) block in a hash map □ combine them with each R-tuple of the corresponding R(i)-tuple block

Repartition Join

M M M L R R L R L R R R L R L R L R

read h(key) % n build L

slide-28
SLIDE 28

7/25/2011 DIMA – TU Berlin 29

■ Equi-Join: D1(A,X) D2(B,Y) F(C,X,Y)

□ star-schema with fact table F and dimensions Di

■ Fragment

□ D1 and D2 are partitioned independently □ the partitions for F are defined as D1 x D2

■ Replicate

□ for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y)) □ for D1-tuple d1 there is one degree of freedom (d1.y is undefined)

 D1-tuples are thus replicated for each possible y value

□ symmetric for D2

■ Reduce

□ find and emit (f, d1, d2) pairs □ depending on the input sorting, different join strategies are possible

Multi-Dimensional Partitioned Join

D1 D2 D1 D2 F

slide-29
SLIDE 29

7/25/2011 DIMA – TU Berlin 31

Joins in Hadoop

  • Asym. = Multi-Dimensional

Partitioned Join nodes selectivity time

slide-30
SLIDE 30

7/25/2011 DIMA – TU Berlin 32

Parallel DBMS vs. Map/Reduce

Parallel DBMS Map/Reduce Schema Support   Indexing   Programming Model Stating what you want (declarative: SQL) Presenting an algorithm (procedural: C/C++, Java, …) Optimization   Scaling 1 – 500 10 - 5000 Fault Tolerance Limited Good Execution Pipelines results between operators Materializes results between phases

slide-31
SLIDE 31

7/25/2011 DIMA – TU Berlin 33

MAP-REDUCE-MERGE

Simplified Relational Data Processing on Large Clusters

slide-32
SLIDE 32

7/25/2011 DIMA – TU Berlin 34

Map-Reduce-Merge

■ Motivation

□ Map/Reduce does not directly support processing multiple related heterogeneous datasets □ difficulties and/or inefficiency when one must implement relational operators like joins

■ Map-Reduce-Merge

□ adds a merge phase that

 Goal: efficiently merge data already partitioned and sorted (or hashed)

□ Map-Reduce-Merge workflows are comparable to RDBMS execution plans □ Can more easily implement parallel join algorithms □ map: □ reduce: □ merge:

 

 

) , ( ) , (

2 2 1 1

v k v k 

  

) , ( ]) [ , (

3 2 2 2

v k v k         

 

) ( ) , ( , ) , (

5 , 4 4 3 3 2

v k v k v k 

slide-33
SLIDE 33

7/25/2011 DIMA – TU Berlin 35

THE CLOUD

Introducing …

slide-34
SLIDE 34

7/25/2011 DIMA – TU Berlin 36

In the Cloud …

slide-35
SLIDE 35

7/25/2011 DIMA – TU Berlin 37

"The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women's fashion. Maybe I'm an idiot, but I have no idea what anyone is talking about. What is it? It's complete gibberish. It's insane. When is this idiocy going to stop? "We'll make cloud computing announcements. I'm not going to fight this thing. But I don't understand what we would do differently in the light of cloud."

slide-36
SLIDE 36

7/25/2011 DIMA – TU Berlin 38

Steve Ballmer’s Vision of Cloud Computing

slide-37
SLIDE 37

7/25/2011 DIMA – TU Berlin 39

What does Hadoop have to do with Cloud? A few months back, Hamid Pirahesh and I were doing a roundtable with a customer of ours, on cloud and data. We got into a set of standard issues -- data security being the primary but when the dialog turned to Hadoop, a person raised his hands and asked: “What has Hadoop got to do with cloud?" I responded, somewhat quickly perhaps, "Nothing specific, and I am willing to have a dialog with you

  • n Hadoop in and out of the cloud context", but it

got me thinking. Is there a relationship, or not?

slide-38
SLIDE 38

7/25/2011 DIMA – TU Berlin 40

Re-inventing the wheel

  • or not?
slide-39
SLIDE 39

7/25/2011 DIMA – TU Berlin 41

STRATOSPHERE

Parallel Analytics in the Cloud beyond Map/Reduce

slide-40
SLIDE 40

7/25/2011 DIMA – TU Berlin 42

Infrastructure as a Service Use-Cases

...

The Stratosphere Project*

■ Explore the power of Cloud computing for complex information management applications ■ Database-inspired approach ■ Analyze, aggregate, and query ■ Textual and (semi-) structured data ■ Research and prototype a web-scale data analytics infrastructure Scientific Data Life Sciences Linked Data

StratoSphere

Above the Clouds

Query Processor

* FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam

slide-41
SLIDE 41

7/25/2011 DIMA – TU Berlin 43

PS,1,1,0,Pa, surface pressure T_2M,11,105,0,K,air_temperature TMAX_2M,15,105,2,K,2m maximum temperature TMIN_2M,16,105,2,K,2m minimum temperature U,33,110,0,ms-1,U-component of wind V,34,110,0,ms-1,V-component of wind QV_2M,51,105,0,kgkg-1,2m specific humidity CLCT,71,1,0,1,total cloud cover … (Up to 200 parameters)

950km, 2km resolution 1100km, 2km resolution

10TB

Analysis Tasks on Climate Data Sets

  • Validate climate models
  • Locate „hot-spots“ in climate models

 Monsoon  Drought  Flooding

  • Compare climate models

Based on different parameter settings

Necessary Data Processing Operations

  • Filter
  • Aggregation (sliding window)
  • Join
  • Multi-dimensional sliding-window operations
  • Geospatial/Temporal joins
  • Uncertainty

Example: Climate Data Analysis

slide-42
SLIDE 42

7/25/2011 DIMA – TU Berlin 44

■ Text Mining in the biosciences ■ Cleansing of linked open data

Further Use-Cases

slide-43
SLIDE 43

7/25/2011 DIMA – TU Berlin 45

■ Architecture of the Stratosphere System ■ The PACT Programming Model ■ The Nephele Execution Engine ■ Parallelizing PACT Programs

Outline

slide-44
SLIDE 44

7/25/2011 DIMA – TU Berlin 46

Architecture Overview

Execution Engine Parallel Programming Model Higher-Level Language Nephele PACT Programming Model JAQL, Pig, Hive Hadoop Dryad Map/Reduce Programming Model Scope, DryadLINQ JAQL? Pig? Hive? Hadoop Stack Dryad Stack Stratosphere Stack

slide-45
SLIDE 45

7/25/2011 DIMA – TU Berlin 47

■ Schema Free ■ Many semantics hidden inside the user code (tricks required to push

  • perations into map/reduce)

■ Single default way of parallelization

Data-Centric Parallel Programming

σ π σ ⋈ γ

Map / Reduce Relational Databases ■ Schema bound (relational model) ■ Well defined properties and requirements for parallelization ■ Flexible and optimizable

GOAL: Advance the m/r programming model Map Reduce Map Reduce Map Reduce

slide-46
SLIDE 46

7/25/2011 DIMA – TU Berlin 48

Stratosphere in a Nutshell

■ PACT Programming Model □ Parallelization Contract (PACT) □ Declarative definition of data parallelism □ Centered around second-order functions □ Generalization of map/reduce ■ Nephele □ Dryad-style execution engine □ Evaluates dataflow graphs in parallel □ Data is read from distributed filesystem □ Flexible engine for complex jobs ■ Stratosphere = Nephele + PACT □ Compiles PACT programs to Nephele dataflow graphs □ Combines parallelization abstraction and flexible execution □ Choice of execution strategies gives optimization potential Nephele PACT Compiler

slide-47
SLIDE 47

7/25/2011 DIMA – TU Berlin 49

■ Parallelization Contracts (PACTs) ■ The Nephele Execution Engine ■ Compiling/Optimizing Programs ■ Related Work

Overview

slide-48
SLIDE 48

7/25/2011 DIMA – TU Berlin 50

Intuition for Parallelization Contracts

■ Map and reduce are second-order functions □ Call first-order functions (user code) □ Provide first-order functions with subsets of the input data ■ Define dependencies between the records that must be obeyed when splitting them into subsets □ Cp: Required partition properties ■ Map □ All records are independently processable ■ Reduce □ Records with identical key must be processed together

Input set Independent subsets Key Value

slide-49
SLIDE 49

7/25/2011 DIMA – TU Berlin 51

Contracts beyond Map and Reduce

■ Cross □ Two inputs □ Each combination of records from the two inputs is built and is independently processable ■ Match □ Two inputs, each combination of records with equal key from the two inputs is built □ Each pair is independently processable ■ CoGroup □ Multiple inputs □ Pairs with identical key are grouped for each input □ Groups of all inputs with identical key are processed together

slide-50
SLIDE 50

7/25/2011 DIMA – TU Berlin 52

Parallelization Contracts (PACTs)

■ Second-order function that defines properties on the input and

  • utput data of its associated first-order function

■ Input Contract □ Specifies dependencies between records (a.k.a. "What must be processed together?") □ Generalization of map/reduce □ Logically: Abstracts a (set of) communication pattern(s)  For "reduce": repartition-by-key  For "match" : broadcast-one or repartition-by-key ■ Output Contract □ Generic properties preserved or produced by the user code  key property, sort order, partitioning, etc. □ Relevant to parallelization of succeeding functions Input Contract First-order function (user code) Output Contract Data Data

slide-51
SLIDE 51

7/25/2011 DIMA – TU Berlin 53

■ For certain PACTs, several distribution patterns exist that fulfill the contract

□ Choice of best one is up to the system

■ Created properties (like a partitioning) may be reused for later operators

□ Need a way to find out whether they still hold after the user code □ Output contracts are a simple way to specify that □ Example output contracts: Same-Key, Super-Key, Unique-Key

■ Using these properties, optimization across multiple PACTs is possible

□ Simple System-R style optimizer approach possible

Optimizing PACT Programs

slide-52
SLIDE 52

7/25/2011 DIMA – TU Berlin 54

From PACT Programs to Data Flows

function match(Key k, Tuple val1, Tuple val2)

  • > (Key, Tuple)

{ Tuple res = val1.concat(val2); res.project(...); Key k = res.getColumn(1); Return (k, res); } invoke(): while (!input2.eof) KVPair p = input2.next(); hash-table.put(p.key, p.value); while (!input1.eof) KVPair p = input1.next(); KVPait t = hash-table.get(p.key); if (t != null) KVPair[] result = UF.match(p.key, p.value, t.value);

  • utput.write(result);

end

UF1 (map) UF2 (map) UF3 (match) UF4 (reduce)

V1 V2 V3 V4 In-Memory Channel Network Channel

PACT Program Nephele DAG Spanned Data Flow

V1 V2 V3 V4 V3 V1 V2 V3 V4 V3

User Function PACT code (grouping) Nephele code (communication)

compile span

slide-53
SLIDE 53

7/25/2011 DIMA – TU Berlin 55

NEPHELE EXECUTION ENGINE

slide-54
SLIDE 54

7/25/2011 DIMA – TU Berlin 56

■ Executes Nephele schedules

□ compiled from PACT programs

■ Design goals

□ Exploit scalability/flexibility of clouds □ Provide predictable performance □ Efficient execution on 1000+ cores □ Flexible fault tolerance mechanisms

■ Inherently designed to run on top of an IaaS Cloud

□ Heterogeneity through different types of VMs □ Knows Cloud‟s pricing model

 VM allocation and de-allocation

□ Network topology inference

Nephele Execution Engine

Nephele PACT Compiler

Infrastructure-as-a-Service

slide-55
SLIDE 55

7/25/2011 DIMA – TU Berlin 57

Nephele Architecture

■ Standard master worker pattern ■ Workers can be allocated on demand

Compute Cloud Client

Public Network (Internet)

Cloud Controller Persistent Storage Master Worker

Private / Virtualized Network Workload over time

Worker Worker

slide-56
SLIDE 56

7/25/2011 DIMA – TU Berlin 58

Structure of a Nephele Schedule

■ Nephele Schedule is represented as DAG □ Vertices represent tasks □ Edges denote communication channels ■ Mandatory information for each vertex □ Task program □ Input/output data location (I/O vertices

  • nly)

■ Optional information for each vertex □ Number of subtasks (degree of parallelism) □ Number of subtasks per virtual machine □ Type of virtual machine (#CPU cores, RAM…) □ Channel types □ Sharing virtual machines among tasks

Output 1 Task 1 Input 1

Task: LineWriterTask.program Output: s3://user:key@storage/outp Task: MyTask.program Task: LineReaderTask.program Input: s3://user:key@storage/input

slide-57
SLIDE 57

7/25/2011 DIMA – TU Berlin 59

Task 1 Output 1 Input 1

ID: 2 Type: m1.large ID: 1 Type: m1.small

■ Explicit assignment to virtual machines □ Specified by ID and type □ Type refers to hardware profile

Internal Schedule Representation

■ Nephele schedule is converted into internal representation

Task 1 (2) Output 1 (1) Input 1 (1)

■ Explicit parallelization □ Parallelization range (mpl) derived from PACT □ Wiring of subtasks derived from PACT

slide-58
SLIDE 58

7/25/2011 DIMA – TU Berlin 60

Stage 0

■ Stages ensure three properties: □ VMs of upcoming stage are available □ All workers are set up and ready □ Data of previous stages is stored in persistent manner

Stage 1 Task 1 Output 1 Input 1

ID: 2 Type: m1.large ID: 1 Type: m1.small

Execution Stages

■ Issues with on-demand allocation: □ When to allocate virtual machines? □ When to deallocate virtual machines? □ No guarantee of resource availability!

Task 1 (2) Output 1 (1) Input 1 (1)

slide-59
SLIDE 59

7/25/2011 DIMA – TU Berlin 61

Channel Types

■ Network channels (pipeline) □ Vertices must be in same stage ■ In-memory channels (pipeline) □ Vertices must run on same VM □ Vertices must be in same stage ■ File channels □ Vertices must run on same VM □ Vertices must be in different stages

Stage 0 Stage 1 Task 1 Output 1 Input 1

ID: 2 Type: m1.large ID: 1 Type: m1.small

Task 1 (2) Output 1 (1) Input 1 (1)

slide-60
SLIDE 60

7/25/2011 DIMA – TU Berlin 62

Some Evaluation (1/2)

■ Demonstrates benefits of dynamic resource allocation ■ Challenge: Sort and Aggregate □ Sort 100 GB of integer numbers (from GraySort benchmark) □ Aggregate TOP 20% of these numbers (exact result!) ■ First execution as map/reduce jobs with Hadoop □ Three map/reduce jobs on 6 VMs (each with 8 CPU cores, 24 GB RAM) □ TeraSort code used for sorting □ Custom code for aggregation ■ Second execution as map/reduce jobs with Nephele □ Map/reduce compatilibilty layer allows to run Hadoop M/R programs □ Nephele controls resource allocation □ Idea: Adapt allocated resources to required processing power

slide-61
SLIDE 61

7/25/2011 DIMA – TU Berlin 63

First Evaluation (2/2)

■ M/R jobs on Hadoop

20 40 60 80 100 20 40 60 80 100 Time [minutes] Average instance utilization [%] (a) (b) (c) (d) (e) (f) (g) (h) USR SYS WAIT Network traffic 50 100 150 200 250 300 350 400 450 500 Average network traffic among instances [MBit/s] 20 40 60 80 100 20 40 60 80 100 Time [minutes] Average instance utilization [%] (a) (b) (c) (d) (e) (f) (g) (h) USR SYS WAIT Network traffic 50 100 150 200 250 300 350 400 450 500 Average network traffic among instances [MBit/s]

■ M/R jobs on Nephele Automatic VM deallocation Poor resource utilization!

slide-62
SLIDE 62

7/25/2011 DIMA – TU Berlin 64

■ [WK09] Daniel Warneke, Odej Kao: Nephele: efficient parallel data processing in the cloud. SC-MTAGS 2009 ■ [BEH+10] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl,

  • D. Warneke: Nephele/PACTs: a programming model and

execution framework for web-scale analytical processing. SoCC 2010: 119-130 ■ [ABE+10] A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): 1625-1628 (2010) ■ [AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hüske, et al.: MapReduce and PACT - Comparing Data Parallel Programming Models, to appear at BTW 2011

References

slide-63
SLIDE 63

7/25/2011 DIMA – TU Berlin 65

■ Adaptive Fault-Tolerance (Odej Kao) ■ Robust Query Optimization (Volker Markl) ■ Parallelization of the PACT Programming Model (Volker Markl) ■ Continuous Re-Optimization (Johann-Christoph Freytag) ■ Validating Climate Simulations with Stratosphere (Volker Markl) ■ Text Analysis with Stratosphere (Ulf Leser) ■ Data Cleansing with Stratosphere (Felix Naumann) ■ JAQL on Stratosphere: Student Project at TUB ■ Open Source Release: Nephele + PACT (TUB, HPI, HU)

Ongoing Work

slide-64
SLIDE 64

7/25/2011 DIMA – TU Berlin 66

Overview

■ Introduction ■ Big Data Analytics ■ Map/Reduce/Merge ■ Introducing … the Cloud ■ Stratosphere (PACT and Nephele) ■ Demo (Thomas Bodner, Matthias Ringwald) ■ Mahout and Scalable Data Mining (Sebastian Schelter)

slide-65
SLIDE 65

7/25/2011 DIMA – TU Berlin 67

The Information Revolution

http://mediatedcultures.net/ksudigg/?p=120

slide-66
SLIDE 66

7/25/2011 DIMA – TU Berlin 74

WEBLOG ANALYSIS QUERY

Demo Screenshots

slide-67
SLIDE 67

7/25/2011 DIMA – TU Berlin 75

Weblog Query and Plan

SELECT r.url, r.rank, r.avg_duration FROM Documents d JOIN Rankings r ON r.url = d.url WHERE CONTAINS(d.text, [keywords]) AND r.rank > [rank] AND NOT EXISTS (SELECT * FROM Visits v WHERE v.url = d.url AND v.date < [date]);

slide-68
SLIDE 68

7/25/2011 DIMA – TU Berlin 76

Weblog Query – Job Preview

slide-69
SLIDE 69

7/25/2011 DIMA – TU Berlin 77

Weblog Query – Optimized Plan

slide-70
SLIDE 70

7/25/2011 DIMA – TU Berlin 78

Weblog Query – Nephele Schedule in Execution

slide-71
SLIDE 71

7/25/2011 DIMA – TU Berlin 79

ENUMERATING TRIANGLES FOR SOCIAL NETWORK MINING

Demo Screenshots

slide-72
SLIDE 72

7/25/2011 DIMA – TU Berlin 80

Enumerating Triangles – Graph and Job

slide-73
SLIDE 73

7/25/2011 DIMA – TU Berlin 81

Enumerating Triangles – Job Preview

slide-74
SLIDE 74

7/25/2011 DIMA – TU Berlin 82

Enumerating Triangles – Optimized Plan

slide-75
SLIDE 75

7/25/2011 DIMA – TU Berlin 83

Enumerating Triangles – Nephele Schedule in Execution

slide-76
SLIDE 76

7/25/2011 DIMA – TU Berlin 85

APACHE MAHOUT

Scalable data mining Sebastian Schelter

slide-77
SLIDE 77

7/25/2011 DIMA – TU Berlin 86

Apache Mahout: Overview

■ What is Apache Mahout?

□ An Apache Software Foundation project aiming to create scalable machine learning libraries under the Apache License □ focus on scalability, not a competitor for R or Weka □ in use at Adobe, Amazon, AOL, Foursquare, Mendeley, Twitter, Yahoo

■ Scalability

□ time is proportional to problem size by resource size □ does not imply Hadoop or parallel, although the majority of implementations use Map/Reduce

R P t

slide-78
SLIDE 78

7/25/2011 DIMA – TU Berlin 87

Apache Mahout: Clustering

■ Clustering

□ Unsupervised learning: assign a set of data points into subsets (called clusters) so that points in the same cluster are similar in some sense

■ Algorithms

□ K-Means □ Fuzzy K-Means □ Canopy □ Mean Shift □ Dirichlet Process □ Spectral Clustering

slide-79
SLIDE 79

7/25/2011 DIMA – TU Berlin 88

Apache Mahout: Classification

■ Classification

□ supervised learning: learn a decision function that predicts labels y on data points x given a set of training samples {(x,y)}

■ Algorithms

□ Logistic Regression (sequential but fast) □ Naive Bayes / Complementary Naïve Bayes □ Random Forests

slide-80
SLIDE 80

7/25/2011 DIMA – TU Berlin 89

Apache Mahout: Collaborative Filtering

■ Collaborative Filtering

□ approach to recommendation mining: given a user's preferences for items, guess which other items would be highly preferred

■ Algorithms

□ Neighborhood methods: Itembased Collaborative Filtering □ Latent factor models: matrix factorization using „Alternating Least Squares“

slide-81
SLIDE 81

7/25/2011 DIMA – TU Berlin 90

Apache Mahout: Singular Value Decomposition

■ Singular Value Decomposition

□ matrix decomposition technique used to create an optimal low-rank approximation of a matrix □ used for dimensional reduction, unsupervised feature selection, “Latent Semantic Indexing”

■ Algorithms

□ Lanczos Algorithm □ Stochastic SVD

slide-82
SLIDE 82

7/25/2011 DIMA – TU Berlin 92

SCALABLE DATA MINING

Comparing implementations of data mining algorithms in Hadoop/Mahout and Nephele/PACT

slide-83
SLIDE 83

7/25/2011 DIMA – TU Berlin 93

Pairwise row similarity computation ■ Computes the pairwise similarities of the rows (or columns) of a sparse matrix using a predefined similarity function

□ used for computing document similarities in large corpora □ used to precompute item-item- similarities for recommendations (Collaborative Filtering) □ similarity function can be cosine, Pearson-correlation, loglikelihood ratio, Jaccard coefficient, …

Problem description

slide-84
SLIDE 84

7/25/2011 DIMA – TU Berlin 94

Map/Reduce

■ Map/Reduce – Step 1

□ compute similarity specific row weights □ transpose the matrix, there by create an inverted index

■ Map/Reduce – Step 2

□ map out all pairs of cooccurring values □ collect all cooccurring values per row pair, compute similarity value

■ Map/Reduce – Step 3

□ use secondary sort to only keep the k most similar rows

■ PACT

slide-85
SLIDE 85

7/25/2011 DIMA – TU Berlin 95

■ Equivalent implementations in Mahout and PACT

□ problem maps relatively well to the Map/Reduce paradigm □ insight: standard Map/Reduce code can be ported to Nephele/PACT with very little effort □

  • utput contracts and memory forwards offer hooks for performance

improvements (unfortunately not applicable in this particular usecase)

Comparison

slide-86
SLIDE 86

7/25/2011 DIMA – TU Berlin 96

K-Means

■ Simple iterative clustering algorithm

□ uses a predefined number of clusters (k) □ start with a random selection of cluster centers □ assign points to nearest cluster □ recompute cluster centers, iterate until convergence

Problem description

slide-87
SLIDE 87

7/25/2011 DIMA – TU Berlin 97

■ Initialization

□ generate k random cluster centers from datapoints (optional) □ put centers to distributed cache

■ Map

□ find nearest cluster for each data point □ emit (cluster id, data point)

■ Combine

□ partially aggregate distances per cluster

■ Reduce

□ compute new centroid for each cluster

■ output converged cluster centers or centers after n iterations ■ optionally output clustered data points

Mahout Repeat

slide-88
SLIDE 88

7/25/2011 DIMA – TU Berlin 98

Stratosphere Implementation

Source: www.stratosphere.eu

slide-89
SLIDE 89

7/25/2011 DIMA – TU Berlin 99

Comparison of the implementations

■ actual execution plan in the underlying distributed systems is nearly equivalent ■ Stratosphere implementation is more intuitive and closer to the mathematical formulation of the algorithm

Code analysis

slide-90
SLIDE 90

7/25/2011 DIMA – TU Berlin 100

Naïve Bayes

■ Simple classification algorithm based on Bayes‟ theorem ■ General Naïve Bayes

□ assumes feature independence □

  • ften good results even this is

not given

■ Mahout‟s version of Naïve Bayes

□ Specialized approach for document classification □ based on tf-idf weight metric

Problem description

slide-91
SLIDE 91

7/25/2011 DIMA – TU Berlin 101

■ Classification

□ straight-forward approach, simply reads complete model into memory □ classification is done in the mapper, reducer only sums up statistics for confusion matrix

■ Trainer

□ much higher complexity □ needs to count documents, features, features per document, features per corpus □ Mahout‟s implementation is optimized by exploiting Hadoop specific features like secondary sort and reading results in memory from the cluster filesystem

M/R Overview

slide-92
SLIDE 92

7/25/2011 DIMA – TU Berlin 102

M/R Trainer Overview

Feature Extractor Tf-Idf Calculation Weight Summer Theta Normalizer

wordFreq termDocC featureC docC Train Data Feature Counter TermDoc Counter WordFr. Counter Doc Counter Tf-Idf Vocab Counter tfIdf vocabC σk σj σkσj σk σk σkσj thetaNorm Theta N.

slide-93
SLIDE 93

7/25/2011 DIMA – TU Berlin 103

■ PACT implementation

□ looks even more complex, but PACTs can be combined in a much more fine-grained manner □ as PACT offers the ability to use local memory forwards, more and higher level functions can be used like Cross and Match □ less framework specific tweaks necessary for a performant implementation □ visualized execution plan is much more similar to the algorithmic formulation of computing several counts and combining them to a model in the end □ subcalculations can be seen and unit-tested in isolation

Pact Trainer Overview

slide-94
SLIDE 94

7/25/2011 DIMA – TU Berlin 104

PACT Trainer Overview

slide-95
SLIDE 95

7/25/2011 DIMA – TU Berlin 105

Hot Path

7,4 GB 14,8 GB 5,89 GB 5,89 GB 3,53 GB 84 kB 8 kB 5 kB

slide-96
SLIDE 96

7/25/2011 DIMA – TU Berlin 106

■ Future work: PACT implementation can still be tuned by

□ sampling input data □ more variable memory management of Stratosphere □ employing context-concept of PACTs for simpler distribution of computed parameters

Pact Trainer Overview

slide-97
SLIDE 97

7/25/2011 DIMA – TU Berlin 107

Thank You

Merci Grazie

Gracias

Obrigado Danke

Japanese English French Russian German Italian

Spanish

Brazilian Portuguese

Arabic

Traditional Chinese Simplified Chinese Hindi Tamil

Thai

Korean

slide-98
SLIDE 98

7/25/2011 DIMA – TU Berlin 108

PARALLEL DATA FLOW LANGUAGES

Programming in a more abstract way

slide-99
SLIDE 99

7/25/2011 DIMA – TU Berlin 109

■ MapReduce paradigm is too low-level

□ Only two declarative primitives (map + reduce) □ Extremely rigid (one input, two-stage data flow) □ Custom code for e.g.: projection and filtering □  Code is difficult to reuse and maintain □ Impedes Optimizations

■ Combination of high-level declarative querying and low-level programming with MapReduce ■ Dataflow Programming Languages

□ Hive □ JAQL □ Pig

Introduction

slide-100
SLIDE 100

7/25/2011 DIMA – TU Berlin 110

■ Data warehouse infrastructure built on top of Hadoop, providing:

□ Data Summarization □ Ad hoc querying

■ Simple query language: Hive QL (based on SQL) ■ Extendable via custom mappers and reducers ■ Subproject of Hadoop ■ No „Hive format“ ■ http://hadoop.apache.org/hive/

Hive

slide-101
SLIDE 101

7/25/2011 DIMA – TU Berlin 111

Hive - Example

LOAD DATA INPATH `/data/visits` INTO TABLE visits INSERT OVERWRITE TABLE visitCounts SELECT url, category, count(*) FROM visits GROUP BY url, category; LOAD DATA INPATH „/data/urlInfo‟ INTO TABLE urlInfo INSERT OVERWRITE TABLE visitCounts SELECT vc.*, ui.* FROM visitCounts vc JOIN urlInfo ui ON (vc.url = ui.url); INSERT OVERWRITE TABLE gCategories SELECT category, count(*) FROM visitCounts GROUP BY category; INSERT OVERWRITE TABLE topUrls SELECT TRANSFORM (visitCounts) USING „top10‟;

slide-102
SLIDE 102

7/25/2011 DIMA – TU Berlin 112

■ Higher level query language for JSON documents ■ Developed at IBM„s Almaden research center ■ Supports several operations known from SQL

□ Grouping, Joining, Sorting

■ Built-in support for

□ Loops, Conditionals, Recursion

■ Custom Java methods extend JAQL ■ JAQL scripts are compiled to MapReduce jobs ■ Various I/O

□ Local FS, HDFS, Hbase, Custom I/O adapters

■ http://www.jaql.org/

JAQL

slide-103
SLIDE 103

7/25/2011 DIMA – TU Berlin 113

JAQL - Example

registerFunction(„top“, „de.tuberlin.cs.dima.jaqlextensions.top10“); $visits = hdfsRead(„/data/visits“); $visitCounts = $visits

  • > group by $url = $

into { $url, num: count($)}; $urlInfo = hdfsRead(„data/urlInfo“); $visitCounts = join $visitCounts, $urlInfo where $visitCounts.url == $urlInfo.url; $gCategories = $visitCounts

  • > group by $category = $

into {$category, num: count($)}; $topUrls = top10($gCategories); hdfsWrite(“/data/topUrls”, $topUrls);

slide-104
SLIDE 104

7/25/2011 DIMA – TU Berlin 114

■ A platform for analyzing large data sets ■ Pig consists of two parts:

□ PigLatin: A Data Processing Language □ Pig Infrastructure: An Evaluator for PigLatin programs □ Pig compiles Pig Latin into physical plans □ Plans are to be executed over Hadoop

■ Interface between the declarative style of SQL and low- level, procedural style of MapReduce ■ http://hadoop.apache.org/pig/

Pig

slide-105
SLIDE 105

7/25/2011 DIMA – TU Berlin 115

Pig - Example

visits = load „/data/visits‟ as (user, url, time); visitCounts = foreach visits generate url, count(visits); urlInfo = load „/data/urlInfo‟ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into „/data/topUrls‟; Example taken from: “Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008

slide-106
SLIDE 106

7/25/2011 DIMA – TU Berlin 116

  • C. Olston, et al. (2008). `Pig Latin: a not-so-foreign language for data

processing'. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110, New York, NY, USA. ACM. ■ Apache Pig http://wiki.apache.org/pig/FrontPage ■ Hive - A Warehousing Solution Over a Map-Reduce Framework. Thusoo, Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony, Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham ■ Apache Hive http://wiki.apache.org/hadoop/Hive ■ Towards a Scalable Enterprise Content Analytics Platform. Kevin S. Beyer, Vuk Ercegovac, Rajasekar Krishnamurthy, Sriram Raghavan, Jun Rao, Frederick Reiss, Eugene J. Shekita, David E. Simmen, Sandeep Tata, Shivakumar Vaithyanathan, Huaiyu Zhu. IEEE Data Eng. Bull. (32): 28-35 (2009) ■ JAQL http://code.google.com/p/jaql/wiki/

Literature

slide-107
SLIDE 107

7/25/2011 DIMA – TU Berlin 117

QUERY COPROCESSING ON GRAPHICS PROCESSORS

slide-108
SLIDE 108

7/25/2011 DIMA – TU Berlin 118

Query Coprocessing on GPUs

■ Graphics Processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation

□ 10x computational power compared to the CPU □ 5x memory bandwith compared to the CPU

■ Parallel primitives available for query processing that

□ provide exploitation of GPU hardware features such as high thread parallelism and reduction of memory stalls through the fast local memory □ are scalable to hundreds of processors because of their lock-free design and low synchronization cost through the use of local memory

slide-109
SLIDE 109

7/25/2011 DIMA – TU Berlin 119

Query Coprocessing on GPUs

■ Map

□ given an array of data tuples and a function, a map applies the function to every tuple □ uses multiple thread groups to scan the relation with each thread group being responsible for a segment of the relation □ the access pattern of the threads in each thread group is designed to exploit the coalesced memory access feature on the GPU

■ Scatter and Gather

□ Scatter: perform indexed writes to a relation (e.g. hashing) defined by a location array □ Gather: perform indexed reads from a relation also defined by a location array □ can be implemented using the multipass optimization scheme to improve their temporal locality

slide-110
SLIDE 110

7/25/2011 DIMA – TU Berlin 120

Query Coprocessing on GPUs

■ Prefix scan

□ applies a binary operator to the input relation □ example: prefix sum, an important operation in parallel databases

■ Reduce

□ computes a value based on the input relation □ implemented as multipass algorithm by utilizing local memory optimization □ logarithmic number of passes constrained by local memory size per multiprocessor

slide-111
SLIDE 111

7/25/2011 DIMA – TU Berlin 121

HADOOP DB

An Architectural Hybrid of MapReduce and DBMS

slide-112
SLIDE 112

7/25/2011 DIMA – TU Berlin 122

Parallel Data Processing Architectures

■ Two major architectures: 1. Parallel databases

 “Standard” relational databases in a (usually) shared-nothing cluster.

2. MapReduce

 Data analysis via parallel Map and Reduce jobs in a replicated cluster.

■ Both approaches have their Pros and Cons.

slide-113
SLIDE 113

7/25/2011 DIMA – TU Berlin 123

Parallel RDBMs ■ Pros:

□ Usually very good and consistent performance. □ Flexible and proven interface (SQL).

■ Cons:

□ Scaling is rather limited (10s of nodes). □ Does not work well in heterogeneous clusters. □ Not very Fault-Tolerant.

slide-114
SLIDE 114

7/25/2011 DIMA – TU Berlin 124

MapReduce ■ Pros:

□ Very fault-tolerant and automatic load-balancing. □ Operates well in heterogeneous clusters.

■ Cons:

□ Writing map/reduce jobs is more complicated than writing SQL queries. □ Performance depends largely on the skill of the programmer.

slide-115
SLIDE 115

7/25/2011 DIMA – TU Berlin 125

HadoopDB ■ Both approaches have their strengths and weaknesses. ■ Idea of HadoopDB: Combine them!

□ Traditional relational databases as data storage and data processing nodes. □ MapReduce for Query Parallelization, Job Tracking, etc. □ Automatic “SQL to MapReduce to SQL” (SMS) query rewriter (based on Hive).

■  Pushing as many operations as possible into database layer improves data access performance. ■  Map Reduce improves fault-tolerance and offers solid cluster management.

slide-116
SLIDE 116

7/25/2011 DIMA – TU Berlin 126

HadoopDB overview …

SMS Planner Postgres DB Node #1 Task Tracker SQL Postgres DB Node #2 Task Tracker SQL Postgres DB Node #n Task Tracker SQL Map Reduce Job Tracker “System catalog” Master Node MapReduce Job SQL query User Replicated Table Data

slide-117
SLIDE 117

7/25/2011 DIMA – TU Berlin 127

HadoopDB Sample query

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Rewrite

slide-118
SLIDE 118

7/25/2011 DIMA – TU Berlin 128

Experimental Findings (I) ■ Compared with: native Hadoop (Hive), Vertica, “commercial row-oriented DB”. ■ Experiments performed on a 10/50/100 node Amazon EC2 cloud instance. ■ Used Benchmark: A. Pavlo et al: “A Comparison of Approaches to Large Scale Data Analysis”, SIGMOD, 2009

slide-119
SLIDE 119

7/25/2011 DIMA – TU Berlin 129

Experimental Findings (II) ■ In absence of failures, HadoopDB is usually slower than parallel DBMS. ■ HadoopDB is consistently faster than Hadoop, but takes ~ 10 times longer to load data. ■ HadoopDB’s performance decreases significantly lower than Vertica’s in case of node failures. ■ HadoopDB is not as susceptible to single slow nodes as Vertica.

slide-120
SLIDE 120

7/25/2011 DIMA – TU Berlin 130

  • A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, A. Silberschatz:

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1): 922-933 (2009)

Literature

slide-121
SLIDE 121

7/25/2011 DIMA – TU Berlin 143

■ Parallel Speedup

□ Ahmdal‘s Law

■ Levels of Parallelism

□ Instruction-Level, Data, Task

■ Modes of Query Parallelism

□ Inter-Query / Intra-Query □ Pipeline (Inter Operator) / Data (Intra Operator)

■ Parallel Database Operations

Basics of Parallel Processing

slide-122
SLIDE 122

7/25/2011 DIMA – TU Berlin 144

Parallel Speedup

slide-123
SLIDE 123

7/25/2011 DIMA – TU Berlin 145

Parallel Speedup

slide-124
SLIDE 124

7/25/2011 DIMA – TU Berlin 146

■ Instruction-level Parallelism

□ Single instructions are automatically processed in parallel □ Example: Modern CPUs with multiple pipelines and instruction units.

■ Data Parallelism

□ Different Data can be processed independently □ Each processor executes the same operations on it‘s share of the input data. □ Example: Distributing loop iterations over multiple processors, or CPUs vectors

■ Task Parallelism

□ Tasks are distributed among the processors/nodes □ Each processor executes a different thread/process. □ Example: Threaded programs.

Levels of Parallelism on Hardware

slide-125
SLIDE 125

7/25/2011 DIMA – TU Berlin 147

■ Inter Query Parallelism (multiple concurrent queries)

  • Necessary for efficient resource utilization: While one query waits

(e.g. for I/O), another one executes

  • Requires concurrency control (locking mechanisms) to guarantee transactional

properties (the "I" in ACID)  Important for highly transactional scenarios (OLTP)

■ Intra-Query Parallelism (parallel processing of a single query)

  • I/O Parallelism: Concurrent reading from multiple disks

Hidden: Hardware RAID, Transparent: Spanned tablespaces

  • Intra Operator Parallelism: Multiple threads work on the same operator.

Example: Parallel Sort

  • Inter Operator Parallelism: Multiple pipelined parts of the plan run in parallel

 Important for complex analytical tasks (OLAP)

Modes of Query Parallelism

slide-126
SLIDE 126

7/25/2011 DIMA – TU Berlin 148

Pipeline Parallelism

Scan Scan HS-Join

T1 T2

Return Scan

T3

HS-Join Sort

Step 1: Two threads scan

  • ne base table each

and build the hash tables for the joins. Step 2: One thread scans the table, probes the hash

  • tables. Second thread

starts the sort (sorting sub-lists, merging the first lists) Step 3: One thread, return result, business as usual…

slide-127
SLIDE 127

7/25/2011 DIMA – TU Berlin 149

Pipeline Parallelism

■ Pipeline Parallelism, also called Inter Operator Parallelism

□ Inter Operator, because the parallelism is between the operators

■ Execute multiple pipelines simultaneously

  • Limited in its applicability, only if multiple pipelines are present and not totally

dependent on each other

■ Problem:

  • High synchronization overhead
  • Mostly limited to lower degree of parallelism (not too many pipelines per query)
  • Only suited for shared-memory architectures
slide-128
SLIDE 128

7/25/2011 DIMA – TU Berlin 150

Data Parallelism

■ Pipeline Parallelism is not applicable to a large degree  Data Parallelism ■ Data divided into several sub-sets

□ Most operations don't need a complete view of the data

 E.g. "Filter" looks only at a single tuple at a time

□ Subsets can be are processed independently and hence in parallel

■ Degree of Parallelism as high as the number of possible subsets

□ For "Filter": As high as the number of tuples

■ Some operations possibly need a view of larger portions of the data

□ E.g. Grouping/Aggregation operation needs all tuples with the same grouping key □ Are they all in the same set? Can we guarantee that? □ Different operators need different sets!

slide-129
SLIDE 129

7/25/2011 DIMA – TU Berlin 151

■ Levels of Resource Sharing

□ Shared-Memory, Shared-Disk, Shared-Nothing

■ Data Partitioning

□ Round-robin, Hash, Range

■ Parallel Operators and Costs

□ Tuple-at-a-time (i.e. Selection) □ Sorting □ Projection, Grouping, Aggregation □ Join

Basics of Parallel Query Processing

slide-130
SLIDE 130

7/25/2011 DIMA – TU Berlin 152

Parallel Architectures (I)

■ Shared Memory

  • Several CPUs share a single memory and disk (array)
  • Communication over a single common bus

Source: Garcia-Molina et al., „Database Systems – The Complete Book. Second Edition“

slide-131
SLIDE 131

7/25/2011 DIMA – TU Berlin 153

■ Shared Disk

  • Several nodes with multiple CPUs, each node has its private memory
  • Single attached disk (array): Often NAS, SAN, etc…

Parallel Architectures (II)

Source: Garcia-Molina et al., „Database Systems – The Complete Book. Second Edition“

slide-132
SLIDE 132

7/25/2011 DIMA – TU Berlin 154

■ Shared Nothing

  • Each node has it own set of CPUs, memory and disks attached
  • Data needs to be partitioned over the nodes
  • Data is exchanged through direct node-to-node communication

Parallel Architectures (III)

Source: Garcia-Molina et al., „Database Systems – The Complete Book. Second Edition“

slide-133
SLIDE 133

7/25/2011 DIMA – TU Berlin 155

Data Partitioning (I) ■ Partitioning the data means creating a set of disjunct sub-sets

  • Example: Sales data, every year gets its own partition

■ For shared-nothing, data must be partitioned across nodes

  • If it were replicated, it would effectively become a shared-disk with the local

disks acting like a cache (must be kept coherent)

■ Partitioning with certain characteristics has more advantages

  • Some queries can be limited to operate on certain sets only, if it is provable

that all relevant data (passing the predicates) is in that partition

  • Partitions can be simply dropped as a whole (data is rolled out) when it is no

longer needed (e.g. discard old sales)

slide-134
SLIDE 134

7/25/2011 DIMA – TU Berlin 156

Data Partitioning (II)

■ How to partition the data into disjoint sets?

  • Round robin: Each set gets a tuple in a round, all sets have guaranteed

equal amount of tuples, no apparent relationship between tuples in one set.

  • Hash Partitioned: Define a set of partitioning columns. Generate a hash

value over those columns to decide the target set. All tuples with equal values in the partitioning columns are in the same set.

  • Range Partitioned: Define a set of partitioning columns and split the

domain of those columns into ranges. The range determines the target set. All tuples on one set are in the same range.

slide-135
SLIDE 135

7/25/2011 DIMA – TU Berlin 157

■ Client send a SQL query to one

  • f the cluster nodes

□ Node becomes the "coordinator"

■ Coordinator compiles the query

□ Parsing, Checking, Optimization □ Parallelization

■ Sends partial plans to the other cluster nodes that describes their tasks

□ Coordinator also executes the partial plan on his part of the data

■ Collects partial results and finalizes them (see next slide)

Data Parallelism Example

Client Query Coordi- nator

Cluster- Node Cluster- Node Cluster- Node

Partial Results Final Results

slide-136
SLIDE 136

7/25/2011 DIMA – TU Berlin 158

Data Parallelism Example

■ For shared-nothing & shared-disk

  • Multiple instances of a sub-plan are

executed on different computers

  • The instances operate on different

splits or partitions of the data

  • At some points, results from the sub-

plans are collected

  • For more complex queries, results

are not collected but re-distributed, for further parallel processing

Scan IX-Scan NL-Join Fetch

T1 (part) IX-T2.1 (part)

Group Agg

T2 (part)

Sort Queue Return Group Agg

Parallel Instances Final Aggregation Sub-plan result collection Point of data shipping Pre- Aggregation

slide-137
SLIDE 137

7/25/2011 DIMA – TU Berlin 159

Ideally: Operate as much as possible on individual partitions of the data

 Bring the operation to the data  No communication needed, ideal parallelism

Easy for simple "per-tuple" operators

 Scan, IX-Scan, Fetch, Filter

Problematic: Some operators need the whole picture

 E.g. Sorts and Aggregations can only be preprocessed in parallel and need a final step on a single node. Unless: They occur in a correlated subplan known to contain only tuples from one partition.  E.g. Joins need matching tuples. Either organize the inputs accordingly ,

  • r join on the coordinator after the collection of partial results (not parallel any more!).

Parallel Operators

slide-138
SLIDE 138

7/25/2011 DIMA – TU Berlin 160

■ S Relation S ■ S[i, h] Partition i of relation S according to partitioning scheme h. ■ B(S) Number of Blocks of Relation S ■ p Number of Nodes ■ We assume a shared-nothing architecture

□ Most commercial database vendors use shared-nothing approaches.

■ Network transfer is at least as expensive as disk access

□ In some cost models still far more expensive □ Today network bandwidth ≈ disk bandwidth □ But: Network is shared, especially Switches and Routers have a throughput limit

■ Partitioning schemes (hash/range) produce partitions of roughly equal size.

Notations and Assumptions

slide-139
SLIDE 139

7/25/2011 DIMA – TU Berlin 161

Selection can be parallelized very efficiently (embarrassingly parallel problem)  Each node performs the selection on its existing local partition.

 Selection needs no context  Data can be partitioned in a arbitrary way

 Partial results are unioned afterwards. Cost: B(S)/p

Parallel Selection

slide-140
SLIDE 140

7/25/2011 DIMA – TU Berlin 162

Parallel Projection, Grouping, Aggregation

slide-141
SLIDE 141

7/25/2011 DIMA – TU Berlin 163

■ Range partitioning sort (partition by range, then sort)

 Range-partition the relation according to the sort columns  Sort the single partitions locally (e.g. by TPMMS)  Cost: B(S) partitioning + B(S) transfer + B(S)/p local sorting  Problem: How to find a uniform range parititioning scheme?  Result is already partitoned in the cluster.

■ Parallel External sort-merge (sort locally, then merge)

 Reuse an existing data partitioning  Partitions are sorted locally (e.g. by TPMMS)  Sorted partitions need to be merged

E.g.: One node merges two partitions until the whole relation is sorted

 Cost: B(S)/p local sorting + log2(p)*B(S)/2 transfer + log2(p)*B(S) local merging  Result is sitting on one machine.

Parallel Sorting

slide-142
SLIDE 142

7/25/2011 DIMA – TU Berlin 164

■ A special class of Joins that are suited for parallelization are Natural- and Equi-Joins.

□ For Equi-Joins we only look at tuple pairs that share the same join key.

■ Idea: Partition relations R and S using the same partitioning scheme over the join key.

□ All values of R and S with the same join key end up at the same node! □ All joins can be performed locally!

■ Actual implementation depends on how the relations are partitioned:

□ Co-Located Join □ Directed Join □ Re-Partitioning Join

Parallel Equi-Joins (I)

slide-143
SLIDE 143

7/25/2011 DIMA – TU Berlin 165

1. Both R and S are already partitioned over the join key (and with the same partitioning scheme):

□ „Co-Located Join“ □ No re-partitioning is needed! □ Cost: ??? Local join cost

2. Only one relation is partitioned over the join key:

□ „Directed Join“

□ Re-Partition the other relation with same partitioning scheme. □ Cost (assuming R is already partitioned): B(S) partitioning B(S) transfer ??? Local join cost

3. No relation is partitioned over the join key:

□ „Repartition Join“

□ Re-Partition both relations over the join key □ Cost: B(S)+B(R) partitioning B(S)+B(R) transfer ??? Local join cost

Parallel Equi-Joins (II)

slide-144
SLIDE 144

7/25/2011 DIMA – TU Berlin 166

Symmetric Fragment-and-Replicate Join

Join

slide-145
SLIDE 145

7/25/2011 DIMA – TU Berlin 167

Symmetric Fragment-and-Replicate Join (II)

Nodes in the Cluster

slide-146
SLIDE 146

7/25/2011 DIMA – TU Berlin 168

■ We can do better, if relation S is much smaller than R. ■ Idea: Reuse the existing partitioning of R and replicate the whole relation S to each node. ■ Cost: p * B(S) transport ??? local join ■  Asymmetric Fragment-and-replicate Join is a special case of the Symmetric Algorithm with m=p and n=1. ■ The Asymmetric Fragment-and-replicate Join is also called Broadcast Join

Asymmetric Fragment-and-Replicate Join

slide-147
SLIDE 147

7/25/2011 DIMA – TU Berlin 169

■ Database clusters tend to scale until 64 or 128 nodes

  • Afterwards the speedup increase curve flattens
  • Communication overhead eats speedup through next node
  • Hard limit example: 1000 nodes for DB2

■ Shared Disk: Does not scale infinitely, bus and synchronization become

  • verhead

□ For Updates: Cache Coherency Problem □ For Reads: I/O Bandwidth Limits

■ Shared Nothing: Cannot compensate loss of a node easily

  • In large clusters, failures and outages are most common.
  • Loss of a node means loss of data!
  • Unless: Data is replicated.

But: Replicated Data must be kept consistent! Has a high overhead…

Limits in Parallel Databases

slide-148
SLIDE 148

7/25/2011 DIMA – TU Berlin 170

  • S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System

Software of A Parallel Relational Database Machine GRACE. ■

  • D. A. Schneider and D. J. DeWitt. A Performance Evaluation of Four Parallel

Join Algorithms in a Shared-Nothing Multiprocessor Environment. SIGMOD Conference, 1989 ■

  • D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M.
  • Muralikrishna. GAMMA – A High Performance Dataflow Database Machine.

  • J. W. Stamos and H. C. Young. A Symmetric Fragment and Replicate

Algorithm for Distributed Joins. IEEE Trans. Parallel Distrib. Syst., 1993.

Literature

slide-149
SLIDE 149

7/25/2011 DIMA – TU Berlin 171

■ OLTP style applications that are beyond relational databases' capabilities exist as well ■ Some applications still require fast and efficient lookup and retrieval of small amounts of data

□ Web index access, mail accounts, warehouse updates for resellers  Addressed by Key/Value pair based storage systems (e.g. Google BigTable and Megastore)  Can only access the data through a key  Can only apply an additional filter on columns and timestamps

■ Some applications do still need updates and certain guarantees about them

□ No hard transactions, especially no multi record transactions !!!  Eventual consistency model (Amazon Dynamo)

■ Techniques require a lecture of their own

Side-Note: What about updates/transactions?

slide-150
SLIDE 150

7/25/2011 DIMA – TU Berlin 172

slide-151
SLIDE 151

7/25/2011 DIMA – TU Berlin 173

slide-152
SLIDE 152

7/25/2011 DIMA – TU Berlin 174

slide-153
SLIDE 153

7/25/2011 DIMA – TU Berlin 175

slide-154
SLIDE 154

7/25/2011 DIMA – TU Berlin 176

slide-155
SLIDE 155

7/25/2011 DIMA – TU Berlin 177

slide-156
SLIDE 156

7/25/2011 DIMA – TU Berlin 178

slide-157
SLIDE 157

7/25/2011 DIMA – TU Berlin 179

slide-158
SLIDE 158

7/25/2011 DIMA – TU Berlin 180

slide-159
SLIDE 159

7/25/2011 DIMA – TU Berlin 181

slide-160
SLIDE 160

7/25/2011 DIMA – TU Berlin 182

slide-161
SLIDE 161

7/25/2011 DIMA – TU Berlin 183

slide-162
SLIDE 162

7/25/2011 DIMA – TU Berlin 184

slide-163
SLIDE 163

7/25/2011 DIMA – TU Berlin 185