v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview - PowerPoint PPT Presentation

Stratosphere v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1

Release Preview Official release coming end of November Hands on sessions today with the latest code snapshot 2

New Features in a Nutshell • Declarative Scala Programming API • Iterative Programs Bulk (batch-to-batch in memory) and Incremental (Delta Updates) o Automatic caching and cross-loop optimizations o • Runs on top of YARN (Hadoop Next Gen) • Various deployment methods VMs, Debian packages, EC2 scripts, ... o • Many usability fixes and of bugfixes 3

Stratosphere System Stack Sky Java Sky Scala Meteor ... API API Stratosphere Optimizer Stratosphere Runtime Cluster Direct EC2 YARN Manager Local Storage HDFS S3 ... Files 4

MapReduce It is nice and good, but... Very verbose and low level. Only usable by system programmers. Everything slightly more complex must result in a cascade of jobs. Loses performance and optimization potential. Map Red. Map Map Red. Map Red. Red. Map Red. Map Red. Map Map Red. Map Red. Map 5

SQL (or Hive or Pig) It is nice and good, but... • Allow you to do a subset of the tasks efficiently and elegantly • What about the cases that do not fit SQL? o Custom types o Custom non-relational functions (they occur a lot!) o Iterative Algorithms  Machine learning, graph analysis • How does it look to mix SQL with MapReduce? 6

SQL (or Hive or Pig) is nice and good, but... FROM ( • Program Fragmentation FROM pv_users MAP pv_users.userid, pv_users.date • Impedance Mismatch USING 'map_script' AS dt, uid • Breaks optimization CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count; A = load 'WordcountInput.txt'; Hive B = MAPREDUCE wordcount.jar store A into 'inputDir ‘ load 'outputDir' as (word:chararray, count: int) Pig 'org.myorg.WordCount inputDir outputDir'; C = sort B by count; 7

Sky Language MapReduce style functions (Map, Reduce, Join, CoGroup, Scala Embedded Language Cross, ...) Relational Set Operations (filter, map, group, join, Optimizer aggregate, ...) Write like a programming Database / UDF Runtime language, execute like a database... 8

Sky Language Add a bit of " languages and compilers " sauce to the database stack 9

Scala API by Example • The classical word count example val input = TextFile(textInput) val words = input flatMap { line => line.split("\\W+") } val counts = words groupBy { word => word } count() 10

Scala API by Example • The classical word count example In-situ data source Transformation function val input = TextFile(textInput) val words = input flatMap { line => line.split("\\W+") } val counts = words groupBy { word => word } count() Group by entire data Count per group type (the words) 11

Scala API by Example • Graph Triangles (Friend-of-a-Friend problem) Recommending friends, finding important connections o • 1) Enumerate candidate triads • 2) Close as triangles 12

Scala API by Example case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base1: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } 13

Scala API by Example Custom Data Types In-situ data source case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base1: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) val byDegree = vertices map { projectToLowerDegree } val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } 14

Scala API by Example Non-relational library function case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base2: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) Non-relational val byDegree = vertices map { projectToLowerDegree } function val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } Relational isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } Join 15

Scala API by Example case class Edge(from: Int, to: Int) case class Triangle(apex: Int, base1: Int, base2: Int) val vertices = DataSource("hdfs:///...", CsvFormat[Edge]) Key val byDegree = vertices map { projectToLowerDegree } References val byID = byDegree map { (x) => if (x.from < x.to) x else Edge(x.to, x.from) } val triads = byDegree groupBy { _.from } reduceGroup { buildTriads } val triangles = triads join byID where { t => (t.base1, t.base2) } isEqualTo { e => (e.from, e.to) } map { (triangle, edge) => triangle } 16

Optimizing Programs • Program optimization happens in two phases 1. Data type and function code analysis inside the Scala Compiler 2. Relational-style optimization of the data flow Type Analyze Generate Code Parser Program Checker Data Types Glue Code Generation Scala Compiler Finalize Create Instantiate Instantiate Optimize Execution Glue Code Schedule Stratosphere Optimizer Run Time 17

Type Analysis/Code Gen • Types and Key Selectors are mapped to flat schema • Generated code for interaction with runtime Primitive Types, Int, Double, Single Value Array[String], Arrays, Lists ... Tuples / (a: Int, b: Int, c: String) (a: Int, b: Int, c: String) Tuples class T(x: Int, y: Long) Classes (x: Int, y: Long) Nested Recursively class T(x: Int, y: Long) (x: Int, y: Long) Types class R(id: String, value: T) (id:String, x:Int, y:Long) flattened Tuples recursive class Node(id: Int, left: Node, (id:Int, left:BLOB, (w/ BLOB for types right: Node) right:BLOB) recursion) 18

Optimization case class Order(id: Int, priority: Int, ...) case class Item(id: Int, price: double, ) val orders = DataSource(...) case class PricedOrder(id, priority, price) val items = DataSource(...) val filtered = orders filter { ... } val prio = filtered join items where { _.id } isEqualTo { _.id } map {(o,li) => PricedOrder(o.id, o.priority, li.price)} val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM) Grp/Agg Grp/Agg (0,1) Join (0) = (0) Join sort (0,1) sort (0) partition(0) ( ∅) Filter partition(0) Filter Items Orders Items Orders 19

Iterative Programs • Many programs have a loop and make multiple passes over the data o Machine Learning algorithms iteratively refine the model o Graph algorithms propagate information one hop by hop Loop outside the system Client Step Step Step Step Step Iteration Loop inside the system 20

Why Iterations Algorithms that need iterations • Clustering (K- Means, …) o Gradient descent o Page-Rank o Logistic Regression o Path algorithms on graphs (shortest paths, centralities, …) o Graph communities / dense sub-components o Inference (believe propagation) o … o  All the hot algorithms for building predictive models 21

Two Types of Iterations Incremental Iterations Bulk Iterations (aka. Workset Iterations) Result Result Iterative Iterative State Function Function Initial Dataset Initial Initial 22 Workset Solutionset

Iterations inside the System 1400 1200 # Vertices (thousands) 1000 Runtime (secs) 800 6000 600 Naïve 5000 400 Incremental 200 4000 0 3000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Iteration 2000 1000 Computations performed in each iteration for connected 0 communities of a social graph Twitter Webbase (20) 23

Iterative Program (Scala) def step = (s: DataSet [Vertex], ws: DataSet [Vertex]) => { val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } } val delta = s join minNeighbor where { _.id } isEqualTo { _.id } flatMap { (c,o) => if (c.component < o.component) Some(c) else None } val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from} map { (v, e) => Vertex(e.to, v.component) } (delta, nextWs) } val components = vertices. iterateWithWorkset (initialWorkset, {_.id}, step) 24

Iterative Program (Scala) Define Step function def step = (s: DataSet [Vertex], ws: DataSet [Vertex]) => { val min = ws groupBy {_.id} reduceGroup { x => x.minBy { _.component } } val delta = s join minNeighbor where { _.id } isEqualTo { _.id } flatMap { (c,o) => if (c.component < o.component) Some(c) else None } val nextWs = delta join edges where {v => v.id} isEqualTo {e => e.from} map { (v, e) => Vertex(e.to, v.component) } (delta, nextWs) } val components = vertices. iterateWithWorkset (initialWorkset, {_.id}, step) Return Delta and Invoke Iteration next Workset 25

Iterative Program (Java) 26

Graph Processing in Stratosphere 27

v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview - PowerPoint PPT Presentation

Stratosphere v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview Official release coming end of November Hands on sessions today with the latest code snapshot 2 New Features in a Nutshell Declarative Scala Programming API

INTERNET, PHONE, MAIL AND MIXED-MODE SURVEYS Dillman, Smyth and Christian Presentation for

Requirements Elicitation Lecture 3, DAT230, Requirements Engineering Robert Feldt, 2010-09-03

Annual General Meeting 2019 Royal Dutch Shell plc May 21, 2019 #makethefuture Royal Dutch

Week 1 Slides Bailey Stevens July 2018 Write down your opinion of each work. Be prepared to

Which beach? Here are a few of our favourite beaches in Cornwall! Perranporth is a popular seaside

The three-dimensional folding of the -globin gene domain reveals formation of chromatin

Probabilistic Reasoning a h C , N R wrt Time Decision Theoretic Agents Introduction to

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de

New medicines for type 2 diabetes 4. Thiazolidinediones 5. GLP-1 receptor agonists 6. DPP-4

Global 1000 Conference and Showcase 2013 GLOBAL 1000 PANEL: LIFE SCIENCES Barbara Araneo, PhD,

Masking the GLP Lattice-Based Signature Scheme at any Order Gilles Barthe (IMDEA Software

G.l.p., optimal coefficients, rank-1 lattice rules, ... Dirk Nuyens Department of Computer

Introduction to E-environment/ Data Integrity workshop Cecilia Arfvidsson, on behalf of the EBF

Pr rss t

CEE 370 Environmental Engineering Principles Lecture #1 Introduction I Reading: Chapter 1 in

Networks in Bacteria Hidde de Jong INRIA Grenoble - Rhne-Alpes Hidde.de-Jong@inria.fr

ROAD: Routablility Analysis & Diagnosis Based on SAT Techniques ISPD 2019 UCSD VLSI LAB

Introduction Clause exchange in parallel solvers Lazy clause exchange Experiments

Process Design for an All Single-Use Manufacturing Facility: Scaling Low to High Titer Processes

Ruby 1 Safety, Tolerability and Efficacy of Darexaban (YM150) in Patients with Acute Coronary

Computer Graphics (CS 4731) Lecture 2: Introduction to OpenGL/GLUT (Part 1) Prof Emmanuel Agu

Typical OpenGL/GLUT Main Program #include <GL/glut.h> // GLUT, GLU, and OpenGL defs int

OpenGL and Assignment #1 Intensive introduction to OpenGL whirlwind tour of: window setup

CS 4204 Computer Graphics Window based Window based programming and GLUT programming and GLUT

v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview - PowerPoint PPT Presentation

Stratosphere v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview Official release coming end of November Hands on sessions today with the latest code snapshot 2 New Features in a Nutshell Declarative Scala Programming API

INTERNET, PHONE, MAIL AND MIXED-MODE SURVEYS Dillman, Smyth and Christian Presentation for

Requirements Elicitation Lecture 3, DAT230, Requirements Engineering Robert Feldt, 2010-09-03

Annual General Meeting 2019 Royal Dutch Shell plc May 21, 2019 #makethefuture Royal Dutch

Week 1 Slides Bailey Stevens July 2018 Write down your opinion of each work. Be prepared to

Which beach? Here are a few of our favourite beaches in Cornwall! Perranporth is a popular seaside

The three-dimensional folding of the -globin gene domain reveals formation of chromatin

Probabilistic Reasoning a h C , N R wrt Time Decision Theoretic Agents Introduction to

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de

New medicines for type 2 diabetes 4. Thiazolidinediones 5. GLP-1 receptor agonists 6. DPP-4

Global 1000 Conference and Showcase 2013 GLOBAL 1000 PANEL: LIFE SCIENCES Barbara Araneo, PhD,

Masking the GLP Lattice-Based Signature Scheme at any Order Gilles Barthe (IMDEA Software

G.l.p., optimal coefficients, rank-1 lattice rules, ... Dirk Nuyens Department of Computer

Introduction to E-environment/ Data Integrity workshop Cecilia Arfvidsson, on behalf of the EBF

Pr rss t

CEE 370 Environmental Engineering Principles Lecture #1 Introduction I Reading: Chapter 1 in

Networks in Bacteria Hidde de Jong INRIA Grenoble - Rhne-Alpes Hidde.de-Jong@inria.fr

ROAD: Routablility Analysis &amp; Diagnosis Based on SAT Techniques ISPD 2019 UCSD VLSI LAB

Introduction Clause exchange in parallel solvers Lazy clause exchange Experiments

Process Design for an All Single-Use Manufacturing Facility: Scaling Low to High Titer Processes

Ruby 1 Safety, Tolerability and Efficacy of Darexaban (YM150) in Patients with Acute Coronary

Computer Graphics (CS 4731) Lecture 2: Introduction to OpenGL/GLUT (Part 1) Prof Emmanuel Agu

Typical OpenGL/GLUT Main Program #include &lt;GL/glut.h&gt; // GLUT, GLU, and OpenGL defs int

OpenGL and Assignment #1 Intensive introduction to OpenGL whirlwind tour of: window setup

CS 4204 Computer Graphics Window based Window based programming and GLUT programming and GLUT

ROAD: Routablility Analysis & Diagnosis Based on SAT Techniques ISPD 2019 UCSD VLSI LAB

Typical OpenGL/GLUT Main Program #include <GL/glut.h> // GLUT, GLU, and OpenGL defs int