Spark: Resilient Distributed Datasets as Workflow System H. Andrew - PowerPoint PPT Presentation

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity Search Hadoop File System Spark Hypothesis Testing Graph Analysis Streaming Recommendation Systems MapReduce Tensorflow Deep Learning

Where is MapReduce Inefficient? DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot).

Where is MapReduce Inefficient? ● Long pipelines sharing data ● Interactive applications ● Streaming applications ● Iterative algorithms (optimization problems) DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot). (Anytime where MapReduce would need to write and read from disk a lot).

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s).

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 Create RDD dfs:// (DATA) filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 transformation1() dfs:// (DATA) (DATA) filename created from dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (can drop (DATA) (DATA) the data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). ● Enables rebuilding datasets on the fly. ● Intermediate datasets not stored on disk (and only in memory if needed and enough space) Faster communication and I O

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). “Stable Storage” Other RDDs

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). map filter join ...

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record Multiple Records of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). (orig.) Actions : RDD to Value Object, or Storage Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Current Transformations and Actions http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations common transformations: filter, map, flatMap, reduceByKey, groupByKey http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions common actions: collect, count, take

Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors count() Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: count() lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.count Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Example 2 Collect times of hdfs-related errors lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.persist errors.count ... Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Spark: Resilient Distributed Datasets as Workflow System H. Andrew - PowerPoint PPT Presentation

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020 Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

The STARK Framework for S patio- T emporal Data Analytics on Sp ark Stefan Hagedorn Philipp Gtze

Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal Burns 9 April 2018 Department of

Implementing Atomicity with Locks Dave Cunningham April 4, 2006 Dave Cunningham Implementing

liver ver sc scan ans s vi via a a re a recu current rrent multi ti-scale scale en enco

Spark: Resilient Distributed Datasets as Workflow System H. Andrew - PowerPoint PPT Presentation

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020 Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

MapReduce &amp; Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

The STARK Framework for S patio- T emporal Data Analytics on Sp ark Stefan Hagedorn Philipp Gtze

Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal Burns 9 April 2018 Department of

Implementing Atomicity with Locks Dave Cunningham April 4, 2006 Dave Cunningham Implementing

liver ver sc scan ans s vi via a a re a recu current rrent multi ti-scale scale en enco

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -