An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS - PowerPoint PPT Presentation

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67

Big Data small data big data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 2 / 67

Big Data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 3 / 67

How To Store and Process Big Data? Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 4 / 67

Scale Up vs. Scale Out ◮ Scale up or scale vertically ◮ Scale out or scale horizontally Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 5 / 67

Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 6 / 67

Three Main Layers: Big Data Stack Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 7 / 67

Resource Management Layer Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 8 / 67

Storage Layer Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 9 / 67

Processing Layer Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 10 / 67

Spark Processing Engine Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 11 / 67

Cluster Programming Model Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 12 / 67

Warm-up Task (1/2) ◮ We have a huge text document. ◮ Count the number of times each distinct word appears in the file ◮ Application: analyze web server logs to find popular URLs. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 13 / 67

Warm-up Task (2/2) ◮ File is too large for memory, but all � word , count � pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 14 / 67

Warm-up Task in MapReduce ◮ words(doc.txt) | sort | uniq -c Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 15 / 67

Warm-up Task in MapReduce ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 15 / 67

Warm-up Task in MapReduce ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 15 / 67

Warm-up Task in MapReduce ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 15 / 67

Warm-up Task in MapReduce ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 15 / 67

Warm-up Task in MapReduce ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. ◮ Write the result. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 15 / 67

Example: Word Count ◮ Consider doing a word count of the following file using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 16 / 67

Example: Word Count - map ◮ The map function reads in words one a time and outputs (word, 1) for each parsed input word. ◮ The map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 17 / 67

Example: Word Count - shuffle ◮ The shuffle phase between map and reduce phase creates a list of values associated with each key. ◮ The reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1)) Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 18 / 67

Example: Word Count - reduce ◮ The reduce function sums the numbers in the list for each key and outputs (word, count) pairs. ◮ The output of the reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 19 / 67

Example: Word Count - map public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 20 / 67

Example: Word Count - reduce public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } } Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 21 / 67

Example: Word Count - driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setCombinerClass(MyReduce.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 22 / 67

Data Flow Programming Model ◮ Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. ◮ Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 23 / 67

Data Flow Programming Model ◮ Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. ◮ Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. ◮ MapReduce greatly simplified big data analysis on large unreliable clusters. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 23 / 67

MapReduce Limitation ◮ MapReduce programming model has not been designed for complex operations, e.g., data mining. ◮ Very expensive (slow), i.e., always goes to disk and HDFS. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 24 / 67

Spark (1/3) ◮ Extends MapReduce with more operators. ◮ Support for advanced data flow graphs. ◮ In-memory and out-of-core processing. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 25 / 67

Spark (2/3) Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 26 / 67

Spark (3/3) Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 27 / 67

Resilient Distributed Datasets (RDD) (1/2) ◮ A distributed memory abstraction. ◮ Immutable collections of objects spread across a cluster. • Like a LinkedList <MyObjects> Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 28 / 67

Resilient Distributed Datasets (RDD) (2/2) ◮ An RDD is divided into a number of partitions, which are atomic pieces of information. ◮ Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 29 / 67

Resilient Distributed Datasets (RDD) (2/2) ◮ An RDD is divided into a number of partitions, which are atomic pieces of information. ◮ Partitions of an RDD can be stored on different nodes of a cluster. ◮ Built through coarse grained transformations, e.g., map , filter , join . Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 29 / 67

Spark Programming Model ◮ Job description based on directed acyclic graphs (DAG). Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 30 / 67

Creating RDDs ◮ Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 31 / 67

Creating RDDs ◮ Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) ◮ Load text file from local FS, HDFS, or S3. val a = sc.textFile("file.txt") val b = sc.textFile("directory/*.txt") val c = sc.textFile("hdfs://namenode:9000/path/file") Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 31 / 67

RDD Higher-Order Functions ◮ Higher-order functions: RDDs operators. ◮ There are two types of RDD operators: transformations and actions. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 32 / 67

RDD Transformations - Map ◮ All pairs are independently processed. Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 33 / 67

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS - PowerPoint PPT Presentation

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 2 / 67 Big Data

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

DONT OPTIMIZE MY QUERIES, ORGANIZE MY DATA! Julian Hyde (Apache Calcite) TELUQ, Montral,

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API CSC 369: Distributed

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov,

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019,

Lowering boundaries between data analysis ecosystems Jim Pivarski Princeton University DIANA

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS - PowerPoint PPT Presentation

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 2 / 67 Big Data

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

DONT OPTIMIZE MY QUERIES, ORGANIZE MY DATA! Julian Hyde (Apache Calcite) TELUQ, Montral,

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API CSC 369: Distributed

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov,

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019,

Lowering boundaries between data analysis ecosystems Jim Pivarski Princeton University DIANA

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark