Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce - PowerPoint PPT Presentation

Spark RDD 1

Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig 2

A Step Back to MapReduce Designed in the early 2000’s Machines were unreliable Focused on fault-tolerance Addressed data-intensive applications Limited memory 3

MapReduce in Practice Reduce Map Shuffle Reduce Map R M R M R M R M … … … … R M R M Can we improve on that? 4

Pig Slightly improves disk I/O by consolidating map-only jobs Map Map Shuffle Reduce M1 M2 R M1 M2 R … … … M1 M2 R 6

Pig Slightly improves disk I/O by consolidating map-only jobs Map Shuffle Reduce M1 M2 R M1 M2 R … … … M1 M2 R 7

Pig Slightly improves disk I/O by consolidating map-only jobs Map Shuffle Reduce Map M1 R1 M2 M1 R1 M2 … … … M1 R1 M2 8

Pig Slightly improves disk I/O by consolidating map-only jobs Map Shuffle Reduce M1 R1 M2 M1 R1 M2 … … … M1 R1 M2 9

Pig (at a higher level) FILTER FOREACH JOIN FOREACH FILTER GROUP BY 10

RDD Resilient Distributed Datasets A distributed query processing engine The Spark counterpart to MapReduce Designed for in-memory processing 11

In-memory Processing The machine specs change More reliable Bigger memory And the workload changed Analytical queries Iterative operations (like ML) The main idea: Rather than storing intermediate results to disk, keep them in memory How about fault tolerance? 12

RDD Example Mem Mem FILTER FOREACH JOIN Mem FOREACH FILTER Mem GROUP BY 13

RDD Abstraction RDD is a pointer to a distributed dataset Stores information about how to compute the data rather than where the data is Transformation: Converts an RDD to another RDD Action: Returns an answer of an operation over an RDD 14

Spark RDD Features Lazy execution: Collect transformations and execute on actions (Similar to Pig) Lineage tracking: Keep track of the lineage of each RDD for fault-tolerance 15

RDD Operation RDD RDD 16

Filter Operation Similarly, projection operation (ForEach in Pig) Filter RDD RDD Narrow dependency 17

GroupBy (Shuffle) Operation Similar operation Join Group By RDD RDD Wide dependency 18

Types of Dependencies Narrow dependencies Wide dependencies Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies 19

Examples of Transformations map flatMap reduceByKey filter sample join union partitionBy 20

Examples of Actions count collect save(path) persist reduce 21

How RDD can be helpful Consolidate operations Combine transformations Iterative operations Keep the output of an iteration in memory till the next iteration Data sharing Reuse the same data without having to read it multiple times 22

Examples # Initialize the Spark context JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo"); 23

Examples # Initialize the Spark context JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo"); # Hello World! Example. Count the number of lines in the file JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv"); long count = textFileRDD.count(); System.out.println("Number of lines is "+count); 24

Examples # Count the number of OK lines JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() { @Override public Boolean call(String s) throws Exception { String code = s.split("\t")[5]; return code.equals("200"); } }); long count = okLines.count(); System.out.println("Number of OK lines is "+count); 25

Examples # Count the number of OK lines # Shorten the implementation using lambdas (Java 8 and above) JavaRDD<String> okLines = textFileRDD.filter( s -> s.split("\t")[5].equals("200") ); long count = okLines.count(); System.out.println("Number of OK lines is "+count); 26

Examples # Make it parametrized by taking the response code as a command line argument String inputFileName = args[0]; String desiredResponseCode = args[1]; ... JavaRDD<String> textFileRDD = spark.textFile(inputFileName); JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() { @Override public Boolean call(String s) { String code = s.split("\t")[5]; return code.equals(desiredResponseCode); } }); 27

Examples # Count by response code JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() { @Override public Tuple2<Integer, String> call(String s) { String code = s.split("\t")[5]; return new Tuple2<Integer, String>(Integer.valueOf(code), s); } }); Map<Integer, Long> countByCode = linesByCode.countByKey(); System.out.println(countByCode); 28

Further Reading Spark home page: http://spark.apache.org/ Quick start: http://spark.apache.org/docs/latest/quick- start.html RDD documentation: http://spark.apache.org/docs/latest/rdd- programming-guide.html RDD Paper: Matei Zaharia et al . "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." NSDI’12 29

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce - PowerPoint PPT Presentation

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig 2 A Step Back to MapReduce Designed in the early 2000s Machines were unreliable Focused on fault-tolerance

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Spark RDD Operations Transformations and Actions 1 RDD Processing Model RDD can be modeled

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Scaling Up Pig Duen Horng (Polo) Chau Associate Professor Associate Director, MS Analytics

Bigtable, Hive, and Pig Jimmy Lin Jimmy Lin University of Maryland Tuesday, April 27, 2010

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

Dictionaries CSSE 120Rose Hulman Institute of Technology Data Collections Frequently

Same Questions across domains, different interpretations What is it? How do we study it?

Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce - PowerPoint PPT Presentation

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig 2 A Step Back to MapReduce Designed in the early 2000s Machines were unreliable Focused on fault-tolerance

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Spark RDD Operations Transformations and Actions 1 RDD Processing Model RDD can be modeled

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Scaling Up Pig Duen Horng (Polo) Chau Associate Professor Associate Director, MS Analytics

Bigtable, Hive, and Pig Jimmy Lin Jimmy Lin University of Maryland Tuesday, April 27, 2010

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

Dictionaries CSSE 120Rose Hulman Institute of Technology Data Collections Frequently

Same Questions across domains, different interpretations What is it? How do we study it?

Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark