Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce - - PowerPoint PPT Presentation

spark rdd
SMART_READER_LITE
LIVE PREVIEW

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce - - PowerPoint PPT Presentation

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig 2 A Step Back to MapReduce Designed in the early 2000s Machines were unreliable Focused on fault-tolerance


slide-1
SLIDE 1

Spark RDD

1

slide-2
SLIDE 2

Where are we?

Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig

2

slide-3
SLIDE 3

A Step Back to MapReduce

Designed in the early 2000’s Machines were unreliable Focused on fault-tolerance Addressed data-intensive applications Limited memory

3

slide-4
SLIDE 4

Shuffle Reduce Map

MapReduce in Practice

4

M M M … R R R … Map M M M … Reduce R R R …

Can we improve on that?

slide-5
SLIDE 5

Pig

Slightly improves disk I/O by consolidating map-only jobs

6

Shuffle Reduce Map M2 M2 M2 … R R R … Map M1 M1 M1 …

slide-6
SLIDE 6

Map

Pig

Slightly improves disk I/O by consolidating map-only jobs

7

Shuffle Reduce M2 M2 M2 … R R R … M1 M1 M1 …

slide-7
SLIDE 7

Pig

Slightly improves disk I/O by consolidating map-only jobs

8

Shuffle Reduce Map M1 M1 M1 … R1 R1 R1 … Map M2 M2 M2 …

slide-8
SLIDE 8

Pig

Slightly improves disk I/O by consolidating map-only jobs

9

Shuffle Reduce Map M1 M1 M1 … R1 R1 R1 … M2 M2 M2 …

slide-9
SLIDE 9

Pig (at a higher level)

10

FILTER FOREACH FOREACH JOIN FILTER GROUP BY

slide-10
SLIDE 10

RDD

Resilient Distributed Datasets A distributed query processing engine The Spark counterpart to MapReduce Designed for in-memory processing

11

slide-11
SLIDE 11

In-memory Processing

The machine specs change

More reliable Bigger memory

And the workload changed

Analytical queries Iterative operations (like ML)

The main idea: Rather than storing intermediate results to disk, keep them in memory How about fault tolerance?

12

slide-12
SLIDE 12

RDD Example

13

FILTER FOREACH FOREACH JOIN FILTER GROUP BY

Mem Mem Mem Mem

slide-13
SLIDE 13

RDD Abstraction

RDD is a pointer to a distributed dataset Stores information about how to compute the data rather than where the data is Transformation: Converts an RDD to another RDD Action: Returns an answer of an operation

  • ver an RDD

14

slide-14
SLIDE 14

Spark RDD Features

Lazy execution: Collect transformations and execute on actions (Similar to Pig) Lineage tracking: Keep track of the lineage of each RDD for fault-tolerance

15

slide-15
SLIDE 15

RDD

16

RDD RDD

Operation

slide-16
SLIDE 16

Filter Operation

17

RDD RDD

Filter

Similarly, projection operation (ForEach in Pig)

Narrow dependency

slide-17
SLIDE 17

GroupBy (Shuffle) Operation

18

RDD RDD

Group By

Similar operation Join

Wide dependency

slide-18
SLIDE 18

Types of Dependencies

Narrow dependencies Wide dependencies

19

Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies

slide-19
SLIDE 19

Examples of Transformations

map flatMap reduceByKey filter sample join union partitionBy

20

slide-20
SLIDE 20

Examples of Actions

count collect save(path) persist reduce

21

slide-21
SLIDE 21

How RDD can be helpful

Consolidate operations

Combine transformations

Iterative operations

Keep the output of an iteration in memory till the next iteration

Data sharing

Reuse the same data without having to read it multiple times

22

slide-22
SLIDE 22

Examples

# Initialize the Spark context JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

23

slide-23
SLIDE 23

Examples

# Initialize the Spark context JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo"); # Hello World! Example. Count the number of lines in the file JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv"); long count = textFileRDD.count(); System.out.println("Number of lines is "+count);

24

slide-24
SLIDE 24

Examples

# Count the number of OK lines JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() { @Override public Boolean call(String s) throws Exception { String code = s.split("\t")[5]; return code.equals("200"); } }); long count = okLines.count(); System.out.println("Number of OK lines is "+count);

25

slide-25
SLIDE 25

Examples

# Count the number of OK lines # Shorten the implementation using lambdas (Java 8 and above) JavaRDD<String> okLines = textFileRDD.filter(s -> s.split("\t")[5].equals("200")); long count = okLines.count(); System.out.println("Number of OK lines is "+count);

26

slide-26
SLIDE 26

Examples

# Make it parametrized by taking the response code as a command line argument String inputFileName = args[0]; String desiredResponseCode = args[1]; ... JavaRDD<String> textFileRDD = spark.textFile(inputFileName); JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() { @Override public Boolean call(String s) { String code = s.split("\t")[5]; return code.equals(desiredResponseCode); } });

27

slide-27
SLIDE 27

Examples

# Count by response code JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() { @Override public Tuple2<Integer, String> call(String s) { String code = s.split("\t")[5]; return new Tuple2<Integer, String>(Integer.valueOf(code), s); } }); Map<Integer, Long> countByCode = linesByCode.countByKey(); System.out.println(countByCode);

28

slide-28
SLIDE 28

Further Reading

Spark home page: http://spark.apache.org/ Quick start: http://spark.apache.org/docs/latest/quick- start.html RDD documentation: http://spark.apache.org/docs/latest/rdd- programming-guide.html

RDD Paper: Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." NSDI’12

29