spark rdd
play

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce - PowerPoint PPT Presentation

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig 2 A Step Back to MapReduce Designed in the early 2000s Machines were unreliable Focused on fault-tolerance


  1. Spark RDD 1

  2. Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop High-level data manipulation using Pig 2

  3. A Step Back to MapReduce Designed in the early 2000’s Machines were unreliable Focused on fault-tolerance Addressed data-intensive applications Limited memory 3

  4. MapReduce in Practice Reduce Map Shuffle Reduce Map R M R M R M R M … … … … R M R M Can we improve on that? 4

  5. Pig Slightly improves disk I/O by consolidating map-only jobs Map Map Shuffle Reduce M1 M2 R M1 M2 R … … … M1 M2 R 6

  6. Pig Slightly improves disk I/O by consolidating map-only jobs Map Shuffle Reduce M1 M2 R M1 M2 R … … … M1 M2 R 7

  7. Pig Slightly improves disk I/O by consolidating map-only jobs Map Shuffle Reduce Map M1 R1 M2 M1 R1 M2 … … … M1 R1 M2 8

  8. Pig Slightly improves disk I/O by consolidating map-only jobs Map Shuffle Reduce M1 R1 M2 M1 R1 M2 … … … M1 R1 M2 9

  9. Pig (at a higher level) FILTER FOREACH JOIN FOREACH FILTER GROUP BY 10

  10. RDD Resilient Distributed Datasets A distributed query processing engine The Spark counterpart to MapReduce Designed for in-memory processing 11

  11. In-memory Processing The machine specs change More reliable Bigger memory And the workload changed Analytical queries Iterative operations (like ML) The main idea: Rather than storing intermediate results to disk, keep them in memory How about fault tolerance? 12

  12. RDD Example Mem Mem FILTER FOREACH JOIN Mem FOREACH FILTER Mem GROUP BY 13

  13. RDD Abstraction RDD is a pointer to a distributed dataset Stores information about how to compute the data rather than where the data is Transformation: Converts an RDD to another RDD Action: Returns an answer of an operation over an RDD 14

  14. Spark RDD Features Lazy execution: Collect transformations and execute on actions (Similar to Pig) Lineage tracking: Keep track of the lineage of each RDD for fault-tolerance 15

  15. RDD Operation RDD RDD 16

  16. Filter Operation Similarly, projection operation (ForEach in Pig) Filter RDD RDD Narrow dependency 17

  17. GroupBy (Shuffle) Operation Similar operation Join Group By RDD RDD Wide dependency 18

  18. Types of Dependencies Narrow dependencies Wide dependencies Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies 19

  19. Examples of Transformations map flatMap reduceByKey filter sample join union partitionBy 20

  20. Examples of Actions count collect save(path) persist reduce 21

  21. How RDD can be helpful Consolidate operations Combine transformations Iterative operations Keep the output of an iteration in memory till the next iteration Data sharing Reuse the same data without having to read it multiple times 22

  22. Examples # Initialize the Spark context JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo"); 23

  23. Examples # Initialize the Spark context JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo"); # Hello World! Example. Count the number of lines in the file JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv"); long count = textFileRDD.count(); System.out.println("Number of lines is "+count); 24

  24. Examples # Count the number of OK lines JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() { @Override public Boolean call(String s) throws Exception { String code = s.split("\t")[5]; return code.equals("200"); } }); long count = okLines.count(); System.out.println("Number of OK lines is "+count); 25

  25. Examples # Count the number of OK lines # Shorten the implementation using lambdas (Java 8 and above) JavaRDD<String> okLines = textFileRDD.filter( s -> s.split("\t")[5].equals("200") ); long count = okLines.count(); System.out.println("Number of OK lines is "+count); 26

  26. Examples # Make it parametrized by taking the response code as a command line argument String inputFileName = args[0]; String desiredResponseCode = args[1]; ... JavaRDD<String> textFileRDD = spark.textFile(inputFileName); JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() { @Override public Boolean call(String s) { String code = s.split("\t")[5]; return code.equals(desiredResponseCode); } }); 27

  27. Examples # Count by response code JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() { @Override public Tuple2<Integer, String> call(String s) { String code = s.split("\t")[5]; return new Tuple2<Integer, String>(Integer.valueOf(code), s); } }); Map<Integer, Long> countByCode = linesByCode.countByKey(); System.out.println(countByCode); 28

  28. Further Reading Spark home page: http://spark.apache.org/ Quick start: http://spark.apache.org/docs/latest/quick- start.html RDD documentation: http://spark.apache.org/docs/latest/rdd- programming-guide.html RDD Paper: Matei Zaharia et al . "Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing." NSDI’12 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend