spark resilient distributed datasets as workflow system
play

Spark: Resilient Distributed Datasets as Workflow System H. Andrew - PowerPoint PPT Presentation

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020 Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity


  1. Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

  2. Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity Search Hadoop File System Spark Hypothesis Testing Graph Analysis Streaming Recommendation Systems MapReduce Tensorflow Deep Learning

  3. Where is MapReduce Inefficient? DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot).

  4. Where is MapReduce Inefficient? ● Long pipelines sharing data ● Interactive applications ● Streaming applications ● Iterative algorithms (optimization problems) DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot). (Anytime where MapReduce would need to write and read from disk a lot).

  5. Where is MapReduce Inefficient? ● Long pipelines sharing data ● Interactive applications ● Streaming applications ● Iterative algorithms (optimization problems) DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot). (Anytime where MapReduce would need to write and read from disk a lot).

  6. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s).

  7. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 Create RDD dfs:// (DATA) filename

  8. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 transformation1() dfs:// (DATA) (DATA) filename created from dfs://filename

  9. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (can drop (DATA) (DATA) the data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

  10. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). ● Enables rebuilding datasets on the fly. ● Intermediate datasets not stored on disk (and only in memory if needed and enough space) Faster communication and I O

  11. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). “Stable Storage” Other RDDs

  12. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). map filter join ...

  13. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

  14. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

  15. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

  16. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

  17. Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

  18. (original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

  19. (original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record Multiple Records of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

  20. (original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

  21. (original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). (orig.) Actions : RDD to Value Object, or Storage Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

  22. Current Transformations and Actions http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations common transformations: filter, map, flatMap, reduceByKey, groupByKey http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions common actions: collect, count, take

  23. Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors count() Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

  24. Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: count() lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.count Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

  25. Example 2 Collect times of hdfs-related errors lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.persist errors.count ... Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend