s park
play

[S PARK ] Spark: Its all about transformation and actions - PDF document

CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University CS 455: I NTRODUCTION TO D ISTRIBUTED S YSTEMS [S PARK ] Spark: Its all about transformation and actions Transformations Wrangle


  1. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University CS 455: I NTRODUCTION TO D ISTRIBUTED S YSTEMS [S PARK ] Spark: It’s all about transformation and actions Transformations Wrangle with the data Consume, and beget, an RDD Flock together … to form daisy chains Shrideep Pallickara Computer Science But it is actions That trigger evaluations Colorado State University Providing them potency Revealing their expressive power CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 Topics covered in this lecture ¨ Resilient Distributed Datasets ¨ Common Transformations and Actions Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.1 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  2. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University R ESILIENT D ISTRIBUTED D ATASET [RDD] CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 Resilient Distributed Dataset (RDD) ¨ RDD is an immutable distributed collection of objects ¨ Each RDD is split into multiple partitions ¤ Maybe computed on different nodes in the cluster ¨ Can contain any type of Java, Scala, or Python objects ¤ Including user-defined classes Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.2 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  3. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University Creation of RDDs ① Loading an external dataset ② Distributing a collection of objects via the driver program >>> lines = sc.textFile(“README.md”) Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 Once created, RDDs offer two types of operations ¨ Transformations ¤ Construct a new RDD from a previous one ¤ E.g.: Filtering data that matches a predicate ¨ Actions ¤ Compute a result based on an RDD ¤ Return result to the driver program or save it in external storage system (HDFS) Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.3 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  4. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University Some more about RDDs ¨ Although you can define new RDDs anytime ¤ Spark computes them in a lazy fashion ¤ When? n The first time they are used in an action ¨ Loading lazily allows transformations to be performed before the action Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 Lazy loading allows Spark to see the whole chain of transformations ¨ Allows it to compute just the data needed for the result ¨ Example: lines = sc.textFile(“README.md”) pythonLines= lines.filter(lambda line: “Python” in line) ¨ If Spark were to load and store all lines in the file, as soon as we wrote lines=sc.textFile() ? ¤ Would waste a lot of storage space, since we immediately filter out a lot of lines Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.4 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  5. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University RDD and actions ¨ RDDs are recomputed (by default) every time you run an action on them ¨ If you wanted to reuse an RDD? ¤ Ask Spark to persist it using RDD.persist() ¤ After computing it the first time, Spark will store RDD contents in memory ( partitioned across cluster machines) ¤ Persisted RDD is used in future actions Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 RDDs: memory residency and immutability implications ¨ Spark can keep an RDD loaded in-memory on the executor nodes throughout the life of a Spark application for faster access in repeated computations ¨ RDDs are immutable, so transforming an RDD returns a new RDD rather than the existing one ¨ Cross-cutting implications? ¤ Lazy evaluation, in-memory storage, and immutability allows Spark to be easy-to-use, fault-tolerant, scalable, and efficient Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.5 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  6. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University Every Spark program and shell works as follows ① Create some input RDD from external data ② Transform them to define new RDDs using transformations like filter() ③ Ask Spark to persist() any intermediate RDDs that needs to be reused ④ Launch actions such as count() , etc. to kickoff a parallel computation ¤ Computing is optimized and executed by Spark Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 A C LOSER LOOK AT RDD O PERATIONS CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.6 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  7. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University RDDs support two types of operations ¨ Transformations ¤ Operations that return a new RDD . E.g.: filter() ¨ Actions ¤ Operations that return a result to the driver program or write to storage ¤ Kicks of a computation. E.g.: count() ¨ Distinguishing aspect? ¤ Transformations return RDDs ¤ Actions return some other data type Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 Transformations ¨ Many transformations are element-wise ¤ Work on only one element at a time ¨ Some transformations are not element-wise ¤ E.g.: We have a logfile, log.text , with several messages, but we only want to select error messages inputRDD = sc.textFile(“log.txt”) errorsRDD = inputRDD.filter(lambda x:”error” in x) Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.7 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  8. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University In our previous example … ¨ filter does not mutate inputRDD ¤ Returns a pointer to an entirely new RDD ¤ inputRDD can still be reused later in the program ¨ We could use inputRDD to search for lines with the word “warning” ¤ While we are at it, we will use another transformation, union() , to print number of lines that contained either errorsRDD = inputRDD.filter(lambda x: “error” in x) warningsRDD = inputRDD.filter(lambda x: “warning” in x) badlinesRDD = errorsRDD.union(warningsRDD) Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT http: ht p://www.cs. cs.co colost state.edu/~cs4 cs455 In our previous example ¨ Note how union() is different from filter() ¤ Operates on 2 RDDs instead of one ¨ Transformations can actually operate on any number of RDDs Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.8 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  9. CS455: Introduction to Distributed Systems [Spring 2020] Dept. Of Computer Science , Colorado State University RDD Lineage graphs ¨ As new RDDs are derived from each other using transformations, Spark tracks dependencies ¤ Lineage graph ¨ Uses lineage graph to ¤ Compute each RDD on demand ¤ Recover lost data if part of persistent RDD is lost Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 RDD lineage graph for our example inputRDD filter filter errorsRDD warningsRDD union badLinesRDD Professor: S HRIDEEP P ALLICKARA CS455: Introduction to Distributed Systems C OM TER S CI NCE D EPAR OMPUTE CIENCE EPARTMEN ENT ht http: p://www.cs. cs.co colost state.edu/~cs4 cs455 L19.9 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend