cs535 big data 2 5 2020 week 3 b sangmi lee pallickara
play

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs PA1 GEAR Session 1 signup is


  1. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • PA1 • GEAR Session 1 signup is available: PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR • See the announcement in canvas SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER • Feedback policy COMPUTING • Quiz, TP proposal: 1week • Email- 24hrs Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Introduction to Spark • Operations: transformations, actions, persistence In-Memory Cluster Computing: Apache Spark RDD (Resilient Distributed Dataset) CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University RDD (Resilient Distributed Dataset) Creating RDDs [1/3] • Read-only, memory resident partitioned collection of records • Loading an external dataset • A fault-tolerant collection of elements that can be operated on in parallel val lines = sc.textFile("/path/to/README.md") • RDDs are the core unit of data in Spark • Most Spark programming involves performing operations on RDDs • Parallelizing a collection in your driver program val lines = sc.parallelize(List("pandas", "i like pandas")) https://spark.apache.org/docs/latest/rdd-programming-guide.html http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

  2. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Creating RDDs [2/3] Creating RDDs [3/3] 1: val lines = sc.textFile("data.txt") lineLengths.persist() 2: val lineLengths = lines.map(s => s.length) 3: val totalLength = lineLengths.reduce((a, b) => a + b) • If you want to use lineLengths again later • Line 1: defines a base RDD from an external file • This dataset is not loaded in memory • Line 2: defines lineLengths as the result of map transformation • It is not immediately computed • Line 3: performs reduce and compute the results CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark Programming Interface to RDD [1/3] Spark Programming Interface to RDD [2/3] 1: val lines = sc.textFile("data.txt") 1: val lines = sc.textFile("data.txt") 2: val lineLengths = lines. map (s => s.length) 2: val lineLengths = lines.map(s => s.length) • transformations 3: val totalLength = lineLengths.reduce((a, b) => a + b) 3: val totalLength = lineLengths. reduce( (a, b) => a + b) • actions • Operations that create RDDs • Return pointers to new RDDs • Operations that return a value to the application or export data to a storage system • e.g. map, filter, and join • e.g. count : returns the number of elements in the dataset • e.g. collect : returns the elements themselves • RDDs can only be created through deterministic operations on either • e.g. save : outputs the dataset to a storage system • Data in stable storage • Other RDDs Don’t be confused with “transformation” of the Scala language Don’t be confused with “transformation” of the Scala language CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark Programming Interface to RDD [3/3] lineLengths. persist( ) • persistance In-Memory Cluster Computing: Apache Spark • Indicates which RDDs they want to reuse in future operations RDD: Transformations • Spark keeps persistent RDDs in memory by default RDD: Actions • If there is not enough RAM RDD: Persistence • It can spill them to disk • Users are allowed to, • store the RDD only on disk • replicate the RDD across machines • specify a persistence priority on each RDD http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

  3. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What are the Transformations? • RDD’s transformations create a new dataset from an existing one • Simple transformations In-Memory Cluster Computing: Apache Spark • Transformations with multiple RDDs RDD: Transformations • Transformations with the Pair RDDs 1. Simple transformations 2. Transformations with multiple RDDs 3. Transformations with the Pair RDDs CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. filter() [1/2] Simple Transformations • These transformations create a new RDD from an existing RDD • The map( func ) transformation takes in a function • E.g. map(), filter(), flatMap(), sample() • applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD • The filter( func ) transformation takes in a function and returns an RDD that only has elements that pass the filter() function inputRDD {1,2,3,4} map x=> x*x filter x !=1 MappedRDD filteredRDD {1,4,9,16} {2,3,4} CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. filter() [1/2] map() vs. flatMap() [1/2] • map() that squares all of the numbers in an RDD • As results of flatMap() , we have an RDD of the elements • Instead of RDD of lists of elements RDD1.map(tokenize) mappedRDD val input = sc.parallelize( List( 1, 2, 3, 4)) {[“coffee”,”panda”],[“happy”,”pa RDD1 val result = input.map( x = > x * x) nda”],[“happiest”,”panda”,”party {“coffee panda”, “happy println(result.collect().mkString(",")) ”]} panda”, ”happiest panda party”} flatMappedRDD RDD1.flatMap(tokenize) {“coffee”,”panda”,“happy”,”pand a”,“happiest”,”panda”,”party”} http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

  4. CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. flatMap() [2/2] Example [1/2] Colorado State University Ohio State University • Using flatMap() that splits lines to multiple words Washington State University Boston University val lines = sc.parallelize( List(" hello world", "hi")) val words = lines.flatMap( line = > line.split(" ")) >>> wc = data.map(lambda line:line.split(" ")); words.first() // returns "hello" • Using map >>> llst = wc.collect() # print the list for line in llist: print line Output? >>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() • Using flatMap # print the list for line in llist: print line Output? CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. mapPartition() [2/2] Example --Answers [2/2] Colorado State University Ohio State University • map( func ) converts each element of the source RDD into a single element of the Washington State University result RDD by applying a function Boston University • mapPartitions( func ) converts each partition of the source RDD into multiple Colorado state university elements of the result (possibly none) Ohio state university • Using map Washington state university • Similar to map, but runs separately on each partition (block) of the RDD Boston university • func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. Colorado State val sc = new SparkContext(master,"BasicAvgMapPartitions”, • Using flatMap University System.getenv("SPARK_HOME")) Ohio val input = sc.parallelize(List(1, 2, 3, 4)) State val result = input.mapPartitions(partition => University Iterator(AvgCount(0, 0).merge(partition))) Washington .reduce((x,y) => x.merge(y)) State println(result) University Boston University CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University map() vs. mapPartition(): Performance repartition() vs. coalesce() • Does map() perform faster than mapPartition() ? • repartition(numParitions) • Assume that they are performed over the same RDD with the same number of partitions • Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them • This always shuffles all data over the network • coalesce(numPartitions) • Decrease the number of partitions in the RDD to numPartitions • Useful for running operations more efficiently after filtering down a large dataset http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend