cs535 big data 2 10 2019 week 4 a sangmi lee pallickara
play

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs PA2 description will be posted this


  1. CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • PA2 description will be posted this week • Weekly Reading List PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR • [W4R1] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev SCALABLE BATCH COMPUTING Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, SECTION 2: IN-MEMORY CLUSTER and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International COMPUTING Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147-156. DOI: https://doi.org/10.1145/2588555.2595641 [Link] • [W4R2] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Sangmi Lee Pallickara Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI: Computer Science, Colorado State University https://doi.org/10.1145/2723372.2742788 [Link] http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • RDD Actions and Persistence • 3. Distributed Computing Models for Scalable Batch Computing • Spark cluster In-Memory Cluster Computing: Apache Spark • RDD dependency RDD: Transformations • Job scheduling RDD: Actions • Closure RDD: Persistence Actions [1/2] Actions [2/2] • collect() • Returns a final value to the driver program • Or writes data to an external storage system • Retrieves entire RDD to the driver • Entire dataset (RDD) should fit in memory on single machine • If the RDD is filtered down to a very small dataset, it is useful • Log file analysis example is continued • take() retrieves a small number of elements in the RDD at the driver program • For very large RDD • Iterates over them locally to print out information at the driver • Store them in the external storage (e.g. S3, or HDFS) • saveAsTextFile() action println(" Input had " + badLinesRDD.count() + " concerning lines") println(" Here are 10 examples:") badLinesRDD. take(10).foreach(println) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara def fold(zeroValue: T)(op: (T, T) ⇒ T): T Aggregate the elements of each partition, reduce() reduce() vs. fold() and then the results for all the partitions, using a given associative function and a neutral "zero value". The function • Takes a function that operates on two elements of the type in your RDD and returns a • Similar to reduce() but it takes ‘zero value’ op(t1, t2) is allowed to modify t1 and return new element of the same type • initial value it as its result value to avoid object allocation; however, it should not modify t2. • The function should be commutative and associative so that it can be computed • The function should be commutative and associative so that it can be computed correctly in parallel correctly in parallel scala> val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) )) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :21 val rdd1 = sc.parallelize(List(1, 2, 5)) scala> rdd1.partitions.length val sum = rdd1.reduce{ (x, y) => x + y} result: res8: Int = 8 //results: sum: Int = 8 scala> val additionalMarks = ("extra", 4) additionalMarks: (String, Int) = (extra,4) scala> val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)} What will be the result(sum)? reduce() vs. fold() take(n) • Similar to reduce() but it takes ‘zero value’ (initial value) • returns n elements from the RDD and attempts to minimize the number of partitions it accesses • The function should be commutative and associative so that it can be computed • It may represent a biased collection correctly in parallel • It does not return the elements in the order you might expect scala> val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) )) • Useful for unit testing rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :21 scala> rdd1.partitions.length result: res8: Int = 8 scala> val additionalMarks = ("extra", 4) additionalMarks: (String, Int) = (extra,4) scala> val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)} // result: sum: (String, Int) = (total,206) // (4x8)+80+90 = 206 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Persistence • Caches dataset across operations • Nodes store any partitions of results from previous operation(s) in memory reuse them in other actions In-Memory Cluster Computing: Apache Spark • An RDD to be persisted can be specified by persist() or cache() RDD: Transformations RDD: Actions • The persisted RDD can be stored using a different storage level RDD: Persistence • Using a StorageLevel object • Passing StorageLevel object to persist() http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Persistence levels level Space CPU In Comment used time memory/On disk High Low Y/N MEMORY_ONLY MEMORY_ONLY_ Low High Y/N Store RDD as serialized Java In-Memory Cluster Computing: Apache Spark SER objects (one byte array per partition). Spark Cluster High Medium Some/Some Spills to disk if there is too much MEMORY_AND_D ISK data to fit in memory MEMORY_AND_D Low High Some/Some Spills to disk if there is too much ISK_SER data to fit in memory. Stores serialized representation in memory DISK_ONLY Low High N/Y CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark cluster and resources Spark cluster [1/3] • Each application gets its own executor processes • Must be up and running for the duration of the entire application Executor Cache Task • Run tasks in multiple threads Driver program Task Cluster Manager • Isolate applications from each other SparkContext • Scheduling side (each driver schedules its own tasks) Executor Cache • Executor side (tasks from different applications run in different JVMs) Hadoop YARN Task Task • Data cannot be shared across different Spark applications (instances of SparkContext) without writing Mesos it to an external storage system Standalone CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark cluster [2/3] Spark cluster [3/3] • Spark is agnostic to the underlying cluster manager • Driver program must listen for and accept incoming connections from its executors throughout its lifetime • As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN) • Driver program must be network addressable from the worker nodes • Driver program should run close to the worker nodes • On the same local area network http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend