CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • PA2 description will be posted this week • Weekly Reading List PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR • [W4R1] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev SCALABLE BATCH COMPUTING Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, SECTION 2: IN-MEMORY CLUSTER and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International COMPUTING Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147-156. DOI: https://doi.org/10.1145/2588555.2595641 [Link] • [W4R2] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Sangmi Lee Pallickara Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI: Computer Science, Colorado State University https://doi.org/10.1145/2723372.2742788 [Link] http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • RDD Actions and Persistence • 3. Distributed Computing Models for Scalable Batch Computing • Spark cluster In-Memory Cluster Computing: Apache Spark • RDD dependency RDD: Transformations • Job scheduling RDD: Actions • Closure RDD: Persistence Actions [1/2] Actions [2/2] • collect() • Returns a final value to the driver program • Or writes data to an external storage system • Retrieves entire RDD to the driver • Entire dataset (RDD) should fit in memory on single machine • If the RDD is filtered down to a very small dataset, it is useful • Log file analysis example is continued • take() retrieves a small number of elements in the RDD at the driver program • For very large RDD • Iterates over them locally to print out information at the driver • Store them in the external storage (e.g. S3, or HDFS) • saveAsTextFile() action println(" Input had " + badLinesRDD.count() + " concerning lines") println(" Here are 10 examples:") badLinesRDD. take(10).foreach(println) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara def fold(zeroValue: T)(op: (T, T) ⇒ T): T Aggregate the elements of each partition, reduce() reduce() vs. fold() and then the results for all the partitions, using a given associative function and a neutral "zero value". The function • Takes a function that operates on two elements of the type in your RDD and returns a • Similar to reduce() but it takes ‘zero value’ op(t1, t2) is allowed to modify t1 and return new element of the same type • initial value it as its result value to avoid object allocation; however, it should not modify t2. • The function should be commutative and associative so that it can be computed • The function should be commutative and associative so that it can be computed correctly in parallel correctly in parallel scala> val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) )) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :21 val rdd1 = sc.parallelize(List(1, 2, 5)) scala> rdd1.partitions.length val sum = rdd1.reduce{ (x, y) => x + y} result: res8: Int = 8 //results: sum: Int = 8 scala> val additionalMarks = ("extra", 4) additionalMarks: (String, Int) = (extra,4) scala> val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)} What will be the result(sum)? reduce() vs. fold() take(n) • Similar to reduce() but it takes ‘zero value’ (initial value) • returns n elements from the RDD and attempts to minimize the number of partitions it accesses • The function should be commutative and associative so that it can be computed • It may represent a biased collection correctly in parallel • It does not return the elements in the order you might expect scala> val rdd1 = sc.parallelize(List( ("maths", 80), ("science", 90) )) • Useful for unit testing rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :21 scala> rdd1.partitions.length result: res8: Int = 8 scala> val additionalMarks = ("extra", 4) additionalMarks: (String, Int) = (extra,4) scala> val sum = rdd1.fold(additionalMarks){(acc, marks) => val sum = acc._2 + marks._2 ("total", sum)} // result: sum: (String, Int) = (total,206) // (4x8)+80+90 = 206 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Persistence • Caches dataset across operations • Nodes store any partitions of results from previous operation(s) in memory reuse them in other actions In-Memory Cluster Computing: Apache Spark • An RDD to be persisted can be specified by persist() or cache() RDD: Transformations RDD: Actions • The persisted RDD can be stored using a different storage level RDD: Persistence • Using a StorageLevel object • Passing StorageLevel object to persist() http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Persistence levels level Space CPU In Comment used time memory/On disk High Low Y/N MEMORY_ONLY MEMORY_ONLY_ Low High Y/N Store RDD as serialized Java In-Memory Cluster Computing: Apache Spark SER objects (one byte array per partition). Spark Cluster High Medium Some/Some Spills to disk if there is too much MEMORY_AND_D ISK data to fit in memory MEMORY_AND_D Low High Some/Some Spills to disk if there is too much ISK_SER data to fit in memory. Stores serialized representation in memory DISK_ONLY Low High N/Y CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark cluster and resources Spark cluster [1/3] • Each application gets its own executor processes • Must be up and running for the duration of the entire application Executor Cache Task • Run tasks in multiple threads Driver program Task Cluster Manager • Isolate applications from each other SparkContext • Scheduling side (each driver schedules its own tasks) Executor Cache • Executor side (tasks from different applications run in different JVMs) Hadoop YARN Task Task • Data cannot be shared across different Spark applications (instances of SparkContext) without writing Mesos it to an external storage system Standalone CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Spark cluster [2/3] Spark cluster [3/3] • Spark is agnostic to the underlying cluster manager • Driver program must listen for and accept incoming connections from its executors throughout its lifetime • As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN) • Driver program must be network addressable from the worker nodes • Driver program should run close to the worker nodes • On the same local area network http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs PA2 description will be posted this

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

IMRT: Patient Specific QA ICPT School on Medical Physics for Radiation Therapy Justus Adamson PhD

A ns2-based simulation framework for performance evaluation of overlay networks Michele Amoretti

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

Towards a Methodology for Benchmarking Edge Processing Frameworks Pedro Silva, Alexandru Costan,

20 Schemes Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

redo logging (fjnish) / distributed systems 1 1 last time (1) block groups keep related

Debugging & Logging Java Logging Java has built-in support for logging Logs contain

Announcements Quiz on Thursday Next assignment will be available later this week (Thursday

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs PA2 description will be posted this

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/25/2020 Week 8-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/24/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 3/9/2020 Week 8-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/19/2020 Week 5-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

FAQs Lossy Algorithm http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Distributed Implementation of the Triplets View CS535 Big Data | Computer Science | Colorado State

IMRT: Patient Specific QA ICPT School on Medical Physics for Radiation Therapy Justus Adamson PhD

A ns2-based simulation framework for performance evaluation of overlay networks Michele Amoretti

ADVANCED DATABASE SYSTEMS Recovery Protocols @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

Towards a Methodology for Benchmarking Edge Processing Frameworks Pedro Silva, Alexandru Costan,

20 Schemes Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

redo logging (fjnish) / distributed systems 1 1 last time (1) block groups keep related

Debugging &amp; Logging Java Logging Java has built-in support for logging Logs contain

Announcements Quiz on Thursday Next assignment will be available later this week (Thursday

Debugging & Logging Java Logging Java has built-in support for logging Logs contain