CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 BIG DATA PART A. BIG DATA TECHNOLOGY
- 3. DISTRIBUTED COMPUTING MODELS FOR
SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING
Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535
CS535 Big Data | Computer Science | Colorado State University
FAQs
- PA2 description will be posted this week
- Weekly Reading List
- [W4R1] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev
Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147-156. DOI: https://doi.org/10.1145/2588555.2595641 [Link]
- [W4R2] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh
Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 239-250. DOI: https://doi.org/10.1145/2723372.2742788 [Link]
CS535 Big Data | Computer Science | Colorado State University
Topics of Todays Class
- RDD Actions and Persistence
- 3. Distributed Computing Models for Scalable Batch Computing
- Spark cluster
- RDD dependency
- Job scheduling
- Closure
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache Spark
RDD: Transformations RDD: Actions RDD: Persistence
Actions [1/2]
println(" Input had " + badLinesRDD.count() + " concerning lines") println(" Here are 10 examples:") badLinesRDD.take(10).foreach(println)
- Returns a final value to the driver program
- Or writes data to an external storage system
- Log file analysis example is continued
- take() retrieves a small number of elements in the RDD at the driver program
- Iterates over them locally to print out information at the driver
Actions [2/2]
- collect()
- Retrieves entire RDD to the driver
- Entire dataset (RDD) should fit in memory on single machine
- If the RDD is filtered down to a very small dataset, it is useful
- For very large RDD
- Store them in the external storage (e.g. S3, or HDFS)
- saveAsTextFile() action