CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1
CS535 Big Data | Computer Science | Colorado State University
CS535 BIG DATA PART A. BIG DATA TECHNOLOGY
- 3. DISTRIBUTED COMPUTING MODELS FOR
SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING
Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535
CS535 Big Data | Computer Science | Colorado State University
FAQs
- PA1
- GEAR Session 1 signup is available:
- See the announcement in canvas
- Feedback policy
- Quiz, TP proposal: 1week
- Email- 24hrs
CS535 Big Data | Computer Science | Colorado State University
Topics of Todays Class
- 3. Distributed Computing Models for Scalable Batch Computing
- Introduction to Spark
- Operations: transformations, actions, persistence
CS535 Big Data | Computer Science | Colorado State University
In-Memory Cluster Computing: Apache Spark
RDD (Resilient Distributed Dataset)
CS535 Big Data | Computer Science | Colorado State University
RDD (Resilient Distributed Dataset)
- Read-only, memory resident partitioned collection of records
- A fault-tolerant collection of elements that can be operated on in parallel
- RDDs are the core unit of data in Spark
- Most Spark programming involves performing operations on RDDs
CS535 Big Data | Computer Science | Colorado State University
Creating RDDs [1/3]
- Loading an external dataset
- Parallelizing a collection in your driver program
val lines = sc.parallelize(List("pandas", "i like pandas")) val lines = sc.textFile("/path/to/README.md") https://spark.apache.org/docs/latest/rdd-programming-guide.html