Final Project M1 CS 327E October 30, 2017 Employee table Final - PowerPoint PPT Presentation

Final Project M1 CS 327E October 30, 2017

Employee table Final Project Overview Goals: • Learn the basics of distributed computing and Spark (Spark Core and Spark SQL) • Gain hands-on exposure to EMR* • Develop ETL pipelines with PySpark and Postgres • Enrich IMDB database with new data sources Format: • Weekly milestones, 6 total • Continue working in pairs • Monday: reading quiz, new concepts, assignment sheet, project work • Wednesday: project work • Friday: milestone submission (except for Thanksgiving week) *EMR is not covered by free tier. Pricing for 1-node cluster on m3.xlarge is $0.34 per hour / $8.2 per day / $246 per month.

Employee table Final Project Milestones • M1: ETL movie ratings data (source: Movielens). Due date: 11/03. • M2: ETL movie tags data (source: Movielens). Due date: 11/10. • M3: ETL movie ticket sales data (source: The-Numbers). Due date: 11/17. • M4: ETL Bollywood data (source: Cinemalytics). Due date: 12/01. • M5: Group presentations. Week of 12/4 – 12/11. • M6: Technical reports. Due date: 12/11. •

1) The MapReduce programming model consists of a user- provided map function and a user-provided reduce function. A) True B) False

2) The fundamental abstraction in Spark is called: A) Discretized Stream B) Resilient Distributed Dataset C) B+ Tree D) Distributed Hash table

3) What type of operation is the map function in Spark? A) A transformation B) An action C) An event D) All of the above

4) What type of operation is the reduce function in Spark? A) A transformation B) An action C) A sample D) All of the above

5) Which of these AWS services provides a Spark cluster? A) CloudFormation B) Athena C) Kinesis D) Elastic MapReduce

RDD Key Concepts • RDD = Partitioned collection of records across a Spark cluster • Operations on RDD = transformations and actions • Base RDD created from file(s) • Transformed RDD created by applying transformations and actions to Base RDD

Spark Transformations • Map: call map on an RDD and pass it a function as a parameter. Map applies the function to each element of the input RDD. It returns a new RDD as output.

Spark Transformations • Filter: works like a SQL where clause. Filter is called on an RDD and provided a function to filter. Spark calls the function on each element of the RDD. If the function returns true, the element will be passed to the output RDD.

Spark Actions • Reduce: calculates a single aggregate over all the elements of an RDD. Requires a function that is both associative and commutative. Spark applies the function to pairs of elements again and again until there is only one output left.

Spark Actions • ReduceByKey: works like the SQL group by . Calculates an aggregate value for each key in a key pair RDD. Requires a function that is both associative and commutative. Spark applies the function to pairs of values again and again until there is only one output left for each key.

Spark Programming Guide: https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#transformations https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

Milestone 1 http://www.cs.utexas.edu/~scohen/projects/m1-assignment.pdf

Final Project M1 CS 327E October 30, 2017 Employee table Final - PowerPoint PPT Presentation

Final Project M1 CS 327E October 30, 2017 Employee table Final Project Overview Goals: Learn the basics of distributed computing and Spark (Spark Core and Spark SQL) Gain hands-on exposure to EMR* Develop ETL pipelines with PySpark

CS 327E Class 8 Oct 30, 2020 Final Project Components Choose a primary and secondary

Final Budget 4-30-2019 Page 1 of 45 Final Budget 4-30-2019 Page 2 of 45 Final Budget 4-30-2019

CS 327E Lecture 13 Shirley Cohen November 21, 2016 Plan for Today Reading Quiz MySQL +

Final Selected Abstracts Final Selected Abstracts Final Selected Abstracts Final Selected

The Oregon Nurse Retention The Oregon Nurse Retention Project: Final Report Project: Final

CS 327E Lecture 12 Shirley Cohen March 7, 2016 Agenda Announcements Readings for

Final Review Introduction to Web Design Final exam on Thursday, December 19 at 12:00 p.m. Final

Final Review Drawing on the Web Final exam on Thursday, May 14 at 2:00 p.m. (EST) Final Review

Grid.java public public class class Grid { private private final final int int width;

Math 211 Math 211 Review for the Final Exam December 8, 2002 2 The Final Exam The Final Exam

CS 327E Class 9 November 19, 2018 Announcements What to expect from the next 3 milestones

CS 327E Class 10 November 26, 2018 Announcements Scheduling your group presentation for

CS 327E Class 7 October 21, 2019 Announcements Midterm is next class from 6pm - 7:30pm

CS 327E Lecture 12 Shirley Cohen November 16, 2016 Plan for Today Reading Quiz APIs

CS 327E Lecture 5 Shirley Cohen September 14, 2016 Plan for Today Finish Normalization

CS 327E Class 11 November 25, 2019 Announcements Milestone 12: What: Group Presentations.

Introduction to (incubating) ApacheCon Big Data, September 2015 sblackmon@apache.org Agenda -

Project 1 2 MapReduce is Dead? Cloud Compu2ng

Leveraging Public Clouds for DOE Environmental Streaming Data Marty Humphrey Dept of Computer

serverless IoT-Applications BED-Con 2017 Niko Will, innoQ @n1ko_w1ll about me > Developer

Modeling Big Data Systems by Extending the Palladio Component Model 6 th Symposium on Software

SERVERLESS ARCHITECTURES (WITH AWS INFRASTRUCTURE) Niko Kbler Soware-Architect,

Building Serverless Applications with Lambda Craig Golightly SENIOR SOFTWARE CONSULTANT

Preparing For a Future Microservices Journey Susanne Kaiser Independent Tech Consultant @suksr

Final Project M1 CS 327E October 30, 2017 Employee table Final - PowerPoint PPT Presentation

Final Project M1 CS 327E October 30, 2017 Employee table Final Project Overview Goals: Learn the basics of distributed computing and Spark (Spark Core and Spark SQL) Gain hands-on exposure to EMR* Develop ETL pipelines with PySpark

CS 327E Class 8 Oct 30, 2020 Final Project Components Choose a primary and secondary

Final Budget 4-30-2019 Page 1 of 45 Final Budget 4-30-2019 Page 2 of 45 Final Budget 4-30-2019

CS 327E Lecture 13 Shirley Cohen November 21, 2016 Plan for Today Reading Quiz MySQL +

Final Selected Abstracts Final Selected Abstracts Final Selected Abstracts Final Selected

The Oregon Nurse Retention The Oregon Nurse Retention Project: Final Report Project: Final

CS 327E Lecture 12 Shirley Cohen March 7, 2016 Agenda Announcements Readings for

Final Review Introduction to Web Design Final exam on Thursday, December 19 at 12:00 p.m. Final

Final Review Drawing on the Web Final exam on Thursday, May 14 at 2:00 p.m. (EST) Final Review

Grid.java public public class class Grid { private private final final int int width;

Math 211 Math 211 Review for the Final Exam December 8, 2002 2 The Final Exam The Final Exam

CS 327E Class 9 November 19, 2018 Announcements What to expect from the next 3 milestones

CS 327E Class 10 November 26, 2018 Announcements Scheduling your group presentation for

CS 327E Class 7 October 21, 2019 Announcements Midterm is next class from 6pm - 7:30pm

CS 327E Lecture 12 Shirley Cohen November 16, 2016 Plan for Today Reading Quiz APIs

CS 327E Lecture 5 Shirley Cohen September 14, 2016 Plan for Today Finish Normalization

CS 327E Class 11 November 25, 2019 Announcements Milestone 12: What: Group Presentations.

Introduction to (incubating) ApacheCon Big Data, September 2015 sblackmon@apache.org Agenda -

Project 1 2 MapReduce is Dead? Cloud Compu2ng

Leveraging Public Clouds for DOE Environmental Streaming Data Marty Humphrey Dept of Computer

serverless IoT-Applications BED-Con 2017 Niko Will, innoQ @n1ko_w1ll about me &gt; Developer

Modeling Big Data Systems by Extending the Palladio Component Model 6 th Symposium on Software

SERVERLESS ARCHITECTURES (WITH AWS INFRASTRUCTURE) Niko Kbler Soware-Architect,

Building Serverless Applications with Lambda Craig Golightly SENIOR SOFTWARE CONSULTANT

Preparing For a Future Microservices Journey Susanne Kaiser Independent Tech Consultant @suksr

serverless IoT-Applications BED-Con 2017 Niko Will, innoQ @n1ko_w1ll about me > Developer