FAQs Submission Deadline for the GEAR Session 1 review Feb 25 - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Submission Deadline for the GEAR Session 1 review • Feb 25 • Presenters: please upload (canvas) your slides at least 2 hours before the presentation session http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Data Frame • Spark SQL • Datasets • 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron • Apache storm model • Parallelism • Grouping methods CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University What is the Spark SQL? • Spark module for structured data processing • Interface is provided by Spark • SQL and the Dataset API • Spark SQL is to execute SQL queries • Available with the command-line or over JDBC/ODBC CS535 Big Data | Computer Science | Colorado State University What is the Datasets? • Dataset is a distributed collection of data • New interface added in Spark (since V1.6) provides • Benefits of RDDs (Storing typing, ability to use lambda functions) • Benefits of Spark SQL’s optimized execution engine • Available in Scala and Java • Python does not support Datasets APIs http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University What is the DataFrames? • DataFrame is a Dataset organized into named columns • Like a table in a relational database or a data frame in R/Python • Strengthened optimization scheme • Available with Scala, Java, Python, and R CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Getting Started http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Create a SparkSession: Starting Point • SparkSession • The entry point into all functionality in Spark import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._ Find full example code at the Spark repo examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala CS535 Big Data | Computer Science | Colorado State University Creating DataFrames • With a SparkSession, applications can create DataFrames from • Existing RDD • Hive table • Spark data sources val df = spark.read.json("examples/src/main/resources/people.json") // Displays the content of the DataFrame to stdout df.show() // +----+-------+ // | age| name | // +----+-------+ // |null|Michael| // | 30| Andy | // | 19| Justin | // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) • DataFrames are just Dataset of Rows in Scala and Java API • Untyped transformations • “typed operations”? • Strongly typed Scala/Java Datasets // This import is needed to use the $-notation import spark.implicits._ // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) // Select only the "name" column df.select("name").show() // +-------+ // | name| // +-------+ // |Michael| // | Andy| // | Justin| // +-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Running SQL Queries • SELECT * FROM people // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University Global Temporary View • Temporary views in Spark SQL • Session-scoped • Will disappear if the session that creates it terminates • Global temporary view • Shared among all sessions and keep alive until the Spark application terminates • A system preserved database http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Global Temporary View • SELECT * FROM people // Register the DataFrame as a global temporary view df.createGlobalTempView("people") // Global temporary view is tied to a system preserved database `global_temp` spark.sql("SELECT * FROM global_temp.people").show() // +----+-------+ // | age| name | // +----+-------+ // |null|Michael| // | 30 | Andy | // | 19 | Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University Global Temporary View // Global temporary view is cross-session spark.newSession().sql("SELECT * FROM global_temp.people").show() // +----+-------+ // | age| name. | // +----+-------+ // |null|Michael| // | 30 | Andy | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Creating Datasets • Datasets are similar to RDDs • Serializes object with Encoder (not standard java/Kryo serialization) • Datasets are using non-standard serialization library (Spark’s Encoder ) • Many of Spark Dataset operations can be performed without deserializing object case class Person(name: String, age: Long) // Encoders are created for case classes val caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show() // +----+---+ // |name|age| // +----+---+ // |Andy| 32| // +----+---+ CS535 Big Data | Computer Science | Colorado State University Creating Datasets // Encoders for most common types are automatically provided by importing spark.implicits._ val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) // DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by name val path = "examples/src/main/resources/people.json" val peopleDS = spark.read.json(path).as[Person] peopleDS.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30 | Andy. | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Interacting with RDDs CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs • Converting RDDs into Datasets • Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing RDD http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

FAQs Submission Deadline for the GEAR Session 1 review Feb 25 - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science,

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

HepSim Monte Carlo samples and their interface with detector simulations S. Chekanov (ANL),

Sm Smart St Strea eamin ing of of P Panoramic Vi Videos s Students: HW Ma, YZ Cai Kandao

Technology for Distributed Streaming Analytics John Wu LBNL Use Case 1: Near Real-Time Feature

Big Data Analytics & IoT Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

Introduction to Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

HTTP-like Streams for 9P John Floren Rochester Institute of Technology October 5, 2010 With Ron

FAQs Submission Deadline for the GEAR Session 1 review Feb 25 - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science,

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

HepSim Monte Carlo samples and their interface with detector simulations S. Chekanov (ANL),

Sm Smart St Strea eamin ing of of P Panoramic Vi Videos s Students: HW Ma, YZ Cai Kandao

Technology for Distributed Streaming Analytics John Wu LBNL Use Case 1: Near Real-Time Feature

Big Data Analytics &amp; IoT Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

Introduction to Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

HTTP-like Streams for 9P John Floren Rochester Institute of Technology October 5, 2010 With Ron

Big Data Analytics & IoT Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.