faqs
play

FAQs Submission Deadline for the GEAR Session 1 review Feb 25 - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science,


  1. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Submission Deadline for the GEAR Session 1 review • Feb 25 • Presenters: please upload (canvas) your slides at least 2 hours before the presentation session http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Data Frame • Spark SQL • Datasets • 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron • Apache storm model • Parallelism • Grouping methods CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University What is the Spark SQL? • Spark module for structured data processing • Interface is provided by Spark • SQL and the Dataset API • Spark SQL is to execute SQL queries • Available with the command-line or over JDBC/ODBC CS535 Big Data | Computer Science | Colorado State University What is the Datasets? • Dataset is a distributed collection of data • New interface added in Spark (since V1.6) provides • Benefits of RDDs (Storing typing, ability to use lambda functions) • Benefits of Spark SQL’s optimized execution engine • Available in Scala and Java • Python does not support Datasets APIs http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  4. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University What is the DataFrames? • DataFrame is a Dataset organized into named columns • Like a table in a relational database or a data frame in R/Python • Strengthened optimization scheme • Available with Scala, Java, Python, and R CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Getting Started http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

  5. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Create a SparkSession: Starting Point • SparkSession • The entry point into all functionality in Spark import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._ Find full example code at the Spark repo examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala CS535 Big Data | Computer Science | Colorado State University Creating DataFrames • With a SparkSession, applications can create DataFrames from • Existing RDD • Hive table • Spark data sources val df = spark.read.json("examples/src/main/resources/people.json") // Displays the content of the DataFrame to stdout df.show() // +----+-------+ // | age| name | // +----+-------+ // |null|Michael| // | 30| Andy | // | 19| Justin | // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

  6. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) • DataFrames are just Dataset of Rows in Scala and Java API • Untyped transformations • “typed operations”? • Strongly typed Scala/Java Datasets // This import is needed to use the $-notation import spark.implicits._ // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) // Select only the "name" column df.select("name").show() // +-------+ // | name| // +-------+ // |Michael| // | Andy| // | Justin| // +-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

  7. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) // Select everybody, but increment the age by 1 df.select($"name", $"age" + 1).show() // +-------+---------+ // | name. |(age + 1)| // +-------+---------+ // |Michael| null| // | Andy. | 31| // | Justin| 20| // +-------+---------+ CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation // Select people older than 21 df.filter($"age" > 21).show() // +---+----+ // |age|name| // +---+----+ // | 30|Andy| // +---+----+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

  8. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Running SQL Queries • SELECT * FROM people // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University Global Temporary View • Temporary views in Spark SQL • Session-scoped • Will disappear if the session that creates it terminates • Global temporary view • Shared among all sessions and keep alive until the Spark application terminates • A system preserved database http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

  9. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Global Temporary View • SELECT * FROM people // Register the DataFrame as a global temporary view df.createGlobalTempView("people") // Global temporary view is tied to a system preserved database `global_temp` spark.sql("SELECT * FROM global_temp.people").show() // +----+-------+ // | age| name | // +----+-------+ // |null|Michael| // | 30 | Andy | // | 19 | Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University Global Temporary View // Global temporary view is cross-session spark.newSession().sql("SELECT * FROM global_temp.people").show() // +----+-------+ // | age| name. | // +----+-------+ // |null|Michael| // | 30 | Andy | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

  10. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Creating Datasets • Datasets are similar to RDDs • Serializes object with Encoder (not standard java/Kryo serialization) • Datasets are using non-standard serialization library (Spark’s Encoder ) • Many of Spark Dataset operations can be performed without deserializing object case class Person(name: String, age: Long) // Encoders are created for case classes val caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show() // +----+---+ // |name|age| // +----+---+ // |Andy| 32| // +----+---+ CS535 Big Data | Computer Science | Colorado State University Creating Datasets // Encoders for most common types are automatically provided by importing spark.implicits._ val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) // DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by name val path = "examples/src/main/resources/people.json" val peopleDS = spark.read.json(path).as[Person] peopleDS.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30 | Andy. | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

  11. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Interacting with RDDs CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs • Converting RDDs into Datasets • Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing RDD http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend