cs535 big data 2 12 2020 week 4 b sangmi lee pallickara
play

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big - PDF document

CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Submission Deadline for the GEAR Session 1 review Feb 25 Presenters: please upload (canvas)


  1. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs • Submission Deadline for the GEAR Session 1 review • Feb 25 • Presenters: please upload (canvas) your slides at least 2 hours before the presentation session PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 2: IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • 3. Distributed Computing Models for Scalable Batch Computing • Data Frame • Spark SQL • Datasets In-Memory Cluster Computing: Apache Spark • 4. Real-time Streaming Computing Models: Apache Storm and Twitter Heron SQL, DataFrames and Datasets • Apache storm model • Parallelism • Grouping methods CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What is the Spark SQL? What is the Datasets? • Spark module for structured data processing • Dataset is a distributed collection of data • Interface is provided by Spark • New interface added in Spark (since V1.6) provides • SQL and the Dataset API • Benefits of RDDs (Storing typing, ability to use lambda functions) • Benefits of Spark SQL’s optimized execution engine • Spark SQL is to execute SQL queries • Available with the command-line or over JDBC/ODBC • Available in Scala and Java • Python does not support Datasets APIs http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

  2. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University What is the DataFrames? • DataFrame is a Dataset organized into named columns • Like a table in a relational database or a data frame in R/Python • Strengthened optimization scheme In-Memory Cluster Computing: Apache Spark • Available with Scala, Java, Python, and R SQL, DataFrames and Datasets Getting Started CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Create a SparkSession: Starting Point Creating DataFrames • SparkSession • With a SparkSession, applications can create DataFrames from • The entry point into all functionality in Spark • Existing RDD • Hive table import org.apache.spark.sql.SparkSession • Spark data sources val df = spark.read.json("examples/src/main/resources/people.json") val spark = SparkSession .builder() // Displays the content of the DataFrame to stdout .appName("Spark SQL basic example") df.show() .config("spark.some.config.option", "some-value") .getOrCreate() // +----+-------+ // | age| name | // For implicit conversions like converting RDDs to DataFrames // +----+-------+ import spark.implicits._ // |null|Michael| // | 30| Andy | // | 19| Justin | // +----+-------+ Find full example code at the Spark repo examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation (A.K.A. DataFrame Operations) Untyped Dataset Operation (A.K.A. DataFrame Operations) • DataFrames are just Dataset of Rows in Scala and Java API // Select only the "name" column df.select("name").show() • Untyped transformations • “typed operations”? • Strongly typed Scala/Java Datasets // +-------+ // | name| // This import is needed to use the $-notation // +-------+ import spark.implicits._ // |Michael| // | Andy| // Print the schema in a tree format // | Justin| df.printSchema() // +-------+ // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

  3. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Untyped Dataset Operation Untyped Dataset Operation (A.K.A. DataFrame Operations) // Select everybody, but increment the age by 1 // Select people older than 21 df.select($"name", $"age" + 1).show() df.filter($"age" > 21).show() // +-------+---------+ // +---+----+ // | name. |(age + 1)| // |age|name| // +-------+---------+ // +---+----+ // |Michael| null| // | 30|Andy| // | Andy. | 31| // +---+----+ // | Justin| 20| // +-------+---------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Running SQL Queries Global Temporary View • SELECT * FROM people • Temporary views in Spark SQL // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") • Session-scoped • Will disappear if the session that creates it terminates val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() • Global temporary view // +----+-------+ • Shared among all sessions and keep alive until the Spark application terminates // | age| name| // +----+-------+ • A system preserved database // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Global Temporary View Global Temporary View • SELECT * FROM people // Register the DataFrame as a global temporary view // Global temporary view is cross-session df.createGlobalTempView("people") spark.newSession().sql("SELECT * FROM global_temp.people").show() // Global temporary view is tied to a system preserved database `global_temp` // +----+-------+ spark.sql("SELECT * FROM global_temp.people").show() // | age| name. | // +----+-------+ // +----+-------+ // |null|Michael| // | age| name | // | 30 | Andy | // +----+-------+ // | 19 | Justin| // |null|Michael| // +----+-------+ // | 30 | Andy | // | 19 | Justin| // +----+-------+ http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  4. CS535 Big Data 2/12/2020 Week 4-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Creating Datasets Creating Datasets • Datasets are similar to RDDs // Encoders for most common types are automatically provided by importing spark.implicits._ • Serializes object with Encoder (not standard java/Kryo serialization) • Datasets are using non-standard serialization library (Spark’s Encoder ) val primitiveDS = Seq(1, 2, 3).toDS() • Many of Spark Dataset operations can be performed without deserializing object primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) case class Person(name: String, age: Long) // DataFrames can be converted to a Dataset by providing a class. // Mapping will be done by name // Encoders are created for case classes val path = "examples/src/main/resources/people.json" val caseClassDS = Seq(Person("Andy", 32)).toDS() val peopleDS = spark.read.json(path).as[Person] caseClassDS.show() peopleDS.show() // +----+-------+ // +----+---+ // | age| name| // |name|age| // +----+-------+ // +----+---+ // |null|Michael| // |Andy| 32| // | 30 | Andy. | // +----+---+ // | 19 | Justin| // +----+-------+ CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs • Converting RDDs into Datasets • Case 1: Using reflections to infer the schema of an RDD • Case 2: Using a programmatic interface to construct a schema and then apply it to an existing RDD In-Memory Cluster Computing: Apache Spark SQL, DataFrames and Datasets Interacting with RDDs CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University Interoperating with RDDs: 1. Using Reflection Interoperating with RDDs: 1. Using Reflection • Automatic converting of an RDD (containing case classes) to a DataFrame // For implicit conversions from RDDs to DataFrames import spark.implicits._ • The case class defines the schema of the table // Create an RDD of Person objects from a text file, convert it to a Dataframe • E.g. the names of the arguments to the case class are read using reflection val peopleDF = spark.sparkContext .textFile("examples/src/main/resources/people.txt") • become the names of the columns .map(_.split(",")) .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) • Case classes can also be nested or contain complex types such as Seqs or Arrays .toDF() // Register the DataFrame as a temporary view • RDD will be implicitly converted to a DataFrame and then be registered as a table peopleDF.createOrReplaceTempView(" people ") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend