SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive - - PowerPoint PPT Presentation

▶

Oct 18, 2023 148 likes •347 views

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3 Shark

SLIDE 1

SparkSQL

11/14/2018 1

SLIDE 2

Where are we?

11/14/2018 2

HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL

SLIDE 3

Where are we?

11/14/2018 3

HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL SQL

SLIDE 4

Shark (Spark on Hive)

A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world

11/14/2018 4

SLIDE 5

SparkSQL

Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API

11/14/2018 5

Dataframe = RDD + schema

SLIDE 6

Dataframes

SparkSQL’s counterpart to relations or tables in RDMBS Consists of rows and columns A dataframe is NOT in 1NF

Why?

Can be created from various data sources

CSV file JSON file MySQL database Hive

11/14/2018 6

SLIDE 7

Dataframe Vs RDD

Dataframe Lazy execution Spark is aware of the data model Spark is aware of the query logic Can optimize the query RDD Lazy execution The data model is hidden from Spark The transformations and actions are black boxes Cannot optimize the query

11/14/2018 7

SLIDE 8

Built-in operations in SprkSQL

Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth)

11/14/2018 8

SLIDE 9

SparkSQL Examples

11/14/2018 9

SLIDE 10

Project Setup

# In dependencies pom.xml  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency>

11/14/2018 10

SLIDE 11

Code Setup

SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show();

11/14/2018 11

SLIDE 12

Filter Example

# Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");

11/14/2018 12

SLIDE 13

SparkSQL Features

Catalyst query optimizer Code generation Integration with libraries

11/14/2018 13

SLIDE 14

SparkSQL Query Plan

SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan

Analysis Logical Optimization Physical Planning

Cost Model Physical Plans

Code Generation

Catalog

DataFrames and SQL share the same optimization/execution pipeline Credits: M. Armbrust

SLIDE 15

Catalyst Query Optimizer

Extensible rule-based optimizer Users can define their own rules

11/14/2018 15

Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down

SLIDE 16

Code Generation

Shift from black-box UDF to Expressions Example

# Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");

SparkSQL understand the logic of user queries and rewrites them in a more concise way

11/14/2018 16

SLIDE 17

Integration

SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program

11/14/2018 17

SLIDE 18

SparkSQL

Where are we?

Where are we?

Shark (Spark on Hive)

SparkSQL

Dataframes

Dataframe Vs RDD

Built-in operations in SprkSQL

SparkSQL Examples

Project Setup

Code Setup

Filter Example

SparkSQL Features

SparkSQL Query Plan

Catalyst Query Optimizer

Code Generation

SparkSQL understand the logic of user queries and rewrites them in a more concise way

Integration

Further Reading