SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - - PowerPoint PPT Presentation

sparksql
SMART_READER_LITE
LIVE PREVIEW

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - - PowerPoint PPT Presentation

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project


slide-1
SLIDE 1

SparkSQL

1

slide-2
SLIDE 2

Where are we?

2

HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL

slide-3
SLIDE 3

Where are we?

3

HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL SQL

slide-4
SLIDE 4

Shark (Spark on Hive)

A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world

4

slide-5
SLIDE 5

SparkSQL

Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API

5

Dataframe = RDD + schema

slide-6
SLIDE 6

Dataframes

SparkSQL’s counterpart to relations or tables in RDMBS Consists of rows and columns A dataframe is NOT in 1NF

Why?

Can be created from various data sources

CSV file JSON file MySQL database Hive

6

slide-7
SLIDE 7

Dataframe Vs RDD

Dataframe Lazy execution Spark is aware of the data model Spark is aware of the query logic Can optimize the query RDD Lazy execution The data model is hidden from Spark The transformations and actions are black boxes Cannot optimize the query

7

slide-8
SLIDE 8

Built-in operations in SprkSQL

Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth)

8

slide-9
SLIDE 9

SparkSQL Examples

9

slide-10
SLIDE 10

Project Setup

# In dependencies pom.xml <!-- https://mvnrepository.com/artifact/org.apache.s park/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency>

10

slide-11
SLIDE 11

Code Setup

SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show();

11

slide-12
SLIDE 12

Filter Example

# Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");

12

slide-13
SLIDE 13

SparkSQL Features

Catalyst query optimizer Code generation Integration with libraries

13

slide-14
SLIDE 14

SparkSQL Query Plan

14

SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan

Analysis Logical Optimization Physical Planning

Cost Model Physical Plans

Code Generation

Catalog

DataFrames and SQL share the same optimization/execution pipeline Credits: M. Armbrust

slide-15
SLIDE 15

Catalyst Query Optimizer

Extensible rule-based optimizer Users can define their own rules

15

Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down

slide-16
SLIDE 16

Code Generation

Shift from black-box UDF to Expressions Example

# Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");

SparkSQL understand the logic of user queries and rewrites them in a more concise way

16

slide-17
SLIDE 17

Integration

SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program

17

slide-18
SLIDE 18

Further Reading

Documentation

http://spark.apache.org/docs/latest/sql- programming-guide.html

SparkSQL paper

  • M. Armbrust et al. "Spark sql: Relational data

processing in spark." SIGMOD 2015

18