SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - - PowerPoint PPT Presentation

sparksql
SMART_READER_LITE
LIVE PREVIEW

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - - PowerPoint PPT Presentation

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? SQL Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project


slide-1
SLIDE 1

SparkSQL

1

slide-2
SLIDE 2

Where are we?

2

HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL

slide-3
SLIDE 3

Where are we?

3

HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL SQL

slide-4
SLIDE 4

Shark (Spark on Hive)

A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world

4

slide-5
SLIDE 5

SparkSQL

Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API

5

Dataframe = RDD + schema

slide-6
SLIDE 6

Dataframes

SparkSQL’s counterpart to relations or tables in RDBMS Consists of rows and columns A dataframe is NOT in 1NF

Why?

Can be created from various data sources

CSV file JSON file MySQL database Hive

6

slide-7
SLIDE 7

RDD Vs Dataframe

7

host url method bytes … Q: Find the total number of bytes SQL: SELECT SUM(bytes) FROM Log;

RDD spark.textfile(“input”) .map(/* split by tab*/) .map(/* keep the bytes */) .reduce((a,b) a+b) SparkSQL spark.csv(“input”) .groupBy() .sum("bytes")

slide-8
SLIDE 8

RDD Vs Dataframe

8

r r … r …

RDD SQL

int sum = 0 foreach value x sum += x return sum Millions of function calls

slide-9
SLIDE 9

Dataframe Vs RDD

Dataframe Lazy execution Spark is aware of the data model Spark is aware of the query logic Can optimize the query RDD Lazy execution The data model is hidden from Spark The transformations and actions are black boxes Cannot optimize the query

9

slide-10
SLIDE 10

Built-in operations in SprkSQL

Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth)

10

slide-11
SLIDE 11

SparkSQL Examples

11

slide-12
SLIDE 12

Project Setup

# In dependencies pom.xml <!-- https://mvnrepository.com/artifact/org.apache.s park/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency>

12

slide-13
SLIDE 13

Code Setup

SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show();

13

slide-14
SLIDE 14

Filter Example

# Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");

14

slide-15
SLIDE 15

SparkSQL Features

Catalyst query optimizer Code generation Integration with libraries

15

slide-16
SLIDE 16

SparkSQL Query Plan

16

SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan

Analysis Logical Optimization Physical Planning

Cost Model Physical Plans

Code Generation

Catalog

DataFrames and SQL share the same optimization/execution pipeline Credits: M. Armbrust

slide-17
SLIDE 17

Catalyst Query Optimizer

Extensible rule-based optimizer Users can define their own rules

17

Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down

slide-18
SLIDE 18

Code Generation

Shift from black-box UDF to Expressions Example

# Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");

SparkSQL understands the logic of user queries and rewrites them in a more concise way

18

slide-19
SLIDE 19

Integration

SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program

19

slide-20
SLIDE 20

Further Reading

Documentation

http://spark.apache.org/docs/latest/sql- programming-guide.html

SparkSQL paper

  • M. Armbrust et al. "Spark sql: Relational Data

Processing in Spark." SIGMOD 2015

20