SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - PowerPoint PPT Presentation

SparkSQL 1

Where are we? Pig Latin HiveQL Pig Hive … ??? Hadoop MapReduce Spark RDD HDFS 2

Where are we? SQL Pig Latin HiveQL Pig Hive … ??? Hadoop MapReduce Spark RDD HDFS 3

Shark (Spark on Hive) A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world 4

SparkSQL Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API Dataframe = RDD + schema 5

Dataframes SparkSQL’s counterpart to relations or tables in RDBMS Consists of rows and columns A dataframe is NOT in 1NF Why? Can be created from various data sources CSV file JSON file MySQL database Hive 6

RDD Vs Dataframe host url method bytes … Q: Find the total number of bytes SQL: SELECT SUM(bytes) FROM Log; RDD SparkSQL spark.textfile(“input”) spark.csv(“input”) .map(/* split by tab*/) .groupBy() .map(/* keep the bytes */) .sum("bytes") .reduce((a,b) a+b) 7

RDD Vs Dataframe RDD SQL r r … int sum = 0 foreach value x Millions of sum += x function calls return sum … r 8

Dataframe Vs RDD Dataframe RDD Lazy execution Lazy execution Spark is aware of The data model is the data model hidden from Spark Spark is aware of The transformations the query logic and actions are black boxes Cannot optimize the Can optimize the query query 9

Built-in operations in SprkSQL Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth) 10

SparkSQL Examples 11

Project Setup # In dependencies pom.xml  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency> 12

Code Setup SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show(); 13

Filter Example # Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response"); 14

SparkSQL Features Catalyst query optimizer Code generation Integration with libraries 15

SparkSQL Query Plan Logical Physical Code Analysis Optimization Planning Generation SQL AST Cost Model Unresolved Optimized Selected Physical Logical Plan RDDs Logical Plan Logical Plan Physical Plan Plans DataFrame Catalog DataFrames and SQL share the same optimization/execution pipeline 16 Credits: M. Armbrust

Catalyst Query Optimizer Extensible rule-based optimizer Users can define their own rules Original Filter Plan Push-Down Project Project name name Filter Project id = 1 id,name Project Filter id,name id = 1 People People 17

Code Generation Shift from black-box UDF to Expressions Example # Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response"); SparkSQL understands the logic of user queries and rewrites them in a more concise way 18

Integration SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program 19

Further Reading Documentation http://spark.apache.org/docs/latest/sql- programming-guide.html SparkSQL paper M. Armbrust et al . "Spark sql: Relational Data Processing in Spark." SIGMOD 2015 20

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - PowerPoint PPT Presentation

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? SQL Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS)

Topics not Covered 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks)

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

CS 241: Systems Programming Lecture 4. Environment and expansion Fall 2019 Prof. Stephen

Functions Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Web

CSSE 220 Performance with Threads Checkout SumArrayInParallel project from SVN We Used Threads

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay

Best Vehicles for Estate Tax Planning Now and Best Ways to Draft Them SLATs, DAPTs, GRATs,

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 7: Mutable State (1/2)

Preview question In a 32-bit Linux/x86 program, which of these objects would have the lowest

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Agenda Move to Digital Publishing Author benefits Using the template Content and

How I Hacked facebook Again! by Orange Tsai Orange Tsai Principal security researcher at

Role of the Plan Commission Rebecca Roberts Center for Land Use Education UW Stevens

AND SETTING GODS PEOPLE FREE ACROSS THE DIOCESE GROWING DISCIPLES AND SETTING GODS PEOPLE

TESTING OF A YOCTO PROJECT BASED AUTOMOTIVE HEAD UNIT MARIO DOMENECH GOULART MIKKO RAPELI

The DSTO Ionospheric Sounder Replacement for JORN Dr Trevor J Harris, Adrian D Quinn

Infinite Models II Zoubin Ghahramani Center for Automated Learning and Discovery Carnegie Mellon

Numerical Range of non-hermitian random Ginibre matrices and the Dvoretzky theorem Karol

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - PowerPoint PPT Presentation

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? SQL Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS)

Topics not Covered 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks)

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

CS 241: Systems Programming Lecture 4. Environment and expansion Fall 2019 Prof. Stephen

Functions Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Web

CSSE 220 Performance with Threads Checkout SumArrayInParallel project from SVN We Used Threads

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay

Best Vehicles for Estate Tax Planning Now and Best Ways to Draft Them SLATs, DAPTs, GRATs,

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 7: Mutable State (1/2)

Preview question In a 32-bit Linux/x86 program, which of these objects would have the lowest

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 Week 2-A-1 1/30/2019

Agenda Move to Digital Publishing Author benefits Using the template Content and

How I Hacked facebook Again! by Orange Tsai Orange Tsai Principal security researcher at

Role of the Plan Commission Rebecca Roberts Center for Land Use Education UW Stevens

AND SETTING GODS PEOPLE FREE ACROSS THE DIOCESE GROWING DISCIPLES AND SETTING GODS PEOPLE

TESTING OF A YOCTO PROJECT BASED AUTOMOTIVE HEAD UNIT MARIO DOMENECH GOULART MIKKO RAPELI

The DSTO Ionospheric Sounder Replacement for JORN Dr Trevor J Harris, Adrian D Quinn

Infinite Models II Zoubin Ghahramani Center for Automated Learning and Discovery Carnegie Mellon

Numerical Range of non-hermitian random Ginibre matrices and the Dvoretzky theorem Karol

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and