SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - PowerPoint PPT Presentation

SparkSQL 1

Where are we? Pig Latin HiveQL … Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2

Where are we? Pig Latin HiveQL SQL … Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3

Shark (Spark on Hive) A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world 4

SparkSQL Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API Dataframe = RDD + schema 5

Dataframes SparkSQL’s counterpart to relations or tables in RDMBS Consists of rows and columns A dataframe is NOT in 1NF Why? Can be created from various data sources CSV file JSON file MySQL database Hive 6

Dataframe Vs RDD Dataframe RDD Lazy execution Lazy execution Spark is aware of The data model is the data model hidden from Spark Spark is aware of The transformations the query logic and actions are black boxes Cannot optimize the Can optimize the query query 7

Built-in operations in SprkSQL Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth) 8

SparkSQL Examples 9

Project Setup # In dependencies pom.xml  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency> 10

Code Setup SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show(); 11

Filter Example # Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response"); 12

SparkSQL Features Catalyst query optimizer Code generation Integration with libraries 13

SparkSQL Query Plan Logical Physical Code Analysis Optimization Planning Generation SQL AST Cost Model Selected Unresolved Optimized Physical Logical Plan Physical RDDs Logical Plan Logical Plan Plans Plan DataFrame Catalog DataFrames and SQL share the same optimization/execution pipeline 14 Credits: M. Armbrust

Catalyst Query Optimizer Extensible rule-based optimizer Users can define their own rules Original Filter Plan Push-Down Project Project name name Filter Project id = 1 id,name Project Filter id,name id = 1 People People 15

Code Generation Shift from black-box UDF to Expressions Example # Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response"); SparkSQL understand the logic of user queries and rewrites them in a more concise way 16

Integration SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program 17

Further Reading Documentation http://spark.apache.org/docs/latest/sql- programming-guide.html SparkSQL paper M. Armbrust et al . "Spark sql: Relational data processing in spark." SIGMOD 2015 18

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - PowerPoint PPT Presentation

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS)

Topics not Covered 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks)

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Graph Analytics using Vertica Relational Database Alekh Jindal - Samuel Madden, Mal

Cosmological Evolution of Gravitationally Unstable Galactic Disks Marcello Cacciato Minerva

MySQL User-Defined Functions ...in JavaScript! https://github.com/rpbouman/mysqlv8udfs Welcome!

Elmer Software Development Practices APIs for Solver and UDF Peter Rback ElmerTeam CSC IT

Lyman- Emitters with no HST Counterpart Michael Maseda (Leiden), Roland Bacon, Marijn Franx,

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee -

Molecular gas across cosmic time and environment Franoise Combes Malta

Android UI Tools CS 4720 Mobile Application Development Resource: developer.android.com CS

MULTI-TOUCH UI: A TOUCHY SUBJECT A TOUCHY SUBJECT Alan Boykiw SMART Technologies Background

Do Less, Get More: Streaming Submodular Maximization with Subsampling Moran Feldman 1 Amin Karbasi

Optimization of phase-field damage Robert Haller-Dintelmann 1 , Hannes Meinlschmidt 2 , Masoumeh

Work-in-progress: An MMT-Based User-Interface Mihnea Iancu and Florian Rabe Jacobs University

User Interface Software Tools Application UIMS Toolkit Window system/graphics package

using formalism in HCI from cognitive models to placemats From Formalism to

A Conceptual Replication of Continuous Integration Pain Points A Conceptual

TAGPMA Update Derek Simmel <dsimmel@psc.edu> TAGPMA Chair APGridPMA Meeting October 15,

sparse arrays for acoustic source localization Authors: T. Lan, Y.L. Wang (Corresponding author)

Sambuz

Useful Links

Newsletter

Mail Us

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - PowerPoint PPT Presentation

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Introduction to Big-data Management Review and next steps 1 What We Covered Storage (HDFS)

Topics not Covered 1 What We Covered Storage (HDFS) Query processing (MapReduce, RDD, Hyracks)

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Graph Analytics using Vertica Relational Database Alekh Jindal - Samuel Madden, Mal

Cosmological Evolution of Gravitationally Unstable Galactic Disks Marcello Cacciato Minerva

MySQL User-Defined Functions ...in JavaScript! https://github.com/rpbouman/mysqlv8udfs Welcome!

Elmer Software Development Practices APIs for Solver and UDF Peter Rback ElmerTeam CSC IT

Lyman- Emitters with no HST Counterpart Michael Maseda (Leiden), Roland Bacon, Marijn Franx,

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee -

Molecular gas across cosmic time and environment Franoise Combes Malta

Android UI Tools CS 4720 Mobile Application Development Resource: developer.android.com CS

MULTI-TOUCH UI: A TOUCHY SUBJECT A TOUCHY SUBJECT Alan Boykiw SMART Technologies Background

Do Less, Get More: Streaming Submodular Maximization with Subsampling Moran Feldman 1 Amin Karbasi

Optimization of phase-field damage Robert Haller-Dintelmann 1 , Hannes Meinlschmidt 2 , Masoumeh

Work-in-progress: An MMT-Based User-Interface Mihnea Iancu and Florian Rabe Jacobs University

User Interface Software Tools Application UIMS Toolkit Window system/graphics package

using formalism in HCI from cognitive models to placemats From Formalism to

A Conceptual Replication of Continuous Integration Pain Points A Conceptual

TAGPMA Update Derek Simmel &lt;dsimmel@psc.edu&gt; TAGPMA Chair APGridPMA Meeting October 15,

sparse arrays for acoustic source localization Authors: T. Lan, Y.L. Wang (Corresponding author)

Sambuz

Useful Links

Newsletter

Mail Us

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

TAGPMA Update Derek Simmel <dsimmel@psc.edu> TAGPMA Chair APGridPMA Meeting October 15,