SparkSQL
1
SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? - - PowerPoint PPT Presentation
SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 2 Where are we? Pig Latin HiveQL SQL Pig Hive ??? Hadoop MapReduce Spark RDD HDFS 3 Shark (Spark on Hive) A small side project
1
2
HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL
3
HDFS Hadoop MapReduce … Spark RDD ??? Pig Pig Latin Hive HiveQL SQL
4
5
Dataframe = RDD + schema
Why?
CSV file JSON file MySQL database Hive
6
7
8
9
# In dependencies pom.xml <!-- https://mvnrepository.com/artifact/org.apache.s park/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.1</version> </dependency>
10
SparkSession sparkS = SparkSession .builder() .appName("Spark SQL examples") .master("local") .getOrCreate(); Dataset<Row> log_file = sparkS.read() .option("delimiter", "\t") .option("header", "true") .option("inferSchema", "true") .csv("nasa_log.tsv"); log_file.show();
11
# Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");
12
13
14
SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan
Analysis Logical Optimization Physical Planning
Cost Model Physical Plans
Code Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline Credits: M. Armbrust
15
Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
# Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytesPerCode = log_file.sqlContext().sql("SELECT response, sum(bytes) from log_lines GROUP BY response");
16
17
http://spark.apache.org/docs/latest/sql- programming-guide.html
processing in spark." SIGMOD 2015
18