Query Optimization 2 Instructor: Matei Zaharia cs245.stanford.edu - - PowerPoint PPT Presentation

query optimization 2
SMART_READER_LITE
LIVE PREVIEW

Query Optimization 2 Instructor: Matei Zaharia cs245.stanford.edu - - PowerPoint PPT Presentation

Query Optimization 2 Instructor: Matei Zaharia cs245.stanford.edu Recap: Data Statistics Information about tuples in a table that we can use to estimate costs Must be approximated for intermediate tables We saw one way to do this for 4


slide-1
SLIDE 1

Query Optimization 2

Instructor: Matei Zaharia cs245.stanford.edu

slide-2
SLIDE 2

Recap: Data Statistics

Information about tuples in a table that we can use to estimate costs

» Must be approximated for intermediate tables

We saw one way to do this for 4 statistics:

» T(R) = # of tuples in R » S(R) = average size of tuples in R » B(R) = # of blocks to hold R’s tuples » V(R, A) = # distinct values of attribute A in R

CS 245 2

slide-3
SLIDE 3

Another Type of Data Stats: Histograms

CS 245 3

10 20 30 40 5 10 15 12

number of tuples in R with A value in a given range

σA≥a(R) = ?

slide-4
SLIDE 4

Outline

What can we optimize? Rule-based optimization Data statistics Cost models Cost-based plan selection Spark SQL

CS 245 4

slide-5
SLIDE 5

Outline

What can we optimize? Rule-based optimization Data statistics Cost models Cost-based plan selection Spark SQL

CS 245 5

slide-6
SLIDE 6

Cost Models

How do we measure a query plan’s cost? Many possible metrics:

» Number of disk I/Os » Number of compute cycles » Combined time metric » Memory usage » Bytes sent on network » …

CS 245 6

We’ll focus on this

slide-7
SLIDE 7

Example: Index vs Table Scan

Our query: σp(R) for some predicate p s = p’s selectivity (fraction tuples passing)

CS 245 7

Table scan:

R has B(R) = T(R)⨯S(R)/b blocks on disk Cost: B(R) I/Os

Index search:

Index lookup for p takes L I/Os We then have to read part of R; Pr[read block i] ≈ 1 – Pr[no match]records in block = 1 – (1–s)b / S(R) Cost: L + (1–(1–s)b/S(R)) B(R)

block size

slide-8
SLIDE 8

What If Results Were Clustered?

We’d need to change our estimate of Cindex: Cindex = L + s B(R)

CS 245 8

Unclustered: records that match p are spread out uniformly Clustered: records that match p are close together in R’s file

Fraction of R’s blocks read

Less than Cindex for unclustered data

slide-9
SLIDE 9

Join Operators

Join orders and algorithms are often the choices that affect performance the most For a multi-way join R ⨝ S ⨝ T ⨝ …, each join is selective and order matters a lot

» Try to eliminate lots of records early

Even for one join R ⨝ S, algorithm matters

CS 245 9

slide-10
SLIDE 10

Example

SELECT order.date, product.price, customer.name FROM order, product, customer WHERE order.product_id = product.product_id AND order.cust_id = customer.cust_id AND product.type = “car” AND customer.country = “US”

CS 245 10

  • rder

product (type=car) customer (country=US)

⨝ ⨝

  • rder

customer (country=US) product (type=car)

Plan 1: Plan 2:

join conditions selection predicates

When is each plan better?

slide-11
SLIDE 11

Common Join Algorithms

Iteration (nested loops) join Merge join Join with index Hash join

CS 245 11

slide-12
SLIDE 12

Iteration Join

for each rÎR1: for each sÎR2: if r.C == s.C then output (r, s)

I/Os: one scan of R1 and T(R1) scans of R2, so cost = B(R1) + T(R1) B(R2) reads Improvement: read M blocks of R1 in RAM at a time then read R2: B(R1) + B(R1) B(R2) / M

CS 245 12

Note: cost of writes is always B(R1 ⨝ R2)

slide-13
SLIDE 13

Merge Join

if R1 and R2 not sorted by C then sort them i, j = 1 while i £ T(R1) && j £ T(R2): if R1[i].C = R2[j].C then outputTuples else if R1[i].C > R2[j].C then j += 1 else if R1[i].C < R2[j].C then i += 1

CS 245 13

slide-14
SLIDE 14

Merge Join

procedure outputTuples: while R1[i].C == R2[j].C && i £ T(R1): jj = j while R1[i].C == R2[jj].C && jj £ T(R2):

  • utput (R1[i], R2[jj])

jj += 1 i += i+1

CS 245 14

slide-15
SLIDE 15

i R1[i].C R2[j].C j

1 10 5 1 2 20 20 2 3 20 20 3 4 30 30 4 5 40 30 5 50 6 52 7

Example

CS 245 15

slide-16
SLIDE 16

Cost of Merge Join

If R1 and R2 already sorted by C, then cost = B(R1) + B(R2) reads

CS 245 16

(+ write cost of B(R1 ⨝ R2))

slide-17
SLIDE 17

Cost of Merge Join

If Ri is not sorted, can sort it in 4 B(Ri) I/Os:

» Read runs of tuples into memory, sort » Write each sorted run to disk » Read from all sorted runs to merge » Write out results

CS 245 17

slide-18
SLIDE 18

Join with Index

for each rÎR1: list = index_lookup(R2, C, r.C) for each sÎlist:

  • utput (r, s)

Read I/Os: 1 scan of R1, T(R1) index lookups

  • n R2, and T(R1) data lookups

cost = B(R1) + T(R1) (Lindex + Ldata)

CS 245 18

Can be less when R1 is sorted/clustered by C!

slide-19
SLIDE 19

Hash Join (R2 Fits in RAM)

hash = load R2 into RAM and hash by C for each rÎR1: list = hash_lookup(hash, r.C) for each sÎlist:

  • utput (r, s)

Read I/Os: B(R1) + B(R2)

CS 245 19

slide-20
SLIDE 20

Hash Join on Disk

Can be done by hashing both tables to a common set of buckets on disk

» Similar to merge sort: 4 (B(R1) + B(R2))

Trick: hash only (key, pointer to record) pairs

» Can then sort the pointers to records that match and fetch them near-sequentially

CS 245 20

slide-21
SLIDE 21

Other Concerns

Join selectivity may affect how many records we need to fetch from each relation

» If very selective, may prefer methods that join pointers or do index lookups

CS 245 21

slide-22
SLIDE 22

Summary

Join algorithms can have different performance in different situations In general, the following are used:

» Index join if an index exists » Merge join if at least one table is sorted » Hash join if both tables unsorted

CS 245 22

slide-23
SLIDE 23

Outline

What can we optimize? Rule-based optimization Data statistics Cost models Cost-based plan selection Spark SQL

CS 245 23

slide-24
SLIDE 24

Complete CBO Process

Generate and compare possible query plans

Query Generate Plans Prune x x Estimate Cost Costs Select

CS 245 24

Pick Min

slide-25
SLIDE 25

How to Generate Plans?

Simplest way: recursive search of the options for each planning choice

CS 245 25

Access paths for table 1 Access paths for table 2 Algorithms for join 1 Algorithms for join 2

⨯ ⨯ ⨯ ⨯ …

slide-26
SLIDE 26

How to Generate Plans?

Can limit search space: e.g. many DBMSes

  • nly consider “left-deep” joins

CS 245 26

Often interacts well with conventions for specifying join inputs in asymmetric join algorithms (e.g. assume right argument has index)

slide-27
SLIDE 27

How to Generate Plans?

Can prioritize searching through the most impactful decisions first

» E.g. join order is one of the most impactful

CS 245 27

slide-28
SLIDE 28

How to Prune Plans?

While computing the cost of a plan, throw it away if it is worse than best so far Start with a greedy algorithm to find an “OK” initial plan that will allow lots of pruning

CS 245 28

slide-29
SLIDE 29

Memoization and Dynamic Programming

During a search through plans, many subplans will appear repeatedly Remember cost estimates and statistics (T(R), V(R, A), etc) for those: “memoization” Can pick an order of subproblems to make it easy to reuse results (dynamic programming)

CS 245 29

slide-30
SLIDE 30

Resource Cost of CBO

It’s possible for cost-based optimization itself to take longer than running the query! Need to design optimizer to not take too long

» That’s why we have shortcuts in stats, etc

Luckily, a few “big” decisions drive most of the query execution time (e.g. join order)

CS 245 30

slide-31
SLIDE 31

Outline

What can we optimize? Rule-based optimization Data statistics Cost models Cost-based plan selection Spark SQL

CS 245 31

slide-32
SLIDE 32

Background

2004: MapReduce published, enables writing large scale data apps on commodity clusters

» Cheap but unreliable “consumer” machines, so system emphasizes fault tolerance » Focus on C++/Java programmers

CS 245 32

slide-33
SLIDE 33

Background

2006: Apache Hadoop project formed as an

  • pen source MapReduce + distributed FS

» Started in Nutch open source search engine » Soon adopted by Yahoo & Facebook

2006: Amazon EC2 service launched as the newest attempt at “utility computing”

CS 245 33

slide-34
SLIDE 34

Background

2007: Facebook starts Hive (later Apache Hive) for SQL on Hadoop

» Other SQL-on-MapReduces existed too » First steps toward “data lake” architecture

CS 245 34

slide-35
SLIDE 35

Background

2006-2012: Many other cluster programming frameworks proposed to bring MR’s benefits to other apps

CS 245 35

Pregel Dremel Dryad

Impala

slide-36
SLIDE 36

Background

2010: Spark engine released, built around MapReduce + in-memory computing

» Motivation: interactive queries + iterative algorithms such as graph analytics

Spark then moves to be a general (“unified”) engine, covering existing ones

CS 245 36

slide-37
SLIDE 37

Code Size Comparison (2013)

20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Impala (SQL) Storm (Streaming) Giraph (Graph) Spark non-test, non-example source lines Shark GraphX Streaming

slide-38
SLIDE 38

Background

2012: Shark starts as a port of Hive on Spark 2014: Spark SQL starts as a SQL engine built directly on Spark (but interoperable w/ Hive)

» Also adds two new features: DataFrames for integrating relational ops in complex programs and extensible optimizer

CS 245 38

slide-39
SLIDE 39

Original Spark API

Resilient Distributed Datasets (RDDs)

» Immutable collections of objects that can be stored in memory or disk across a cluster » Built with parallel transformations (map, filter, …) » Automatically rebuilt on failure

slide-40
SLIDE 40

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(s => s.startswith(“ERROR”)) messages = errors.map(s => s.split(‘\t’)(2)) messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

messages.filter(s => s.contains(“foo”)).count() messages.filter(s => s.contains(“bar”)).count() . . .

tasks results

Cache 1 Cache 2 Cache 3

Base RDD Transformed RDD Action

Result: full-text search of Wikipedia in 1 sec (vs 40 s for on-disk data)

Example: Log Mining

slide-41
SLIDE 41

Challenges with Spark’s Functional API

Looks high-level, but hides many semantics

  • f computation from engine

» Functions passed in are arbitrary blocks of code » Data stored is arbitrary Java/Python objects

Users can mix APIs in suboptimal ways

slide-42
SLIDE 42

Example Problem

pairs = data.map(word => (word, 1)) groups = pairs.groupByKey() groups.map((k, vs) => (k, vs.sum))

Materializes all groups as lists of integers Then promptly aggregates them

slide-43
SLIDE 43

Challenge: Data Representation

Java objects often many times larger than data

class User(name: String, friends: Array[Int]) User(“Bobby”, Array(1, 2))

User 0x… 0x… String 3 1 2 Bobby 5 0x… int[] char[] 5

slide-44
SLIDE 44

Spark SQL & DataFrames

Efficient library for working with structured data

» 2 interfaces: SQL for data analysts and external apps, DataFrames for complex programs » Optimized computation and storage underneath

slide-45
SLIDE 45

Spark SQL Architecture

Logical Plan Physical Plan Catalog

Optimizer

RDDs

Data Source API

SQL Data Frames

Code Generator

slide-46
SLIDE 46

DataFrame API

DataFrames hold rows with a known schema and offer relational operations through a DSL

c = HiveContext() users = c.sql(“select * from users”) ma_users = users[users.state == “MA”] ma_users.count() ma_users.groupBy(“name”).avg(“age”) ma_users.map(lambda row: row.user.toUpper())

Expression AST

slide-47
SLIDE 47

API Details

Based on data frame concept in R, Python

» Spark is the first to make this declarative

Integrated with the rest of Spark

» ML library takes DataFrames as input/output » Easily convert RDDs ↔ DataFrames

Google trends for “data frame”

slide-48
SLIDE 48

What DataFrames Enable

  • 1. Compact binary representation
  • Columnar, compressed cache; rows for

processing

  • 2. Optimization across operators (join

reordering, predicate pushdown, etc)

  • 3. Runtime code generation
slide-49
SLIDE 49

Performance

2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Ti Time for aggregation benchmark (s)

slide-50
SLIDE 50

Performance

2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Ti Time for aggregation benchmark (s)

slide-51
SLIDE 51

Data Sources

Uniform way to access structured data

» Apps can migrate across Hive, Cassandra, JSON, Parquet, … » Rich semantics allows query pushdown into data sources

Spark SQL

users[users.age > 20] select * from users

slide-52
SLIDE 52

Examples

JSON: JDBC: Together:

select user.id, text from tweets

{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 } }

tweets.json

select age from users where lang = “en” select t.text, u.age from tweets t, users u where t.user.id = u.id and u.lang = “en”

Spark SQL {JSON}

select id, age from users where lang=“en”

slide-53
SLIDE 53

Extensible Optimizer

Uses Scala pattern matching (see demo!)

CS 245 53

slide-54
SLIDE 54

Which Spark Components Do People Use?

58% 58% 62% 69% MLlib + GraphX Spark Streaming DataFrames Spark SQL

75% 75% of users use 2 or more components

(2015 survey)

CS 245 54

slide-55
SLIDE 55

Which Languages Are Used?

84% 38% 38% 71% 31% 58% 18%

2014 Languages Used 2015 Languages Used

CS 245 55

slide-56
SLIDE 56

Extensions to Spark SQL

Tens of data sources using the pushdown API Interval queries on genomic data Geospatial package (Magellan) Approximate queries & other research

CS 245 56