Query Execution 2 and Query Optimization Instructor: Matei Zaharia - PowerPoint PPT Presentation

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu

Query Execution Overview Query representation (e.g. SQL) Logical query plan (e.g. relational algebra) Query optimization Optimized logical plan Physical plan (code/operators to run) CS 245 2

Example SQL Query SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ ); (Find the movies with stars born in 1960) CS 245 3

Parse Tree <Query> <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starName <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Attribute> LIKE <Pattern> name MovieStar birthDate ‘%1960’ CS 245 4

Logical Query Plan P title s starName=name ´ StarsIn P name s birthdate LIKE ‘%1960’ MovieStar CS 245 5

Improved Logical Query Plan P title starName=name StarsIn P name s birthdate LIKE ‘%1960’ MovieStar CS 245 6

Estimate Result Sizes Need expected size P StarsIn s MovieStar CS 245 7

One Physical Plan Parameters: join order, Hash join memory size, project attributes, ... H Parameters: Seq scan Index scan select condition, ... StarsIn MovieStar CS 245 8

Another Physical Plan Parameters: join order, Hash join memory size, project attributes, ... H Parameters: Index scan Seq scan select condition, ... StarsIn MovieStar CS 245 9

Another Physical Plan Sort-merge join Seq scan Seq scan StarsIn MovieStar CS 245 10

Estimating Plan Costs Logical plan Physical plan P 1 P 2 … P n candidates C 1 C 2 … C n Pick best! CS 245 11

Execution Methods: Once We Have a Plan, How to Run it? Several options that trade between complexity, performance and startup time CS 245 12

Example: Simple Query SELECT quantity * price FROM orders WHERE productId = 75 P quanity*price (σ productId=75 (orders)) CS 245 13

Method 1: Interpretation interface Expression { interface Operator { Tuple next(); Value compute(Tuple in); } } class Attribute: Expression { class TableScan: Operator { String name; String tableName; } } class Times: Expression { class Select: Operator { Expression left, right; Operator parent; } Expression condition; } class Equals: Expression { Expression left, right; class Project: Operator { } Operator parent; Expression[] exprs; } CS 245 14

Example Expression Classes class Attribute: Expression { probably better to use a String name; numeric field ID instead Value compute(Tuple in) { return in.getField(name); } } class Times: Expression { Expression left, right; Value compute(Tuple in) { return left.compute(in) * right.compute(in); } } CS 245 15

Example Operator Classes class TableScan: Operator { String tableName; Tuple next() { // read & return next record from file } } class Project: Operator { Operator parent; Expression[] exprs; Tuple next() { tuple = parent.next(); fields = [expr.compute(tuple) for expr in exprs]; return new Tuple(fields); } } CS 245 16

Running Our Query with Interpretation ops = Project( expr = Times(Attr(“quantity”), Attr(“price”)), parent = Select( expr = Equals(Attr(“productId”), Literal(75)), parent = TableScan(“orders”) ) ); recursively calls Operator.next() while(true) { and Expression.compute() Tuple t = ops.next(); if (t != null) { out.write(t); Pros & cons of this } else { break; approach? } } CS 245 17

Method 2: Vectorization Interpreting query plans one record at a time is simple, but it’s too slow » Lots of virtual function calls and branches for each record (recall Jeff Dean’s numbers) Keep recursive interpretation, but make Operators and Expressions run on batches CS 245 18

Implementing Vectorization class ValueBatch { class TupleBatch { // Efficient storage, e.g. // Efficient storage } // schema + column arrays } interface Expression { ValueBatch compute( interface Operator { TupleBatch next(); TupleBatch in); } } class Times: Expression { class Select: Operator { Expression left, right; Operator parent; Expression condition; } } ... ... CS 245 19

Typical Implementation Values stored in columnar arrays (e.g. int[]) with a separate bit array to mark nulls Tuple batches fit in L1 or L2 cache Operators use SIMD instructions to update both values and null fields without branching CS 245 20

Pros & Cons of Vectorization + Faster than record-at-a-time if the query processes many records + Relatively simple to implement – Lots of nulls in batches if query is selective – Data travels between CPU & cache a lot CS 245 21

Method 3: Compilation Turn the query into executable code CS 245 22

Compilation Example P quanity*price (σ productId=75 (orders)) generated class with the right field types for orders table class MyQuery { void run() { Iterator<OrdersTuple> in = openTable(“orders”); for(OrdersTuple t: in) { if (t.productId == 75) { out.write(Tuple(t.quantity * t.price)); } } } Can also theoretically generate } vectorized code CS 245 23

Pros & Cons of Compilation + Potential to get fastest possible execution + Leverage existing work in compilers – Complex to implement – Compilation takes time – Generated code may not match hand-written CS 245 24

What’s Used Today? Depends on context & other bottlenecks Transactional databases (e.g. MySQL): mostly record-at-a-time interpretation Analytical systems (Vertica, Spark SQL): vectorization, sometimes compilation ML libs (TensorFlow): mostly vectorization (the records are vectors!), some compilation CS 245 25

Query Optimization CS 245 26

Outline What can we optimize? Rule-based optimization Data statistics Cost models Cost-based plan selection CS 245 27

What Can We Optimize? Operator graph: what operators do we run, and in what order? Operator implementation: for operators with several impls (e.g. join), which one to use? Access paths: how to read each table? » Index scan, table scan, C-store projections, … CS 245 29

Typical Challenge There is an exponentially large set of possible query plans Access paths Access paths Algorithms Algorithms ⨯ ⨯ ⨯ ⨯ … for table 1 for table 2 for join 1 for join 2 Result: we’ll need techniques to prune the search space and complexity involved CS 245 30

What is a Rule? Procedure to replace part of the query plan based on a pattern seen in the plan Example: When I see expr OR TRUE for an expression expr, replace this with TRUE CS 245 32

Implementing Rules Each rule is typically a function that walks through query plan to search for its pattern node void replaceOrTrue(Plan plan) { Or for (node in plan.nodes) { node.left node.right if (node instanceof Or) { if (node.right == Literal(true)) { expr TRUE plan.replace(node, Literal(true)); break; } // Similar code if node.left == Literal(true) } } } CS 245 33

Implementing Rules Rules are often grouped into phases » E.g. simplify Boolean expressions, pushdown selects, choose join algorithms, etc Each phase runs rules till they no longer apply plan = originalPlan; while (true) { for (rule in rules) { rule.apply(plan); } if (plan was not changed by any rule) break; } CS 245 34

Result Simple rules can work together to optimize complex query plans (if designed well): SELECT * FROM users WHERE (age>=16 && loc==CA) || (age>=16 && loc==NY) || age>=18 (age>=16) && (loc==CA || loc==NY) || age>=18 (age>=16 && (loc IN (CA, NY)) || age>=18 age>=18 || (age>=16 && (loc IN (CA, NY)) CS 245 35

Example Extensible Optimizer For Thursday, you’ll read about Spark SQL’s Catalyst optimizer » Written in Scala using its pattern matching features to simplify writing rules » >500 contributors worldwide, >1000 types of expressions, and hundreds of rules We’ll also use Spark SQL in assignment 2 CS 245 36

CS 245 37

CS 245 38

Common Rule-Based Optimizations Simplifying expressions in select, project, etc » Boolean algebra, numeric expressions, string expressions, etc » Many redundancies because queries are optimized for readability or generated by code Simplifying relational operator graphs » Select, project, join, etc These relational optimizations have the most impact CS 245 39

Query Execution 2 and Query Optimization Instructor: Matei Zaharia - PowerPoint PPT Presentation

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query Execution Overview Query representation (e.g. SQL) Logical query plan (e.g. relational algebra) Query optimization Optimized logical plan Physical

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Introduction Query Execution Engine Implements a set of physical operators 2 key

Data Management Systems Query Processing Execution models Views and Schemas

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Data Management Systems Query Processing Introduction Execution models Optimization I

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

Query Optimization Through the Looking Glass Some Lessons From Building an LLVM-Based Query

Physical Operators Scanning, sorting, merging, hashing 193 Physical Operators Execution Query

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #17: QUERY

PostgreSQL Query Optimization Step by step techniques Ilya Kosmodemiansky (ik@dataegret.com)

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

Data Management Systems Query Processing Execution models Optimization heuristics

Fodor & Pylyshyn 1998 Lake & Baroni 2018 Jacob Andreas / MIT 6.884 / Fall 2020 Today 1.

Grammars & Parsing Lecture 12 CS 2112 Fall 2018 Motivation The cat ate the rat. Not

Practical JTAG: : From 0 to 1 HyperChem Tencents Xuanwu Lab http://xlab.tencent.com

DIGITAL GEOMETRY PROCESSING Algorithms for Representing, Analyzing and Comparing 3D shapes Today

Overview CKY algorithm: explores all analyses in parallel bottom-up From well-formed

High performance computational techniques for the simplex method Julian Hall School of

Data Ma Mana nagement for r Vide deo Ana nalyti tics Video data is everywhere. Brandon

CREATE STATISTICS What is it for? Tomas Vondra <tomas.vondra@2ndquadrant.com>

Query Execution 2 and Query Optimization Instructor: Matei Zaharia - PowerPoint PPT Presentation

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query Execution Overview Query representation (e.g. SQL) Logical query plan (e.g. relational algebra) Query optimization Optimized logical plan Physical

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Introduction Query Execution Engine Implements a set of physical operators 2 key

Data Management Systems Query Processing Execution models Views and Schemas

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Data Management Systems Query Processing Introduction Execution models Optimization I

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

Query Optimization Through the Looking Glass Some Lessons From Building an LLVM-Based Query

Physical Operators Scanning, sorting, merging, hashing 193 Physical Operators Execution Query

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #17: QUERY

PostgreSQL Query Optimization Step by step techniques Ilya Kosmodemiansky (ik@dataegret.com)

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

Data Management Systems Query Processing Execution models Optimization heuristics

Fodor &amp; Pylyshyn 1998 Lake &amp; Baroni 2018 Jacob Andreas / MIT 6.884 / Fall 2020 Today 1.

Grammars &amp; Parsing Lecture 12 CS 2112 Fall 2018 Motivation The cat ate the rat. Not

Practical JTAG: : From 0 to 1 HyperChem Tencents Xuanwu Lab http://xlab.tencent.com

DIGITAL GEOMETRY PROCESSING Algorithms for Representing, Analyzing and Comparing 3D shapes Today

Overview CKY algorithm: explores all analyses in parallel bottom-up From well-formed

High performance computational techniques for the simplex method Julian Hall School of

Data Ma Mana nagement for r Vide deo Ana nalyti tics Video data is everywhere. Brandon

CREATE STATISTICS What is it for? Tomas Vondra &lt;tomas.vondra@2ndquadrant.com&gt;

Fodor & Pylyshyn 1998 Lake & Baroni 2018 Jacob Andreas / MIT 6.884 / Fall 2020 Today 1.

Grammars & Parsing Lecture 12 CS 2112 Fall 2018 Motivation The cat ate the rat. Not

CREATE STATISTICS What is it for? Tomas Vondra <tomas.vondra@2ndquadrant.com>