Automatically Synthesizing SQL Queries from Input-Output Examples - - PowerPoint PPT Presentation

automatically synthesizing sql queries
SMART_READER_LITE
LIVE PREVIEW

Automatically Synthesizing SQL Queries from Input-Output Examples - - PowerPoint PPT Presentation

Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun Goal: making it easier for non-expert users to write correct SQL queries Non-expert database end-users


slide-1
SLIDE 1

Automatically Synthesizing SQL Queries from Input-Output Examples

Sai Zhang University of Washington

Joint work with: Yuyin Sun

slide-2
SLIDE 2

Goal: making it easier for non-expert users to write correct SQL queries

2

  • Non-expert database end-users

– Business analysts, scientists, marketing managers, etc. can describe what the query task is do not know how write a correct query

This paper: bridge the gap!

slide-3
SLIDE 3

An example

SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1

name stu_id Alice 1 Bob 2 Charlie 3 Dan 4 stu_id course_id score 1 504 100 1 505 99 2 504 96 3 501 60 3 502 88 3 505 68

Table: student Table: enrolled

name MAX(score) Alice 100 Charlie 88

Output table

Find the name and the maximum course score of each student enrolled in more than 1 course.

The correct SQL query:

slide-4
SLIDE 4

Existing solutions for querying a database

  • General programming languages

+ powerful − learning barriers

  • GUI tools

+ easy to use − limited in customization and personalization − hard to discover desired features in complex GUIs

4

slide-5
SLIDE 5

SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1

name stu_id Alice 1 Bob 2 Charlie 3 Dan 4 stu_id course_id score 1 504 100 1 505 99 2 504 96 3 501 60 3 502 88 3 505 68

Table: student Table: enrolled

name MAX(score) Alice 100 Charlie 88

Output table SQLSynthesizer

Our solution: programming by example

slide-6
SLIDE 6

How do end-users use SQLSynthesizer?

6

Real, large database tables Desired output result

SQL?

Small, representative Input-output examples

SQLSynthesizer

slide-7
SLIDE 7

SQLSynthesizer’s advantages

7

  • Fully automated

− Only requires input-output examples − No need of annotations, hints, or specification of any form

  • Support a wide range of SQL queries

− Beyond the “select-from-where” queries [Tran’09]

slide-8
SLIDE 8

Expressiveness Ease

  • f

Use GUI tools SQLSynthesizer Programming languages

Comparison of solutions

slide-9
SLIDE 9

Outline

  • Motivation
  • A SQL Subset
  • Synthesis Approach
  • Evaluation
  • Related Work
  • Conclusion

9

slide-10
SLIDE 10

Designing a SQL subset

10

The full SQL language A SQL Subset

  • 1000+ pages specification
  • PSPace-Completeness [Sarma’10]
  • Some features are rarely used

SQLSynthesizer’s focus: a widely-used SQL subset

slide-11
SLIDE 11

How to design a SQL subset?

11

The full SQL language A SQL Subset

  • Previous approaches:

– Decided by the paper authors [Kandel’11] [Tran’09]

  • Our approach:

– Ask experienced IT professionals for the most widely-used SQL features

?

slide-12
SLIDE 12

Our approach in designing a SQL subset

  • 1. Online survey: eliciting design requirement
  • 2. Designing the SQL subset
  • 3. Follow-up interview: obtaining feedback

− Ask each participant to select 10 most widely-used SQL features − Got 12 responses − Ask each participant to rate the sufficiency of the subset Supported SQL features 1) SELECT.. FROM…WHERE 2) JOIN 3) GROUP BY / HAVING 4) Aggregators (e.g., MAX, COUNT, SUM, etc) 5) ORDER BY Supported in the previous work [Tran’09]

Not sufficient at all Completely sufficient

5 Average rating: 4.5

slide-13
SLIDE 13

Our approach in designing a SQL subset

  • 1. Online survey: eliciting design requirement
  • 1. Designing the SQL subset
  • 2. Follow-up interview: obtaining feedback

− Ask each participant to select 10 most widely-used SQL features − Got 12 respondents − Ask each participant to rate the sufficiency of the subset Supported SQL features

  • SELECT.. FROM…WHERE
  • JOIN
  • GROUP BY / HAVING
  • Aggregators (e.g., MAX, COUNT, SUM, etc)
  • ORDER BY

Supported in the previous work [Tran’09]

Not sufficient at all Completely sufficient

5 Average rating: 4.5

The SQL subset is enough to write most common queries.

slide-14
SLIDE 14

Outline

  • Motivation
  • Language Design
  • Synthesis Approach
  • Evaluation
  • Related Work
  • Conclusion

14

slide-15
SLIDE 15

SQLSynthesizer Workflow

15

Input-Output Examples SQLSynthesizer

Queries

Select the desired query, or provide more examples Input tables Output table

slide-16
SLIDE 16

SQLSynthesizer Workflow

16

Input-Output Examples SQLSynthesizer

Queries

Select the desired query, or provide more examples Input tables Output table A SQL query

slide-17
SLIDE 17

SQLSynthesizer Workflow

17

Input-Output Examples SQLSynthesizer

Queries

Select the desired query, or provide more examples Input tables Output table Filter Project Combine

slide-18
SLIDE 18

SQLSynthesizer Workflow

18

Input-Output Examples SQLSynthesizer

Queries

Select the desired query, or provide more examples Input tables Output table Filter Project SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1

Query condition Join condition Projection columns Join condition Query condition Projection columns

A complete SQL: Combine

slide-19
SLIDE 19

Multiple solutions

19

Input tables Output table Query 2 Query 1 Query 3 …

slide-20
SLIDE 20

Multiple solutions

20

Input tables Output table Filter Project Combine

… …

Computes all solutions, ranks them, and shows them to the user.

slide-21
SLIDE 21

Key techniques

21

Input tables Output table Filter Project

Join condition Query condition Projection columns

Combine

  • 1. Combine:

Exhaustive search over legal combinations (e.g., cannot join columns with different types)

  • 2. Filter:

A machine learning approach to infer query conditions

  • 3. Project:

Exhaustive search over legal columns (e.g., cannot apply AVG to a string column)

slide-22
SLIDE 22

22

Cast as a rule learning problem:

Finding rules that can perfectly divide a search space

into a positive part and a negative part

All rows in the joined table Rows contained in the output table The rest of the rows

Learning query conditions

slide-23
SLIDE 23

Search space: the joined table

23

name stu_id Alice 1 Bob 2 Charlie 3 Dan 4 stu_id course_id score 1 504 100 1 505 99 2 504 96 3 501 60 3 502 88 3 505 68

Table: student Table: enrolled

Name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 3 505 68

Join on the stu_id column The joined table

(inferred in the Combine step)

slide-24
SLIDE 24

Finding rules selecting rows contained in the output table

24

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

The joined table

name MAX(score) Alice 100 Charlie 88

Output table

slide-25
SLIDE 25

Finding rules selecting rows containing the output table

25

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

The joined table

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

slide-26
SLIDE 26

Finding rules selecting rows containing the output table

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

The joined table No good rules!

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

slide-27
SLIDE 27

Solution: computing additional features

  • Key idea:

– Expand the search space with additional features

  • Enumerate all possibilities that a table can be aggregated
  • Precompute aggregation values as features

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 3 505 68

Suppose grouping it by stu_id

MAX(score) SUM (score) COUNT (course_id) 100 199 2 100 199 2 96 96 1 88 216 3 88 216 3 88 216 3

additional features ... The joined table

slide-28
SLIDE 28

Finding rules without additional features

28

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

The joined table No good rules!

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

slide-29
SLIDE 29

Finding rules with additional features

The joined table

name stu_id course_id score COUNT(course_id) MIN(score) Alice 1 504 100 2 99 Alice 1 505 99 2 99 Bob 2 504 96 1 96 Charlie 3 501 60 3 60 Charlie 3 502 88 3 60 Charlie 4 505 68 3 60

COUNT(course_id) > 1 (after groupping by stu_id)

SELECT name, MAX(score) FROM student, enrolled WHERE student.stu_id = enrolled.stu_id GROUP BY student.stu_id HAVING COUNT(enrolled.course_id) > 1

after the table is grouped by stu_id …

name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

slide-30
SLIDE 30

Ranking multiple SQL queries

  • Occam’s razor principle: rank simpler queries higher

– A simpler query is less likely to overfit the examples

  • Approximate a query’s complexity by its text length

30

name age score Alice 20 100 Bob 20 99 Charlie 30 99 name Alice Bob

Query 1: select name from student where age < 30 Query 2: select name from student where name = ‘Alice’ || name = ‘Bob’ Input table: student Output table

slide-31
SLIDE 31

Outline

  • Motivation
  • Language Design
  • Synthesis Approach
  • Evaluation
  • Related Work
  • Conclusion

31

slide-32
SLIDE 32

Research Questions

  • Success ratio in synthesizing SQL queries?
  • What is the tool time cost?
  • How much human effort is needed in writing examples?
  • Comparison to existing techniques.

32

slide-33
SLIDE 33

Benchmarks

  • 23 SQL query related exercises from a classic textbook

– All exercises in chapters 5.1 and 5.2

  • 5 forum questions

– Can be answered by using standard SQL (Most questions are vendor-specific) – 2 questions contain example tables

33

slide-34
SLIDE 34

Evaluation Procedure

  • Rank of the correct SQL query
  • Tool time cost
  • Manual cost

– Example size, time cost, and the number of interaction rounds (All experiments are done by the second author)

34

Input-Output Examples SQLSynthesizer

Queries

Select the desired query, or provide more examples

slide-35
SLIDE 35

Results: success ratio

35

28 SQL questions

Fail on 8 questions Succeed on 20 questions

5 forum questions 23 textbook exercises

Succeed on 5 forum questions Succeed on 15 exercises Fail on 8 exercises

Require writing sub-queries, which are not supported in SQLSynthesizer The correct query ranks 1st in all succeeded questions

slide-36
SLIDE 36

Result: tool time cost

  • On average, 8 seconds per benchmark

– Min: 1 second, max: 120 seconds – Roughly proportional to the #table and #column

36

slide-37
SLIDE 37

Results: manual cost

  • Example size

– 22 rows, on average (min: 8 rows, max: 52 rows)

  • Time cost in writing examples

– 3.6 minutes per benchmark, on average (min: 1 minute, max: 7 minutes)

  • Number of interaction rounds

– 2.3 rounds per benchmark, on average (min: 1 round, max: 5 rounds)

37

slide-38
SLIDE 38

Comparison with an existing approach

  • Query-by-Output (QBO) [Tran’09]

– Support simple “select-from-where” queries – Use data values as machine learning features

38

name age score Alice 20 100 Bob 20 99 Charlie 30 80 name Alice Bob select name from student where age < 30

Table: student

age MAX(score) 20 100 30 80 select age, max(score) from student group by age

Output table

slide-39
SLIDE 39

Query-by-Output vs. SQLSynthesizer

39

Query-by-Output SQLSynthesizer

28 SQL questions 28 SQL questions

Fail on 26 questions Succeed on 2 questions Succeed on 20 questions Fail on 8 questions

  • Many realistic SQL queries use aggregation features.
  • Users are unlikely to get stuck on simple “select-from-where” queries.
slide-40
SLIDE 40

Experimental Conclusions

  • Good success ratio (71%)
  • Low tool time cost

– 8 seconds on average

  • Reasonable manual cost

– 3.6 minutes on average – 2.3 interaction rounds

  • Outperform an existing technique

– Success ratio: QBO (7%) vs. SQLSynthesizer (71%)

40

slide-41
SLIDE 41

Outline

  • Motivation
  • Language Design
  • Synthesis Approach
  • Evaluation
  • Related Work
  • Conclusion

41

slide-42
SLIDE 42

Related Work

  • Reverse engineering SQL queries

Query-by-Examples [Zloof’75]

A new GUI with a domain-specific language to write queries Query-by-Output [Tran’09] Uses data values as features, and supports a small SQL subset. View definition Synthesis [Sarma’10] Theoretical analysis, and is limited to 1 input/output table.

  • Automated program synthesis

PADS [Fisher’08], Wrangler [Kandel’11], Excel Macro [Harris’11], SQLShare [Howe’11], SnippetSuggest [Khoussainova’11], SQL Inference from Java code [Cheung’13] − Targets different problems, or requires different input.

− Inapplicable to SQL synthesis

42

slide-43
SLIDE 43

Outline

  • Motivation
  • Language Design
  • Synthesis Approach
  • Evaluation
  • Related Work
  • Conclusion

43

slide-44
SLIDE 44

Contributions

  • A programming-by-example technique

– Synthesize SQL queries from input-output examples – Core idea: using machine learning to infer query conditions

  • Experiments that demonstrate its usefulness

– Accurate and efficient

  • Inferred correct answers for 20 out of 28 SQL questions
  • 8 seconds for each question

– Outperforms an existing technique

  • The SQLSynthesizer implementation

http://sqlsynthesizer.googlecode.com

44

slide-45
SLIDE 45

[Backup Slides]

45

slide-46
SLIDE 46

The most widely-used SQL features

46

2 4 6 8 10 12 RIGHT JOIN NOT EXIST UNION NOT NULL LIKE BETWEEN LEFT JOIN HAVING INNER JOIN FULL JOIN MIN MAX IN SUM AVG DISTINCT COUNT ORDER BY GROUP BY SELECT... FROM..

Number of votes

21 features Aggregation features The standard select .. from.. where.. feature Joining features Existential features Value matching features

slide-47
SLIDE 47

Design a SQL subset

47

2 4 6 8 10 12 RIGHT JOIN NOT EXIST UNION NOT NULL LIKE BETWEEN LEFT JOIN HAVING INNER JOIN FULL JOIN MIN MAX IN SUM AVG DISTINCT COUNT ORDER BY GROUP BY SELECT... FROM..

Special joins Wildcard matching Sub-query Covered features Uncovered features Covered 15 features