Formal Methods for Database Application Evolution I l Dillig - - PowerPoint PPT Presentation

formal methods for database application evolution
SMART_READER_LITE
LIVE PREVIEW

Formal Methods for Database Application Evolution I l Dillig - - PowerPoint PPT Presentation

Formal Methods for Database Application Evolution I l Dillig University of Texas at Austin VSTTE 2020 1 Database Applications DB application: program that interacts with underlying database Database applications are ubiquitous


slide-1
SLIDE 1

Formal Methods for Database Application Evolution

Işıl Dillig University of Texas at Austin

1

VSTTE 2020

slide-2
SLIDE 2

Database Applications

2

DB application: program that interacts with underlying database

  • Enterprise software, web

applications, CRM applications, ….

Database applications are ubiquitous

slide-3
SLIDE 3

Schema Refactoring

3

  • Even more extreme: Switch from relational schema

to non-relational DB Common theme during DB app evolution: schema refactoring

  • Splitting/merging tables

Examples

  • Denormalization
slide-4
SLIDE 4

Implications of Schema Changes

4

Structural schema changes have significant implications!

Data migration: Move data from source to target schema Code migration: Must re-implement parts of program

Attracted lots

  • f attention

Currently done manually!

slide-5
SLIDE 5

Our Recent Research

Program verification and synthesis techniques to help programmers with DB schema refactoring

5

My goal: Convince you that there are lots of open & interesting problems in this space! Just a starting point, not a finished body of work!

slide-6
SLIDE 6

Idealized Model: Parametrized SQL Programs

6

Each method is either an update transaction (i.e., modifies DB), or query transaction

Program is a set of methods (transactions), where each method can take user input, but the body is straight-line SQL code

string getName(int id){ SELECT name FROM users WHERE uid = id } void insertUser (int id, string name) { INSERT INTO users VALUES (id, name) }

slide-7
SLIDE 7

Outline

7

Part I: Equivalence verification for parametrized SQL programs (POPL’18) Part II: Synthesizing new version of parametrized SQL program for a given target schema (PLDI’19) Part III: Open problems & challenges

slide-8
SLIDE 8

Verification Problem

Program1 Database1

Refactor schema

Program2 Database2

Are the two programs equivalent before and after schema change?

8

slide-9
SLIDE 9

Example

9

void createSub(int id, String name, String fltr) INSERT INTO Subscriber VALUES (id, name, fltr); void updateSub(int id, String fltr) UPDATE Subscriber SET filter=fltr WHERE sid=id; List<Tuple> getSubFilter(int id) SELECT filter FROM Subscriber WHERE sid=id;

Subscriber sid sname filter … … …

P

Subscriber’ sid sname fid_fk … … … Filter fid params … …

void createSub(int id, String name, String fltr) INSERT INTO Subscriber’ VALUES (id, name, UID_x); INSERT INTO Filter VALUES (UID_x, fltr); void updateSub(int id, String fltr) UPDATE Filter SET params=fltr WHERE fid IN (SELECT fid_fk FROM Subscriber’ WHERE sid=id); List<Tuple> getSubFilter(int id) SELECT params FROM Filter JOIN Subscriber’ ON fid=fid_fk WHERE sid=id;

P’

Are these implementations equivalent?

slide-10
SLIDE 10

Defining Equivalence

Consider two SQL programs P and P’ that provide same interface but different implementations

P P’: Every query transaction yields the same result after invoking same sequence of update transactions

P P’

updates queries

=

queries

10

slide-11
SLIDE 11

Example Revisited

11

void createSub(int id, String name, String fltr) INSERT INTO Subscriber VALUES (id, name, fltr); void updateSub(int id, String fltr) UPDATE Subscriber SET filter=fltr WHERE sid=id; List<Tuple> getSubFilter(int id) SELECT filter FROM Subscriber WHERE sid=id;

P

void createSub(int id, String name, String fltr) INSERT INTO Subscriber’ VALUES (id, name, UID_x); INSERT INTO Filter VALUES (UID_x, fltr); void updateSub(int id, String fltr) UPDATE Filter SET params=fltr WHERE fid IN (SELECT fid_fk FROM Subscriber’ WHERE sid=id); List<Tuple> getSubFilter(int id) SELECT params FROM Filter JOIN Subscriber’ ON fid=fid_fk WHERE sid=id;

P’

After an arbitrary sequence of createSub and updateSub transactions, getSubFilter should yield same answer

slide-12
SLIDE 12

Proving Equivalence: Methodology

Find bisimulation invariant that relates the two DB states

12

Φ

A = B C

D0 D1 D2 D0

update update update update update update ...

...

' D1 ' D2 '

inv inv inv

S S' Φ Φ Φ

(Φ ∧ ~ x = ~ y) | = JQiK = JQ0

iK

slide-13
SLIDE 13

Inferring Bisimulation Invariants

13

  • Generate a universe of candidate

predicates from a set of templates Π?(?) = Π?(?) Π?(?) = Π?(? o n?)

  • Perform fixed-point computation to find

strongest (conjunctive) bisimulation invariant over this universe

Automatically infer inductive bisimulation invariants using a guess-and-check approach (Houdini)

slide-14
SLIDE 14

Verification Workflow

14

No Yes P \ {ϕ} No Generate predicate universe P

Φ ≡ ⋀

φ∈P

φ

Yes Check if inductive Check if it implies query results are same

Use SMT solver Generate VCs

slide-15
SLIDE 15

Outline

15

Part I: Equivalence verification for parametrized SQL programs (POPL’18) Part II: Synthesizing new version of parametrized SQL program for a given target schema (PLDI’19) Part III: Open problems & challenges

slide-16
SLIDE 16

Synthesis Problem

16

Program P Schema S’

Synthesizer

New program P'

P’ is over new schema S’ and equivalent to P Enumerate-and- verify approach Check equivalence using technique from Part1

slide-17
SLIDE 17

Challenge

17

Search space is very large!

Sketch generation Sketch completion

slide-18
SLIDE 18

Synthesis Methodology

18

Value Correspondence Generator

Target schema Source schema

Sketch Generator

Source prog

Value Corr.

Target prog

Mapping from source attrb’s to target attrb’s

Program with unknown tables and columns

Sketch Solver

Sketch

slide-19
SLIDE 19

Motivating Example for Synthesis

19

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, pic); update deleteTA(int id) DELETE FROM TA WHERE TaId=id; query getTAInfo(int id) SELECT TName, TPic FROM TA WHERE TaId=id;

Inst

InstId IName IPic

TA

TaId TName TPic

Inst’

InstId IName PicId

TA’

TaId TName PicId

Picture

PicId Pic

slide-20
SLIDE 20

Motivating Example for Synthesis

20

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, pic); update deleteTA(int id) DELETE FROM TA WHERE TaId=id; query getTAInfo(int id) SELECT TName, TPic FROM TA WHERE TaId=id;

Inst

InstId IName IPic

TA

TaId TName TPic

Inst’

InstId IName PicId

TA’

TaId TName PicId

Picture

PicId Pic

slide-21
SLIDE 21

Motivating Example for Synthesis

21

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, pic); update deleteTA(int id) DELETE FROM TA WHERE TaId=id; query getTAInfo(int id) SELECT TName, TPic FROM TA WHERE TaId=id;

Inst

InstId IName IPic

TA

TaId TName TPic

Inst’

InstId IName PicId

TA’

TaId TName PicId

Picture

PicId Pic

slide-22
SLIDE 22

Motivating Example for Synthesis

22

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, pic); update deleteTA(int id) DELETE FROM TA WHERE TaId=id; query getTAInfo(int id) SELECT TName, TPic FROM TA WHERE TaId=id;

Inst

InstId IName IPic

TA

TaId TName TPic

Inst’

InstId IName PicId

TA’

TaId TName PicId

Picture

PicId Pic

query getTAInfo(int id) SELECT ??1, ??2 FROM ??3 WHERE ??4=id;

??1=TName ??2=Pic ??4=TaId ??3 ∈ { Picture ⋈ TA, Picture ⋈ TA ⋈ Inst }

slide-23
SLIDE 23

Motivating Example for Synthesis

23

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, pic); update deleteTA(int id) DELETE FROM TA WHERE TaId=id; query getTAInfo(int id) SELECT TName, TPic FROM TA WHERE TaId=id;

Inst

InstId IName IPic

TA

TaId TName TPic

Inst’

InstId IName PicId

TA’

TaId TName PicId

Picture

PicId Pic

update addTA(int id, String name, Binary pic) INSERT INTO ??1 VALUES (id, name, pic); update deleteTA(int id) DELETE ??2 FROM ??3 WHERE TaId = id; query getTAInfo(int id) SELECT TName, Pic FROM ??4 WHERE TaId=id;

??1, ??4 ∈ { Picture ⋈ TA, Picture ⋈ TA ⋈ Inst } ??2 ∈ { [Picture], [TA], …, [Picture, TA, Inst] } ??3 ∈ { Picture ⋈ TA, Picture ⋈ TA ⋈ Inst }

slide-24
SLIDE 24

Solving the Sketch

24

Basic idea: Enumerate all programs in search space

For each completion, check if it is equivalent to original program

Scalability?

15 holes, 3 instantiations: >14 million programs!

Use conflict-driven learning* to prune search space

* Program Synthesis using Conflict-Driven Learning. Feng, Martins, Bastani, Dillig. PLDI’18

slide-25
SLIDE 25

Learning from Conflicts

25

Whenever you enumerate an incorrect program, infer a set of other provably-incorrect programs!

Prior work* shows how to do this for programming-by- example, but not applicable here

Leverage notion of “minimum distinguishing inputs” to learn from failed synthesis attempts!

* Program Synthesis using Conflict-Driven Learning. Feng, Martins, Bastani, Dillig. PLDI’18

slide-26
SLIDE 26

Distinguishing Inputs

26

Recall: Input to DB program is a set of function invocations along with their arguments.

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, UID1); INSERT INTO Picture VALUES (UID1, pic); update deleteTA(int id) DELETE Picture FROM Picture JOIN TA ON Picture.PicId = TA.PicId JOIN Inst ON TA.PicId = Inst.PicId WHERE TaId = id; query getTaInfo(int id) SELECT TName, Pic FROM Picture JOIN TA ON Picture.PicId=TA.PicId JOIN Inst ON TA.PicId = Inst.PicID WHERE TaId=id;

addTA(1, “A”, null); getTAInfo(1);

Input:

Input I is distinguishing for a pair of programs P , P’ iff P(I) P’(I)

slide-27
SLIDE 27

Conflict-Driven Learning from Distinguishing Inputs

27

Consider distinguishing input I for original program P and synthesized (incorrect) program P’

Suppose P , P’ contain N functions but I invokes

  • nly K functions where K

N

Any program P’’ that agrees with P’ on those K functions will also be incorrect!!

Rather than just rejecting a single program, we can reject a whole SET! The smaller the input, the bigger the set, so want minimum distinguishing inputs!

slide-28
SLIDE 28

Distinguishing Inputs in Action

28

addTA(1, “A”, null); getTAInfo(1);

Distinguishing Input:

update addTA(int id, String name, Binary pic) INSERT INTO TA VALUES (id, name, UID1); INSERT INTO Picture VALUES (UID1, pic); update deleteTA(int id) DELETE Picture FROM Picture JOIN TA ON Picture.PicId = TA.PicId JOIN Inst ON TA.PicId = Inst.PicId WHERE TaId = id; query getTaInfo(int id) SELECT TName, Pic FROM Picture JOIN TA ON Picture.PicId=TA.PicId JOIN Inst ON TA.PicId = Inst.PicID WHERE TaId=id;

Assignments to holes in deleteTA are irrelevant!

Can use this information to rule out 14 programs instead of a single one! Synthesized incorrect program:

slide-29
SLIDE 29

Evaluation

29

Implemented tool called Migrator:

https://github.com/utopia-group/migrator

Used Migrator to synthesize new versions of 20 different parametrized SQL programs

slide-30
SLIDE 30

Statistics about Benchmarks

30

Avg functions Avg #tables (source) Avg # tables (target) Avg # cols (source) Avg #cols (target) Text book

10 2 2 8 9

Github

105 12 13 107 110

slide-31
SLIDE 31

Key Results

31

# programs enumerated Average synthesis time Average total time Textbook Benchmarks

3 0.4 5.6

Github benchmarks

19 138 155 Migrator can synthesize new version of SQL programs with over 100 transactions in 2.5 minutes!

Successfully synthesized equivalent versions of all 20 SQL programs!

slide-32
SLIDE 32

Outline

32

Part I: Equivalence verification for parametrized SQL programs (POPL’18) Part II: Synthesizing new version of parametrized SQL program for a given target schema (PLDI’19) Part III: Open problems & challenges

slide-33
SLIDE 33

Future Work

33

Good starting point, but lots of research remains to be done…

slide-34
SLIDE 34

Challenge #1

34

From parametrized SQL programs to real-world DB applications

SQL code dynamically generated Interaction between SQL and other languages Computation

  • n query

inputs & result

slide-35
SLIDE 35

Challenge #2

35

From relational DB applications to no-SQL applications

Larger search space: need to synthesize entire program Verify equivalence between programs written in different languages

slide-36
SLIDE 36

Challenge #3

36

Unified verification & falsification

POPL’18 work cannot disprove equivalence CEGAR-like approaches that search for both proofs & cexs

slide-37
SLIDE 37

Call for New Research

37

Lots of exciting research

  • pportunities for FM

research in automating DB application evolution!

slide-38
SLIDE 38

Acknowledgements

38

Yuepeng Wang PhD student @ UT Austin

slide-39
SLIDE 39

Acknowledgements, cont.

39

James Dong UT Austin undergrad Rushi Shah UT Austin undergrad

slide-40
SLIDE 40

Acknowledgements, cont.

40

Shuvendu Lahiri Principal Researcher @ MSR William Cook Faculty @ UT Austin

slide-41
SLIDE 41

Thank you!

41

slide-42
SLIDE 42

Comparison Results

42

Textbook Benchmarks Synthesis Time (s)

5,000 10,000 15,000 20,000

5437.1 17770.8 0.4

Github Benchmarks

Synthesis Time (s)

25,000 50,000 75,000 100,000

Migrator Enum 26643.5 86400 138.4

Take-away: >190x average speedup compared to enumeration and >750x speed up compared to Sketch!

Sketch Migrator Enum Sketch Migrator Enum

slide-43
SLIDE 43

Key Results

43

# benchmarks verified Avg verification time Textbook Benchmarks

10/10 12 s

Github benchmarks

10/11 47 s

Avg verification time is under 1 min! Mediator can verify all but

  • ne benchmark!

Cause of FP: bisimulation invariant not strong enough

slide-44
SLIDE 44

Theory of Relational Algebra w/ Updates

44

Recall: Bisimulation invariants relate DB states

Subscriber’ sid sname fid_fk … … … Filter fid params … … Subscriber sid sname filter … … …

Axiomatized new theory (Relational Algebra with Updates)

TRA

No existing first-order theory for expressing such invariants

See POPL’18 paper!

slide-45
SLIDE 45

Refinement

Program1 Database1 Refactor/Change +New features Program2 Database2

45

slide-46
SLIDE 46

Refinement

46

Subscriber sid sname filter … … … Subscriber’ sid sname fid_fk email … … … … Filter fid params … …

Correct?

  • New transactions may use more

tables and columns

  • New results “include” old results
slide-47
SLIDE 47

Refinement Checking

  • Simulation invariant Φ holds with empty databases
  • Updates
  • Queries

47

~ x.Ui and ~ y.U 0

i

~ x.Qi and ~ y.Q0

i

Inductive

n Φ ∧ V

xj2~ x xj = yj

  • Ui ; U 0

i

n Φ

  • Sufficient

⇣ Φ ∧ V

xj2~ x xj = yj

⌘ | = ∃L. Qi = ΠL(Q0

i)

New results include the old results

slide-48
SLIDE 48

Generating Value Correspondence

48

To ensure completeness of synthesis algorithm, need to enumerate all possible value correspondences can’t do eagerly! Need to lazily enumerate in decreasing order of likelihood Use MaxSAT

  • based approach

Hard constraints for type compatibility requirements Weighted soft constraints for similarity between attr names

Allows lazy enumeration of most likely value correspondences

slide-49
SLIDE 49

Sketch Completion Algorithm

49

SKETCH

Target Program

Sketch Encoder SAT Solver

𝛀

UNSAT ⊥

Equivalence Verifier

Source program

𝝌

Minimum Differentiating Input

Blocking

Clause Learner

P

SAT