TaCLe Learning constraints in spreadsheets and tabular data Samuel - - PowerPoint PPT Presentation

tacle
SMART_READER_LITE
LIVE PREVIEW

TaCLe Learning constraints in spreadsheets and tabular data Samuel - - PowerPoint PPT Presentation

TaCLe Learning constraints in spreadsheets and tabular data Samuel Kolb, Sergey Paramonov, Tias Guns, Luc De Raedt September 6, 2017 1 / 29 Inspiration: Flash-fill 2 / 29 Inspiration: Flash-fill 2 / 29 Inspiration: Flash-fill 2 / 29


slide-1
SLIDE 1

TaCLe

Learning constraints in spreadsheets and tabular data Samuel Kolb, Sergey Paramonov, Tias Guns, Luc De Raedt September 6, 2017

1 / 29

slide-2
SLIDE 2

Inspiration: Flash-fill

2 / 29

slide-3
SLIDE 3

Inspiration: Flash-fill

2 / 29

slide-4
SLIDE 4

Inspiration: Flash-fill

2 / 29

slide-5
SLIDE 5

Inspiration: Flash-fill

2 / 29

slide-6
SLIDE 6

Inspiration: Flash-fill

2 / 29

slide-7
SLIDE 7

Inspiration: Flash-fill

2 / 29

slide-8
SLIDE 8

Inspiration: Flash-fill

2 / 29

slide-9
SLIDE 9

Inspiration: Flash-fill

2 / 29

slide-10
SLIDE 10

Inspiration: Flash-fill

2 / 29

slide-11
SLIDE 11

Inspiration: Flash-fill

2 / 29

slide-12
SLIDE 12

Inspiration: Flash-fill

2 / 29

slide-13
SLIDE 13

Inspiration: Flash-fill

2 / 29

slide-14
SLIDE 14

Inspiration: Flash-fill

2 / 29

slide-15
SLIDE 15

Inspiration: Flash-fill

2 / 29

slide-16
SLIDE 16

Inspiration: Flash-fill

2 / 29

slide-17
SLIDE 17

What is Flash-fill [Gulwani et al.]?

◮ learns string transformation ◮ classic ML supervised setting ◮ one positive (or very few) example ◮ integrated into Excel

Key Questions:

◮ what if we make it unsupervised? ◮ what if we learn constraints rather than functions?

3 / 29

slide-18
SLIDE 18

Can we recover formulas from a CSV file?

◮ What are the formulas here?

4 / 29

slide-19
SLIDE 19

Can we recover formulas from a CSV file?

◮ What are the formulas here? ◮

T1[:, 6] = SUM(T1[:, 3:5], row)

4 / 29

slide-20
SLIDE 20

Can we recover formulas from a CSV file?

◮ What are the formulas here? ◮

T1[:, 6] = SUM(T1[:, 3:5], row)

T2[:, 2] = SUMIF(T1[:, 1]=T2[:, 1], T1[:, 6])

4 / 29

slide-21
SLIDE 21

What is new here?

◮ Atypical data settings: no variables in columns and transactions

in rows – everything mixed and position matters

◮ Multi-relational learning (across multiple tables, e.g., lookup) ◮ Constraints – specific for spreadsheets (e.g., fuzzy lookup,

sumproduct)

◮ Important details:

◮ None values ◮ Semi-structured data ◮ Numeric and textual data ◮ Numerical precision

5 / 29

slide-22
SLIDE 22

Formalization

Tables (T)

◮ n × m matrix ◮ Headerless ◮ (Optional: orientation)

6 / 29

slide-23
SLIDE 23

Formalization

Blocks (B)

◮ Contiguous group of entire rows or entire columns (vectors) ◮ Type-consistent ◮ Fixed orientation

7 / 29

slide-24
SLIDE 24

Formalization

Block containment (B′ ⊑ B)

◮ B′ subblock of B ◮ B superblock of B′ ◮ Supports different granularities for reasoning

8 / 29

slide-25
SLIDE 25

Formalization

Block properties

type a block is type-consistent, so it has one type table the table that the block belongs to

  • rientation either row-oriented or column-oriented

size the number of vectors a block contains length the length of its vectors; as all vectors are from the same table, they always have the same length; rows the number of rows in the block; in row-oriented blocks this is equivalent to the size; columns the number of columns in the block; in row-oriented blocks this is equivalent to the length.

9 / 29

slide-26
SLIDE 26

Formalization

Constraint templates

◮ Syntax

◮ Syntactic form, e.g. ALLDIFFERENT(Bx) ◮ In logic: relation / predicate

◮ Signature

◮ Requirements on arguments, e.g. discrete(Bx) ◮ In logic: bias (Sigs)

◮ Definition

◮ Actual definition, e.g. i = j: Bx[i] = Bx[j] ◮ In logic: ackground knowledge (Defs)

10 / 29

slide-27
SLIDE 27

Formalization

Row-wise sum

◮ Syntax: Br = SUMrow(Bx) ◮ Signature: Br and Bx are numeric; columns(Bx) ≥ 2; and

rows(Bx) = length(Br)

◮ Definition: Br[i] = columns(Bx) j=1

row(i, Bx)[j]

11 / 29

slide-28
SLIDE 28

Formalization

Row-wise sum

◮ Syntax: Br = SUMrow(Bx) ◮ Signature: Br and Bx are numeric; columns(Bx) ≥ 2; and

rows(Bx) = length(Br)

◮ Definition: Br[i] = columns(Bx) j=1

row(i, Bx)[j]

Lookup

◮ Syntax: Br = LOOKUP(Bfk, Bpk, Bval) ◮ Signature: Bfk and Bpk are discrete; arguments {Bfk, Br} and

{Bpk, Bval} within the same set have the same length, table and

  • rientation; Br and Bval have the same type; and

FOREIGNKEY(Bfk, Bpk).

◮ Definition: Br[i] = Bval[j] where Bpk[j] = Bfk[i]

11 / 29

slide-29
SLIDE 29

High level approach

  • 1. Detect tables (Visual selection)

12 / 29

slide-30
SLIDE 30

High level approach

  • 1. Detect tables (Visual selection)

13 / 29

slide-31
SLIDE 31

High level approach

  • 1. Detect tables (Visual selection)

14 / 29

slide-32
SLIDE 32

High level approach

  • 2. Detect blocks (Automatically, transparant to the user)

15 / 29

slide-33
SLIDE 33

High level approach

  • 3. Learn constraints

SERIES(T1[:, 1]) T1[:, 1] = RANK(T1[:, 5])∗ T1[:, 1] = RANK(T1[:, 6])∗ T1[:, 1] = RANK(T1[:, 10])∗ T1[:, 8] = RANK(T1[:, 7]) T1[:, 8] = RANK(T1[:, 3])∗ T1[:, 8] = RANK(T1[:, 4])∗ T1[:, 7] = SUMrow(T1[:, 3:6]) T1[:, 10] = SUMIF(T3[:, 1], T1[:, 2], T3[:, 2]) T1[:, 11] = MAXIF(T3[:, 1], T1[:, 2], T3[:, 2]) T2[1, :] = SUMcol(T1[:, 3:7]) T2[2, :] = AVERAGEcol(T1[:, 3:7]) T2[3, :] = MAXcol(T1[:, 3:7]), T2[4, :] = MINcol(T1[:, 3:7]) T4[:, 2] = SUMcol(T1[:, 3:6]) T4[:, 4] = PREV(T4[:, 4]) + T4[:, 2] − T4[:, 3] T5[:, 2] = LOOKUP(T5[:, 3], T1[:, 2], T1[:, 1])∗ T5[:, 3] = LOOKUP(T5[:, 2], T1[:, 1], T1[:, 2])

16 / 29

slide-34
SLIDE 34

Approach

Intuition

◮ Example: Y = SUMcol(X) ◮ Step 1: Find superblock assignments (i.e. suitable blocks)

◮ Assignments compatible signature, relax where necessary ◮ e.g. numeric(X), numeric(Y), columns(X) ≥ 2 ◮ e.g. rows(X) = length(Y)

◮ Step 2: Find all constraints over subblocks of the superblock

assignments

◮ Find subassignments that satisfy the signature and the definition ◮ e.g. find subblocks for X and Y such that Y[i] = column i

17 / 29

slide-35
SLIDE 35

Approach

Refinements (similar as in constraint learning - clausal discovery)

◮ Dependencies between constraints ◮ Redundancies in the output constraints ◮ Limited precision

18 / 29

slide-36
SLIDE 36

Refinements

Dependencies

Reuse learned constraints to learn dependent constraints

19 / 29

slide-37
SLIDE 37

Refinements

Redundancies

◮ Hide constraints that are equivalent ◮ Equivalent constraints can be computed from one another ◮ Example: B1 = B2 × B3 and B1 = B3 × B2 ◮ → Use canonical form

Limited precision

◮ Use the result value to deduce required precision ◮ Compute formula on input values and round to precision

20 / 29

slide-38
SLIDE 38

Evaluation

Evaluation questions

◮ Recall (Q1) ◮ Precision (Q2) ◮ Speed (Q3)

21 / 29

slide-39
SLIDE 39

Evaluation

Method

◮ Benchmark spreadsheets collected

◮ Exercise session ◮ Online tutorials ◮ Data spreadsheets (e.g. crime statistics, fincancial data)

22 / 29

slide-40
SLIDE 40

Evaluation

Method

◮ Benchmark spreadsheets collected

◮ Exercise session ◮ Online tutorials ◮ Data spreadsheets (e.g. crime statistics, fincancial data)

◮ Convert all spreadsheets to CSV files

22 / 29

slide-41
SLIDE 41

Evaluation

Method

◮ Benchmark spreadsheets collected

◮ Exercise session ◮ Online tutorials ◮ Data spreadsheets (e.g. crime statistics, fincancial data)

◮ Convert all spreadsheets to CSV files ◮ Manually specify ground-truth: intended constraints

◮ Based on original formulas, context, headers (intuition) ◮ Structural constraints are ignored

22 / 29

slide-42
SLIDE 42

Evaluation

Benchmark

Exercises (9) Tutorials (21) Data (4) Overall Sheet avg Overall Sheet avg Overall Sheet avg Tables 19 2.11 48 2.29 4 1 Cells 1231 137 1889 90 2320 580 Intended Constraints 34 3.78 52 2.48 6 1.50

23 / 29

slide-43
SLIDE 43

Evaluation

Benchmark

Exercises (9) Tutorials (21) Data (4) Overall Sheet avg Overall Sheet avg Overall Sheet avg Recall 0.85 0.83 0.88 0.87 1.00 1.00 Recall Supported 1.00 1.00 1.00 1.00 1.00 1.00 Precision 0.97 0.98 0.70 0.91 1.00 1.00 Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

24 / 29

slide-44
SLIDE 44

Evaluation

Benchmark

Exercises (9) Tutorials (21) Data (4) Overall Sheet avg Overall Sheet avg Overall Sheet avg Recall 0.85 0.83 0.88 0.87 1.00 1.00 Recall Supported 1.00 1.00 1.00 1.00 1.00 1.00 Precision 0.97 0.98 0.70 0.91 1.00 1.00 Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

◮ Q1. How many intended constraints are found by TaCLe?

◮ High recall ◮ All supported constraints always found

24 / 29

slide-45
SLIDE 45

Evaluation

Benchmark

Exercises (9) Tutorials (21) Data (4) Overall Sheet avg Overall Sheet avg Overall Sheet avg Recall 0.85 0.83 0.88 0.87 1.00 1.00 Recall Supported 1.00 1.00 1.00 1.00 1.00 1.00 Precision 0.97 0.98 0.70 0.91 1.00 1.00 Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

◮ Q1. How many intended constraints are found by TaCLe?

◮ High recall ◮ All supported constraints always found

◮ Q2. How precise is TaCLe?

◮ Precise on most spreadsheets ◮ Duplicates / multiple ways to calculate thwart precision

24 / 29

slide-46
SLIDE 46

Evaluation

Benchmark

Exercises (9) Tutorials (21) Data (4) Overall Sheet avg Overall Sheet avg Overall Sheet avg Recall 0.85 0.83 0.88 0.87 1.00 1.00 Recall Supported 1.00 1.00 1.00 1.00 1.00 1.00 Precision 0.97 0.98 0.70 0.91 1.00 1.00 Speed (s) 1.62 0.18 1.76 0.08 0.81 0.20

◮ Q1. How many intended constraints are found by TaCLe?

◮ High recall ◮ All supported constraints always found

◮ Q2. How precise is TaCLe?

◮ Precise on most spreadsheets ◮ Duplicates / multiple ways to calculate thwart precision

◮ Q3. How fast is TaCLe?

◮ Dependencies crucial

24 / 29

slide-47
SLIDE 47

Smart import

TaCLe + constraint translation + cycle breaking

25 / 29

slide-48
SLIDE 48

Auto-completion / dynamic error checking

incremental TaCLe + vector tracking

26 / 29

slide-49
SLIDE 49

Conclusion

Conclusion

◮ Approach that learns constraints in spreadsheet

◮ Accurate ◮ Rather precise ◮ Efficient

Future work

◮ Applications (incremental learning, vector tracking, noise) ◮ Nested constraints ◮ Sub vector-size ◮ Post-processing (heuristic / entailment)

27 / 29

slide-50
SLIDE 50

https://unsplash.com/search/photos/cat?photo=Qmox1MkYDnY

28 / 29

slide-51
SLIDE 51

Questions?

29 / 29