[PPT] - Towards Sustainable Insights Tim Kraska PowerPoint Presentation

SLIDE 1

Towards Sustainable Insights

Tim Kraska <tim_kraska@brown.edu>

SLIDE 2 http://www.huffingtonpost.co.uk/2016/01/08/a-glass-of-red- wine-is-the-equivalent-to-an-hour-at-the-gym-says-new- study_n_7317240.html

A New Study shows: A Glass Of Red Wine Is The Equivalent To An Hour At The Gym [Fox News 02/15 and others]

SLIDE 3

A new study shows: Secret to winning a nobel prize? Eat More Chocolate [Time 10/12]

SLIDE 4

A new study shows: Secret to winning a nobel prize? Eat More Chocolate [Time 10/12]

SLIDE 5

Scientists find the secret of longer life for men (The bad news: castration is the key) [Daily Mail UK, 09/12]

http://www.dailymail.co.uk/sciencetech/article-2207981/Scientists-secret- living-life-men-bad-news-Castration-key.html

SLIDE 6

There has been an ex

explos

sion
n of

(data-driven) discoveries, many of which being qu

questionabl ble.

SLIDE 7

Reasons are manifold, but… the database community

… and many others

works hard on to be not left out (again)

SLIDE 8

A note for Reviewer 2: We actually liked your comments and it helped us to sharpen our points. If you feel in any way

ffended by this talk, this was not my intention and I am more than happy to make it up to you with a lot of whisky.

Just come to me after the talk and say we need to drink. Knowing this crowd, enough people will do it and I will even never find out your identity if you do not wish so.

Let me introduce (vi

(virtual) ) Re Revi viewe wer 2:

The paper's shortcomings are in its motivation, solution, and presentation. The part of the paper that I did like was the examples given in Sec 2.2.2.

SLIDE 9

Outline

Par art I: I: The problem with: A.

A. In

Interac active Dat ata a Explorat ation B.

B. Vi

Visualization Recommendation Sy System ems C.

C. Hy

Hypothes hesis Gener enerator A.

A. Par

art II: II: Solutions

SLIDE 10

A) Interactive Data Exploration Tools (Vizdom as an Example)

SLIDE 11

Why Visualizations contribute to the problem

If If a a visual alizati ation provides an any insight, t, it t is an an hy hypothes hesis t tes est (just one where you not necessarily know if

it is statistical significant)

Otherwise, visualizations have just to be taken as pretty pictures about (potentially) random facts

gender count Male Female Other A salary over 50k count True False gender count Male Female Other gender count Male Other salary over 50k count True False gender count Male Female Other B C Female salary over 50k count True False education count HS Bachelor Master PhD marital status count Married Never Married Not Married Widowed E F education count HS Bachelor Master PhD marital status count Married Never Married Not Married Widowed

SLIDE 12 gender count Male Female Other A salary over 50k count True False gender count Male Female Other gender count Male Other salary over 50k count True False gender count Male Female Other B C education count HS Bachelor Master PhD marital status count Married Never Married Not Married Widowed Female salary over 50k count True False education count HS Bachelor Master PhD marital status count Married Never Married Not Married Widowed age count 10 20 30 50 60 70 40 90 80 age count 10 20 30 50 60 70 40 90 80 0.011 p t-test D E F salary over 50k count True False education count HS Bachelor Master PhD marital status count Married Never Married Not Married Widowed

If visualizations are used to find something interesting, the user is doing multiple hypothesis testing

SLIDE 13

Running Example: Survey on Amazon Mechanical Turk

SLIDE 14

Our goal: To find good indicators (correlations) that somebody knows who Mike Stonebraker is.

SLIDE 15

And after searching for a bit,

ne of my favorites

Pearson correlation significance-level p < 0.05

SLIDE 16

But Why Does the DB community make the situation worse?

SLIDE 17

So What Did Reviewer 2 say?

Blaming the multiple-comparison problem on fast visualization- generation is like blaming fast cars for child driver casualties due to car accidents…

But…

SLIDE 18

2) Visual Recommendation Systems (SeeDB as an Example)

0.2 0.4 0.6 0.8 1

V1 V2 Normalized Aggr(Collumn A) Collumn B (filtered Column C = V?)

Target

0.2 0.4 0.6 0.8 1

V1 V2 Normalized Aggr(Collumn A) Collumn B (filtered Column C = V?)

Reference

0.2 0.4 0.6 0.8 1

V1 V2 Normalized Aggr(Collumn A) Collumn B (filtered Column D = V?)

Target

Uninteresting Interesting

SLIDE 19

What is different

The The system em aut utomatically gener enerates es tho hous usand nds

f
f visualization
ns and

d ranks them som

mehow
w

(e (e.g .g., b ., base sed e effe fect si t size ze)

SLIDE 20

SeeDB on Our Survey Data

Startup Corporation

Filter: All

0.2 0.4 0.6 0.8

% Cheddar & Sour Cream Potato Chips vs Workspace Preference

Startup Corporation

Filter: Belief in Alien Existence

0.5 1

% Cheddar & Sour Cream Potato Chips vs Workspace Preference

Startup Corporation

Filter: Disbelief in Alien Existence

0.2 0.4 0.6 0.8

% Cheddar & Sour Cream Potato Chips vs Workspace Preference

Startup Corporation

Filter: Prefer Blow Hair Drying

0.1 0.2 0.3 0.4

% Cheddar & Sour Cream Potato Chips vs Workspace Preference

…I did like […] the example …

SLIDE 21

What is the Problem?

The user is in the dark what the system did. The system might have “tested” thousands of potential visualization, just to find something interesting.

SLIDE 22

What did Reviewer 2 say?

These systems are not designed for an

average person to run and get insights

that they can publish medical articles on! The end users are still analysts. The only difference is that they automate hypotheses

generation and NOT hypotheses testing,…

SLIDE 23

WARNING

Afterusingthetool, throwawaythedata. Itisnotsafe!1

My suggestions, papers should include in the future a a warning like

1To be more precise: you do not have to throw it all away, but you can not use the same data anymore for

significance testing

SLIDE 24

3) Real Hypothesis Generators (Data Polygamy as an Example)

SLIDE 25

(Data) Polygamy is bad, especially if you do not know what is going on.

SLIDE 26

Outline

Par art I: I: The problem with: A.

A. In

Interac active Dat ata a Explorat ation B.

B. Vi

Visualization Recommendation Sy System ems C.

C. Hy

Hypothes hesis Gener enerator A.

A. Par

art II: II: Solutions

SLIDE 27

Should we stop working on IDE, Recommenders, etc?

Actively inform the user about the risk factors
Try your techniques over random data with different data sizes
If possible, split data into a exp

exploration a n and nd a a va validation s n set et. .

Be aware, si

significantly lowers s the power

Everything on the validation data set has to be carefully handled (i.e., use

multi-hypothesis control)

If possible, use ad

addit itional ional exp exper erim iment ents s (e.g., A/B testing)

Requires a small number of hypothesis and careful design
Might not always be possible or is very expensive

Be Bette tter: r: con

ntr

trol

l th

the multi ti-hy hypot

thes

hesis prob

blem

em from

m the

he star art

NO

SLIDE 28

QUDE

Quantifying the Uncertainty in Data Exploration

Python

BigDAWG

IDEA

Interactive Data Exploration Accelerator

Legacy Systems

Mlbase2

With hypothesis control

Our Interactive Data Exploration Stack (BIDES)

SLIDE 29

Many Interesting Open Problems

Trans

nsparent ent hyp hypothes hesis t tes esting ng

how to automatically derive what the hypothesis is the user is testing

How t

to co convey t nvey the m he mea eani ning ng t to t the us he user er

(e.g., FDR vs family-wise error)

Sa

Safe r e reco ecommend ender er t techni echniques ues

(we are currently exploring new techniques based VC-dimensions to control the error)

Incr

ncrem ement ental m mul ultiple-hyp hypothes hesis co cont ntrol t techni echniques ues

(for example, see ”Controlling False Discoveries During Interactive Data Exploration” CoRR abs/1612.01040 how we use new alpha-investing policies to do that)

Dep

epend endenci encies es b bet etween hyp een hypothes hesis

(this can safe ”hypothesis budget”)

…

We are just at the beginning

SLIDE 30

A Final Note from Reviewer 2 on

Is Is the Si he Situat uation

n real

eally s so B

Bad

ad?

.., the systems that are criticized by this paper are essentially three tools [4,6,28] … So the problem is not really as serious as it might seem as none of these systems are used by anyone in practice

SLIDE 31

Tim Kraska <tim_kraska@brown.edu>

Special thanks to: A last note to Reviewer 2:

1st I sincerely hope you are not one of my letter writers for my tenure case :) 2nd Your comments actually helped us to improve the paper and helped with the

talk. So thank you!

3rd I am happy to pay for your drinks tonight to make it up to you.