Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + - - PDF document

agenda
SMART_READER_LITE
LIVE PREVIEW

Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + - - PDF document

10/30/17 Agenda Closing the Loop on Data Analysis Smoke Fast Lineage + Interactions eugenewu.net Precision Interfaces Interface for All Analyses assistant professor Scorpion Explaining Outliers columbia university data science institute 2


slide-1
SLIDE 1

10/30/17 1

Closing the Loop on Data Analysis

eugenewu.net

assistant professor

columbia university data science institute

Agenda

Smoke Fast Lineage + Interactions Precision Interfaces Interface for All Analyses Scorpion Explaining Outliers

2

DB SQL DB Vis BI ML

… “The W

  • rld”

[Unflattening - Nick Sousanis]

DB Vis BI ML

“The W

  • rld”

Data Visualization Management System

DVMS

Co Co-de design gn

end-to-end

human-in-the-loop

data analysis

slide-2
SLIDE 2

10/30/17 2

DB BI ML

“The W

  • rld”

… to create and scale?

How

interfaces to create?

What

Vis

the data?

Prep Why?

What? Why?

Scorpion: Explaining Outliers Ne NeuroFlash: : Explaining Neural Networks Explaining Social Media Popularities

How?

De DeVIL: : Use human limits SMOKE: Lineage for Interactive Vis VI VISTREAM: Prefetching architecture

Prep?

Ac ActiveClean an: : Interactive Cleaning for ML QF QFix: Cleaning past queries Pr PreCog: Quality Push-down [CIDR ’17] [revision] [in progress] [VLDB ’13] [in progress] [in progress] [VLDB ’16] [SIGMOD ’17] [in review] PI: Scalable Interface Generation S4 S4: Spreadsheet-style search PV PVD: D: Physical Visualization Design [HILDA ‘17] [SIGMOD ‘15] [in progress]

DVMS DVMS Projects

Agenda

Smoke Fast Lineage + Interactions Precision Interfaces Interface for All Analyses Scorpion Explaining Outliers

17

Smoke: Fast Lineage + Interactions

18

Result 1 Result 2 backward_trace() forward_trace() view_refresh()

Revenue Profit Price Product Price Product

Smoke: Fast Lineage + Interactions

refresh(backward_trace( ,input))

Revenue Profit

backward_trace() view_refresh()

Revenue Profit Price Product Price Product

Smoke: Fast Lineage + Interactions

backward_trace() view_refresh() refresh(backward_trace( ,input))

Revenue Profit

slide-3
SLIDE 3

10/30/17 3

Revenue Profit Price Product Price Product

Smoke: Fast Lineage + Interactions

backward_trace() view_refresh() refresh(backward_trace( ,input))

Revenue Profit

SPLOT = SELECT 8 AS radius, 'gray' AS stroke, 'gray' AS fill, lscale(revenue, sx) AS center_x, lscale(profit, sy) AS center_y, FROM A, B, sx, sy WHERE …; HIST = SELECT 4 AS width, 'blue' AS fill, hscale(price, hx) AS height FROM B, C, hx WHERE …; render(SELECT * FROM SPLOT); render(SELECT * FROM HIST); BT = BACKWARD TRACE FROM HIST@vnow-1 AS HS, clicked WHERE clicked.id = HS.id TO A; SPLOT = SELECT ..., 'red' AS fill FROM BT, B WHERE … UNION SELECT ..., 'gray' AS fill FROM (A EXCEPT BT), B WHERE … HIST = SELECT ..., 'red' AS fill FROM BT, C WHERE … UNION SELECT ..., 'blue’ AS fill FROM (A EXCEPT BT), C WHERE …

interaction(vis(database))

SQL(Lineage( )) SQL

Fine-grained Lineage Capture

22

id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5 id qty b 1 1 6 b 2 1 1 b 3 2 9 id $ a1 1 40 a2 2 5

𝛿"#,%&'()*+∗$)(A⨝B) ⨝

id

sum

  • 1

1 280

  • 2

2 45

γ

Fine-grained Lineage Capture

23

id

sum

  • 1

1 280

  • 2

2 45 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5 id qty b 1 1 6 b 2 1 1 b 3 2 9 id $ a1 1 40 a2 2 5

Capture lineage graph w/ low-overhead to answer lineage queries efficiently

𝛿"#,%&'()*+∗$)(A⨝B)

How do people capture lineage today?

Lazy aka don’t capture Eager via Query rewrites Eager via Instrumentation

24

Lazy Approach

25

id

sum

  • 1

1 280

  • 2

2 45 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5 id qty b 1 1 6 b 2 1 1 b 3 2 9 id $ a1 1 40 a2 2 5

Rewrite lineage qs into SQL

Backward_trace(o1,B) = σid=1(B)

[C [Cui et al. and Ikeda et al.]

No No c capt pture overhead Go Good f d for h high gh-se selectivity Ba Bad fo for low-se selectivity No No su support for non-in invertib ible le op

  • ps

Co Complex rewrite predica tes

CO CONS PR PROS

γ

Eager Logical Denormalized

27

id $ pid $ pid qty

  • 1

1 280 1 40 1 6

  • 2

1 280 1 40 1 1

  • 3

2 45 2 5 2 9 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5 id qty b 1 1 6 b 2 1 1 b 3 2 9 id $ a1 1 40 a2 2 5

A B

Rewrite original query into single big query

⨝’

γ’

Le Leverage DB query

  • pts

Fl Flex exibility Us Use existing da tab abas ase In Introduces redundanc y Re Resu sult must st be further processed In Index result to use it Ad Addtl pr project ion t t

  • ge

get r real result

CO CONS PR PROS

[Perm, Gpro m, and DB No tes ]

slide-4
SLIDE 4

10/30/17 4

28

id

sum

  • 1

1 280

  • 2

2 45 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5 id qty b 1 1 6 b 2 1 1 b 3 2 9 id $ a1 1 40 a2 2 5

A O 1 1 2 2

γ

Re Reduces s redundanc y Ea Easi sily add annot

  • t

ation

  • ns

Us Use existing da tab abas ase Ex Extra Qs s to mak e lineage tables Ne Need t d to i inde dex l lineage ge tabl bles Li Lineage tracing requires join

CO CONS PR PROS

[T rio, and DBN

  • tes]

Eager Logical Normalized Eager Physical

29

id

sum

  • 1

1 280

  • 2

2 45 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5 id qty b 1 1 6 b 2 1 1 b 3 2 9 id $ a1 1 40 a2 2 5

⨝’

γ’

Av Avoid relat ational al overhe a ds Co Control over physic al rep Ea Easi sier to

  • integrate

RP RPC/virtual function

  • n calls

s expensiv e Wr Write-in ineffic icie ient lin lineage storage No No ph physical p plan o

  • ptimizations

CO CONS PR PROS

[Subzero, NewT , Ram p, Clo thia et al., Titian]

Lineage Subsystem

(a1, j1) (j1, o1)

Can Lineage Express Interactions?

Performance Issues

High capture overhead Slow lineage tracing

Issues come from:

Redundant work Inefficient representations Per-pointer overheads

Are these issues necessary?

No. See Smoke

30

4 Design Principles Tight Integration

Operator instrumentation Write efficient lineage idxs

Reuse work

Lineage indexes ≈ Hash tables Intra-plan hash table reuse

Apriori Knowledge

Don’t capture if not used

Lineage Consumption Push computation into lineage capture

Can Lineage Express Interactions?

31

4 Design Principles Tight Integration

Operator instrumentation Write efficient lineage idxs

Reuse work

Lineage indexes ≈ Hash tables Intra-plan hash table reuse

Apriori Knowledge

Don’t capture if not used

Lineage Consumption Push computation into lineage capture

W

  • rkload-based
  • ptimizations

(lineage workload) Make capture fast

Smoke Overview

Eager physical approach Write and read-efficient lineage indexes T wo instrumentation approaches

32

Result 1 Result 2

Lineage Index Representation

33

r1 r2 r3 j1 j2 j3

input r1 r2 r3 r4

  • 1
  • 2
  • 3
  • 4
  • utput

N-to-1 j1 j4 j7 j5 j8 j9 j3 rid index 1-to-1 j2 j4 j2 j1 rid array

Op

slide-5
SLIDE 5

10/30/17 5

Two Capture Approaches

34

id

sum

  • 1

1 2

  • 2

2 1 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5

γ

id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5

1 1 2

id

sum

  • 1

1 2

  • 2

2 1 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5

γ’

1 1 2

Defer Inject

Inject Capture for GROUPBY

35

(1, 1)

id

sum

  • 1

1 2

  • 2

2 1 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5

γbuild γagg

Inject Capture for GROUPBY

36

(1, 2) (2, 1)

j1 j3 j2

id

sum

  • 1

1 2

  • 2

2 1 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5

γbuild γagg

Inject Capture for GROUPBY

37

(1, 2) (2, 1)

j1 j3 j2

id

sum

  • 1

1 2

  • 2

2 1

j1 j3 j2

id

sum

  • 1

1 2

  • 2

2 1 id qty $ j1 1 6 40 j2 1 1 40 j3 2 9 5

γbuild γagg

Experiments

Setup

Custom in-memory query compiled engine Execution comparable with MonetDB Smoke vs Logical vs Subsystem vs Lazy TPC-H, synthetic, and cross-filter

Smoke is fast

Lowest capture overhead Fastest tracing & lineage query perf Interactive capture and tracing speeds

38

Capture Overhead GROUPBY

39 SELECT z, COUNT(*), SUM(v), SUM(v*v), SUM(sqrt(v)), MIN(v), MAX(v) FROM zipf GROUP BY z

Sm Smoke-I I best overall à 0. 0.7x 7x ov

  • verhead

Ar Array ay resizing à ~1 ~1/2 of smoke overhead Wr Write-in ineffic icie ient id idxs à ~4 ~4x ov

  • verhead

Vi Virtual function ca calls à~1 ~1.6x Lo Logical pe penalized d by by de denormalized re repre resenta tati tion

slide-6
SLIDE 6

10/30/17 6

Cross Filtering Experiments

40

Cross Filtering Experiments

41

Stepping Back

Lineage capture for high-throughput workflows need not cripple normal execution Lineage capture can directly create idxs and pre-compute results for future Qs

Smells like cracking. W

  • rking on partial data cubes

Lineage tracing Qs fast enough for interactive vis Extending to other applications e.g., ML

42

Agenda

Smoke Fast Lineage + Interactions Precision Interfaces Interface for All Analyses Scorpion Explaining Outliers

43

SQL

Interfaces Make Life Easy Building Interfaces

45

Specs Engineering

It takes work!

slide-7
SLIDE 7

10/30/17 7

task i # people who do the task

Interfaces for Everybody

SQL

W

  • rth

building Not worth building

task i # people who do the task

Our Vision

An Interface for Every Task

Existing Approach 1 Help Developers Build Interfaces

49

Specs Engineering

Existing Approach 2 Non-programmers Program

50

Specs Engineering

51

Read Minds

  • Gen. Interface

Precision Interfaces PI

52

Mine Logs

  • Gen. Interface

SQL sparQL

Precision Interfaces PI

slide-8
SLIDE 8

10/30/17 8

53

What is an Interface? PI

Vis renders program output Interactions change program Program1 Program2 Program3 …

54

Program1 Program2 Program3 …

PI

Precision Interfaces PI Precision Interfaces

55

Interaction Mining Interface Generation

PI Precision Interfaces

Detect interactions in log

Interaction(P) = P’ Proxy: program differences expressible by interactions

56

Interaction Mining Interface Generation

PI Precision Interfaces

57

Interaction Mining Interface Generation

PI

Subtree transform Ti

Precision Interfaces

58

Interaction Mining Interface Generation

PI

Interaction Graph Subtree transform Ti expressible by interactions

slide-9
SLIDE 9

10/30/17 9

Precision Interfaces

59

Interaction Mining Interface Generation

PI

Interaction Graph Q1 Q2 Q3 Q4 Q6 Q5 Q7 Q8

Visually simpler Less efficient Visually complex More efficient

~Set Cover

60

  • Exp. 1 - Synthetic Data

Simple design More typing Complex design No typing Lots of clicking Happy medium Drag & drop Random walk through OLAP space OnTime Flight Dataset

Stepping Back

In the paper:

More logs e.g.,

Sloan Digital Sky Survey, company X

2 Languages: SQL and SPARQL Running user studies

In the future:

Multi-language Leverage query plans + ASTs General visual log summarization Interfaces generate queries... DVMS generates interfaces

61

Agenda

Smoke Fast Lineage + Interactions Precision Interfaces Interface for All Analyses Scorpion Explaining Outliers

62

Scorpion: Data Explanation

sensor , light, voltage, humidity , temperature 54 sensors 3.2k readings/hour

Example: Data Cleaning

64

slide-10
SLIDE 10

10/30/17 10

Example: Data Cleaning

65

Why are highlighted std(temp) pts so wacky?

sensor , light, voltage, humidity , temperature 54 sensors 3.2k readings/hour

Example: Data Cleaning

66

“sensors with low voltage”

sensor , light, voltage, humidity , temperature 54 sensors 3.2k readings/hour

Other Questions

Q: “Why did Obama’s Oct. campaign spend $millions?” A: company = GMMC

$5M $2M $0

Apr 12 Oct 12 Oct 11

67

Other Questions

Q: “Why did Obama’s Oct. campaign spend $millions?” A: company = GMMC Q: “Why does one district test better than the rest?” A: income > 100k ^ nchildren = 1

68

11 1 12 50 100

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80 Temp Time

69

11 1 12 50 100

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80 Temp Time

  • 1. Label inputs of selected outliers

70

slide-11
SLIDE 11

10/30/17 11

11 1 12 50 100

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80 Temp Time

  • 1. Label inputs of selected outliers
  • 2. Label inputs of normal results

71

11 1 12 50 100

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80 Temp Time

  • 1. Label inputs of selected outliers
  • 2. Label inputs of normal results
  • 3. Apply rule-learning algorithm

Volt < 2.5 & Sensor = 3

72

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80

11 1 12 50 100

Temp Time

73

Big data means can’t plot everything

11 1 12 50 100

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80 A VG(Temp) Time

SELECT time, AVG(Temp) FROM readings GROUP BY time

74

11 1 12 50 100

Time Sensor Volt Humid Temp 11 1 2.64 0.4 34 11 2 2.65 0.3 40 11 3 2.63 0.3 35 12 1 2.7 0.5 35 12 2 2.7 0.4 38 12 3 2.2 0.3 100 1 1 2.7 0.5 35 1 2 2.65 0.5 38 1 3 2.3 0.5 80 A VG(Temp) Time

Confounds the anomalous & normal readings!

Volt > 2.67

75

DB

Process Data

11 1 12 50 100

AVG(Temp) Time

slide-12
SLIDE 12

10/30/17 12

DB

11 1 12 50 100

AVG(Temp) Time

Process Data

DB

IGNORE: V

  • lt < 2.5 & sensor = 3

Process Data

11 1 12 50 100

AVG(Temp) Time

System Data cleaning for systematic errors User Quality and error specification

Stepping Back

79

DB Vis BI ML

Vis + Cleaning is natural step Scorpion: specific type of data cleaning User finds outliers, specifies through vis Generate program to delete culprits Arachnida: generalizes ideas User finds any error, specify through vis Generate broader classes of cleaning programs

From Presentation to Manipulation

DB Vis DB Vis

Why?

Sc Scorpion: n: Explaining Outliers Ne NeuroFlash: : Explaining Neural Networks Explaining Social Media Popularities [VLDB ’13] [in progress] [in progress]

Prep?

Ac ActiveClean: : Interactive Cleaning for ML QF QFix: Cleaning past queries Pr PreCog:Quality Push-down [VLDB ’16] [SIGMOD ’17] [in review]

What?

PI PI: Scalable Interface Generation S4 S4: Spreadsheet-style search PV PVD: Physical Visualization Design [HILDA ‘17] [SIGMOD ‘15] [in progress]

How?

De DeVIL: : Use human limits SM SMOKE: Lineage for Interactive Vis VI VISTREAM: Prefetching architecture [CIDR ’17] [revision] [in progress]

eugenewu.net

W e are hiring!

DVMS