HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. - - PowerPoint PPT Presentation

human powered data management
SMART_READER_LITE
LIVE PREVIEW

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. - - PowerPoint PPT Presentation

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 ! Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet


slide-1
SLIDE 1

HUMAN-POWERED DATA MANAGEMENT!

! Aditya Parameswaran!

! with H. Garcia-Molina, !

  • J. Widom, A. Polyzotis, M. Teh!

!

1!

slide-2
SLIDE 2

Why should we (DM/DB folks) care?!

2!

Structured Data! Unstructured Data! ! images, videos, text!

Automated processing: not yet solved! ! Incorporate xyzabc !

Reason 1: Most data is unstructured!

slide-3
SLIDE 3

3!

Reason 2: S/ware companies use crowds at scale!

Often 10s+ of Millions of $ / yr. / company ! (on crowds + supervisors)!

Plenty of startups too!! We undertook a survey of industry crowdsourcing users!

!! ! ! ! ! ! ! ! ! ! ! ! !use crowds!!

Why should we (DM/DB folks) care?!

slide-4
SLIDE 4

! !

Crowdsourcing Marketplaces! 20+ marketplaces! ! Big companies have internal ones!

Reason 3: Marketplaces are growing rapidly!

Size of these marketplaces have doubled in 2011 – 2013!

Why should we (DM/DB folks) care?!

slide-5
SLIDE 5

5!

Reason 1: Most data is unstructured! Reason 2: Software companies use crowds at scale! Reason 3: Marketplaces are growing rapidly!

Why should we (DM/DB folks) care?!

slide-6
SLIDE 6

What is Human-Powered Data Management?!

6!

Data Processing! Algorithms! Learning accuracies! Data Processing Systems!

Machine Learning!

Interfaces! Patterns!

HCI!

Incentives!

Economics! where humans act as “data processors”! e.g., compare, label, extract !

slide-7
SLIDE 7

Efficient Data Processing Algorithms & Systems!

7!

Data Processing! Algorithms! Auxiliary Plugins: Quality, Pricing! Data Processing Systems!

Filter [SIGMOD12, VLDB14] !Max [SIGMOD12] ! Clean [KDD12, TKDD13] !Categorize [VLDB11]! Search [ICDE14] ! !Debugging [NIPS12]! Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12]! DataSift [HCOMP13, SIGMOD14] HQuery [CIDR11] ! Confidence [KDD13, TR14] !Eviction [TR12] ! Pricing [VLDB15] ! !Quality [HCOMP14]!

i.stanford.edu/~adityagp/scoop.html!

slide-8
SLIDE 8

type of cable that connects to! buildings in the vicinity of xxx!

8!

Data Proc. Sys.: Crowd-Powered Search!

Can your search engine handle this?! apartments in a good school district near Urbana, with a bus stop near by!

slide-9
SLIDE 9

DataSift: Crowd-Powered Search!

!

  • No

Non-t n-text xtual ual cont ntent nt:! ! !“cables that plug into <img>”! ! !“funny pictures of cats with hats with captions”!

!

  • Ti

Time-c

  • consum

nsuming ng: “find noise canceling headphones where the battery lasts 13 hrs”! ! !“apartments in a nice area around urbana”! !

9!

slide-10
SLIDE 10

10!

slide-11
SLIDE 11

Building DataSift: Challenges!

!

!

!

!

!

11!

Gather! Filter!

Ask for text reformulations for query! Check if item satisfies query!

Gather! Retrieve! Filter! Gather! Retrieve! Filter! Retrieve! Filter!

!

!

  • How many

any re reformul ulat ations ns sho shoul uld we gat athe her? r? !

  • How many

any items s sho shoul uld we re retrieve at at eac ach h st step? p?!

  • How do we fi

filter r items? s? How many any pe peopl ple do we ask ask? ? !

  • How do we opt

ptimize the he workfl flow?!

  • How do we guar

uarant antee corre rectne ness? ss?! !

slide-12
SLIDE 12

Fundamental Tradeoffs!

Latency! Cost! Quality!

How much am I willing to spend?! How long can I wait?! What is my desired quality?!

12!

slide-13
SLIDE 13

DataSift Summary!

13!

Sample applications:! education, social media, commerce, journalism, …!

Latency ! Cost! Quality!

Gather! Retrieve! Filter!

[SIGMOD14] DataSift: A Crowd-Powered Search Toolkit (demo)! [HCOMP13] An expressive and accurate crowd powered search!

slide-14
SLIDE 14

Filtering: The Simplest Version!

14!

Dataset set of

  • f It

Items ems!

Boolean! Predicate!

Filt Filtered ered Dataset set!

Y! Y! N!

Does X satisfy predicate?! For now, all humans have same error rates! Latency! Cost! Quality! Is this image a cat?!

slide-15
SLIDE 15

5! 4! 3! 2! 1! 5! 4! 3! 2! 1!

Yes es! No No!

Our Visualization of Strategies!

15!

decide PASS! continue! decide FAIL!

Markov Decision! Process!

slide-16
SLIDE 16

5! 4! 3! 2! 1! 5! 4! 3! 2! 1!

Yes es! No No!

Strategy Examples!

16!

5! 4! 3! 2! 1! 5! 4! 3! 2! 1!

Yes es! No No!

decide PASS! continue! decide FAIL!

slide-17
SLIDE 17

Simplest Version!

!

Given: !

— Human error probability (FP/FN)! — Pr [Yes | 0]; Pr [No | 1] ! — A-priori probability! — Pr [0]; Pr[1]!

!

Find st stra rateg egy with minimum expected cost (# of questions)!

— Expected error < t (say, 5%)! — Cost per item < m (say, 20 questions)!

17!

Via sampling,! prior history, or ! gold standard!

m! m! x+y=m!

slide-18
SLIDE 18

5! 4! 3! 2! 1! 5! 4! 3! 2! 1!

Yes es! No No!

Evaluating Strategies!

18!

decide PASS! continue! decide FAIL!

  • Pr. [reach (4, 2)] = !
  • Pr. [reach (4, 1) & get a No]+
  • Pr. [reach (3, 2) & get a

Yes]!

Cost

  • st = (x+y) Pr [reach(x,y)]

Error ror = Pr [reach 1] + ! Pr [reach 0]

!

!

!

y! x!

slide-19
SLIDE 19

Naïve Approach!

! ! For all strategies:!

  • Evaluate cost & error!

Return the best! O(3 (3g), ), g = O(m (m2)! If If m= m= 5, 5, g = 21 21! !

19!

5! 4! 3! 2! 1! 5! 4! 3! 2! 1!

Yes es! No No!

For each grid point! Assign , or!

slide-20
SLIDE 20

Comparison!

20!

Computing Strategy! Naïve! deterministic ! Not feasible! Our best deterministic ! Exponential; feasible! Money !

$$!! $$$!

slide-21
SLIDE 21

Probabilistic Strategy Example!

21!

5! 4! 3! 2! 1! 5! 4! 3! 2! 1!

Yes es! No No!

6!

(0.2, 0.8, 0)!

decide PASS! continue! decide FAIL!

slide-22
SLIDE 22

Comparison!

22!

Computing Strategy! Naïve! deterministic ! Our best deterministic ! Exponential; feasible! The best probabilistic ! Polynomial(m)! THE BEST THE BEST! Money !

$$!! $$$! $!

Exponential;! not feasible!

slide-23
SLIDE 23

Finding the Optimal Strategy!

Simple: Use Linear Programming" !

  • variables: “probabilistic decision per grid point”!

!

  • constraints:!
  • probability conservation!
  • boundary conditions!

!

23!

[SIGMOD12] Crowdscreen: Algorithms for filtering data with humans!

slide-24
SLIDE 24

Generalizations!

  • Multiple answers (ratings, categories)!
  • Multiple independent filters!
  • Difficulty!
  • Different penalty functions!
  • Latency!
  • Different worker abilities!
  • Different worker probes!
  • A-priori scores!

!

24!

Doable! Hard!!

slide-25
SLIDE 25

Generalization: Worker Abilities!

25!

(W1Yes, W1 No, …, Wn Yes, Wn No)! O(m2n) points! n 1000! Explosion of state!!

It Item em 1! It Item em 2! It Item em 3! Actual ! 0! 1! 0! W1! 0! 1! 0! W2! 1! 1! 1! W3! 1! 0! 1!

slide-26
SLIDE 26

A Different Representation!

26!

3! 2! 1! 1! 0.8! 0.6! 0.4! 0.2!

Cost

  • st!

Pr Pr [1| 1|An Ans] s]!

3! 2! 1! 3! 2! 1!

Yes es! No No!

slide-27
SLIDE 27

Worker Abilities: Sufficiency!

27!

( W1Yes, W1 No, ! W2 Yes, W2 No, ! …,! Wn Yes, Wn No)! 5! 4! 3! 2! 1! 1! 0.8! 0.6! 0.4! 0.2!

Cost

  • st!

Pr Pr [1| 1|An Ans] s]!

Recording Pr[1|Ans] is sufficient: ! Strategy ! Optimal!

slide-28
SLIDE 28

MOOCs: Application of Filtering!

28!

Peer Evaluation! Required ! Crowdsourcing!

!

Generalization of boolean filtering to scoring [1-5]!

A+! A! B+! B-!

slide-29
SLIDE 29

Experiments on MOOCs!

Stanford HCI Course! 1000 x 5 x 5 Parts = 25000 Parts! Graded by random peers with known error rates! To study: how much we can reduce error for fixed cost! !

29!

slide-30
SLIDE 30

30!

Summary : ! For same cost, reduction in error ! (distance from correct grade) of:!

  • 50% over median!
  • 30% over MLE!
  • 10-20% over same accuracy !

[VLDB14] Optimal Crowd-Powered Rating and Filtering Algorithms!

slide-31
SLIDE 31

Efficient Data Processing Algorithms & Systems!

31!

Data Processing! Algorithms! Auxiliary Plugins: Quality, Pricing! Data Processing Systems!

Filt Filter er [SIG IGMOD12, 12, VLDB14] 14] !Max [SIGMOD12] ! Clean [KDD12, TKDD13] !Categorize [VLDB11]! Search [ICDE14] ! !Debugging [NIPS12]! Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12]! DataSif ift [HCOMP13, 13, SIG IGMOD14] 14] HQuery [CIDR11] ! Confidence [KDD13, TR14] !Eviction [TR12] ! Pricing [VLDB15] ! !Quality [HCOMP14]!

i.stanford.edu/~adityagp/scoop.html!

Latency ! Cost! Quality!

slide-32
SLIDE 32

32!

slide-33
SLIDE 33

VISUAL DATA MANAGEMENT with SeeDB!

Aditya Parameswaran! ! with:! Hector Garcia Molina, Sam Madden, ! Alkis Polyzotis, Manasi Vartak! !

33!

slide-34
SLIDE 34

34!

Simplifying Data Analytics!

!

Up to a million additional analysts will be needed to address data analytics needs in 2018 in the US alone.! ! ! ! !--- McKinsey Big Data Report, 2013!

How w do

  • we

e ma make e it it ea easier sier for

  • r novice

ice data analyst ysts s to

  • get

et in insig sights s from rom data?!

slide-35
SLIDE 35

35!

Data Analytics Workflow!

1.5! 2! 2.5! 3! 3.5! 4! 4.5!

50! 10! 10! 30! MA! CA! IL! NY!

“Staplers”! All Products! “Production by State”! “Sales by Year”!

25! 15! 40! 20!

Query! Views! “Production by Year”!

1.5! 2! 2.5! 3! 3.5! 4! 4.5!

Labor

  • riou

ious s and Tiresome! iresome!! Can we e automa

  • mate

e this? is?!

!

Simila imilar r issu issues es wit with ! Tab ableau, au, Sho ShowMe, Pro rofi filer, Spo Spotfi fire re!

slide-36
SLIDE 36

36!

Potentially Interesting Views(Visualizations)!

“P “Pot

  • ten

entia ially lly in interest erestin ing”: ”: tren rend in in su subset set that is is not

  • t in

in ov

  • vera

erall ll data!

! Can we automatically highlight ! potentially interesting views?! ! Savin ing: st step eppin ing throu rough all ll views iews! now

  • w on
  • nly

ly pot

  • ten

entia ially lly in interest erestin ing on

  • nes!

es!!

!

1.5! 2! 2.5! 3! 3.5! 4! 4.5!

“Sales by Year”!

slide-37
SLIDE 37

D!

Q!

D!

1.5! 2! 2.5! 3! 3.5! 4! 4.5! 0! 1! 2! 3! 4! 5! 6!

Q! DBMS! SeeDB!

Our Proposed System: SeeDB!

many rows! many columns! k potentially interesting ! visualizations!

[VLDB14] SeeDB: Visualizing Database Queries Efficiently (Vision)! [VLDB14] Automatically Generating Query Visualizations (Demo)!

slide-38
SLIDE 38

Q

D!

V1! V2! …! Vn!

Score! Visual Engine!

1.5! 2! 2.5! 3! 3.5! 4! 4.5! 0! 1! 2! 3! 4! 5! 6!

SeeDB: Conceptual Workflow!

Objective : find k-best scoring views (or visualizations) !

V!

Really Expensive!!

slide-39
SLIDE 39

39!

How do we score views?!

We e are re pursu rsuin ing wa ways ys to

  • lea

learn rn this is scorin scoring funct ction ion usin sing crowd crowds. s.! ! For For now,

  • w, a proxy

roxy that is is “g “good

  • od-en
  • enou
  • ugh”!

differences in “distribution”! e.g., EMD, euclidean, KL-divergence! !

1.5! 2! 2.5! 3! 3.5! 4! 4.5!

“Sales by Year”!

Difference(Distribution of Sales by year overall, !! ! !! Distribution of Sales by year for Staplers)!

  • ur techniques work with any scoring metric !

! This is a hard, domain-specific question!!

slide-40
SLIDE 40

40!

How many views to consider?!

Star Schema; Histogram Visualizations! ! M measure attributes! A dimension attributes! F aggregation measures! ! One-dimensional visualizations:! M x A x F! If we consider binning:! M x A x F x B! !

1.5! 2! 2.5! 3! 3.5! 4! 4.5!

“Sales by Year”!

slide-41
SLIDE 41

41!

Building SeeDB: Concrete Directions!

  • Sharing computation!
  • Approximate visualizations!
  • Approximate scoring!
  • Visualization pruning!

!

How do we minimize computation?!

slide-42
SLIDE 42

42!

Technique 1: Sharing Computation!

!

!

“Sales by Year”! “Production by Year”! “Sales and Production by Year”!

!

SELECT AGG(M (M1), 1), AGG(M (M2), 2), D,! FR FROM R! WHERE Prod rod = “S “Stapler lers” s”! GROUP BY Y D!

!

Linear ! Speedup!!

slide-43
SLIDE 43

43!

Technique 1: Sharing Computation!

!

!

“Sales by Year”! “Sales by Region”! “Sales by Year, Region”!

! SELECT AGG(M (M), ), D1, 1, D2! FR FROM R! WHERE Prod rod = “S “Stapler lers” s”! GROUP BY Y D1, 1, D2!

!

Problematic: # of aggregates grow rapidly! ! Intractable!!

slide-44
SLIDE 44

44!

Technique 2: Approximate Visualizations!

!

!

“Production by Year”! Can we provide visualizations that are guar uarant anteed to look similar (e.g., similar order, similar differences) to actual ones, but at much lower cost? !

Analysts are only interested in trends, not absolutes! Limited also by resolution!

slide-45
SLIDE 45

45!

Technique 2: Approximate Visualizations!

The e answer swer is is yes! yes!! ! At At a a hi high-l h-level, , al algori rithm hm sam sampl ples s “m “more re” ” fro from cont ntent ntious us are areas as! !

  • Ord

rder er of

  • f ma

magnit itudes es sa savin ing comp compared red to

  • baselin

selines es!

  • Optima

imalit lity y guara rantees ees!

  • Also

Also of

  • f in

indep epen enden ent in interest erest!

“Production by Year”!

[TR14] Generating Rapid Visualizations with Guarantees!

slide-46
SLIDE 46

46!

Building SeeDB: Concrete Directions!

  • Sharing computation!
  • Approximate visualizations!
  • Approximate utility computation!
  • Visualization pruning!

!

!

How do we minimize computation?!

Overall, a rich space of questions generalizable beyond SeeDB!!

slide-47
SLIDE 47

architecture of our system is shown in Figure 1 below.

47!

Our Current Design!

slide-48
SLIDE 48

48!

Interactive Query Builder!

slide-49
SLIDE 49

49!

Top-k Visualizations!

slide-50
SLIDE 50

50!

To summarize…!

SeeDB has some ambitious goals…! ! ! “show me all that’s interesting about the query result”! i.e., the holy grail of exploratory visual data analysis! ! ! We’ve barely scratched the surface, yet!! … doesn’t mean we can’t build a useful tool!