Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work - - PDF document

using crowdsourcing for data analytics
SMART_READER_LITE
LIVE PREVIEW

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work - - PDF document

Using CrowdSourcing for Data Analytics Hector Garcia-Molina (work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others) Stanford University 1 Big Data Analytics CrowdSourcing 1 CrowdSourcing 3 Real World Examples


slide-1
SLIDE 1

1

1

Using CrowdSourcing for Data Analytics

Hector Garcia-Molina

(work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others)

Stanford University

  • Big Data Analytics
  • CrowdSourcing
slide-2
SLIDE 2

2

CrowdSourcing

3

Real World Examples

4

Image Matching Translation Image Matching Translation Categorizing Images Categorizing Images S earch R elevance S earch R elevance Data Gathering Data Gathering

slide-3
SLIDE 3

3

Many Crowdsourcing Marketplaces! Many Research Projects!

6

slide-4
SLIDE 4

4

7

analytics data results humans

Example tasks:

  • get missing data
  • verify results
  • analyze data

8

analytics data results humans

Example tasks:

  • get missing data
  • verify results
  • analyze data

Key Point:

  • use humans judiciously
slide-5
SLIDE 5

5

Today will illustrate with

  • Entity Resolution
  • (may cover another topic briefly)

9 10

Traditional Entity Resolution

cleansing analysis System 1 System n

...

what matches what??

slide-6
SLIDE 6

6

Why is ER Challenging?

  • Huge data sets
  • No unique identifiers
  • Missing data
  • Lots of uncertainty
  • Many ways to skin the cat

11

Simple ER Example

12

slide-7
SLIDE 7

7

Simple ER Example

13

a c b d

sim=0.9 sim=0.8

Simple ER Example

14

a c b d

sim=0.9 sim=0.8

slide-8
SLIDE 8

8

15

ER: Exact vs Approximate

products cameras resolved cameras CDs books

...

resolved CDs resolved books

...

ER ER ER

Simple ER Algorithm

  • Compute pairwise similarities
  • Apply threshold
  • Perform transitive closure

16

slide-9
SLIDE 9

9

Simple ER Algorithm

  • Compute pairwise similarities
  • Apply threshold
  • Perform transitive closure

17

0.45 0.63 0.5 0.9 0.95 0.9 0.7 0.9 0.87 0.7 0.4 0.6 0.5 0.63

Simple ER Algorithm

  • Compute pairwise similarities
  • Apply threshold
  • Perform transitive closure

18

0.9 0.95 0.9 0.7 0.9 0.87 0.7

threshold = 0.7

slide-10
SLIDE 10

10

Simple ER Algorithm

  • Compute pairwise similarities
  • Apply threshold
  • Perform transitive closure

19

0.9 0.95 0.9 0.7 0.9 0.87 0.7

Crowd ER

20

slide-11
SLIDE 11

11

Same as this?

21

Crowd ER

  • First Cut: For every pair of records,

ask workers if they match (i.e., get similarity)

22

slide-12
SLIDE 12

12

Crowd ER

  • First Cut: For every pair of records,

ask workers if they match (i.e., get similarity)

  • Too expensive!

23

0.45 0.63 0.5 0.9 0.95 0.9 0.7 0.9 0.87 0.7 0.4 0.6 0.5 0.63

Crowd ER

  • Second Cut: Compute similarities;

workers verify "critical" pairs

24

0.45 0.63 0.5 0.9 0.95 0.9 0.7 0.9 0.87 0.7 0.4 0.6 0.5 0.63

critical??

slide-13
SLIDE 13

13

Crowd ER

  • Second Cut: Compute similarities;

workers verify "critical" pairs

25

0.45 0.63 0.5 0.9 0.95 0.9 0.7 0.9 0.87 0.7 0.4 0.6 0.5 0.63

critical??

Crowd ER

  • Second Cut: Compute similarities;

workers verify "critical" pairs

26

0.45 0.63 0.5 0.9 0.95 0.9 0.7 0.9 0.87 0.7 0.4 0.6 0.5 0.63

critical??

slide-14
SLIDE 14

14

27

pairwise analysis records clusters generate questions

Key Point:

  • use humans judiciously

global analysis crowd new evidence

Key Issue: Semantics of Crowd Answer

28

slide-15
SLIDE 15

15

Key Issue: Semantics of Crowd Answer

29

?

E D C B A

Also issue: Similarities as Probabilities

30

sim(a,b) → prob(a,b)

slide-16
SLIDE 16

16

Strategy

31

current state a b c

0.9 0.5 0.2

ER result use any given ER algorithm

Strategy

32

current state a b c

0.9 0.5 0.2

Q(a,b) Q(b,c) Q(a,c) consider ALL possible questions (three in this example)

slide-17
SLIDE 17

17

Strategy

33

current state a b c

0.9 0.5 0.2

Q(a,b) Q(b,c) Q(a,c) new state new state new state new state new state new state ER result ER result ER result ER result ER result ER result

Y Y Y N N N

consider possible

  • utcomes

Strategy

34

current state a b c

0.9 0.5 0.2

Q(b,c) new state ER result

Y

example a b c

0.9 1.0 0.2

a b c

slide-18
SLIDE 18

18

Strategy

35

current state a b c

0.9 0.5 0.2

Q(a,b) Q(b,c) Q(a,c) new state new state new state new state new state new state ER result ER result ER result ER result ER result ER result score? score? score? score? score? score?

Y Y Y N N N

Two Remaining Issues

  • How do we score an ER result?

36

  • Efficiency?

ER result gold standard F score

slide-19
SLIDE 19

19

Gold Standard?

37

a b c

0.9 0.5 0.2

a b c

1.0 0.6 0.2

sim to prob

Gold Standard?

38

a b c

0.9 0.5 0.2

a b c

1.0 0.6 0.2

a b c a b c a b c a b c

0.12 0.48 0.08 0.32

sim to prob possible worlds

slide-20
SLIDE 20

20

Gold Standard?

39

a b c

0.9 0.5 0.2

a b c

1.0 0.6 0.2

a b c a b c a b c a b c

0.12 0.48 0.08 0.32

a b c

0.68

a b c

0.32

sim to prob possible worlds possible clustering (via ER algorithm)

Strategy

40

current state a b c

0.9 0.5 0.2

Q(a,b) Q(b,c) Q(a,c) new state new state new state new state new state new state ER result ER result ER result ER result ER result ER result score vs GS?

Y Y Y N N N

score vs GS? score vs GS? score vs GS? score vs GS? score vs GS?

slide-21
SLIDE 21

21

Evaluating Efficiently

  • See: Steven E. Whang, Peter Lofgren, and H.

Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013.

41

Sample Result

42

slide-22
SLIDE 22

22

43

analytics data results humans

Example tasks:

  • get missing data
  • verify results
  • analyze data

Key Point:

  • use humans judiciously

Summary

44

analytics big data

Now for something completely different!

DBMS

slide-23
SLIDE 23

23

45

analytics big data

Now for something completely different!

DBMS humans

46

data

DeCo: Declarative CrowdSourcing

DBMS humans

End user

what is best price for Nikon DS LR cameras?

slide-24
SLIDE 24

24

47

data

DeCo: Declarative CrowdSourcing

DBMS humans

End user

what is best price for Nikon DS LR cameras?

model type brand D7100 DSLR Nikon 7D DSLR Canon P5000 comp Nikon

  • • •
  • • •
  • • •

48

data

DeCo: Declarative CrowdSourcing

DBMS humans

End user what is best price for Nikon DS LR cameras?

model type brand D7100 DSLR Nikon 7D DSLR Canon P5000 comp Nikon

  • • •
  • • •
  • • •

what is best price for Nikon D7100 camera? Crowd

slide-25
SLIDE 25

25

restaurant rating cuisine Chez Panisse 4.9 French Chez Panisse 4.9 California Bytes 3.8 California

  • • •
  • • •
  • • •

Example with a bit more detail:

User view restaurant rating cuisine Chez Panisse 4.9 French Chez Panisse 4.9 California Bytes 3.8 California

  • • •
  • • •
  • • •

  • 50

User view restaurant Chez Panisse Bytes

  • • •

restaurant rating Chez Panisse 4.8 Chez Panisse 5.0 Chez Panisse 4.9 Bytes 3.6 Bytes 4.0

  • • •
  • • •

restaurant cuisine Chez Panisse French Chez Panisse California Bytes California Bytes California

  • • •
  • • •
  • • •
  • • •

Anchor Dependent Dependent

Example with a bit more detail:

slide-26
SLIDE 26

26

restaurant rating cuisine Chez Panisse 4.9 French Chez Panisse 4.9 California Bytes 3.8 California

  • • •
  • • •
  • • •

  • 51

User view restaurant Chez Panisse Bytes

  • • •

restaurant rating Chez Panisse 4.8 Chez Panisse 5.0 Chez Panisse 4.9 Bytes 3.6 Bytes 4.0

  • • •
  • • •

restaurant cuisine Chez Panisse French Chez Panisse California Bytes California Bytes California

  • • •
  • • •
  • • •
  • • •

Anchor Dependent Dependent fetch rule fetch rule Bytes

Chez Panisse

fetch rule

Example with a bit more detail:

restaurant rating cuisine Chez Panisse 4.9 French Chez Panisse 4.9 California Bytes 3.8 California

  • • •
  • • •
  • • •

  • 52

User view restaurant Chez Panisse Bytes

  • • •

restaurant rating Chez Panisse 4.8 Chez Panisse 5.0 Chez Panisse 4.9 Bytes 3.6 Bytes 4.0

  • • •
  • • •

restaurant cuisine Chez Panisse French Chez Panisse California Bytes California Bytes California

  • • •
  • • •
  • • •
  • • •

Anchor Dependent Dependent fetch rule fetch rule Bytes

Chez Panisse

fetch rule fetch rule

French

Example with a bit more detail:

slide-27
SLIDE 27

27

restaurant rating cuisine Chez Panisse 4.9 French Chez Panisse 4.9 California Bytes 3.8 California

  • • •
  • • •
  • • •

  • 53

User view restaurant Chez Panisse Bytes

  • • •

restaurant rating Chez Panisse 4.8 Chez Panisse 5.0 Chez Panisse 4.9 Bytes 3.6 Bytes 4.0

  • • •
  • • •

restaurant cuisine Chez Panisse French Chez Panisse California Bytes California Bytes California

  • • •
  • • •
  • • •
  • • •

Anchor Dependent Dependent resolution rule resolution rule Bytes

Chez Panisse

Example with a bit more detail:

restaurant rating cuisine Chez Panisse 4.9 French Chez Panisse 4.9 California Bytes 3.8 California

  • • •
  • • •
  • • •

  • 54

User view restaurant Chez Panisse Bytes

  • • •

restaurant rating Chez Panisse 4.8 Chez Panisse 5.0 Chez Panisse 4.9 Bytes 3.6 Bytes 4.0

  • • •
  • • •

restaurant cuisine Chez Panisse French Chez Panisse California Bytes California Bytes California

  • • •
  • • •
  • • •
  • • •

Anchor Dependent Dependent

  • 1. Fetch
  • 2. Resolve
  • 3. Join

Example with a bit more detail:

slide-28
SLIDE 28

28 Fetch [n] Fetch [ln] Fetch [ln,c] Scan D1(n,l) Scan A(n)

55

Join Join AtLeast [8]

SELECT n,l,c FROM country WHERE l = ‘Spanish’ ATLEAST 8

Resolve[m3] Resolve[d.e] Fetch [nl] Scan D2(n,c) Resolve[m3] Fetch [nl,c]

Many Query Processing Challenges

Filter [l=‘Spanish’] Fetch [nl,c]

Deco Prototype V1.0

56

slide-29
SLIDE 29

29

Conclusion

  • Crowdsourcing is important for

managing data!

  • Still many challenges ahead!

57 58