[PPT] - Divesh Srivastava AT&T PowerPoint Presentation, free download

SLIDE 1

Divesh Srivastava

AT&T Labs-Research

SLIDE 2

2

SLIDE 3

♦ Big data integration = Big data + data integration

♦ Data integration: easy access to multiple data sources [DHI12]

– Data in large organizations, governments is often siloed – Multiple sources of data in the same domain exist on the Web

♦ Big data: all about the V’s

– Size: large volume of data, collected and analyzed at high velocity

– Complexity: huge variety of data, of questionable veracity – Utility: data of considerable value

3

SLIDE 4

♦ Big data integration = Big data + data integration

♦ Data integration: easy access to multiple data sources [DHI12]

– Data in large organizations, governments is often siloed – Multiple sources of data in the same domain exist on the Web

♦ Big data in the context of data integration: still about the V’s

– Size: large volume of sources, changing at high velocity

– Complexity: huge variety of sources, of questionable veracity – Utility: sources of considerable value

4

SLIDE 5

!"#$

♦ Study on two domains

– Belief of good data – Bad data can have big impact

5

#Sources Period #Objects #Local- attrs #Global- attrs Considered items Stock 55 7/2011 1000*20 333 153 16000*20 Flight 38 12/2011 1200*31 43 15 7200*31

SLIDE 6

%

♦ Is the data consistent?

– Tolerance to 1% value difference, 15 min for time

6

SLIDE 7

%

♦ Why such inconsistency?

– Semantic ambiguity

7

Yahoo! Finance Nasdaq

52wk Range: 25.38-95.71 52 Wk: 25.38-93.72 Day’s Range: 93.80-95.71

SLIDE 8

%

♦ Why such inconsistency?

– Instance ambiguity

8

SLIDE 9

%

♦ Why such inconsistency?

– Unit errors

9

76,821,000 76.82B

SLIDE 10

%

♦ Why such inconsistency?

– Pure errors

10

FlightView FlightAware Orbitz 6:15 PM 6:15 PM 6:22 PM 9:40 PM 8:33 PM 9:54 PM

SLIDE 11

%

♦ Why such inconsistency?

– Random sample of 20 data items + 5 items with largest # of values

11

SLIDE 12

%

♦ Do sources copy from other sources?

12

SLIDE 13

%

♦ Do sources copy from accurate sources?

13

SLIDE 14

&

♦ Geo-spatial data fusion

14

http://axiomamuse.wordpress.com/2011/04/18/

SLIDE 15

&

♦ Scientific data analysis

15

http://scienceline.org/2012/01/from-index-cards-to-information-overload/

SLIDE 16

&

16

Google knowledge graph

♦ Building web-scale knowledge bases with correct information

SLIDE 17

♦ Data integration = solving lots of jigsaw puzzles

– Each jigsaw puzzle (e.g., Taj Mahal) is an integrated entity – Each piece of a puzzle comes from some source – Small data integraEon F solving small puzzles

17

SLIDE 18

♦ Data integration = solving lots of jigsaw puzzles

– Big data integraEon F big, messy puzzles – E.g., missing, duplicate, damaged pieces

18

SLIDE 19

'

♦ Motivation ♦ Record linkage ♦ Data fusion ♦ Emerging topics

19

SLIDE 20

() *

♦ Volume: dealing with billions of records

– Map-reduce based record linkage [VCL10, KTR12] – Adaptive record blocking [DNS+12, MKB12, VN12] – Blocking in heterogeneous data spaces [PIP+12, PKP+13]

♦ Velocity

– Incremental record linkage [WGM10, WGM13, GDS14]

20

SLIDE 21

() *

♦ Variety

– Matching structured and unstructured data [KGA+11, KTT+12] – Matching Web tables and catalogs [LSC10]

♦ Veracity

– Linking temporal records [LDM+11] – Using crowdsourcing oracle [WLK+13, VBD14, FSS16]

21

SLIDE 22

*+() ,!""$

♦ How many Wei Wang’s are in DBLP, with which publications?

22

SLIDE 23

*+(),-

♦ Traditional record linkage

– Links records of an entity from multiple sources at a point in time

♦ Record linkage in Long Data

– Links records of an entity over a long time period – Attribute values of an entity evolve over time – Different entities across time may have the same attribute value

Adam Smith (1723-1790) Adam Smith (1965-)

23

SLIDE 24

*+()

24

1991 2004 2005 2006 2007 2008 2009 2010 2011

Who authored what?

r1: Xin Dong

R. Polytechnic Institute

r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r7: Dong Xin University of Illinois r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r12: Dong Xin Microsoft Research

SLIDE 25

*+()

25

1991 2004 2005 2006 2007 2008 2009 2010 2011 r1: Xin Dong

R. Polytechnic Institute

r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r7: Dong Xin University of Illinois r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r12: Dong Xin Microsoft Research

Ground truth

SLIDE 26

*+()

26

1991 2004 2005 2006 2007 2008 2009 2010 2011 r1: Xin Dong

R. Polytechnic Institute

r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r7: Dong Xin University of Illinois r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r12: Dong Xin Microsoft Research

Traditional solution 1:

high value consistency

SLIDE 27

*+()

27

1991 2004 2005 2006 2007 2008 2009 2010 2011

Traditional solution 2:

using similar names

r1: Xin Dong

R. Polytechnic Institute

r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research r7: Dong Xin University of Illinois r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r12: Dong Xin Microsoft Research

SLIDE 28

♦ Smooth transition in one attribute, despite evolution of another

*+()'

28

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 29

♦ Erratic changes in an attribute value are quite unlikely

*+()'

29

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 30

♦ Typically, there is continuity of history, i.e., no big gaps in time

*+()'

30

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 31

♦ High penalty for value disagreement over a short time period

*+()

31

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 32

♦ Lower penalty for value disagreement over a long time period

*+()

32

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 33

♦ High reward for value agreement across a small time gap

*+()

33

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 34

♦ Lower reward for value agreement across a big time gap

*+()

34

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 35

♦ Consider records in time order for clustering

*+()

35

ID Name Affiliation Co-authors Year r1 Xin Dong

R. Polytechnic Institute

Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011

SLIDE 36

'

♦ Motivation ♦ Record linkage ♦ Data fusion ♦ Emerging topics

36

SLIDE 37

.

♦ Veracity

– Using source trustworthiness [YHY08, GAM+10, PR11, YT11, GSH11,

PR13]

– Combining source accuracy and copy detection [DBS09a, QAH+13] – Multiple truth values [ZRG+12] – Erroneous numeric data [ZH12] – Experimental comparison on deep web data [LDL+13]

37

SLIDE 38

.

♦ Volume:

– Online data fusion [LDO+11]

♦ Velocity

– Truth discovery for dynamic data [DBS09b, PRM+12]

♦ Variety

– Combining record linkage with data fusion [GDS+10]

38

SLIDE 39

)&/-0

♦ Supports difference of opinion, allows conflict resolution ♦ Works well for independent sources that have similar accuracy ♦ When sources have different accuracies

– Need to give more weight to votes by knowledgeable sources

♦ When sources copy from other sources

– Need to reduce the weight of votes by copiers

39

SLIDE 40

))))12134536$

♦ Need to give more weight to knowledgeable sources ♦ Computing source accuracy: A(S) = Avg vi(D) ∈ S Pr(vi(D) true | Ф)

– vi(D) ∈ S : S provides value vi on data item D – Ф: observations on all data items by sources S – Pr(vi(D) true | Ф) : probability of vi(D) being true

♦ How to compute Pr(vi(D) true | Ф)?

40

SLIDE 41

))))

♦ Input: data item D, val(D) = {v0,v1,…,vn}, Ф ♦ Output: Pr(vi(D) true | Ф), for i=0,…, n (sum=1) ♦ Based on Bayes Rule, need Pr(Ф | vi(D) true)

– Under independence, need Pr(ФD(S)|vi(D) true) – If S provides vi : Pr(ФD(S) |vi(D) true) = A(S) – If S does not : Pr(ФD(S) |vi(D) true) =(1-A(S))/n

♦ Challenge:

– Inter-dependence between source accuracy and value probability?

41

SLIDE 42

Value Vote Count Source Vote Count Value Probability Source Accuracy

))))

♦ Continue until source accuracy converges

42

) | ) ( Pr( ) (

) (

Φ =

∈

D v Avg S A

S D v

) ( 1 ) ( ln ) ( ' S A S nA S A − =

∈

= Φ

) ( )) ( ( )) ( (

) | ) ( Pr(

D val v D v C D v C

e e D v

∈

=

)) ( (

) ( ' )) ( (

D v S S

S A D v C

SLIDE 43

)

43

Source 1 on USA Presidents: 1st : George Washington 2nd : John Adams 3rd : Thomas Jefferson 4th : James Madison … 41st : George H.W. Bush 42nd : William J. Clinton 43rd : George W. Bush 44th: Barack Obama Source 2 on USA Presidents: 1st : George Washington 2nd : John Adams 3rd : Thomas Jefferson 4th : James Madison … 41st : George H.W. Bush 42nd : William J. Clinton 43rd : George W. Bush 44th: Barack Obama Are Source 1 and Source 2 dependent?

Not necessarily

SLIDE 44

)

44

Source 1 on USA Presidents: 1st : George Washington 2nd : Benjamin Franklin 3rd : John F. Kennedy 4th : Abraham Lincoln … 41st : George W. Bush 42nd : Hillary Clinton 43rd : Dick Cheney 44th: Barack Obama Source 2 on USA Presidents: 1st : George Washington 2nd : Benjamin Franklin 3rd : John F. Kennedy 4th : Abraham Lincoln … 41st : George W. Bush 42nd : Hillary Clinton 43rd : Dick Cheney 44th: John McCain Are Source 1 and Source 2 dependent?

Very likely

SLIDE 45

)

45

Different Values Od TRUE Ot S1 ∩ ∩ ∩ ∩ S2 FALSE Of Same Values

SLIDE 46

)

46

Different Values Od TRUE Ot S1 ∩ ∩ ∩ ∩ S2 FALSE Of Same Values

Pr Independence Copying Ot Of Od

n A

2

) 1 ( −

2

A P

d =1− A2 − (1− A)2

n

A•c+ A2(1−c)

) 1 ( ) 1 ( ) 1 (

2

c n A c A − − +

−

P

d(1−c)

< < < < < < < < < < < < >

SLIDE 47

)

47

♦ Typically converges when #objs >> #srcs Truth Discovery Accuracy Computation Copy Detection

Step 1 Step 3 Step 2

SLIDE 48

'

♦ Motivation ♦ Record linkage ♦ Data fusion ♦ Emerging topics

48

SLIDE 49

))"#5("7$

♦ How to select sources before integration to balance gain, cost?

49

Big Data Integration Source Selection

SLIDE 50

89).":$

♦ Improving progressive quality of linkage using an oracle

50

SLIDE 51

%

51

SLIDE 52

);<+

52

Data.gov

SLIDE 53

'-+

53

SLIDE 54

)

♦ Big data integration is an important area of research

– Knowledge bases, linked data, geo-spatial fusion, scientific data

♦ Much interesting work has been done in this area

– Challenges due to volume, velocity, variety, veracity

♦ A lot more research needs to be done!

54