Entity Resolution: Glue for Middleware Hector Garcia-Molina - - PowerPoint PPT Presentation

entity resolution glue for middleware
SMART_READER_LITE
LIVE PREVIEW

Entity Resolution: Glue for Middleware Hector Garcia-Molina - - PowerPoint PPT Presentation

Entity Resolution: Glue for Middleware Hector Garcia-Molina Stanford University Middleware apps middleware ... System n System 1 2 Middleware apps middleware what matches what?? ... System n System 1 3 Matching Execution Level


slide-1
SLIDE 1

Entity Resolution: Glue for Middleware

Hector Garcia-Molina Stanford University

slide-2
SLIDE 2

2

Middleware

middleware apps System 1 System n

...

slide-3
SLIDE 3

3

Middleware

middleware apps System 1 System n

...

what matches what??

slide-4
SLIDE 4

Matching

  • Execution Level

– matching ports, calls, parameters, workflows ...

  • Data Level

– matching records, attributes, values ...

4

slide-5
SLIDE 5

Matching

  • Execution Level
  • Data Level

– Ontology – Schema – Instance

5

slide-6
SLIDE 6

Example: Stock Options

  • Ontology

6

Black–Scholes Valuation Market Valuation Stock Option Stock Grant Deferred Compensation Strike Price Date of Option Taxable Income

slide-7
SLIDE 7

Example: Stock Options

  • Schema

7

Stock_Option(Date, Price, Shares, Holder, Plan, ...) Option(Date, StrikePrice, Shares, Employee, Restrictions, ...)

slide-8
SLIDE 8

Example: Stock Options

  • Instance

8

Name: Tom S. Smith Adr: 123 Main St Date: Shares: Option 1 Name: Thomas Smith Adr: 132 Main St Date: Shares: Option 2

slide-9
SLIDE 9

This Talk

  • Instance Resolution

– a.k.a. Entity Resolution – a.k.a. De-Duplication – a.k.a. Record Linkage

9

slide-10
SLIDE 10

10

Applications

  • comparison shopping
  • mailing lists
  • classified ads
  • customer files
  • counter-terrorism

N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2

slide-11
SLIDE 11

Why is ER Challenging?

  • Huge data sets
  • No unique identifiers
  • Lots of uncertainty
  • Many ways to skin the cat

11

slide-12
SLIDE 12

Outline

  • Taxonomy
  • Swoosh Algorithm
  • Distributed ER
  • More on blocking

12

slide-13
SLIDE 13

Taxonomy: Pairwise vs Global

  • Decide if r, s match only by looking at r, s?
  • Or need to consider more (all) records?

13

Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111

  • r
slide-14
SLIDE 14

Taxonomy: Pairwise vs Global

  • Global matching complicates things a lot!

– e.g., change decision as new records arrive

14

Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111

  • r
slide-15
SLIDE 15

Taxonomy: Outcome

  • Partition of records

– e.g., comparison shopping

  • Merged records

15

Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 555-1212 (650) 777-1111 Hair: Black Nm: Patricia Smith Ad: 132 Main St Ph: (650) 777-1111 Hair: Black

slide-16
SLIDE 16

16

Taxonomy: Outcome

  • Iterate after merging

Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K

slide-17
SLIDE 17

Taxonomy: Record Reuse

  • One record related to multiple entities?

17

Nm: Pat Smith Sr. Ph: (650) 555-1212 Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Nm: Pat Smith Sr. Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Ad: 123 Main St

slide-18
SLIDE 18

Taxonomy: Record Reuse

  • Partitions

18

  • Merges

r s t r s t rs st

slide-19
SLIDE 19

Taxonomy: Record Reuse

  • Partitions

19

  • Merges

r s t r s t rs st

  • Record reuse complex and expensive!
slide-20
SLIDE 20

20

Taxonomy: Multiple Entity Types

person 1 person 2 Organization B Organization A brother member business member

slide-21
SLIDE 21

21

Taxonomy: Multiple Entity Types

p1 p2 p5 p7 a1 a2 a3 a5 a4 authors papers same??

slide-22
SLIDE 22

22

Taxonomy: Exact vs Approximate

products cameras resolved cameras CDs books

...

resolved CDs resolved books

...

ER ER ER

slide-23
SLIDE 23

23

Taxonomy: Exact vs Approximate

terrorists terrorists sort by age B Cooper 30 match against ages 25-35

slide-24
SLIDE 24

Taxonomy: Other Variations

  • Managing uncertainty
  • Similarity computation

24

slide-25
SLIDE 25

Outline

  • Taxonomy
  • Swoosh Algorithm
  • Distributed ER
  • More on blocking

25

slide-26
SLIDE 26

Scenario

  • Pairwise matching
  • Record merging
  • No record reuse
  • Single entity type

26

slide-27
SLIDE 27

27

Model

Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K

r1 r3 r2 r4:<r1, r2> <r4, r3> M(r1, r2) M(r4, r3)

slide-28
SLIDE 28

28

Correct Answer

r1 r2 r3 r4 r5 r6 s9 s8 s7 s10

ER(R) = All derivable records..... Minus “dominated” records

slide-29
SLIDE 29

29

Question

  • What is best sequence of match, merge calls

that give us right answer?

slide-30
SLIDE 30

30

Brute Force Algorithm

  • Input R:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6]

slide-31
SLIDE 31

31

Brute Force Algorithm

  • Input R:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6]

  • Match all pairs:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6] – r12 = [a:1, b:2, c:4, e:5]

slide-32
SLIDE 32

32

Brute Force Algorithm

  • Match all pairs:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6] – r12 = [a:1, b:2, c:4, e:5]

  • Repeat:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6] – r12 = [a:1, b:2, c:4, e:5] – r123 = [a:1, b:2, c:4, e:5, f:6]

slide-33
SLIDE 33

33

Question # 1

Brute Force Algorithm

  • Input R:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6]

  • Match all pairs:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6] – r12 = [a:1, b:2, c:4, e:5]

Can we delete r1, r2?

slide-34
SLIDE 34

34

Question # 2

Brute Force Algorithm

  • Match all pairs:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6] – r12 = [a:1, b:2, c:4, e:5]

  • Repeat:

– r1 = [a:1, b:2] – r2 = [a:1, c: 4, e:5] – r3 = [b:2, c:4, f:6] – r4 = [a:7, e:5, f:6] – r12 = [a:1, b:2, c:4, e:5] – r123 = [a:1, b:2, c:4, e:5, f:6]

Can we avoid comparisons?

slide-35
SLIDE 35

35

ICAR Properties

  • Idempotence:

– M(r1, r1) = true; <r1, r1> = r1

  • Commutativity:

– M(r1, r2) = M(r2, r1) – <r1, r2> = <r2, r1>

  • Associativity

– <r1, <r2, r3>> = <<r1, r2>, r3>

slide-36
SLIDE 36

36

More Properties

  • Representativity

– If <r1, r2> = r3, then for any r4 such that M(r1, r4) is true we also have M(r3, r4) = true.

r1 r2 r3 r4

slide-37
SLIDE 37

37

ICAR Properties  Efficiency

  • Commutativity
  • Idempotence
  • Associativity
  • Representativity
  • Can discard records
  • ER result independent
  • f processing order
slide-38
SLIDE 38

38

Swoosh Algorithms

  • Record Swoosh
  • Merges records as soon as they match
  • Optimal in terms of record comparisons
  • Feature Swoosh
  • Remembers values seen for each feature
  • Avoids redundant value comparisons
slide-39
SLIDE 39

39

Swoosh Performance

slide-40
SLIDE 40

40

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r2: [Joe, 123 Main, Ph:123] r3: [Joe Jr., 123 Main, DL:Y]

slide-41
SLIDE 41

41

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r2: [Joe, 123 Main, Ph:123] r3: [Joe Jr., 123 Main, DL:Y] Full Answer: ER(R) = {r12, r23, r1, r2, r3} Minus Dominated: ER(R) = {r12, r23}

slide-42
SLIDE 42

42

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X] r23: [Joe Jr., 123 Main, Ph: 123, DL:Y] r12: [Joe Sr., 123 Main, Ph: 123, DL:X] r2: [Joe, 123 Main, Ph:123] r3: [Joe Jr., 123 Main, DL:Y] Full Answer: ER(R) = {r12, r23, r1, r2, r3} Minus Dominated: ER(R) = {r12, r23} R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}

slide-43
SLIDE 43

43

Swoosh Without ICAR Properties

slide-44
SLIDE 44

44

Distributed Swoosh

P1 P2 P3 r1 r2 r3 r4 r5 r6 ...

slide-45
SLIDE 45

45

Distributed Swoosh

P1 P2 P3 r1 r3 r4 r6 ... r1 r2 r4 r5 ... r2 r3 r5 r6 ...

slide-46
SLIDE 46

46

DSwoosh Performance

slide-47
SLIDE 47

Outline

  • Swoosh Algorithm
  • Distributed ER
  • More on blocking

47

slide-48
SLIDE 48

48

Iterative Blocking: Example

Record Name Addr (zip) Email r John Doe 52139 jdoe@yahoo s John Doe 94305 t

  • J. Foe

94305 jdoe@yahoo u Bobbie Brown 12345 bob@gmail v Bobbie Brown 12345 bob@gmail

slide-49
SLIDE 49

49

Example

Record Name Addr (zip) Email r John Doe 52139 jdoe@yahoo s John Doe 94305 t

  • J. Foe

94305 jdoe@yahoo u Bobbie Brown 12345 bob@gmail v Bobbie Brown 12345 bob@gmail

Iterative ER: r s <r, s>

(John Doe, {52139, 94305}, jdoe@yahoo)

t <r, s, t>

slide-50
SLIDE 50

50

Blocking

Record Name Addr (zip) Email r John Doe 52139 jdoe@yahoo s John Doe 94305 t

  • J. Foe

94305 jdoe@yahoo u Bobbie Brown 12345 bob@gmail v Bobbie Brown 12345 bob@gmail Criterion Partition by b-,1 b-,2 b-,3 SC1 zip code r s, t u,v SC2 1st char last name r, s t u, v

slide-51
SLIDE 51

51

Blocking

Record Name Addr (zip) Email r John Doe 52139 jdoe@yahoo s John Doe 94305 t

  • J. Foe

94305 jdoe@yahoo u Bobbie Brown 12345 bob@gmail v Bobbie Brown 12345 bob@gmail Criterion Partition by b-,1 b-,2 b-,3 SC1 zip code r s, t u,v SC2 1st char last name r, s t u, v

Will miss: < r, s, t >

slide-52
SLIDE 52

52

Blocking

Record Name Addr (zip) Email r John Doe 52139 jdoe@yahoo s John Doe 94305 t

  • J. Foe

94305 jdoe@yahoo u Bobbie Brown 12345 bob@gmail v Bobbie Brown 12345 bob@gmail Criterion Partition by b-,1 b-,2 b-,3 SC1 zip code r s, t u,v SC2 1st char last name r, s t u, v

Solution: Propagate Matches < r, s >

slide-53
SLIDE 53

53

What We Have Done

  • Formal Model for Iterative Blocking

– based on generic “core” ER algorithm

  • Two Algorithms:

– Lego: in memory – Duplo: disk based

  • Experiments
slide-54
SLIDE 54

54

Sample Results: Accuracy

slide-55
SLIDE 55

55

Sample Results: Run Time

slide-56
SLIDE 56

56

Accuracy vs Run Time

slide-57
SLIDE 57

57

Conclusion

  • ER is old and important problem
  • Critical for gluing components
slide-58
SLIDE 58

58

Thanks.