The Best of Both Worlds: Combining Hand-Tuned and - - PowerPoint PPT Presentation

the best of both worlds
SMART_READER_LITE
LIVE PREVIEW

The Best of Both Worlds: Combining Hand-Tuned and - - PowerPoint PPT Presentation

The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen, Gabriel Campero Durand, Roman Zoun, David Broneske, Yang Li, Gunter Saake xiao.chen@ovgu.de Otto-von-Guericke-University


slide-1
SLIDE 1

1 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

The Best of Both Worlds:

Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution

Xiao Chen, Gabriel Campero Durand, Roman Zoun, David Broneske, Yang Li, Gunter Saake xiao.chen@ovgu.de Otto-von-Guericke-University of Magdeburg BTW’19, Rostock, March 7th, 2019

slide-2
SLIDE 2

2 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Entity Resolution (ER)

2/19

Digital-world Records: Real-world Entities:

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Real world vs. Digital world

slide-3
SLIDE 3

3 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Entity Resolution (ER)

Real world vs. Digital world

Definition: Identifying records that refer to the same entity

Digital-world Records: Real-world Entities:

2/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-4
SLIDE 4

4 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Entity Resolution (ER)

Real world vs. Digital world

Definition: Identifying records that refer to the same entity

2/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Hospital Citizen’s

  • ffice
slide-5
SLIDE 5

5 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Entity Resolution (ER)

Real world vs. Digital world

Definition: Identifying records that refer to the same entity

2/19 Google Amazon Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for ...

  • 17.9
slide-6
SLIDE 6

6 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Entity Resolution (ER)

Real world vs. Digital world

Definition: Identifying records that refer to the same entity

2/19

ID Titel Author Venue Year conf/sigmod/ GrossmanHQ 95 PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 223901 PTool: a light weight persistent object manager

  • R. L. Grossman, D.

Hanley, X. Qin International Conference

  • n Management of Data

1995

ACM DBLP

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-7
SLIDE 7

7 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Basic Steps of Pair-Wise ER

3/19

Pair-Wise comparison Classification Clerical review Matches Non- matches Potential matches

Input data Results: Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-8
SLIDE 8

8 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Basic Steps of Pair-Wise ER

3/19 Similarity scores

A D A B

( , ); ( , );

A E A C

( , ); ( , ); ( , );

B C B D B E C D C E D E

( , ); ( , ); ( , ); ( , ); ( , );

A B C D

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Pair-Wise comparison Classification Clerical review Matches Non- matches Potential matches

Input data Results:

E

slide-9
SLIDE 9

9 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Basic Steps of Pair-Wise ER

3/19 Similarity scores

A D A B

( , ); ( , );

A E A C

( , ); ( , ); ( , );

B C B D B E C D C E D E

( , ); ( , ); ( , ); ( , ); ( , );

C E D E C D

(( ), score)

A B

(( ), score) (( ), score) (( ), score) … …

Match/Non-match?

A B C D E

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Pair-Wise comparison Classification Clerical review Matches Non- matches Potential matches

Input data Results:

slide-10
SLIDE 10

10 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Three Groups of Attributes

4/19

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians...

  • 17.9

DBLP-ACM bibliography data: Amazon-Google product data:

Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Persons:

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-11
SLIDE 11

11 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Three Groups of Attributes

4/19

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians... 17.9 Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

DBLP-ACM bibliography data: Amazon-Google product data: Persons:

Numerical attributes (NA):

slide-12
SLIDE 12

12 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Three Groups of Attributes

4/19

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians... 17.9 Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

DBLP-ACM bibliography data: Amazon-Google product data: Persons:

Numerical attributes (NA):

  • Don’t include numerical strings
slide-13
SLIDE 13

13 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Three Groups of Attributes

4/19

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians... 17.9

Numerical attributes (NA):

Non-semantically related attributes (NRA):

  • Often relatively short strings (including

numerical strings)

  • Without semantics
  • Possible reasons: typos, formats

Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

DBLP-ACM bibliography data: Amazon-Google product data: Persons:

slide-14
SLIDE 14

14 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Numerical attributes (NA):

Non-semantically related attributes (NRA):

  • Often relatively short strings (including

numerical strings)

  • Without semantics
  • Possible reasons: typos, formats

Semantically related attributes (SRA):

  • Often relatively long strings or

sentences

  • With semantics
  • Possible reasons: different

expressions, different names

Three Groups of Attributes

4/19

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedi a 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a

global reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians... 17.9 Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

DBLP-ACM bibliography data: Amazon-Google product data: Persons:

slide-15
SLIDE 15

15 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Approaches to Calculate Similarities

5/19 Similarity scores

A D A B

( , ); ( , );

A E A C

( , ); ( , ); ( , );

B C B D B E C D C E D E

( , ); ( , ); ( , ); ( , ); ( , );

Pair-Wise comparison

Traditional approaches:

  • Syntactical-based
  • Without considering semantics
  • Correct selection of similarity measures by domain

experts

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-16
SLIDE 16

16 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Approaches to Calculate Similarities

5/19 Similarity scores

A D A B

( , ); ( , );

A E A C

( , ); ( , ); ( , );

B C B D B E C D C E D E

( , ); ( , ); ( , ); ( , ); ( , );

Pair-Wise comparison

Traditional approaches:

  • Syntactical-based
  • Without considering semantics
  • Correct selection of similarity measures by domain

experts

➢ Limited accuracy for SRAs

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-17
SLIDE 17

17 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Traditional approaches:

  • Syntactical-based
  • Without considering semantics
  • Correct selection of similarity measures by domain

experts

➢ Limited accuracy for SRAs

Recently:

  • Word embedding based
  • Considering semantics
  • Applicable for all kinds of data

Approaches to Calculate Similarities

5/19 Similarity scores

A D A B

( , ); ( , );

A E A C

( , ); ( , ); ( , );

B C B D B E C D C E D E

( , ); ( , ); ( , ); ( , ); ( , );

Pair-Wise comparison

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-18
SLIDE 18

18 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Traditional approaches:

  • Syntactical-based
  • Without considering semantics
  • Correct selection of similarity measures by domain

experts

➢ Limited accuracy for SRAs

Recently:

  • Word embedding based
  • Considering semantics
  • Applicable for all kinds of data

➢ Negative effects on efficiency ➢ Possible low accuracy for NAs and NRAs

Approaches to Calculate Similarities

5/19 Similarity scores

A D A B

( , ); ( , );

A E A C

( , ); ( , ); ( , );

B C B D B E C D C E D E

( , ); ( , ); ( , ); ( , ); ( , );

Pair-Wise comparison

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-19
SLIDE 19

19 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

▪ No one-fit-all solution

One dataset contains more than

  • ne type of attributes:
  • Non-semantically related attributes

(NRA)

  • Semantically related attributes (SRA)
  • Numerical attributes (NA)

Problems Using A Single Approach

6/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians...

  • 17.9

DBLP-ACM bibliography data: Amazon-Google product data:

Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Persons:

slide-20
SLIDE 20

20 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

▪ No one-fit-all solution

One dataset contains more than

  • ne type of attributes:
  • Non-semantically related attributes

(NRA)

  • Semantically related attributes (SRA)
  • Numerical attributes (NA)

➢ Hybrid approach to calculate similarity scores

Problems Using A Single Approach

6/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians...

  • 17.9

DBLP-ACM bibliography data: Amazon-Google product data:

Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Persons:

slide-21
SLIDE 21

21 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Hybrid Approach

7/19

Non-semantically related attributes (NRA):

  • Relatively short strings (including

numerical strings)

  • Without semantics

Numerical attributes (NA): Traditional approaches

  • Syntactical-based
  • Without considering semantics
  • Choosing suitable functions:

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-22
SLIDE 22

22 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Hybrid Approach

7/19

Non-semantically related attributes (NRA):

  • Relatively short strings (including

numerical strings)

  • Without semantics

Numerical attributes (NA):

Semantically related attributes (SRA):

  • Relatively long strings
  • With semantics

Word embedding based

  • Considering semantics
  • Cosine similarity on transformed vectors

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Traditional approaches

  • Syntactical-based
  • Without considering semantics
  • Choosing suitable functions:
slide-23
SLIDE 23

23 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Vector for one word:

  • FastText model

Word Embedding Approach for SRAs

8/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-24
SLIDE 24

24 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Vector for one word:

  • FastText model

Vector for one attribute:

  • Word Embedding Approach for SRAs

8/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-25
SLIDE 25

25 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Vector for one word:

  • FastText model

Vector for one attribute:

Similarity scores calculated on each attribute vector:

  • Word Embedding Approach for SRAs

8/19 Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-26
SLIDE 26

26 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Evaluation: Setup

9/19

Three datasets:

Datasets #Pairs (DS1 & DS2) #Matches Persons 551250 (1050 & 1050) 96 DBLP - ACM 6001104 (2616 & 2294) 2224 Amazon - Google 4400264 (1364 & 3226) 1300

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Titel Author Venue Year PTool: A Light Weight Persistent Object Manager David Hanley, Robert L. Grossman, Xiao Qin SIGMOD Conference 1995 PTool: a light weight persistent object manager

  • R. L. Grossman, D. Hanley,
  • X. Qin

International Conference

  • n Management of Data

1995 Name Description Manufacturer Price world book encyclopedia 2006 the world book encyclopedia 2006 is a truly student-friendly cd reference resource. it's been … topics entertainment 19.99 world book 2006

  • verview with over 87 years of experience and a global

reputation for unsurpassed excellence world book 2006 is firmly established as the premier reference source for students parents teachers and librarians...

  • 17.9

DBLP-ACM bibliography data: Amazon-Google product data:

Given-name Surname city Postcode Age Phone-number Sex starab Kuaririo brisbane 1402 25 03 2867 8172 f sarah Guarino brisbane 1402 26 03 2897 8172 m

Persons:

slide-27
SLIDE 27

27 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 9/19

Three datasets:

#SRAs #NRAs #NAs 2 6 5 2 2 3 1

Evaluation: Setup

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Datasets #Pairs (DS1 & DS2) #Matches Persons 551250 (1050 & 1050) 96 DBLP - ACM 6001104 (2616 & 2294) 2224 Amazon - Google 4400264 (1364 & 3226) 1300

slide-28
SLIDE 28

28 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 10/19

Approaches for similarity calculations:

  • Traditional similarity functions only:
  • Jaro-Winkler for SRAs and NRAs
  • Euclidean distance for NAs
  • Word embedding and cosine similarity based method only:
  • Word embedding + cosine similarity for all SRAs, NRAs and NAs
  • Hybrid:
  • Jaro-Winkler for NRAs
  • Euclidean distance for NAs
  • Word embedding + cosine similarity for SRAs

Evaluation: Setup

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-29
SLIDE 29

29 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 11/19

Classification approach: learning-based classification

  • XGBoost
  • Random forest
  • K-Nearest neighbor

Evaluation: Setup

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-30
SLIDE 30

30 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 11/19

Classification approach: learning-based classification

  • XGBoost
  • Random forest
  • K-Nearest neighbor

▪ Training & test data:

  • Took all pairs of cartesian product;
  • For training, 66% of matches & 66% of non-matches;
  • For testing, remaining 34% of both.

Evaluation: Setup

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-31
SLIDE 31

31 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 12/19

Evaluation: Results

Combinations XGBoost RF KNN Persons Traditional 100 100 88.46 WordEmbedding 100 100 100 Hybrid 100 100 58.54

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Persons:

  • Best: word embedding
  • KNN F-measures
slide-32
SLIDE 32

32 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 13/19

Evaluation: Results

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Combinations XGBoost RF KNN Persons Traditional 100 100 88.46 WordEmbedding 100 100 100 Hybrid 100 100 58.54 DBLP - ACM Traditional 97.04 97.7 95.17 WordEmbedding 92.56 94.82 93.94 Hybrid 93.69 94.28 89.31

Persons:

  • Best: word embedding
  • KNN F-measures

DBLP - ACM bibliography:

  • Best: traditional approach
  • “Title” should belong to NRA
slide-33
SLIDE 33

33 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 14/19

Evaluation: Results

Combinations XGBoost RF KNN Persons Traditional 100 100 88.46 WordEmbedding 100 100 100 Hybrid 100 100 58.54 DBLP - ACM Traditional 97.04 97.7 95.17 WordEmbedding 92.56 94.82 93.94 Hybrid 93.69 94.28 89.31 Amazon - Google Traditional 20.19 25.35 21.11 WordEmbedding 19.10 31.09 24.1 Hybrid 29.72 38.32 19.78

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Persons:

  • Best: word embedding
  • KNN F-measures

DBLP - ACM bibliography:

  • Best: traditional approach
  • “Title” should belong to NRA

Amazon - Google product:

  • Word-Embedding outperforms

traditional for RF and KNN, is comparable for XGBoost

  • Hybrid approach is the best for

XGBoost and RF

slide-34
SLIDE 34

34 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 15/19

A true matching example of a product pair:

Amazon: train sim modeler design studio, with train sim modeler you can create 3d traincars boxcars and engines along with your own custom scenery! create train station stores hills and trees and more scenery set up a virtual cab so you can see from the train driver's view you'll have your own personal railroad cars running the rails in no time!,abacus,39.99 Google: train sim modeler, microsoft train simulator brings the most realistic virtual train experience to the pc. already ms train simulator is the number one selling simulator in europe. and by all indications microsoft train simulator (ts) is a bestseller since it was ..., ,29.84

Evaluation: Results

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Name Description Manufacturer Price Traditional

0.6611724 0.72039728 0.0 0.99997712

WordEmbedding

0.8569186 0.87175614 0.0

  • 0.03565185

Hybrid

0.8569186 0.87175614 0.0 0.99997712

slide-35
SLIDE 35

35 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 16/19

Evaluation: Results

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Combinations XGBoost RF KNN Persons Traditional 100 100 88.46 WordEmbedding 100 100 100 Hybrid 100 100 58.54 DBLP - ACM Traditional 97.04 97.7 95.17 WordEmbedding 92.56 94.82 93.94 Hybrid 93.69 94.28 89.31 Amazon - Google Traditional 20.19 25.35 21.11 WordEmbedding 19.10 31.09 24.1 Hybrid 29.72 38.32 19.78

Lower than published results

slide-36
SLIDE 36

36 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 17/19

Evaluation: Results

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Combinations XGBoost RF KNN Persons Traditional 100 100 88.46 WordEmbedding 100 100 100 Hybrid 100 100 58.54 DBLP - ACM Traditional 97.04 97.7 95.17 WordEmbedding 92.56 94.82 93.94 Hybrid 93.69 94.28 89.31 Amazon - Google Traditional 20.19 25.35 21.11 WordEmbedding 19.10 31.09 24.1 Hybrid 29.72 38.32 19.78

Word embedding:

  • SRAs: predominantly better
  • NRAs: comparable or worse
  • NAs: not recommended

Hybrid approach:

  • Is able to provide better accuracy for

data including different types of attributes

Classifier choices

slide-37
SLIDE 37

37 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 18/19

Conclusion

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

Three groups of attributes:

  • SRAs, NRAs and NAs

Hybrid similarity calculations:

  • SRAs: word embedding + cosine similarity
  • NRAs and NAs: traditional similarity functions

Evaluation:

  • Word embedding performs predominantly better for SRAs, and worse for NAs;
  • Hybrid approach is useful to fix the similarity scores, which are wrongly calculated by word embedding

for numerical attributes.

slide-38
SLIDE 38

38 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen 19/19

Evaluate the hybrid approach when using blocking or thresholding techniques

Classification algorithms

Future Work

Introduction Motivation Hybrid Similarity Calculation Evaluation Conclusion & Future Work

slide-39
SLIDE 39

39 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

Thank you!

Xiao Chen, Gabriel Campero Durand, Roman Zoun, David Broneske, Yang Li, Gunter Saake xiao.chen@ovgu.de Otto-von-Guericke-University of Magdeburg BTW’19, Rostock, March 7th, 2019

slide-40
SLIDE 40

40 Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for Entity Resolution Xiao Chen

[1] Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016. [2] Chen, T.; Guestrin, C.: Xgboost: A scalable tree boosting system. In: SIGKDD. ACM, pp. 785–794, 2016. [3] Ebraheem, M.; Thirumuruganathan, S.; Joty, S. R.; Ouzzani, M.; Tang, N.: Distributed Representations of Tuples for Entity Resolution. PVLDB, 11(11):1454–1467, 2018. [4] Kooli, N.; Allesiardo, R.; Pigneul, E.: Deep Learning Based Approach for Entity Resolution in Databases. In: ACIIDS. Springer, pp. 3–12, 2018. [5] Ko ̈ pcke, H.; Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD. pp. 3–12, 2008. [6] Mikolov, T.; Chen, K.; Corrado, G.; Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [7] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J.: Distributed Representations of Words and Phrases and their

  • Compositionality. Curran Associates, 2013.

[8] Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V.: Deep Learning for Entity Matching: A Design Space Exploration. In: SIGMOD. ACM, pp. 19–34, 2018. [9] Pennington, J.; R. Socher, Riand Manning, Christopher: Glove: Global vectors for word representation. In: EMNLP.

  • pp. 1532–1543, 2014.

References