Entity Matching for Semistructured Data in the Cloud Marcus - - PowerPoint PPT Presentation

entity matching for semistructured data in the cloud
SMART_READER_LITE
LIVE PREVIEW

Entity Matching for Semistructured Data in the Cloud Marcus - - PowerPoint PPT Presentation

Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity


slide-1
SLIDE 1

Entity Matching for Semistructured Data in the Cloud

Marcus Paradies IBM F2CE Workshop December 1, 2011

1 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-2
SLIDE 2

Outline

1 Motivation 2 Entity Matching 3 MAXIM: Entity Matching in the Cloud 4 Summary

2 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-3
SLIDE 3

Motivation

Enriching/Improving Wikipedia

References from Wikipedia article Hash join

3 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-4
SLIDE 4

Motivation

Enriching/Improving Wikipedia

Lookup in the CiteSeer database

3 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-5
SLIDE 5

Motivation

Enriching/Improving Wikipedia

Lookup in Google

3 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-6
SLIDE 6

Motivation

Wikipedia in a nutshell

Characteristics

3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc.

4 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-7
SLIDE 7

Motivation

Wikipedia in a nutshell

Characteristics

3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc.

Challenges

Articles in Wikipedia are incomplete Articles in Wikipedia are inaccurate Articles in Wikipedia are subjective

4 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-8
SLIDE 8

Motivation

Problem Statement

Definition Given two datasets of records, R and S, a set of attributes a1, . . . , an, a set of similarity functions sima1, . . . , siman and a similarity threshold τ, the task between R and S is defined as finding and combining all pairs of records from R and S where n

i=1 simai(R.ai, S.ai) ≥ τ

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} {{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record> <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record>

Wikipedia Data set CiteSeer Data set 5 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-9
SLIDE 9

Motivation

Problem Statement

Definition Given two datasets of records, R and S, a set of attributes a1, . . . , an, a set of similarity functions sima1, . . . , siman and a similarity threshold τ, the task between R and S is defined as finding and combining all pairs of records from R and S where n

i=1 simai(R.ai, S.ai) ≥ τ

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} {{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record> <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record>

Wikipedia Data set CiteSeer Data set 5 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-10
SLIDE 10

Entity Matching

6 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-11
SLIDE 11

Entity Matching

What is Entity Matching?

7 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-12
SLIDE 12

Entity Matching

What is Entity Matching?

Challenges

Entity Matching has quadratic runtime behavior Entity Matching has high CPU- and memory demands The definition of “what is similar” is domain-dependent

7 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-13
SLIDE 13

Entity Matching

Entity Matching Architecture

Data Source S1 Data Source S1 Data Source S2 Data Source S2

Blocking Blocking

b1 b1 b2 b2 b3 b3 bn bn

. . .

Matching Matching

Match Result R Match Result R

8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-14
SLIDE 14

Entity Matching

Entity Matching Architecture

Data Source S1 Data Source S1 Data Source S2 Data Source S2

Blocking Blocking

b1 b1 b2 b2 b3 b3 bn bn

. . .

Matching Matching

Match Result R Match Result R

How can we improve the runtime of an EM task?

8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-15
SLIDE 15

Entity Matching

Entity Matching Architecture

Data Source S1 Data Source S1 Data Source S2 Data Source S2

Blocking Blocking

b1 b1 b2 b2 b3 b3 bn bn

. . .

Matching Matching

Match Result R Match Result R

Distributed Blocking 8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-16
SLIDE 16

Entity Matching

Entity Matching Architecture

Data Source S1 Data Source S1 Data Source S2 Data Source S2

Blocking Blocking

b1 b1 b2 b2 b3 b3 bn bn

. . .

Matching Matching

Match Result R Match Result R

Distributed Blocking Parallel Matching 8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-17
SLIDE 17

MAXIM: Entity Matching in the Cloud

9 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-18
SLIDE 18

MAXIM: Entity Matching in the Cloud

Requirements and Approach

Requirements

Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions

10 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-19
SLIDE 19

MAXIM: Entity Matching in the Cloud

Requirements and Approach

Requirements

Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions

Main Idea

Use MapReduce and ChuQL to process semistructured data Use a search-based blocking to generate candidate pairs Apply similarity functions to candidate pairs within a block

10 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-20
SLIDE 20

MAXIM: Entity Matching in the Cloud

ChuQL by example

Wordcount in ChuQL

1 mapreduce { 2 input { fn: collection (" hdfs :// wiki /") } 3 rr { for $rev in $hxml:in// revision 4 return {" key ": fn:data($x// username|$x//ip), 5 "val ": $x// title } } 6 map { $hxml:in } 7 reduce { {" key ": $hxml:in=>"key", "value ": fn:count($hxml:in=>"val ")} } 8 rw { <author name ="{ $hxml:in=>"key "}" count ="{ $hxml:in=>"val "}"/ > } 9

  • utput { fn:put (" hdfs :// outputdir /") }

10 } 11 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-21
SLIDE 21

MAXIM: Entity Matching in the Cloud

Architecture

Node 1

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

Node 2

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

Node N

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

... HDFS HDFS

Architecture

Hadoop cluster with up to 40 nodes Each node runs a search engine and an attached full-text index Each node runs an in-memory XQuery processor Semistructured data is partitioned and placed on HDFS

12 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-22
SLIDE 22

MAXIM: Entity Matching in the Cloud

Processing Stages

Search Engines Search Engines HDFS HDFS

Three Stages

Preparation Stage Blocking Stage Matching Stage

13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-23
SLIDE 23

MAXIM: Entity Matching in the Cloud

Processing Stages

Extract Wikipedia references Extract Wikipedia references Index CiteSeerX records Index CiteSeerX records

Search Engines Search Engines HDFS HDFS

Extract references S t

  • r

e r e f e r e n c e s Transform into full-text index XML B u i l d i n d e x

Preparation Stage

Stage 1: Preparation Stage

Extracts references from Wikipedia Reads and transforms records from CiteSeerX Sends CiteSeerX data to local full-text index

13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-24
SLIDE 24

MAXIM: Entity Matching in the Cloud

Processing Stages

Extract Wikipedia references Extract Wikipedia references Index CiteSeerX records Index CiteSeerX records Generate Semantic Block Generate Semantic Block

Search Engines Search Engines HDFS HDFS

Extract references S t

  • r

e r e f e r e n c e s Transform into full-text index XML B u i l d i n d e x R e t r i e v e r e f e r e n c e s Generate query G e t q u e r y r e s p

  • n

s e S t

  • r

e b l

  • c

k s

Preparation Stage Blocking Stage

Stage 2: Blocking Stage

Reads extracted references from HDFS Probes full-text index to retrieve candidate publications Assign candidate publications to block(s)

13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-25
SLIDE 25

MAXIM: Entity Matching in the Cloud

Processing Stages

Extract Wikipedia references Extract Wikipedia references Index CiteSeerX records Index CiteSeerX records Generate Semantic Block Generate Semantic Block Record pair generation Record pair generation

Search Engines Search Engines HDFS HDFS

Extract references S t

  • r

e r e f e r e n c e s Transform into full-text index XML B u i l d i n d e x R e t r i e v e r e f e r e n c e s Generate query G e t q u e r y r e s p

  • n

s e S t

  • r

e b l

  • c

k s V e r i f y c a n d i d a t e p a i r s S t

  • r

e r e c

  • r

d p a i r s

Preparation Stage Blocking Stage Matching Stage

Stage 3: Matching Stage

Read blocks from HDFS Generate candidate pairs and apply similarity functions Store matching pairs and their similarity

13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-26
SLIDE 26

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References Indexing Publications

14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-27
SLIDE 27

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }}

Extraction

Indexing Publications

14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-28
SLIDE 28

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>

Extraction Transformation

Indexing Publications

14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-29
SLIDE 29

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>

Extraction Transformation

Indexing Publications

HDFS 14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-30
SLIDE 30

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>

Extraction Transformation

Indexing Publications

<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>

HDFS Read and Transformation 14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-31
SLIDE 31

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>

Extraction Transformation

Indexing Publications

<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>

Lucene Index Lucene Index HDFS Read and Transformation Indexing 14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-32
SLIDE 32

MAXIM: Entity Matching in the Cloud

Stage 2: Blocking Stage

Block generation

Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference

15 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-33
SLIDE 33

MAXIM: Entity Matching in the Cloud

Stage 2: Blocking Stage

Block generation

Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference

Example

<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue

  • f the Google Matrix</title>

<journal>Stanford University Technical Report</journal> </ref> </citation> <citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue

  • f the Google Matrix</title>

<journal>Stanford University Technical Report</journal> </ref> </citation> <citation> <id>26334893</id> <cat>Hashing</cat> <cat>Join algorithms</cat> <ref> <type>journal</type> <author>Hansjörg Zeller</author> <author>Jim Gray</author> <year>1990</year> <pages>186-197</pages> <title>An Adaptive Hash Join Algorithm for Multiuser Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> </ref> </citation>

Full-Text Index Search Engine

Hashing Join algorithms 10.0.1.1.124 10.0.1.1.124 10.0.7.23.14 10.0.1.11.23 send result send result send query

15 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-34
SLIDE 34

MAXIM: Entity Matching in the Cloud

Stage 2: Blocking Stage

Distributed Search in MAXIM

(a) (a) ( a ) (a) ( b ) (b) (b) (b)

Node 2

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

Node 3

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

Node 4

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

Node 5

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

Node 1

Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine

(c) Collect partial results (b) HTTP response (partial result) (a) Send HTTP request (query) (c)

15 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-35
SLIDE 35

MAXIM: Entity Matching in the Cloud

Stage 3: Matching Stage

Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function

16 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-36
SLIDE 36

MAXIM: Entity Matching in the Cloud

Stage 3: Matching Stage

Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function

Number of candidate pairs CP =

n

  • i=1

Ci ∗ Ri (1)

n - # of blocks in B1, . . . , Bn Ri - # of references in block Bi Ci - # of candidate publications in block Bi CP - # of candidate pairs to verify

16 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-37
SLIDE 37

Summary

Summary

Wikipedia provides many opportunities for research Need for efficiently processing semistructured data is increasing Entity Matching is critical for data integration and data cleaning Entity Matching is difficult to parallelize due to unbalanced data partitions MAXIM parallelizes EM by building blocks of similar records in a classification fashion MAXIM allows to define own similarity functions and computation functions without changing the algorithm

17 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-38
SLIDE 38

“Everything that can be invented has been invented.”

(Charles H. Duell, Commissioner, U.S. Office of Patents, 1899) 18 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-39
SLIDE 39

Experiments

Scaleup and Speedup

1 2 3 4 5 6 7 8 9 5 10 20 40 Speedup = Base Time / New Time Number of nodes Ideal INDEXING-2000 EXTRACTING-2000 BLOCKING MATCHING

(a) Speedup for all stages

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 5 10 20 40 Scaleup = Base Time / New Time Number of nodes Ideal EXTRACTING-2000 INDEXING-2000

(b) Scaleup for preparation stage

19 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-40
SLIDE 40

Experiments

Query Performance

100 200 300 400 500 600 700 800 900 5 10 20 40

  • Avg. Query Response Time (ms)

Number of Nodes RESULTCOUNT-50 RESULTCOUNT-100 RESULTCOUNT-150 RESULTCOUNT-200

Figure: Query Performance for different result set sizes and cluster sizes.

20 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-41
SLIDE 41

Experiments

Blocking Accuracy

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 0.25 0.5 0.75 1.0 Accuracy Variance Ideal WRONG-ORDER MISPLACED-END MISPLACED-ANY MISSING

Figure: Blocking accuracy for different typographical error classes.

21 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies

slide-42
SLIDE 42

Experiments

Number of Candidate Pairs

500000 1e+006 1.5e+006 2e+006 2.5e+006 3e+006 3.5e+006 4e+006 4.5e+006 5e+006 5.5e+006 0.0 0.1 0.25 0.5 0.75 1.0 Number of candidate pairs Variance RSCOUNT-50 RSCOUNT-100 RSCOUNT-150 RSCOUNT-200

Figure: Number of candidate pair verifications in the matching stage.

22 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies