Entity Matching for Semistructured Data in the Cloud
Marcus Paradies IBM F2CE Workshop December 1, 2011
1 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching for Semistructured Data in the Cloud Marcus - - PowerPoint PPT Presentation
Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity
Entity Matching for Semistructured Data in the Cloud
Marcus Paradies IBM F2CE Workshop December 1, 2011
1 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
1 Motivation 2 Entity Matching 3 MAXIM: Entity Matching in the Cloud 4 Summary
2 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
References from Wikipedia article Hash join
3 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
Lookup in the CiteSeer database
3 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
Lookup in Google
3 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
Characteristics
3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc.
4 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
Characteristics
3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc.
Challenges
Articles in Wikipedia are incomplete Articles in Wikipedia are inaccurate Articles in Wikipedia are subjective
4 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
Definition Given two datasets of records, R and S, a set of attributes a1, . . . , an, a set of similarity functions sima1, . . . , siman and a similarity threshold τ, the task between R and S is defined as finding and combining all pairs of records from R and S where n
i=1 simai(R.ai, S.ai) ≥ τ
{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} {{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record> <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record>
Wikipedia Data set CiteSeer Data set 5 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Motivation
Definition Given two datasets of records, R and S, a set of attributes a1, . . . , an, a set of similarity functions sima1, . . . , siman and a similarity threshold τ, the task between R and S is defined as finding and combining all pairs of records from R and S where n
i=1 simai(R.ai, S.ai) ≥ τ
{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} {{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }} <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record> <record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi> </record>
Wikipedia Data set CiteSeer Data set 5 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
6 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching
7 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching
Challenges
Entity Matching has quadratic runtime behavior Entity Matching has high CPU- and memory demands The definition of “what is similar” is domain-dependent
7 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching
Data Source S1 Data Source S1 Data Source S2 Data Source S2
Blocking Blocking
b1 b1 b2 b2 b3 b3 bn bn
. . .
Matching Matching
Match Result R Match Result R
8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching
Data Source S1 Data Source S1 Data Source S2 Data Source S2
Blocking Blocking
b1 b1 b2 b2 b3 b3 bn bn
. . .
Matching Matching
Match Result R Match Result R
How can we improve the runtime of an EM task?
8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching
Data Source S1 Data Source S1 Data Source S2 Data Source S2
Blocking Blocking
b1 b1 b2 b2 b3 b3 bn bn
. . .
Matching Matching
Match Result R Match Result R
Distributed Blocking 8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Entity Matching
Data Source S1 Data Source S1 Data Source S2 Data Source S2
Blocking Blocking
b1 b1 b2 b2 b3 b3 bn bn
. . .
Matching Matching
Match Result R Match Result R
Distributed Blocking Parallel Matching 8 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
9 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Requirements
Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions
10 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Requirements
Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions
Main Idea
Use MapReduce and ChuQL to process semistructured data Use a search-based blocking to generate candidate pairs Apply similarity functions to candidate pairs within a block
10 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Wordcount in ChuQL
1 mapreduce { 2 input { fn: collection (" hdfs :// wiki /") } 3 rr { for $rev in $hxml:in// revision 4 return {" key ": fn:data($x// username|$x//ip), 5 "val ": $x// title } } 6 map { $hxml:in } 7 reduce { {" key ": $hxml:in=>"key", "value ": fn:count($hxml:in=>"val ")} } 8 rw { <author name ="{ $hxml:in=>"key "}" count ="{ $hxml:in=>"val "}"/ > } 9
10 } 11 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Node 1
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
Node 2
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
Node N
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
... HDFS HDFS
Architecture
Hadoop cluster with up to 40 nodes Each node runs a search engine and an attached full-text index Each node runs an in-memory XQuery processor Semistructured data is partitioned and placed on HDFS
12 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Search Engines Search Engines HDFS HDFS
Three Stages
Preparation Stage Blocking Stage Matching Stage
13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extract Wikipedia references Extract Wikipedia references Index CiteSeerX records Index CiteSeerX records
Search Engines Search Engines HDFS HDFS
Extract references S t
e r e f e r e n c e s Transform into full-text index XML B u i l d i n d e x
Preparation Stage
Stage 1: Preparation Stage
Extracts references from Wikipedia Reads and transforms records from CiteSeerX Sends CiteSeerX data to local full-text index
13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extract Wikipedia references Extract Wikipedia references Index CiteSeerX records Index CiteSeerX records Generate Semantic Block Generate Semantic Block
Search Engines Search Engines HDFS HDFS
Extract references S t
e r e f e r e n c e s Transform into full-text index XML B u i l d i n d e x R e t r i e v e r e f e r e n c e s Generate query G e t q u e r y r e s p
s e S t
e b l
k s
Preparation Stage Blocking Stage
Stage 2: Blocking Stage
Reads extracted references from HDFS Probes full-text index to retrieve candidate publications Assign candidate publications to block(s)
13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extract Wikipedia references Extract Wikipedia references Index CiteSeerX records Index CiteSeerX records Generate Semantic Block Generate Semantic Block Record pair generation Record pair generation
Search Engines Search Engines HDFS HDFS
Extract references S t
e r e f e r e n c e s Transform into full-text index XML B u i l d i n d e x R e t r i e v e r e f e r e n c e s Generate query G e t q u e r y r e s p
s e S t
e b l
k s V e r i f y c a n d i d a t e p a i r s S t
e r e c
d p a i r s
Preparation Stage Blocking Stage Matching Stage
Stage 3: Matching Stage
Read blocks from HDFS Generate candidate pairs and apply similarity functions Store matching pairs and their similarity
13 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extracting References Indexing Publications
14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }}
Extraction
Indexing Publications
14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>
Extraction Transformation
Indexing Publications
14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>
Extraction Transformation
Indexing Publications
HDFS 14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>
Extraction Transformation
Indexing Publications
<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>
HDFS Read and Transformation 14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>
Extraction Transformation
Indexing Publications
<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>
Lucene Index Lucene Index HDFS Read and Transformation Indexing 14 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Block generation
Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference
15 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Block generation
Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference
Example
<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue
<journal>Stanford University Technical Report</journal> </ref> </citation> <citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue
<journal>Stanford University Technical Report</journal> </ref> </citation> <citation> <id>26334893</id> <cat>Hashing</cat> <cat>Join algorithms</cat> <ref> <type>journal</type> <author>Hansjörg Zeller</author> <author>Jim Gray</author> <year>1990</year> <pages>186-197</pages> <title>An Adaptive Hash Join Algorithm for Multiuser Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> </ref> </citation>
Full-Text Index Search Engine
Hashing Join algorithms 10.0.1.1.124 10.0.1.1.124 10.0.7.23.14 10.0.1.11.23 send result send result send query
15 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Distributed Search in MAXIM
(a) (a) ( a ) (a) ( b ) (b) (b) (b)
Node 2
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
Node 3
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
Node 4
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
Node 5
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
Node 1
Task Tracker ChuQL Engine Data Node Hadoop Full-text Index Search Engine
(c) Collect partial results (b) HTTP response (partial result) (a) Send HTTP request (query) (c)
15 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function
16 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
MAXIM: Entity Matching in the Cloud
Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function
Number of candidate pairs CP =
n
Ci ∗ Ri (1)
n - # of blocks in B1, . . . , Bn Ri - # of references in block Bi Ci - # of candidate publications in block Bi CP - # of candidate pairs to verify
16 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Summary
Wikipedia provides many opportunities for research Need for efficiently processing semistructured data is increasing Entity Matching is critical for data integration and data cleaning Entity Matching is difficult to parallelize due to unbalanced data partitions MAXIM parallelizes EM by building blocks of similar records in a classification fashion MAXIM allows to define own similarity functions and computation functions without changing the algorithm
17 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
“Everything that can be invented has been invented.”
(Charles H. Duell, Commissioner, U.S. Office of Patents, 1899) 18 / 18 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Experiments
1 2 3 4 5 6 7 8 9 5 10 20 40 Speedup = Base Time / New Time Number of nodes Ideal INDEXING-2000 EXTRACTING-2000 BLOCKING MATCHING
(a) Speedup for all stages
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 5 10 20 40 Scaleup = Base Time / New Time Number of nodes Ideal EXTRACTING-2000 INDEXING-2000
(b) Scaleup for preparation stage
19 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Experiments
100 200 300 400 500 600 700 800 900 5 10 20 40
Number of Nodes RESULTCOUNT-50 RESULTCOUNT-100 RESULTCOUNT-150 RESULTCOUNT-200
Figure: Query Performance for different result set sizes and cluster sizes.
20 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Experiments
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 0.25 0.5 0.75 1.0 Accuracy Variance Ideal WRONG-ORDER MISPLACED-END MISPLACED-ANY MISSING
Figure: Blocking accuracy for different typographical error classes.
21 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies
Experiments
500000 1e+006 1.5e+006 2e+006 2.5e+006 3e+006 3.5e+006 4e+006 4.5e+006 5e+006 5.5e+006 0.0 0.1 0.25 0.5 0.75 1.0 Number of candidate pairs Variance RSCOUNT-50 RSCOUNT-100 RSCOUNT-150 RSCOUNT-200
Figure: Number of candidate pair verifications in the matching stage.
22 / 22 Entity Matching for Semistructured Data in the Cloud Marcus Paradies