entity matching for semistructured data in the cloud
play

Entity Matching for Semistructured Data in the Cloud Marcus - PowerPoint PPT Presentation

Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity


  1. Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 �

  2. Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity Matching in the Cloud 4 Summary Marcus Paradies Entity Matching for Semistructured Data in the Cloud 2 / 18 �

  3. Motivation Enriching/Improving Wikipedia References from Wikipedia article Hash join Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 18 �

  4. Motivation Enriching/Improving Wikipedia Lookup in the CiteSeer database Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 18 �

  5. Motivation Enriching/Improving Wikipedia Lookup in Google Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 18 �

  6. Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 18 �

  7. Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Challenges Articles in Wikipedia are incomplete Articles in Wikipedia are inaccurate Articles in Wikipedia are subjective Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 18 �

  8. Motivation Problem Statement Definition Given two datasets of records, R and S , a set of attributes a 1 , . . . , a n , a set of similarity functions sim a 1 , . . . , sim a n and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where � n i =1 sim a i ( R . a i , S . a i ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford <record id=”6627383”> | first = David | first = David <record id=”6627383”> | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | authorlink = David Mumford | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | title = The Red Book of Varieties and Schemes Schemes</title> | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> <publisher>Springer</publisher> | location = Berlin | location = Berlin <publisher>Springer</publisher> | date = 1999 <year>1999</year> <year>1999</year> | date = 1999 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | page = 198 </record> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data set Marcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 18 �

  9. Motivation Problem Statement Definition Given two datasets of records, R and S , a set of attributes a 1 , . . . , a n , a set of similarity functions sim a 1 , . . . , sim a n and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where � n i =1 sim a i ( R . a i , S . a i ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford <record id=”6627383”> | first = David | first = David <record id=”6627383”> | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | authorlink = David Mumford | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | title = The Red Book of Varieties and Schemes Schemes</title> | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> <publisher>Springer</publisher> | location = Berlin | location = Berlin <publisher>Springer</publisher> | date = 1999 <year>1999</year> <year>1999</year> | date = 1999 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | page = 198 </record> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data set Marcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 18 �

  10. Entity Matching Marcus Paradies Entity Matching for Semistructured Data in the Cloud 6 / 18 �

  11. Entity Matching What is Entity Matching? Marcus Paradies Entity Matching for Semistructured Data in the Cloud 7 / 18 �

  12. Entity Matching What is Entity Matching? Challenges Entity Matching has quadratic runtime behavior Entity Matching has high CPU- and memory demands The definition of “what is similar” is domain-dependent Marcus Paradies Entity Matching for Semistructured Data in the Cloud 7 / 18 �

  13. Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . . S 2 S 2 b n b n Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

  14. Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . S 2 . S 2 b n b n How can we improve the runtime of an EM task? Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

  15. Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . . S 2 S 2 b n b n Distributed Blocking Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

  16. Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . . S 2 S 2 b n b n Distributed Blocking Parallel Matching Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

  17. MAXIM: Entity Matching in the Cloud Marcus Paradies Entity Matching for Semistructured Data in the Cloud 9 / 18 �

  18. MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 18 �

  19. MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Main Idea Use MapReduce and ChuQL to process semistructured data Use a search-based blocking to generate candidate pairs Apply similarity functions to candidate pairs within a block Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 18 �

  20. MAXIM: Entity Matching in the Cloud ChuQL by example Wordcount in ChuQL 1 mapreduce { 2 input { fn: collection (" hdfs :// wiki /") } 3 rr { for $rev in $hxml:in// revision 4 return {" key ": fn:data($x// username|$x//ip), 5 "val ": $x// title } } 6 map { $hxml:in } 7 reduce { {" key ": $hxml:in=>"key", "value ": fn:count($hxml:in=>"val ")} } 8 rw { <author name ="{ $hxml:in=>"key "}" count ="{ $hxml:in=>"val "}"/ > } 9 output { fn:put (" hdfs :// outputdir /") } 10 } Marcus Paradies Entity Matching for Semistructured Data in the Cloud 11 / 18 �

  21. MAXIM: Entity Matching in the Cloud Architecture Node N Search Node 1 Search Node 2 Search Engine Engine Engine Data Node Data Node Data Node ... Hadoop Hadoop Hadoop Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Index Index Index ChuQL Engine ChuQL Engine ChuQL Engine HDFS HDFS Architecture Hadoop cluster with up to 40 nodes Each node runs a search engine and an attached full-text index Each node runs an in-memory XQuery processor Semistructured data is partitioned and placed on HDFS Marcus Paradies Entity Matching for Semistructured Data in the Cloud 12 / 18 �

  22. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Three Stages Preparation Stage Blocking Stage Matching Stage Marcus Paradies Entity Matching for Semistructured Data in the Cloud 13 / 18 �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend