Entity Matching for Semistructured Data in the Cloud Marcus - PowerPoint PPT Presentation

Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 �

Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity Matching in the Cloud 4 Summary Marcus Paradies Entity Matching for Semistructured Data in the Cloud 2 / 18 �

Motivation Enriching/Improving Wikipedia References from Wikipedia article Hash join Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 18 �

Motivation Enriching/Improving Wikipedia Lookup in the CiteSeer database Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 18 �

Motivation Enriching/Improving Wikipedia Lookup in Google Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 18 �

Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 18 �

Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Challenges Articles in Wikipedia are incomplete Articles in Wikipedia are inaccurate Articles in Wikipedia are subjective Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 18 �

Motivation Problem Statement Definition Given two datasets of records, R and S , a set of attributes a 1 , . . . , a n , a set of similarity functions sim a 1 , . . . , sim a n and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where � n i =1 sim a i ( R . a i , S . a i ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford <record id=”6627383”> | first = David | first = David <record id=”6627383”> | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | authorlink = David Mumford | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | title = The Red Book of Varieties and Schemes Schemes</title> | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> <publisher>Springer</publisher> | location = Berlin | location = Berlin <publisher>Springer</publisher> | date = 1999 <year>1999</year> <year>1999</year> | date = 1999 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | page = 198 </record> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data set Marcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 18 �

Entity Matching Marcus Paradies Entity Matching for Semistructured Data in the Cloud 6 / 18 �

Entity Matching What is Entity Matching? Marcus Paradies Entity Matching for Semistructured Data in the Cloud 7 / 18 �

Entity Matching What is Entity Matching? Challenges Entity Matching has quadratic runtime behavior Entity Matching has high CPU- and memory demands The definition of “what is similar” is domain-dependent Marcus Paradies Entity Matching for Semistructured Data in the Cloud 7 / 18 �

Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . . S 2 S 2 b n b n Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . S 2 . S 2 b n b n How can we improve the runtime of an EM task? Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . . S 2 S 2 b n b n Distributed Blocking Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

Entity Matching Entity Matching Architecture b 1 b 1 Data Data Source Source S 1 S 1 b 2 b 2 Match Match Blocking Blocking Matching Matching Result Result R R b 3 b 3 Data Data . Source Source . . S 2 S 2 b n b n Distributed Blocking Parallel Matching Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 18 �

MAXIM: Entity Matching in the Cloud Marcus Paradies Entity Matching for Semistructured Data in the Cloud 9 / 18 �

MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 18 �

MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Main Idea Use MapReduce and ChuQL to process semistructured data Use a search-based blocking to generate candidate pairs Apply similarity functions to candidate pairs within a block Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 18 �

MAXIM: Entity Matching in the Cloud ChuQL by example Wordcount in ChuQL 1 mapreduce { 2 input { fn: collection (" hdfs :// wiki /") } 3 rr { for $rev in $hxml:in// revision 4 return {" key ": fn:data($x// username|$x//ip), 5 "val ": $x// title } } 6 map { $hxml:in } 7 reduce { {" key ": $hxml:in=>"key", "value ": fn:count($hxml:in=>"val ")} } 8 rw { <author name ="{ $hxml:in=>"key "}" count ="{ $hxml:in=>"val "}"/ > } 9 output { fn:put (" hdfs :// outputdir /") } 10 } Marcus Paradies Entity Matching for Semistructured Data in the Cloud 11 / 18 �

MAXIM: Entity Matching in the Cloud Architecture Node N Search Node 1 Search Node 2 Search Engine Engine Engine Data Node Data Node Data Node ... Hadoop Hadoop Hadoop Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Index Index Index ChuQL Engine ChuQL Engine ChuQL Engine HDFS HDFS Architecture Hadoop cluster with up to 40 nodes Each node runs a search engine and an attached full-text index Each node runs an in-memory XQuery processor Semistructured data is partitioned and placed on HDFS Marcus Paradies Entity Matching for Semistructured Data in the Cloud 12 / 18 �

MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Three Stages Preparation Stage Blocking Stage Matching Stage Marcus Paradies Entity Matching for Semistructured Data in the Cloud 13 / 18 �

Entity Matching for Semistructured Data in the Cloud Marcus - PowerPoint PPT Presentation

Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity

Data and Analysis Part II Semistructured Data Alex Simpson Part II: Semistructured Data Inf1,

Data and Analysis Part II Semistructured Data Ian Stark February 2011 Part II: Semistructured

Data and Analysis Part II Semistructured Data Ian Stark February 2011 Part II: Semistructured

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

(Modal) Logics for Semistructured Data (bis) Stphane Demri Laboratoire Spcification et

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Towards Grouping Constructs for Motivation Grouping Facets Semistructured Data Data Model

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

http://ceds.ed.gov CEDS Data Model The CEDS Data Model Process Domain Normalized CEDS Entity

XML and Web Data Chapter 15 1 Whats in This Module? Semistructured data XML &

Hargreaves Services plc Preliminary Results Year ended 31 May 2018 Highlights Underlying

Red Lists and Red Books in Poland Main characteristics: 1. Red Lists lists of species with

Scottish Islands Deposit Return Scheme About Zero Waste Scotland Zero Waste Scotland focus on

A presentation by Perrys Farm Hazardous Waste Management Facility Contents 1. Introduction

ENERGY PRACTICE GROUP Energy Practice Group In an era in which a rising demand for energy has

Exploiting Justifications for Lazy Grounding of Answer Set Programs Bart Bogaerts Antonius

Onslow Bay Financial LLC January 2019 Safe Harbor Notice This presentation, other written or

Dont Mock Yourself Out David Chelimsky Articulated Man, Inc

Entity Matching for Semistructured Data in the Cloud Marcus - PowerPoint PPT Presentation

Entity Matching for Semistructured Data in the Cloud Marcus Paradies IBM F2CE Workshop December 1, 2011 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 18 Outline 1 Motivation 2 Entity Matching 3 MAXIM: Entity

Data and Analysis Part II Semistructured Data Alex Simpson Part II: Semistructured Data Inf1,

Data and Analysis Part II Semistructured Data Ian Stark February 2011 Part II: Semistructured

Data and Analysis Part II Semistructured Data Ian Stark February 2011 Part II: Semistructured

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

Part II Semistructured Data XML: II.1 Semistructured data, XPath and XML II.2 Structuring XML

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

(Modal) Logics for Semistructured Data (bis) Stphane Demri Laboratoire Spcification et

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Towards Grouping Constructs for Motivation Grouping Facets Semistructured Data Data Model

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

http://ceds.ed.gov CEDS Data Model The CEDS Data Model Process Domain Normalized CEDS Entity

XML and Web Data Chapter 15 1 Whats in This Module? Semistructured data XML &amp;

Hargreaves Services plc Preliminary Results Year ended 31 May 2018 Highlights Underlying

Red Lists and Red Books in Poland Main characteristics: 1. Red Lists lists of species with

Scottish Islands Deposit Return Scheme About Zero Waste Scotland Zero Waste Scotland focus on

A presentation by Perrys Farm Hazardous Waste Management Facility Contents 1. Introduction

ENERGY PRACTICE GROUP Energy Practice Group In an era in which a rising demand for energy has

Exploiting Justifications for Lazy Grounding of Answer Set Programs Bart Bogaerts Antonius

Onslow Bay Financial LLC January 2019 Safe Harbor Notice This presentation, other written or

Dont Mock Yourself Out David Chelimsky Articulated Man, Inc

XML and Web Data Chapter 15 1 Whats in This Module? Semistructured data XML &