Schema & Ontology Matching: Schema & Ontology Matching: - PowerPoint PPT Presentation

Schema & Ontology Matching: Schema & Ontology Matching: Current Research Directions Current Research Directions AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004

Road Map Road Map � Schema Matching – motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture � Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture � Conclusions & Emerging Directions 2

Motivation: Data Integration Motivation: Data Integration Find houses with 2 bedrooms priced under 200K New faculty member realestate.com homeseekers.com homes.com 3

Architecture of Data Integration System Architecture of Data Integration System Find houses with 2 bedrooms priced under 200K mediated schema source schema 1 source schema 2 source schema 3 realestate.com homeseekers.com homes.com 4

Semantic Matches between Schemas Semantic Matches between Schemas Mediated-schema price agent-name address 1-1 match complex match homes.com listed-price contact-name city state 320K Jane Brown Seattle WA 240K Mike Smith Miami FL 5

Schema Matching is Ubiquitous! Schema Matching is Ubiquitous! � Fundamental problem in numerous applications � Databases – data integration – data translation – schema/view integration – data warehousing – semantic query processing – model management – peer data management � AI – knowledge bases, ontology merging, information gathering agents, ... � Web – e-commerce – marking up data using ontologies (e.g., on Semantic Web) 6

Why Schema Matching is Difficult Why Schema Matching is Difficult � Schema & data never fully capture semantics! – not adequately documented – schema creator has retired to Florida! � Must rely on clues in schema & data – using names, structures, types, data values, etc. � Such clues can be unreliable – same names => different entities: area => location or square-feet – different names => same entity: area & address => location � Intended semantics can be subjective – house-style = house-description? – military applications require committees to decide! � Cannot be fully automated, needs user feedback! 7

Current State of Affairs Current State of Affairs � Finding semantic mappings is now a key bottleneck! – largely done by hand – labor intensive & error prone – data integration at GTE [Li&Clifton, 2000] – 40 databases, 27000 elements, estimated time: 12 years � Will only be exacerbated – data sharing becomes pervasive – translation of legacy data � Need semi-automatic approaches to scale up! � Many research projects in the past few years – Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, ... – AI: Stanford, Karlsruhe University, NEC Japan, ... 8

Road Map Road Map � Schema Matching – motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture � Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture � Conclusions & Emerging Directions 9

LSD LSD � Learning Source Description � Developed at Univ of Washington 2000-2001 – with Pedro Domingos and Alon Halevy � Designed for data integration settings – has been adapted to several other contexts � Desirable characteristics – learn from previous matching activities – exploit multiple types of information in schema and data – incorporate domain integrity constraints – handle user feedback – achieves high matching accuracy (66 -- 97%) on real-world data 10

Schema Matching for Data Integration: Schema Matching for Data Integration: the LSD Approach the LSD Approach Suppose user wants to integrate 100 data sources 1. User – manually creates matches for a few sources, say 3 – shows LSD these matches 2. LSD learns from the matches 3. LSD predicts matches for remaining 97 sources 11

Learning from the Manual Matches Learning from the Manual Matches Mediated schema price agent-name agent-phone office-phone description listed-price contact-name contact-phone office comments Schema of realestate.com realestate.com listed-price contact-name contact-phone office comments $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” occur frequently in sold-at contact-agent extra-info data instances => description $350K (206) 634 9435 Beautiful yard 12

Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information! Mediated schema price agent-name agent-phone office-phone description If “office” occurs in name => office-phone listed-price contact-name contact-phone office comments Schema of realestate.com realestate.com listed-price contact-name contact-phone office comments $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” homes.com occur frequently in sold-at contact-agent extra-info data instances => description $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle 13

Multi- -Strategy Learning Strategy Learning Multi � Use a set of base learners – each exploits well certain types of information � To match a schema element of a new source – apply base learners – combine their predictions using a meta-learner � Meta-learner – uses training sources to measure base learner accuracy – weighs each learner based on its accuracy 14

Base Learners Base Learners � Training (X 1 ,C 1 ) Observed label Object (X 2 ,C 2 ) Classification model ... Training (hypothesis) (X m ,C m ) examples � Matching X labels weighted by confidence score � Name Learner – training: (“location”, address) (“contact name”, name) – matching: agent-name => (name,0.7),(phone,0.3) � Naive Bayes Learner – training: (“Seattle, WA”,address) (“250K”,price) – matching: “Kent, WA” => (address,0.8),(name,0.2) 15

The LSD Architecture The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Training data Base-Learner 1 .... Base-Learner k for base learners Meta-Learner Base-Learner 1 Base-Learner k Predictions for instances Hypothesis 1 Hypothesis k Prediction Combiner Domain Predictions for elements constraints Constraint Handler Weights for Meta-Learner Base Learners Mappings 16

Training the Base Learners Training the Base Learners Mediated schema address price agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Name Learner Naive Bayes Learner (“location”, address) (“Miami, FL”, address) (“price”, price) (“$250K”, price) (“contact name”, agent-name) (“James Smith”, agent-name) (“contact phone”, agent-phone) (“(305) 729 0831”, agent-phone) (“office”, office-phone) (“(305) 616 1822”, office-phone) (“comments”, description) (“Fantastic house”, description) (“Boston,MA”, address) 17

Meta- -Learner: Stacking Learner: Stacking Meta [Wolpert Wolpert 92,Ting&Witten99] 92,Ting&Witten99] [ � Training – uses training data to learn weights – one for each (base-learner,mediated-schema element) pair – weight (Name-Learner,address) = 0.2 – weight (Naive-Bayes,address) = 0.8 � Matching: combine predictions of base learners – computes weighted average of base-learner confidence scores area Name Learner (address,0.4) Seattle, WA Naive Bayes (address,0.9) Kent, WA Bend, OR Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8) 18

The LSD Architecture The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Training data Base-Learner 1 .... Base-Learner k for base learners Meta-Learner Base-Learner 1 Base-Learner k Predictions for instances Hypothesis 1 Hypothesis k Prediction Combiner Domain Predictions for elements constraints Constraint Handler Weights for Meta-Learner Base Learners Mappings 19

Applying the Learners Applying the Learners homes.com schema area sold-at contact-agent extra-info area Name Learner (address,0.8), (description,0.2) Meta-Learner Seattle, WA Naive Bayes (address,0.6), (description,0.4) Kent, WA (address,0.7), (description,0.3) Name Learner Meta-Learner Bend, OR Naive Bayes Prediction-Combiner homes.com (address,0.7), (description,0.3) sold-at (price,0.9), (agent-phone,0.1) contact-agent (agent-phone,0.9), (description,0.1) extra-info (address,0.6), (description,0.4) 20

Schema & Ontology Matching: Schema & Ontology Matching: - PowerPoint PPT Presentation

Schema & Ontology Matching: Schema & Ontology Matching: Current Research Directions Current Research Directions AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Road Map Road Map

Schema Matching in a Large Scale Schema Matching in a Large Scale Personal Schema Based Querying

Linked Open Data data.slub-dresden.de Linked Open Usable Data data.slub-dresden.de schema.org

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Schema Languages Schema Languages Regular expressions a commonly used formalism in schema

Schema validation and evolution for PGs Eugenia Oshurko (ENS Lyon) 7 March 2019 Main ideas

DBpedia Ontology Enrichment for Inconsistency Detection and Statistical Schema Induction

Ontology Matching for Patent Classification Christoph Quix, Sandra Geisler, Rihan Hai, Sanchit

Ontology matching tutorial J er ome Euzenat Provide an introduction to ontology

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Ontology matching tutorial J er ome Euzenat Pavel Shvaiko Illustrate the role of

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

XBenchMatch: a Benchmark for XML Schema Matching Tools Fabien Duchateau 1 , Zohra Bellahsene 1 and

COMA COMA A system for flexible system for flexible combination of schema matching

IP-XACT XML Schema Vanderlei Bonato Sep 2008 Outline XML Schema The seven top-level

Constraining the pulsar power in gamma-ray binaries through thermal X-ray emission V ctor

S D Agent-based Modeling using L Marco VALENTE 1 , 2 , 3 1 LEM, S. Anna School of Advanced

Light Composite Scalars George T. Fleming Yale University (for the LSD Collaboration) Lattice

REXNORD REXNORD Fourth Quarter 2015 Earnings Release Earnings Release May 20, 2015 May 20, 2015

TODAY String sorts Key-indexed counting LSD radix sort MSD radix sort 3-way

Unit 3 Binary Representation ANALOG VS. DIGITAL 3.3 3.4 Analog vs. Digital Analog vs. Digital

Lattice Simplices of Bounded Degree Einstein Workshop on Lattice Polytopes 2016 Johannes

Geometric and semantic SLAM using high level features Shichao Yang Michael Kaess Sebastian