Schema & Ontology Matching: Schema & Ontology Matching: - - PowerPoint PPT Presentation
Schema & Ontology Matching: Schema & Ontology Matching: - - PowerPoint PPT Presentation
Schema & Ontology Matching: Schema & Ontology Matching: Current Research Directions Current Research Directions AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Road Map Road Map
2
Road Map Road Map
Schema Matching
– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture
Ontology Matching
– motivation & problem definition – representative current solution: GLUE – broader picture
Conclusions & Emerging Directions
3
New faculty member Find houses with 2 bedrooms priced under 200K
homes.com realestate.com homeseekers.com
Motivation: Data Integration Motivation: Data Integration
4
Architecture of Data Integration System Architecture of Data Integration System
mediated schema homes.com realestate.com source schema 2 homeseekers.com source schema 3 source schema 1
Find houses with 2 bedrooms priced under 200K
5
price agent-name address
Semantic Matches between Schemas Semantic Matches between Schemas
1-1 match complex match homes.com
listed-price contact-name city state
Mediated-schema
320K Jane Brown Seattle WA 240K Mike Smith Miami FL
6
Schema Matching is Ubiquitous! Schema Matching is Ubiquitous!
Fundamental problem in numerous applications Databases
– data integration – data translation – schema/view integration – data warehousing – semantic query processing – model management – peer data management
AI
– knowledge bases, ontology merging, information gathering agents, ...
Web
– e-commerce – marking up data using ontologies (e.g., on Semantic Web)
7
Why Schema Matching is Difficult Why Schema Matching is Difficult
Schema & data never fully capture semantics!
– not adequately documented – schema creator has retired to Florida!
Must rely on clues in schema & data
– using names, structures, types, data values, etc.
Such clues can be unreliable
– same names => different entities: area => location or square-feet – different names => same entity: area & address => location
Intended semantics can be subjective
– house-style = house-description? – military applications require committees to decide!
Cannot be fully automated, needs user feedback!
8
Current State of Affairs Current State of Affairs
Finding semantic mappings is now a key bottleneck!
– largely done by hand – labor intensive & error prone – data integration at GTE [Li&Clifton, 2000]
– 40 databases, 27000 elements, estimated time: 12 years
Will only be exacerbated
– data sharing becomes pervasive – translation of legacy data
Need semi-automatic approaches to scale up! Many research projects in the past few years
– Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, ... – AI: Stanford, Karlsruhe University, NEC Japan, ...
9
Road Map Road Map
Schema Matching
– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture
Ontology Matching
– motivation & problem definition – representative current solution: GLUE – broader picture
Conclusions & Emerging Directions
10
LSD LSD
Learning Source Description Developed at Univ of Washington 2000-2001
– with Pedro Domingos and Alon Halevy
Designed for data integration settings
– has been adapted to several other contexts
Desirable characteristics
– learn from previous matching activities – exploit multiple types of information in schema and data – incorporate domain integrity constraints – handle user feedback – achieves high matching accuracy (66 -- 97%) on real-world data
11
Suppose user wants to integrate 100 data sources
- 1. User
– manually creates matches for a few sources, say 3 – shows LSD these matches
- 2. LSD learns from the matches
- 3. LSD predicts matches for remaining 97 sources
Schema Matching for Data Integration: Schema Matching for Data Integration: the LSD Approach the LSD Approach
12
price agent-name agent-phone office-phone description
Learning from the Manual Matches Learning from the Manual Matches
listed-price contact-name contact-phone
- ffice
comments Schema of realestate.com Mediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great”
- ccur frequently in
data instances => description sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard
13
price agent-name agent-phone office-phone description
Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information!
listed-price contact-name contact-phone
- ffice
comments Schema of realestate.com Mediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great”
- ccur frequently in
data instances => description sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle homes.com If “office”
- ccurs in name
=> office-phone
14
Multi Multi-
- Strategy Learning
Strategy Learning
Use a set of base learners
– each exploits well certain types of information
To match a schema element of a new source
– apply base learners – combine their predictions using a meta-learner
Meta-learner
– uses training sources to measure base learner accuracy – weighs each learner based on its accuracy
15
Base Learners Base Learners
Training Matching Name Learner
– training: (“location”, address)
(“contact name”, name)
– matching: agent-name
=> (name,0.7),(phone,0.3)
Naive Bayes Learner
– training: (“Seattle, WA”,address)
(“250K”,price)
– matching: “Kent, WA” => (address,0.8),(name,0.2) labels weighted by confidence score X (X1,C1) (X2,C2) ... (Xm,Cm)
Observed label Training examples Object
Classification model (hypothesis)
16
The LSD Architecture The LSD Architecture
Matching Phase Training Phase
Mediated schema Source schemas Base-Learner1 Base-Learnerk Meta-Learner Training data for base learners Hypothesis1 Hypothesisk Weights for Base Learners Base-Learner1 .... Base-Learnerk Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints
17
Naive Bayes Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address)
Training the Base Learners Training the Base Learners
Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location location price contact-name contact-phone office comments realestate.com (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Name Learner address price agent-name agent-phone office-phone description Mediated schema
18
Meta Meta-
- Learner: Stacking
Learner: Stacking
[ [Wolpert Wolpert 92,Ting&Witten99] 92,Ting&Witten99]
Training
– uses training data to learn weights – one for each (base-learner,mediated-schema element) pair – weight (Name-Learner,address) = 0.2 – weight (Naive-Bayes,address) = 0.8
Matching: combine predictions of base learners
– computes weighted average of base-learner confidence scores
Seattle, WA Kent, WA Bend, OR (address,0.4) (address,0.9) Name Learner Naive Bayes Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8) area
19
The LSD Architecture The LSD Architecture
Matching Phase Training Phase
Mediated schema Source schemas Base-Learner1 Base-Learnerk Meta-Learner Training data for base learners Hypothesis1 Hypothesisk Weights for Base Learners Base-Learner1 .... Base-Learnerk Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints
20
contact-agent
Applying the Learners Applying the Learners
Name Learner Naive Bayes Prediction-Combiner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (price,0.9), (agent-phone,0.1)
extra-info homes.com Seattle, WA Kent, WA Bend, OR area sold-at
(agent-phone,0.9), (description,0.1) Meta-Learner
area sold-at contact-agent extra-info homes.com schema
21
Domain Constraints Domain Constraints
Encode user knowledge about domain Specified only once, by examining mediated schema Examples
– at most one source-schema element can match address – if a source-schema element matches house-id then it is a key – avg-value(price) > avg-value(num-baths)
Given a mapping combination
– can verify if it satisfies a given constraint
area: address sold-at: price contact-agent: agent-phone extra-info: address
22
area: (address,0.7), (description,0.3) sold-at: (price,0.9), (agent-phone,0.1) contact-agent: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4)
The Constraint Handler The Constraint Handler
Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback
– sold-at does not match price 0.3 0.1 0.1 0.4 0.0012 0.7 0.9 0.9 0.4 0.2268
Domain Constraints At most one element matches address Predictions from Prediction Combiner
area: address sold-at: price contact-agent: agent-phone extra-info: description 0.7 0.9 0.9 0.6 0.3402 area: address sold-at: price contact-agent: agent-phone extra-info: address
23
The Current LSD System The Current LSD System
Can also handle data in XML format
– matches XML DTDs
Base learners
– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97] – exploits frequencies of words & symbols – WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98] – employs information-retrieval similarity metric – Name Learner [SIGMOD-01] – matches elements based on their names – County-Name Recognizer [SIGMOD-01] – stores all U.S. county names – XML Learner [SIGMOD-01] – exploits hierarchical structure of XML data
24
Empirical Evaluation Empirical Evaluation
Four domains
– Real Estate I & II, Course Offerings, Faculty Listings
For each domain
– created mediated schema & domain constraints – chose five sources – extracted & converted data into XML – mediated schemas: 14 - 66 elements, source schemas: 13 - 48
Ten runs for each domain, in each run:
– manually provided 1-1 matches for 3 sources – asked LSD to propose matches for remaining 2 sources – accuracy = % of 1-1 matches correctly identified
25
High Matching Accuracy High Matching Accuracy
10 20 30 40 50 60 70 80 90 100 Real Estate I Real Estate II Course Offerings Faculty Listings
LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%
Average Matching Acccuracy (%)
26
10 20 30 40 50 60 70 80 90 100
Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs. Data Contribution of Schema vs. Data
LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) More experiments in [Doan et al. SIGMOD-01]
27
LSD Summary LSD Summary
LSD
– learns from previous matching activities – exploits multiple types of information
– by employing multi-strategy learning
– incorporates domain constraints & user feedback – achieves high matching accuracy
LSD focuses on 1-1 matches Next challenge: discover more complex matches!
– iMAP (illinois Mapping) system [SIGMOD-04] – developed at Washington and Illinois, 2002-2004 – with Robin Dhamanka, Yoonkyong Lee, Alon Halevy, Pedro Domingos
28
listed-price agent-id full-baths half-baths city zipcode
The The iMAP iMAP Approach Approach
For each mediated-schema element
– searches space of all matches – finds a small set of likely match candidates – uses LSD to evaluate them
To search efficiently
– employs a specialized searcher for each element type – Text Searcher, Numeric Searcher, Category Searcher, ...
price num-baths address Mediated-schema homes.com 320K 53211 2 1 Seattle 98105 240K 11578 1 1 Miami 23591
29
The The iMAP iMAP Architecture [SIGMOD Architecture [SIGMOD-
- 04]
04]
Source schema + data Mediated schema Searcherk Searcher2 Domain knowledge and data Searcher1 User Base-Learner1 .... Base-Learnerk 1-1 and complex matches Meta-Learner Similarity Matrix Match candidates Match selector Explanation module
30
An Example: Text Searcher An Example: Text Searcher
Best match candidates for address
– (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9) listed-price agent-id full-baths half-baths city zipcode price num-baths address Mediated-schema 320K 532a 2 1 Seattle 98105 240K 115c 1 1 Miami 23591 homes.com concat(agent-id,zipcode) 532a 98105 115c 23591 concat(city,zipcode) Seattle 98105 Miami 23591 concat(agent-id,city) 532a Seattle 115c Miami
Beam search in space of all concatenation matches Example: find match candidates for address
31
Empirical Evaluation Empirical Evaluation
Current iMAP system
– 12 searchers
Four real-world domains
– real estate, product inventory, cricket, financial wizard – target schema: 19 -- 42 elements, source schema: 32 -- 44
Accuracy: 43 -- 92% Sample discovered matches
– agent-name = concat(first-name,last-name) – area = building-area / 43560 – discount-cost = (unit-price * quantity) * (1 - discount)
More detail in [Dhamanka et. al. SIGMOD-04]
32
Observations Observations
Finding complex matches much harder than 1-1 matches!
– require gluing together many components – e.g., num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms – if missing one component => incorrect match
However, even partial matches are already very useful!
– so are top-k matches => need methods to handle partial/top-k matches
Huge/infinite search spaces
– domain knowledge plays a crucial role!
Matches are fairly complex, hard to know if they are correct
– must be able to explain matches
Human must be fairly active in the loop
– need strong user interaction facilities
Break matching architecture into multiple "atomic" boxes!
33
Road Map Road Map
Schema Matching
– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture
Ontology Matching
– motivation & problem definition – representative current solution: GLUE – broader picture
Conclusions & Emerging Directions
34
Finding Matches is only Half of the Job! Finding Matches is only Half of the Job!
Mappings
– area = SELECT location FROM HOUSES – agent-address = SELECT concat(city,state) FROM AGENTS – list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id Schema T Schema S location price ($) agent-id Atlanta, GA 360,000 32 Raleigh, NC 430,000 15 HOUSES area list-price agent-address agent-name Denver, CO 550,000 Boulder, CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown LISTINGS id name city state fee-rate 32 Mike Brown Athens GA 0.03 15 Jean Laup Raleigh NC 0.04 AGENTS
To translate data/queries, need mappings, not matches
35
Clio: Elaborating Matches into Mappings Clio: Elaborating Matches into Mappings
Developed at Univ of Toronto & IBM Almaden, 2000-2003
– by Renee Miller, Laura Haas, Mauricio Hernandez, Lucian Popa, Howard Ho, Ling Yan, Ron Fagin
Given a match
– list-price = price * (1 + fee-rate)
Refine it into a mapping
– list-price = SELECT price * (1 + fee-rate) FROM HOUSES (FULL OUTER JOIN) AGENTS WHERE agent-id = id
Need to discover
– the correct join path among tables, e.g., agent-id = id – the correct join, e.g., full outer join? inner join?
Use heuristics to decide
– when in doubt, ask users – employ sophisticated user interaction methods [VLDB-00, SIGMOD-01]
36
Clio: Illustrating Examples Clio: Illustrating Examples
Mappings
– area = SELECT location FROM HOUSES – agent-address = SELECT concat(city,state) FROM AGENTS – list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id Schema T Schema S location price ($) agent-id Atlanta, GA 360,000 32 Raleigh, NC 430,000 15 HOUSES area list-price agent-address agent-name Denver, CO 550,000 Boulder, CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown LISTINGS id name city state fee-rate 32 Mike Brown Athens GA 0.03 15 Jean Laup Raleigh NC 0.04 AGENTS
37
Road Map Road Map
Schema Matching
– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture
Ontology Matching
– motivation & problem definition – representative current solution: GLUE – broader picture
Conclusions & Emerging Directions
38
Broader Picture: Find Matches Broader Picture: Find Matches
COMA by Erhard Rahm group David Embley group at BYU Jaewoo Kang group at NCSU Kevin Chang group at UIUC Clement Yu group at UIC SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et al. 97] AutoMatch, Autoplex [Berlin & Motro, 01-03] LSD [Doan et al., SIGMOD-01] iMAP [Dhamanka et. al., SIGMOD-04]
Single learner Exploit data 1-1 matches Hand-crafted rules Exploit schema 1-1 matches Learners + rules, use multi-strategy learning Exploit schema + data 1-1 + complex matches Exploit domain constraints More about some of these works soon ....
TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01]
Other Important Works
39
Broader Picture: From Matches Broader Picture: From Matches to Mappings to Mappings
iMAP [Dhamanka et al., SIGMOD-04] CLIO [Miller et. al., 00] [Yan et al. 01]
Rules Exploit data Powerful user interaction Learners + rules Exploit schema + data 1-1 + complex matches Automate as much as possible
?
40
Road Map Road Map
Schema Matching
– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture
Ontology Matching
– motivation & problem definition – representative current solution: GLUE – broader picture
Conclusions & Emerging Directions
41
Ontology Matching Ontology Matching
Increasingly critical for
– knowledge bases, Semantic Web
An ontology
– concepts organized into a taxonomy tree – each concept has – a set of attributes – a set of instances – relations among concepts
Matching
– concepts – attributes – relations
name: Mike Burns degree: Ph.D. Entity Undergrad Courses Grad Courses People Staff Faculty Assistant Professor Associate Professor Professor CS Dept. US
42
Matching Taxonomies of Concepts Matching Taxonomies of Concepts
Entity Courses Staff Technical Staff Academic Staff Lecturer Senior Lecturer Professor CS Dept. Australia Entity Undergrad Courses Grad Courses People Staff Faculty Assistant Professor Associate Professor Professor CS Dept. US
43
Glue Glue
Solution
– Use data instances extensively – Learn classifiers using information within taxonomies – Use a rich constraint satisfaction scheme
[Doan, Madhavan, Domingos, Halevy; WWW’2002]
44
Concept Similarity Concept Similarity
Multiple Similarity measures in terms of the JPD Multiple Similarity measures in terms of the JPD
Concept A Concept S A,¬S ¬A, S ¬A,¬S A,S
P(A,¬S) + P(A,S) + P(¬A,S) P(A,S)
= J oint Probabilit y Dist ribut ion: P(A, S), P(¬A, S), P(A, ¬S), P(¬A, ¬S)
Hypot het ical universe of all examples P(A ∪ S) P(A ∩ S)
Sim(Concept A, Concept S) =
[J accard, 1908]
45
Machine Learning for Machine Learning for Computing Similarities Computing Similarities
JPD estimated by counting the sizes of the partitions JPD estimated by counting the sizes of the partitions
CLS
S ¬S Taxonomy 1 Taxonomy 2 A ¬A S ¬S
CLA
A ¬A A,¬S A,S ¬A,¬S ¬A,S A,S ¬A,S A,¬S ¬A,¬S
46
The Glue The Glue System
System
Similarit y Est imat or Base Learner Base Learner Met a Learner Relaxat ion Labeling
Common Knowledge & Domain Const raint s Similarit y Funct ion J oint Probabilit y Dist ribut ion P(A,B), P(A’, B)… Similarit y Mat rix Taxonomy O1 (t ree st ruct ure + dat a inst ances) Taxonomy O2 (t ree st ruct ure + dat a inst ances) Dist ribut ion Est imat or Mat ches f or O1 , Mat ches f or O2
47
Constraints in Taxonomy Matching Constraints in Taxonomy Matching
Domain-dependent
– at most one node matches department-chair – a node that matches professor can not be a child of a node that matches assistant-professor
Domain-independent
– two nodes match if parents & children match – if all children of X matches Y, then X also matches Y – Variations have been exploited in many restricted settings [Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98], [Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]
Challenge: find a general & efficient approach
48
Solution: Relaxation Labeling Solution: Relaxation Labeling
Relaxation labeling [Hummel&Zucker, 83]
– applied to graph labeling in vision, NLP, hypertext classification – finds best label assignment, given a set of constraints – starts with initial label assignment – iteratively improves labels, using constraints
Standard relax. labeling not applicable
– extended it in many ways [Doan et al., W W W-02]
49
Real World Experiments Real World Experiments
Taxonomies on the web
– University organization (UW and Cornell) – Colleges, departments and sub-fields – Companies (Yahoo and The Standard) – Industries and Sectors
For each taxonomy
– Extract data instances – course descriptions, company profiles – Trivial data cleaning – 100 – 300 concepts per taxonomy – 3-4 depth of taxonomies – 10-90 data instances per concept
Evaluation against manual mappings as the gold standard
50
Glue’s Performance Glue’s Performance
10 20 30 40 50 60 70 80 90 100 Cornell to Wash.
- Wash. to Cornell
Cornell to Wash.
- Wash. to Cornell
Standard to Yahoo Yahoo to Standard
M atch in g accu racy (% )
Name Learner Content Learner Meta Learner Relaxation Labeler
Universit y Dept s 1 Company Prof iles Universit y Dept s 2
51
Broader Picture Broader Picture
Ontology matching parallels the development of
schema matching
– rule-based & learning-based approaches – PROMPT family, OntoMorph, OntoMerge, Chimaera, Onion, OBSERVER, FCAMerge, ... – extensive work by Ed Hovy's group – ontology versioning (e.g., by Noy et. al.)
More powerful user interaction methods
– e.g., iPROMPT, Chimaera
Much more theoretical works in this area
52
Road Map Road Map
Schema Matching
– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture
Ontology Matching
– motivation & problem definition – representative current solution: GLUE – broader picture
Conclusions & Emerging Directions
53
Develop the Theoretical Foundation Develop the Theoretical Foundation
Not much is going on, however ...
– see works by Alon Halevy (AAAI-02) and Phil Bernstein (in model management contexts) – some preliminary work in AnHai Doan's Ph.D. dissertation – work by Stuart Russell and other AI people on identity uncertainty is potentially relevant
Most likely foundation
– probability framework
54
Need Much More Domain Knowledge Need Much More Domain Knowledge
Where to get it?
– past matches (e.g., LSD, iMAP) – other schemas in the domain – holistic matching approach by Kevin Chang group [SIGMOD-02] – corpus-based matching by Alon Halevy group [IJCAI-03] – clustering to achieve bridging effects by Clement Yu group [SIGMOD-04] – external data (e.g., iMAP at SIGMOD-04) – mass of users (e.g., MOBS at WebDB-03)
How to get it and how to use it?
– no clear answer yet
55
Employ Multi Employ Multi-
- Module Architecture
Module Architecture
Many "black boxes", each is good at doing a single thing Combine them and tailor them to each application Examples
– LSD, iMAP, COMA, David Embley's systems
Open issues
– what are these back boxes? – how to build them? – how to combine them?
56
Powerful User Interaction Powerful User Interaction
Minimize user effort, maximize its impact Make it very easy for users to
– supply domain knowledge – provide feedback on matches/mappings
Develop powerful explanation facilities
57
Other Issues Other Issues
What to do with partial/top-k matches? Meaning negotiation Fortifying schemas for interoperability Very-large-scale matching scenarios (e.g., the Web) What can we do without the mappings? Interaction between schema matching and tuple matching? Benchmarks, tools?
58
Summary Summary
Schema/ontology matching:
key to numerous data management problems
– much attention in the database, AI, Semantic Web communities
Simple problem definition, yet very difficult to do
– no satisfactory solution yet – AI complete?
We now understand the problems much better
– still at the beginning of the journey – will need techniques from multiple fields