Schema & Ontology Matching: Schema & Ontology Matching: - - PowerPoint PPT Presentation

schema ontology matching schema ontology matching current
SMART_READER_LITE
LIVE PREVIEW

Schema & Ontology Matching: Schema & Ontology Matching: - - PowerPoint PPT Presentation

Schema & Ontology Matching: Schema & Ontology Matching: Current Research Directions Current Research Directions AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Road Map Road Map


slide-1
SLIDE 1

AnHai Doan

Database and Information System Group University of Illinois, Urbana Champaign Spring 2004

Schema & Ontology Matching: Schema & Ontology Matching: Current Research Directions Current Research Directions

slide-2
SLIDE 2

2

Road Map Road Map

Schema Matching

– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture

Ontology Matching

– motivation & problem definition – representative current solution: GLUE – broader picture

Conclusions & Emerging Directions

slide-3
SLIDE 3

3

New faculty member Find houses with 2 bedrooms priced under 200K

homes.com realestate.com homeseekers.com

Motivation: Data Integration Motivation: Data Integration

slide-4
SLIDE 4

4

Architecture of Data Integration System Architecture of Data Integration System

mediated schema homes.com realestate.com source schema 2 homeseekers.com source schema 3 source schema 1

Find houses with 2 bedrooms priced under 200K

slide-5
SLIDE 5

5

price agent-name address

Semantic Matches between Schemas Semantic Matches between Schemas

1-1 match complex match homes.com

listed-price contact-name city state

Mediated-schema

320K Jane Brown Seattle WA 240K Mike Smith Miami FL

slide-6
SLIDE 6

6

Schema Matching is Ubiquitous! Schema Matching is Ubiquitous!

Fundamental problem in numerous applications Databases

– data integration – data translation – schema/view integration – data warehousing – semantic query processing – model management – peer data management

AI

– knowledge bases, ontology merging, information gathering agents, ...

Web

– e-commerce – marking up data using ontologies (e.g., on Semantic Web)

slide-7
SLIDE 7

7

Why Schema Matching is Difficult Why Schema Matching is Difficult

Schema & data never fully capture semantics!

– not adequately documented – schema creator has retired to Florida!

Must rely on clues in schema & data

– using names, structures, types, data values, etc.

Such clues can be unreliable

– same names => different entities: area => location or square-feet – different names => same entity: area & address => location

Intended semantics can be subjective

– house-style = house-description? – military applications require committees to decide!

Cannot be fully automated, needs user feedback!

slide-8
SLIDE 8

8

Current State of Affairs Current State of Affairs

Finding semantic mappings is now a key bottleneck!

– largely done by hand – labor intensive & error prone – data integration at GTE [Li&Clifton, 2000]

– 40 databases, 27000 elements, estimated time: 12 years

Will only be exacerbated

– data sharing becomes pervasive – translation of legacy data

Need semi-automatic approaches to scale up! Many research projects in the past few years

– Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, ... – AI: Stanford, Karlsruhe University, NEC Japan, ...

slide-9
SLIDE 9

9

Road Map Road Map

Schema Matching

– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture

Ontology Matching

– motivation & problem definition – representative current solution: GLUE – broader picture

Conclusions & Emerging Directions

slide-10
SLIDE 10

10

LSD LSD

Learning Source Description Developed at Univ of Washington 2000-2001

– with Pedro Domingos and Alon Halevy

Designed for data integration settings

– has been adapted to several other contexts

Desirable characteristics

– learn from previous matching activities – exploit multiple types of information in schema and data – incorporate domain integrity constraints – handle user feedback – achieves high matching accuracy (66 -- 97%) on real-world data

slide-11
SLIDE 11

11

Suppose user wants to integrate 100 data sources

  • 1. User

– manually creates matches for a few sources, say 3 – shows LSD these matches

  • 2. LSD learns from the matches
  • 3. LSD predicts matches for remaining 97 sources

Schema Matching for Data Integration: Schema Matching for Data Integration: the LSD Approach the LSD Approach

slide-12
SLIDE 12

12

price agent-name agent-phone office-phone description

Learning from the Manual Matches Learning from the Manual Matches

listed-price contact-name contact-phone

  • ffice

comments Schema of realestate.com Mediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great”

  • ccur frequently in

data instances => description sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard

slide-13
SLIDE 13

13

price agent-name agent-phone office-phone description

Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information!

listed-price contact-name contact-phone

  • ffice

comments Schema of realestate.com Mediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great”

  • ccur frequently in

data instances => description sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle homes.com If “office”

  • ccurs in name

=> office-phone

slide-14
SLIDE 14

14

Multi Multi-

  • Strategy Learning

Strategy Learning

Use a set of base learners

– each exploits well certain types of information

To match a schema element of a new source

– apply base learners – combine their predictions using a meta-learner

Meta-learner

– uses training sources to measure base learner accuracy – weighs each learner based on its accuracy

slide-15
SLIDE 15

15

Base Learners Base Learners

Training Matching Name Learner

– training: (“location”, address)

(“contact name”, name)

– matching: agent-name

=> (name,0.7),(phone,0.3)

Naive Bayes Learner

– training: (“Seattle, WA”,address)

(“250K”,price)

– matching: “Kent, WA” => (address,0.8),(name,0.2) labels weighted by confidence score X (X1,C1) (X2,C2) ... (Xm,Cm)

Observed label Training examples Object

Classification model (hypothesis)

slide-16
SLIDE 16

16

The LSD Architecture The LSD Architecture

Matching Phase Training Phase

Mediated schema Source schemas Base-Learner1 Base-Learnerk Meta-Learner Training data for base learners Hypothesis1 Hypothesisk Weights for Base Learners Base-Learner1 .... Base-Learnerk Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints

slide-17
SLIDE 17

17

Naive Bayes Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address)

Training the Base Learners Training the Base Learners

Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location location price contact-name contact-phone office comments realestate.com (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Name Learner address price agent-name agent-phone office-phone description Mediated schema

slide-18
SLIDE 18

18

Meta Meta-

  • Learner: Stacking

Learner: Stacking

[ [Wolpert Wolpert 92,Ting&Witten99] 92,Ting&Witten99]

Training

– uses training data to learn weights – one for each (base-learner,mediated-schema element) pair – weight (Name-Learner,address) = 0.2 – weight (Naive-Bayes,address) = 0.8

Matching: combine predictions of base learners

– computes weighted average of base-learner confidence scores

Seattle, WA Kent, WA Bend, OR (address,0.4) (address,0.9) Name Learner Naive Bayes Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8) area

slide-19
SLIDE 19

19

The LSD Architecture The LSD Architecture

Matching Phase Training Phase

Mediated schema Source schemas Base-Learner1 Base-Learnerk Meta-Learner Training data for base learners Hypothesis1 Hypothesisk Weights for Base Learners Base-Learner1 .... Base-Learnerk Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints

slide-20
SLIDE 20

20

contact-agent

Applying the Learners Applying the Learners

Name Learner Naive Bayes Prediction-Combiner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (price,0.9), (agent-phone,0.1)

extra-info homes.com Seattle, WA Kent, WA Bend, OR area sold-at

(agent-phone,0.9), (description,0.1) Meta-Learner

area sold-at contact-agent extra-info homes.com schema

slide-21
SLIDE 21

21

Domain Constraints Domain Constraints

Encode user knowledge about domain Specified only once, by examining mediated schema Examples

– at most one source-schema element can match address – if a source-schema element matches house-id then it is a key – avg-value(price) > avg-value(num-baths)

Given a mapping combination

– can verify if it satisfies a given constraint

area: address sold-at: price contact-agent: agent-phone extra-info: address

slide-22
SLIDE 22

22

area: (address,0.7), (description,0.3) sold-at: (price,0.9), (agent-phone,0.1) contact-agent: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4)

The Constraint Handler The Constraint Handler

Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback

– sold-at does not match price 0.3 0.1 0.1 0.4 0.0012 0.7 0.9 0.9 0.4 0.2268

Domain Constraints At most one element matches address Predictions from Prediction Combiner

area: address sold-at: price contact-agent: agent-phone extra-info: description 0.7 0.9 0.9 0.6 0.3402 area: address sold-at: price contact-agent: agent-phone extra-info: address

slide-23
SLIDE 23

23

The Current LSD System The Current LSD System

Can also handle data in XML format

– matches XML DTDs

Base learners

– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97] – exploits frequencies of words & symbols – WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98] – employs information-retrieval similarity metric – Name Learner [SIGMOD-01] – matches elements based on their names – County-Name Recognizer [SIGMOD-01] – stores all U.S. county names – XML Learner [SIGMOD-01] – exploits hierarchical structure of XML data

slide-24
SLIDE 24

24

Empirical Evaluation Empirical Evaluation

Four domains

– Real Estate I & II, Course Offerings, Faculty Listings

For each domain

– created mediated schema & domain constraints – chose five sources – extracted & converted data into XML – mediated schemas: 14 - 66 elements, source schemas: 13 - 48

Ten runs for each domain, in each run:

– manually provided 1-1 matches for 3 sources – asked LSD to propose matches for remaining 2 sources – accuracy = % of 1-1 matches correctly identified

slide-25
SLIDE 25

25

High Matching Accuracy High Matching Accuracy

10 20 30 40 50 60 70 80 90 100 Real Estate I Real Estate II Course Offerings Faculty Listings

LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%

Average Matching Acccuracy (%)

slide-26
SLIDE 26

26

10 20 30 40 50 60 70 80 90 100

Real Estate I Real Estate II Course Offerings Faculty Listings

Contribution of Schema vs. Data Contribution of Schema vs. Data

LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) More experiments in [Doan et al. SIGMOD-01]

slide-27
SLIDE 27

27

LSD Summary LSD Summary

LSD

– learns from previous matching activities – exploits multiple types of information

– by employing multi-strategy learning

– incorporates domain constraints & user feedback – achieves high matching accuracy

LSD focuses on 1-1 matches Next challenge: discover more complex matches!

– iMAP (illinois Mapping) system [SIGMOD-04] – developed at Washington and Illinois, 2002-2004 – with Robin Dhamanka, Yoonkyong Lee, Alon Halevy, Pedro Domingos

slide-28
SLIDE 28

28

listed-price agent-id full-baths half-baths city zipcode

The The iMAP iMAP Approach Approach

For each mediated-schema element

– searches space of all matches – finds a small set of likely match candidates – uses LSD to evaluate them

To search efficiently

– employs a specialized searcher for each element type – Text Searcher, Numeric Searcher, Category Searcher, ...

price num-baths address Mediated-schema homes.com 320K 53211 2 1 Seattle 98105 240K 11578 1 1 Miami 23591

slide-29
SLIDE 29

29

The The iMAP iMAP Architecture [SIGMOD Architecture [SIGMOD-

  • 04]

04]

Source schema + data Mediated schema Searcherk Searcher2 Domain knowledge and data Searcher1 User Base-Learner1 .... Base-Learnerk 1-1 and complex matches Meta-Learner Similarity Matrix Match candidates Match selector Explanation module

slide-30
SLIDE 30

30

An Example: Text Searcher An Example: Text Searcher

Best match candidates for address

– (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9) listed-price agent-id full-baths half-baths city zipcode price num-baths address Mediated-schema 320K 532a 2 1 Seattle 98105 240K 115c 1 1 Miami 23591 homes.com concat(agent-id,zipcode) 532a 98105 115c 23591 concat(city,zipcode) Seattle 98105 Miami 23591 concat(agent-id,city) 532a Seattle 115c Miami

Beam search in space of all concatenation matches Example: find match candidates for address

slide-31
SLIDE 31

31

Empirical Evaluation Empirical Evaluation

Current iMAP system

– 12 searchers

Four real-world domains

– real estate, product inventory, cricket, financial wizard – target schema: 19 -- 42 elements, source schema: 32 -- 44

Accuracy: 43 -- 92% Sample discovered matches

– agent-name = concat(first-name,last-name) – area = building-area / 43560 – discount-cost = (unit-price * quantity) * (1 - discount)

More detail in [Dhamanka et. al. SIGMOD-04]

slide-32
SLIDE 32

32

Observations Observations

Finding complex matches much harder than 1-1 matches!

– require gluing together many components – e.g., num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms – if missing one component => incorrect match

However, even partial matches are already very useful!

– so are top-k matches => need methods to handle partial/top-k matches

Huge/infinite search spaces

– domain knowledge plays a crucial role!

Matches are fairly complex, hard to know if they are correct

– must be able to explain matches

Human must be fairly active in the loop

– need strong user interaction facilities

Break matching architecture into multiple "atomic" boxes!

slide-33
SLIDE 33

33

Road Map Road Map

Schema Matching

– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture

Ontology Matching

– motivation & problem definition – representative current solution: GLUE – broader picture

Conclusions & Emerging Directions

slide-34
SLIDE 34

34

Finding Matches is only Half of the Job! Finding Matches is only Half of the Job!

Mappings

– area = SELECT location FROM HOUSES – agent-address = SELECT concat(city,state) FROM AGENTS – list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id Schema T Schema S location price ($) agent-id Atlanta, GA 360,000 32 Raleigh, NC 430,000 15 HOUSES area list-price agent-address agent-name Denver, CO 550,000 Boulder, CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown LISTINGS id name city state fee-rate 32 Mike Brown Athens GA 0.03 15 Jean Laup Raleigh NC 0.04 AGENTS

To translate data/queries, need mappings, not matches

slide-35
SLIDE 35

35

Clio: Elaborating Matches into Mappings Clio: Elaborating Matches into Mappings

Developed at Univ of Toronto & IBM Almaden, 2000-2003

– by Renee Miller, Laura Haas, Mauricio Hernandez, Lucian Popa, Howard Ho, Ling Yan, Ron Fagin

Given a match

– list-price = price * (1 + fee-rate)

Refine it into a mapping

– list-price = SELECT price * (1 + fee-rate) FROM HOUSES (FULL OUTER JOIN) AGENTS WHERE agent-id = id

Need to discover

– the correct join path among tables, e.g., agent-id = id – the correct join, e.g., full outer join? inner join?

Use heuristics to decide

– when in doubt, ask users – employ sophisticated user interaction methods [VLDB-00, SIGMOD-01]

slide-36
SLIDE 36

36

Clio: Illustrating Examples Clio: Illustrating Examples

Mappings

– area = SELECT location FROM HOUSES – agent-address = SELECT concat(city,state) FROM AGENTS – list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id Schema T Schema S location price ($) agent-id Atlanta, GA 360,000 32 Raleigh, NC 430,000 15 HOUSES area list-price agent-address agent-name Denver, CO 550,000 Boulder, CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown LISTINGS id name city state fee-rate 32 Mike Brown Athens GA 0.03 15 Jean Laup Raleigh NC 0.04 AGENTS

slide-37
SLIDE 37

37

Road Map Road Map

Schema Matching

– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture

Ontology Matching

– motivation & problem definition – representative current solution: GLUE – broader picture

Conclusions & Emerging Directions

slide-38
SLIDE 38

38

Broader Picture: Find Matches Broader Picture: Find Matches

COMA by Erhard Rahm group David Embley group at BYU Jaewoo Kang group at NCSU Kevin Chang group at UIUC Clement Yu group at UIC SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et al. 97] AutoMatch, Autoplex [Berlin & Motro, 01-03] LSD [Doan et al., SIGMOD-01] iMAP [Dhamanka et. al., SIGMOD-04]

Single learner Exploit data 1-1 matches Hand-crafted rules Exploit schema 1-1 matches Learners + rules, use multi-strategy learning Exploit schema + data 1-1 + complex matches Exploit domain constraints More about some of these works soon ....

TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01]

Other Important Works

slide-39
SLIDE 39

39

Broader Picture: From Matches Broader Picture: From Matches to Mappings to Mappings

iMAP [Dhamanka et al., SIGMOD-04] CLIO [Miller et. al., 00] [Yan et al. 01]

Rules Exploit data Powerful user interaction Learners + rules Exploit schema + data 1-1 + complex matches Automate as much as possible

?

slide-40
SLIDE 40

40

Road Map Road Map

Schema Matching

– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture

Ontology Matching

– motivation & problem definition – representative current solution: GLUE – broader picture

Conclusions & Emerging Directions

slide-41
SLIDE 41

41

Ontology Matching Ontology Matching

Increasingly critical for

– knowledge bases, Semantic Web

An ontology

– concepts organized into a taxonomy tree – each concept has – a set of attributes – a set of instances – relations among concepts

Matching

– concepts – attributes – relations

name: Mike Burns degree: Ph.D. Entity Undergrad Courses Grad Courses People Staff Faculty Assistant Professor Associate Professor Professor CS Dept. US

slide-42
SLIDE 42

42

Matching Taxonomies of Concepts Matching Taxonomies of Concepts

Entity Courses Staff Technical Staff Academic Staff Lecturer Senior Lecturer Professor CS Dept. Australia Entity Undergrad Courses Grad Courses People Staff Faculty Assistant Professor Associate Professor Professor CS Dept. US

slide-43
SLIDE 43

43

Glue Glue

Solution

– Use data instances extensively – Learn classifiers using information within taxonomies – Use a rich constraint satisfaction scheme

[Doan, Madhavan, Domingos, Halevy; WWW’2002]

slide-44
SLIDE 44

44

Concept Similarity Concept Similarity

Multiple Similarity measures in terms of the JPD Multiple Similarity measures in terms of the JPD

Concept A Concept S A,¬S ¬A, S ¬A,¬S A,S

P(A,¬S) + P(A,S) + P(¬A,S) P(A,S)

= J oint Probabilit y Dist ribut ion: P(A, S), P(¬A, S), P(A, ¬S), P(¬A, ¬S)

Hypot het ical universe of all examples P(A ∪ S) P(A ∩ S)

Sim(Concept A, Concept S) =

[J accard, 1908]

slide-45
SLIDE 45

45

Machine Learning for Machine Learning for Computing Similarities Computing Similarities

JPD estimated by counting the sizes of the partitions JPD estimated by counting the sizes of the partitions

CLS

S ¬S Taxonomy 1 Taxonomy 2 A ¬A S ¬S

CLA

A ¬A A,¬S A,S ¬A,¬S ¬A,S A,S ¬A,S A,¬S ¬A,¬S

slide-46
SLIDE 46

46

The Glue The Glue System

System

Similarit y Est imat or Base Learner Base Learner Met a Learner Relaxat ion Labeling

Common Knowledge & Domain Const raint s Similarit y Funct ion J oint Probabilit y Dist ribut ion P(A,B), P(A’, B)… Similarit y Mat rix Taxonomy O1 (t ree st ruct ure + dat a inst ances) Taxonomy O2 (t ree st ruct ure + dat a inst ances) Dist ribut ion Est imat or Mat ches f or O1 , Mat ches f or O2

slide-47
SLIDE 47

47

Constraints in Taxonomy Matching Constraints in Taxonomy Matching

Domain-dependent

– at most one node matches department-chair – a node that matches professor can not be a child of a node that matches assistant-professor

Domain-independent

– two nodes match if parents & children match – if all children of X matches Y, then X also matches Y – Variations have been exploited in many restricted settings [Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98], [Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]

Challenge: find a general & efficient approach

slide-48
SLIDE 48

48

Solution: Relaxation Labeling Solution: Relaxation Labeling

Relaxation labeling [Hummel&Zucker, 83]

– applied to graph labeling in vision, NLP, hypertext classification – finds best label assignment, given a set of constraints – starts with initial label assignment – iteratively improves labels, using constraints

Standard relax. labeling not applicable

– extended it in many ways [Doan et al., W W W-02]

slide-49
SLIDE 49

49

Real World Experiments Real World Experiments

Taxonomies on the web

– University organization (UW and Cornell) – Colleges, departments and sub-fields – Companies (Yahoo and The Standard) – Industries and Sectors

For each taxonomy

– Extract data instances – course descriptions, company profiles – Trivial data cleaning – 100 – 300 concepts per taxonomy – 3-4 depth of taxonomies – 10-90 data instances per concept

Evaluation against manual mappings as the gold standard

slide-50
SLIDE 50

50

Glue’s Performance Glue’s Performance

10 20 30 40 50 60 70 80 90 100 Cornell to Wash.

  • Wash. to Cornell

Cornell to Wash.

  • Wash. to Cornell

Standard to Yahoo Yahoo to Standard

M atch in g accu racy (% )

Name Learner Content Learner Meta Learner Relaxation Labeler

Universit y Dept s 1 Company Prof iles Universit y Dept s 2

slide-51
SLIDE 51

51

Broader Picture Broader Picture

Ontology matching parallels the development of

schema matching

– rule-based & learning-based approaches – PROMPT family, OntoMorph, OntoMerge, Chimaera, Onion, OBSERVER, FCAMerge, ... – extensive work by Ed Hovy's group – ontology versioning (e.g., by Noy et. al.)

More powerful user interaction methods

– e.g., iPROMPT, Chimaera

Much more theoretical works in this area

slide-52
SLIDE 52

52

Road Map Road Map

Schema Matching

– motivation & problem definition – representative current solutions: LSD, iMAP, Clio – broader picture

Ontology Matching

– motivation & problem definition – representative current solution: GLUE – broader picture

Conclusions & Emerging Directions

slide-53
SLIDE 53

53

Develop the Theoretical Foundation Develop the Theoretical Foundation

Not much is going on, however ...

– see works by Alon Halevy (AAAI-02) and Phil Bernstein (in model management contexts) – some preliminary work in AnHai Doan's Ph.D. dissertation – work by Stuart Russell and other AI people on identity uncertainty is potentially relevant

Most likely foundation

– probability framework

slide-54
SLIDE 54

54

Need Much More Domain Knowledge Need Much More Domain Knowledge

Where to get it?

– past matches (e.g., LSD, iMAP) – other schemas in the domain – holistic matching approach by Kevin Chang group [SIGMOD-02] – corpus-based matching by Alon Halevy group [IJCAI-03] – clustering to achieve bridging effects by Clement Yu group [SIGMOD-04] – external data (e.g., iMAP at SIGMOD-04) – mass of users (e.g., MOBS at WebDB-03)

How to get it and how to use it?

– no clear answer yet

slide-55
SLIDE 55

55

Employ Multi Employ Multi-

  • Module Architecture

Module Architecture

Many "black boxes", each is good at doing a single thing Combine them and tailor them to each application Examples

– LSD, iMAP, COMA, David Embley's systems

Open issues

– what are these back boxes? – how to build them? – how to combine them?

slide-56
SLIDE 56

56

Powerful User Interaction Powerful User Interaction

Minimize user effort, maximize its impact Make it very easy for users to

– supply domain knowledge – provide feedback on matches/mappings

Develop powerful explanation facilities

slide-57
SLIDE 57

57

Other Issues Other Issues

What to do with partial/top-k matches? Meaning negotiation Fortifying schemas for interoperability Very-large-scale matching scenarios (e.g., the Web) What can we do without the mappings? Interaction between schema matching and tuple matching? Benchmarks, tools?

slide-58
SLIDE 58

58

Summary Summary

Schema/ontology matching:

key to numerous data management problems

– much attention in the database, AI, Semantic Web communities

Simple problem definition, yet very difficult to do

– no satisfactory solution yet – AI complete?

We now understand the problems much better

– still at the beginning of the journey – will need techniques from multiple fields