A Semantic Similarity Measure for Formal Ontologies
Mark Hall
Final presentation for the master thesis 17.03.2006
A Semantic Similarity Measure for Formal Ontologies Mark Hall - - PowerPoint PPT Presentation
A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master thesis 17.03.2006 Overview I A Semantic Similarity Model and Algorithm Motivation - Heterogeneous Data Ontologies Similarity Measures Hybrid
Final presentation for the master thesis 17.03.2006
A Semantic Similarity Model and Algorithm
Motivation - Heterogeneous Data Ontologies Similarity Measures Hybrid Model & Similarity Calculation Application
Evaluation of the Model and Algorithm Summary
Heterogeneous data sources are the norm Integration poses two main problems
Syntactic differences Semantic differences
Hello, is this the train station? Frische T
Frische T
Je m’appelle Jane et je t’emmerde
Integration depends on finding matches
Syntactic problems Semantic problems Matching requires similarity and similarity measure
J’ai 1000 disques I have a thousand CDs
Forest: A wooded area Forest: Land that belongs to the forestry commission
Ontology is the study of the existence of entities Shared specification of domain knowledge
Ontologies are one way to encode semantics T
Houses Villa Industry Shack Iron Foundry T extiles Industry Houses T extiles Iron Foundry Shack Villa
Provide a means of comparing two entities to determine how similar they are
Not based on a cognitive model
Description Logics, Word based, Structure based
Based on a cognitive model
Feature, Network, Cognitive Spaces Hybrid Semantic Similarity Measure
Combines the approaches of the feature and the network model Basis is the feature model, but each feature has an inner structure in the form of the network model
Red Filled Round Blue Filled Square
Every class is represented by a set of properties Shared vocabulary is structured hierarchically Property values reference a shared vocabulary Property value ranges are sets of shared vocabulary
Forest Has Surface Has Vegetation Has Use
Similarity of two classes is the aggregate of the similarities of their properties Property similarities can be weighted to emphasise certain aspects
Coniferous Forest Broad-leaved Forest
Has Surface Has Vegetation Has Use Has Surface Has Vegetation Has Use
Properties are matched based on their quantifier and name Similarity for two matching properties is the similarity
Has Surface Has Surface Has Vegetation Has Vegetation
Slovenia - Corine Italy - Moland Austria - Realraumanalyse Slovenia - Corine Italy - Moland Austria - Realraumanalyse
A Semantic Similarity Model and Algorithm Evaluation of the Model and Algorithm
Expert evaluation Shortcomings of the model Modelling errors Performance analysis
Summary
Mappings evaluated by domain experts Realraumanalyse => Corine
136 total / 116 correct / 20 incorrect
Corine => Realraumanalyse
64 total / 34 correct / 30 incorrect
Incorrect mappings grouped by reasons
Shortcomings of the model Modelling errors Correct but reclassified
Non built-up areas belonging to the public administration
No negation possible
Knee timber partially with rocks and alpine turf
Internal structure and relations between properties can’t be defined
Knee timber Vegetation Surface Knee timber Rocks Vegetation Alpine turf 90% : 10% 80% : 20% refers to
No relations between concepts in the land-use
Workaround via special properties such as “Lies next to”
River Alluvial Forrest
No relations between concepts in the skeleton
Elevation Alpine Greenland and Woods Mountain Pasture Sub alpine higher than
Additional incorrect knowledge specified
Bare Rocks which included a value for the property Vegetation
Knowledge left out or none specified
Green Urban Areas which somehow managed to only have one property specified
Incorrect metadata
Incomplete settlement along a road which in the metadata was specified as belonging to the continuous urban fabric and was thus modelled as such
Correctly mapped to the most similar concept, but would be handled different by the experts
Sea and Ocean, Olive Groves, Annual crops associated with permanent crops
Suggested strategy for dealing with these
Leave them out. Create no mapping Reclassify based on additional knowledge
Some knowledge could be added to the system Some knowledge basically a hunch
Initial evaluation result not too good
Realraumanalyse => Corine: 85% correct Corine => Realraumanalyse: 53% correct
Analysis of errors revealed (out of a total 200 mappings):
3 erroneous mappings due to model shortcomings 17 erroneous mappings due to modelling errors 30 reclassifications of correct mappings
Modelling errors can be corrected Reclassifications are not actual errors but differing methodologies Updated number of correct mappings
Realraumanalyse => Corine: 134 out of 136 (98%) Corine => Realraumanalyse: 63 out of 64 (98%)
Analysis of the evaluations of the other mappings reveals an average error rate between 0 and 5%
Every source concept is matched to each target concept and then the best is selected.
Corine Agricultural Artificial Arable Pastures Realraum Agricultural Settlement Arable Dense T ransport
T
time (O(N5)) Loading and hierarchy calculation in Description Logics: Exponential time Optimisation required for larger ontologies
Removing the Description Logics reasoning Heuristics / Parallelisation for the similarity calculation
OWL Hierarchy Mapping DL Reasoning Similarity Calculation Static Hierarchy Heuristics
From / T
Moland Realraumanalyse Corine 5sec 10sec 15sec Moland 11sec 20sec 31sec Realraumanalyse 18sec 34sec 52sec Ontology # Concepts Avg # Prop. Load Time Corine 64 3 31sec Moland 96 5 3min 19sec Realraumanalyse 136 6 5min 16sec
A Semantic Similarity Model and Algorithm Evaluation of the Model and Algorithm Summary
Cognitive model is capable of describing most real- world situations Similarity algorithm works sufficiently well to be used in real-world situations (average correctness of above 95%) Performance is the major bottleneck. Without improvement it is unusable for larger ontologies Cognitive model needs to be extended in some areas
101 pages (a nice prime number)
77 pages with actual content 24 pages of structural padding
6 Chapters (average 12.8 pages per chapter) 29208 words
Average of 379 words per page Most frequent word: similarity (239x)
62 Figures and 3 T ables 65 References T