A Semantic Similarity Measure for Formal Ontologies Mark Hall - - PowerPoint PPT Presentation

a semantic similarity measure for formal ontologies
SMART_READER_LITE
LIVE PREVIEW

A Semantic Similarity Measure for Formal Ontologies Mark Hall - - PowerPoint PPT Presentation

A Semantic Similarity Measure for Formal Ontologies Mark Hall Final presentation for the master thesis 17.03.2006 Overview I A Semantic Similarity Model and Algorithm Motivation - Heterogeneous Data Ontologies Similarity Measures Hybrid


slide-1
SLIDE 1

A Semantic Similarity Measure for Formal Ontologies

Mark Hall

Final presentation for the master thesis 17.03.2006

slide-2
SLIDE 2

A Semantic Similarity Model and Algorithm

Motivation - Heterogeneous Data Ontologies Similarity Measures Hybrid Model & Similarity Calculation Application

Evaluation of the Model and Algorithm Summary

Overview I

slide-3
SLIDE 3

Heterogenous Data

Heterogeneous data sources are the norm Integration poses two main problems

Syntactic differences Semantic differences

Hello, is this the train station? Frische T

  • maten!

Frische T

  • maten!

Je m’appelle Jane et je t’emmerde

slide-4
SLIDE 4

Integration = Matching

Integration depends on finding matches

Syntactic problems Semantic problems Matching requires similarity and similarity measure

Syntactic

J’ai 1000 disques I have a thousand CDs

Semantic

Forest: A wooded area Forest: Land that belongs to the forestry commission

slide-5
SLIDE 5

Ontologies

Ontology is the study of the existence of entities Shared specification of domain knowledge

Ontologies are one way to encode semantics T

  • ol of choice for the Semantic Web

Houses Villa Industry Shack Iron Foundry T extiles Industry Houses T extiles Iron Foundry Shack Villa

slide-6
SLIDE 6

Similarity Measure

Provide a means of comparing two entities to determine how similar they are

Not based on a cognitive model

Description Logics, Word based, Structure based

Based on a cognitive model

Feature, Network, Cognitive Spaces Hybrid Semantic Similarity Measure

slide-7
SLIDE 7

Hybrid Cognitive Model I

Combines the approaches of the feature and the network model Basis is the feature model, but each feature has an inner structure in the form of the network model

Red Filled Round Blue Filled Square

slide-8
SLIDE 8

Hybrid Cognitive Model II

Every class is represented by a set of properties Shared vocabulary is structured hierarchically Property values reference a shared vocabulary Property value ranges are sets of shared vocabulary

Forest Has Surface Has Vegetation Has Use

slide-9
SLIDE 9

Similarity Calculation I

Similarity of two classes is the aggregate of the similarities of their properties Property similarities can be weighted to emphasise certain aspects

Coniferous Forest Broad-leaved Forest

Has Surface Has Vegetation Has Use Has Surface Has Vegetation Has Use

slide-10
SLIDE 10

Similarity Calculation II

Properties are matched based on their quantifier and name Similarity for two matching properties is the similarity

  • f their ranges

Has Surface Has Surface Has Vegetation Has Vegetation

slide-11
SLIDE 11

Application - HarmonISA

Slovenia - Corine Italy - Moland Austria - Realraumanalyse Slovenia - Corine Italy - Moland Austria - Realraumanalyse

slide-12
SLIDE 12

Overview II

A Semantic Similarity Model and Algorithm Evaluation of the Model and Algorithm

Expert evaluation Shortcomings of the model Modelling errors Performance analysis

Summary

slide-13
SLIDE 13

Expert Evaluation I

Mappings evaluated by domain experts Realraumanalyse => Corine

136 total / 116 correct / 20 incorrect

Corine => Realraumanalyse

64 total / 34 correct / 30 incorrect

Incorrect mappings grouped by reasons

Shortcomings of the model Modelling errors Correct but reclassified

slide-14
SLIDE 14

Model Shortcomings I

Non built-up areas belonging to the public administration

No negation possible

Knee timber partially with rocks and alpine turf

Internal structure and relations between properties can’t be defined

Knee timber Vegetation Surface Knee timber Rocks Vegetation Alpine turf 90% : 10% 80% : 20% refers to

slide-15
SLIDE 15

Model Shortcomings II

No relations between concepts in the land-use

  • ntologies

Workaround via special properties such as “Lies next to”

River Alluvial Forrest

slide-16
SLIDE 16

Model Shortcomings III

No relations between concepts in the skeleton

  • ntology

Elevation Alpine Greenland and Woods Mountain Pasture Sub alpine higher than

slide-17
SLIDE 17

Modelling Errors

Additional incorrect knowledge specified

Bare Rocks which included a value for the property Vegetation

Knowledge left out or none specified

Green Urban Areas which somehow managed to only have one property specified

Incorrect metadata

Incomplete settlement along a road which in the metadata was specified as belonging to the continuous urban fabric and was thus modelled as such

slide-18
SLIDE 18

Reclassification of concepts

Correctly mapped to the most similar concept, but would be handled different by the experts

Sea and Ocean, Olive Groves, Annual crops associated with permanent crops

Suggested strategy for dealing with these

Leave them out. Create no mapping Reclassify based on additional knowledge

Some knowledge could be added to the system Some knowledge basically a hunch

slide-19
SLIDE 19

Expert Evaluation II

Initial evaluation result not too good

Realraumanalyse => Corine: 85% correct Corine => Realraumanalyse: 53% correct

Analysis of errors revealed (out of a total 200 mappings):

3 erroneous mappings due to model shortcomings 17 erroneous mappings due to modelling errors 30 reclassifications of correct mappings

slide-20
SLIDE 20

Expert Evaluation III

Modelling errors can be corrected Reclassifications are not actual errors but differing methodologies Updated number of correct mappings

Realraumanalyse => Corine: 134 out of 136 (98%) Corine => Realraumanalyse: 63 out of 64 (98%)

Analysis of the evaluations of the other mappings reveals an average error rate between 0 and 5%

slide-21
SLIDE 21

Performance Analysis I

Every source concept is matched to each target concept and then the best is selected.

Source T arget

Corine Agricultural Artificial Arable Pastures Realraum Agricultural Settlement Arable Dense T ransport

slide-22
SLIDE 22

Performance Analysis II

T

  • tal complexity of the similarity calculation: Polynomial

time (O(N5)) Loading and hierarchy calculation in Description Logics: Exponential time Optimisation required for larger ontologies

Removing the Description Logics reasoning Heuristics / Parallelisation for the similarity calculation

OWL Hierarchy Mapping DL Reasoning Similarity Calculation Static Hierarchy Heuristics

slide-23
SLIDE 23

Performance Analysis III

From / T

  • Corine

Moland Realraumanalyse Corine 5sec 10sec 15sec Moland 11sec 20sec 31sec Realraumanalyse 18sec 34sec 52sec Ontology # Concepts Avg # Prop. Load Time Corine 64 3 31sec Moland 96 5 3min 19sec Realraumanalyse 136 6 5min 16sec

slide-24
SLIDE 24

Overview III

A Semantic Similarity Model and Algorithm Evaluation of the Model and Algorithm Summary

slide-25
SLIDE 25

Summary

Cognitive model is capable of describing most real- world situations Similarity algorithm works sufficiently well to be used in real-world situations (average correctness of above 95%) Performance is the major bottleneck. Without improvement it is unusable for larger ontologies Cognitive model needs to be extended in some areas

slide-26
SLIDE 26

Statistics

101 pages (a nice prime number)

77 pages with actual content 24 pages of structural padding

6 Chapters (average 12.8 pages per chapter) 29208 words

Average of 379 words per page Most frequent word: similarity (239x)

62 Figures and 3 T ables 65 References T

  • tal size: Source: 1.5MB, PDF: 1.4MB
slide-27
SLIDE 27

Questions, Comments, Praise

Thank you for listening