[PPT] - KNOWLEDGE MANAGEMENT AND APPLICATIONS David Snchez Department of PowerPoint Presentation

SLIDE 1

KNOWLEDGE MANAGEMENT AND APPLICATIONS

April 2013

David Sánchez

Department of Computer Science and Mathematics

SLIDE 2

Tarragona

2

SLIDE 3

The university

3

 Created in 1991  52 programmes of study  Over 12,000 students

SLIDE 4

The faculty

4

 Engineering degress

 Computer science  Telematics

 Masters

 Computer Security and

Intelligent Systems

 Artificial Intelligence  Security of the Information and

Communication technologies

 Doctoral program

 Computer Engineering

SLIDE 5

Research group

5

 Data privacy and electronic commerce  Privacy and security in mobile environments  Private information recovery and codes  9 professors and lecturers  6 post doctoral researchers  7 Ph.D. students  7 Research assistants

SLIDE 6

Motivation

7

 Numerical data is easy to manage and

transform

 3<4 = true  (1+2)/2 = 1.5  {3, 2, 5} -> {2, 3, 5}

 A plethora of algorithms rely on aritmetical

functions to deal with numerical data

SLIDE 8

Motivation

8

 What about text?

 Car ¿>? bike  (apple + orange) / 2 = ??  {flu, cold, pneumonia} -> {?, ?, ?}

 Arithmetical functions do not make sense  Text (words, noun phrases) refers to concepts  Concepts should be managed according to

their formal semantics

SLIDE 9

Ontologies

9

 Provide a structured representation of a

shared conceptualization

 Elements

 Classes (concepts)  Instances (individuals)

 Semantics

 Properties (semantic relationships)  Restrictions (logical definition of meanings)

SLIDE 10

Creating ontologies

11

 Manually

 Knowledge formalization is challenging  Knowledge can be subjective  Time consuming

 Assisted

 Proactive knowledge modelling tools

 Wizards  Reasoners to check knowledge consistency

 Knowledge engineering methods

 101, METHONTOLOGY, On-To-Knowledge

SLIDE 12

Ontology learning

12

 Semantics are implicitly referred in text  Textual corpora can be analysed to acquire

knowledge

 Discover concepts and individuals  Discover and label relations

 Taxonomic (cancer is a disease)  Non-taxonomic (cancer is treated with radiotherapy)  Attributes (cancer is non-contagious)

 Discover restrictions

 Axioms (Spain borders France -> France borders Spain)

SLIDE 13

Ontology learning from the Web

13

 Corpora: the Web

 The largest electronic repository  Heterogenous  It approximates the distribution of information at a

social scale

 Availability of massive IR tools: Web search

engines

SLIDE 14

Knowledge discovery from text

14

 NL processing tools to identify nouns, noun phrases

and named entities

 Concepts and individuals  Linguistic patterns to discover semantics  Taxonomic  “cities such as (Nimes)”, “cancers likes (melanoma)”  Non taxonomic  “cancer is treated with (surgery)”  Attributes  “camera has (10MP resolution)”, “camera features (3x zoom)”  Axioms (functionality, transitivity, symmetry, reflexibity, etc.)  “Spain borders France”, “France borders Spain” -> Symmetry

SLIDE 15

Retrieval of suitable corpora

15

 Create appropriate web search queries

 Taxonomic: “cities such as” […]  Non taxonomic: “cancer is treated with” […]  Attributes: “camera features” […]  Axioms: “Spain borders” & “France borders”

SLIDE 16

Statistical assessment

16

 Statistical assessor

 WSE page count approximates query

probabilities at a social scale

 Use an association score to filter noisy

extractions

 Point-wise mutual information

SLIDE 17

References

17  Taxonomic learning

 David Sánchez, Antonio Moreno: Pattern-based automatic taxonomy

learning from the Web. AI Commununications 21(1): 27-48 (2008)

 Non-taxonomic learning

 David Sánchez, Antonio Moreno: Learning non-taxonomic relationships

from web documents for domain ontology construction. Data & Knowledge Engineering 64(3): 600-623 (2008)

 Attribute learning

 David Sánchez: A methodology to learn ontological attributes from the

Web. Data & Knowledge Engineering 69(6): 573-597 (2010)

 Axiom learning

 David Sánchez, Antonio Moreno, Luis Del Vasto Terrientes: Learning

relation axioms from text: An automatic Web-based approach. Expert Systems with Applications 39(5): 5792-5805 (2012)

SLIDE 18

Exploiting ontologies

 Structured knowledge enables a

semantically-coherent interpretation of textual data by

 Defining semantically-grounded operators  Semantic similarity is the most basic operator

 Similarity(apple, orange) > Similarity(apple, bike)

SLIDE 20

Semantic similarity

20

 Semantic similarity  Degree of taxonomical resemblance

 e.g., dogs and cats are similar as they are mammals

 Semantic relatedness  Other non taxonomic relationships are also considered

 e.g., car and wheel or pencil and paper

 Similarity measures can be grouped in several

families according to

 the type of knowledge exploited  the principles in which similarity estimation relies

SLIDE 21

Ontology-based similarity

21

SLIDE 22

Edge-counting measures

22

( , ) | min_ ( , ) | Distance a b path a b =

SLIDE 23

IC-based measures

23

Least Common Subsumer (LCS)

( , ) ( ( , )) Sim a b IC LCS a b =

SLIDE 24

IC-based semantic similarity

24

 IC calculus relies on probability

assessments

Based on corpora

Requires general and heterogeneous

corpora

Language ambiguity hampers results Data sparseness produce weak statistics

( ) log ( ) IC c p c = −

SLIDE 25

Ontology-based IC computation

25

 Assumption: concepts with many hyponyms in

an ontology are more probable to appear in corpora

 Concept probabilities are intrinsically

approximated according to taxonomic knowledge

 Number of hyponyms

( ) ( ) log hyponyms c IC c

ntology_size

= −

SLIDE 26

Feature-based measures

26

( , ) common_features(a,b) Sim a b disjoint_features(a,b) =

SLIDE 27

References

27



Feature-based similarity measures

 Montserrat Batet, David Sánchez, Aïda Valls: An ontology-based measure to

compute semantic similarity in biomedicine. Journal of Biomedical Informatics 44(1): 118-125 (2011)

 David Sánchez, Montserrat Batet, David Isern, Aïda Valls: Ontology-based

semantic similarity: A new feature-based approach. Expert Systems with Applications 39(9): 7718-7728 (2012)



IC-based similarity mesures

 Based on corpora

 David Sánchez, Montserrat Batet, Aïda Valls, Karina Gibert: Ontology-driven web-based

semantic similarity. Journal of Intelligent Information Systems 35(3): 383-413 (2010)

 Based on ontologies

 David Sánchez, Montserrat Batet, David Isern: Ontology-based information content

computation. Knowledge-Based Systems 24(2): 297-303 (2011)

 David Sánchez, Montserrat Batet: A New Model to Compute the Information Content of

Concepts from Taxonomic Knowledge. International Journal on Semantic Web and Information Systems 8(2): 34-50 (2012)

 David Sánchez, Montserrat Batet: Semantic similarity estimation in the biomedical

domain: An ontology-based information-theoretic perspective. Journal of Biomedical Informatics 44(5): 749-759 (2011)

SLIDE 28

Other semantic operators

28

 Semantic similarity/distance is the base to

develop other semantically-grounded

perators over a sample of textual data

 Aggregation (mean/centroid)

1 2 1

( , ,..., ) arg min ( , )

n n c i i

Mean x x x distance c x

=

  =    

∑

SLIDE 29

Aggregation

29

Sample colic lumbago lumbago migraine pain appendicitis gastritis lumbago migraine

Mean candidates colic (1) lumbago (3) migraine (2) appendicitis (1) gastritis (1) pain (1) Sum dist colic 3 3 4 4 1 24 lumbago 3 2 5 5 2 19 migraine 3 2 5 5 2 21 appendicitis 4 5 5 2 3 34 gastritis 4 5 5 2 3 34 pain 1 2 2 3 3 17 ache 2 1 1 4 4 1 16 inflammation 3 4 4 1 1 2 27 symptom 2 3 3 2 2 1 22

SLIDE 30

Sorting algorithm

30

Algorithm. Sorting procedure

Inputs: P (dataset) Output: P’ (P sorted) 1 Compute the mean of all values in P 2 Consider the most distant value f to the mean 3 Add f to P’ and remove it from P 4 while (|P| > 0) do 5 Obtain the least distant value r to f 6 Add r to P’ and remove it from P 7 Output P’

SLIDE 31

References

31

 Sergio Martínez, Aïda Valls, David Sánchez: Semantically-

grounded construction of centroids for datasets with textual

attributes. Knowledge-Based Systems 35: 160-172 (2012)

 Sergio Martínez, David Sánchez and Aida Valls: A

semantic framework to protect the privacy of electronic health records with non-numerical attributes. Journal of Biomedical Informatics 46(2): 294-303

 Josep Domingo-Ferrer, David Sánchez, Guillem Rufian-

Torrell: Anonymization of Nominal Data Based on Semantic Marginality. Information Sciences. To Appear

SLIDE 32

Statistical Disclosure Control

33

Publish anonymized data

Remove or mask data that could reveal personal identities

Collect data

Statistical Offices Identifiers Quasi-identifying attributes Confidential attributes D.N.I.

Birth place Birth year Occupation

Income

12345678 Valls 1962 Lawyer 2,300 87654321 Cambrils 1965 Judge 5,000

 Statistical Disclosure Control (SDC)

discipline aims at protecting statistical data in a way that:  Can be released and exploited  Without publishing information that could identify a concrete individual

SLIDE 34

Statistical Disclosure Control

34 34

K-Anonymity

For each combination of quasi-identifiers values, at least k records exist in the dataset with the same combination

Information loss

SLIDE 35

Statistical Disclosure Control

35

 Many masking methods have been designed to

build groups of k-anonymous datasets

 Focusing mainly on numerical attributes

 Our goal

 To apply semantically-grounded operators to existing

anonymization algorithms

 Recoding  Microaggregation  Resampling

 The anonymized output would preserve the meaning

f the input

 Retain the analytical utility

SLIDE 36

Anonymization via microaggregation

36

holidays sports nature nature sports nature relaxation nature nature surfing sports relaxation nature 1 2 2 5 relaxation holidays holidays 2 activity

Centroid:

nature 5

Clusters:

event relaxation 2 holidays 2 surfing 1 sports 2 sports

Centroid:

nature calmness sports

Original dataset Anonymised dataset

surfing calmness sports nature nature sports nature calmness nature nature calmness calmness sports

SLIDE 37

Document sanitization

37

 Should

 Avoid disclosure  Preserve semantics

 We rely on the

Information Content

f terms

 Highly informative

terms (high IC) are removed

SLIDE 38

Document sanitization

38

 Given an input document

 Ex: “Peter Greenow, from United States, suffers

from pancreatic cancer”

 Fix a sanitization threshold

 The most concrete concept to be known. (Ex:

“Cancer”)

 Extract noun-phrases from text

 “Peter Greenow”, “United States”, “Pancreatic

Cancer”

SLIDE 39

Document sanitization

39

 Compute the IC of each noun phrase and

detect sensitive ones

 IC(Peter Grenow) > IC(Cancer)  IC(United States) < IC (Cancer)  IC(Pancreatic Cancer) > IC (Cancer)

 Replace sensitive terms by generalized

versions (from an ontology) fulfilling the treshold

 Ex: “[Person], from United States, suffers from

[cancer]”

SLIDE 40

Publications

40  Data anonymization

 Sergio Martínez, David Sánchez, Aïda Valls: Semantic adaptive

microaggregation of categorical microdata. Computers & Security 31(5): 653-672 (2012)

 Sergio Martínez, David Sánchez, Aïda Valls, Montserrat Batet: Privacy

protection of textual attributes through a semantic-based masking method.Information Fusion 13(4): 304-314 (2012)

 Sergio Martínez, David Sánchez and Aida Valls: A semantic framework

to protect the privacy of electronic health records with non-numerical

attributes. Journal of Biomedical Informatics. 46(2): 294-303

 Montserrat Batet, Arnau Erola, David Sánchez, Jordi Castellà-Roca:

Utility preserving query log anonymization via semantic resampling. Information Sciences. To appear.

 Document sanitization

 David Sánchez, Montserrat Batet and Alexandre Viejo. Automatic

general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security. To Appear.

KNOWLEDGE MANAGEMENT AND APPLICATIONS

April 2013

David Sánchez

Tarragona

The university

 Created in 1991  52 programmes of study  Over 12,000 students

The faculty

Research group

Contents

Motivation

transform

functions to deal with numerical data

Motivation

their formal semantics

Ontologies

shared conceptualization

Contents

Creating ontologies

Ontology learning

knowledge

Ontology learning from the Web

social scale

engines

Knowledge discovery from text

and named entities

Retrieval of suitable corpora

Statistical assessment

probabilities at a social scale

extractions

References

Contents

Exploiting ontologies

semantically-coherent interpretation of textual data by

Semantic similarity

families according to

Ontology-based similarity

Edge-counting measures

( , ) | min_ ( , ) | Distance a b path a b =

IC-based measures

( , ) ( ( , )) Sim a b IC LCS a b =

IC-based semantic similarity

assessments

Based on corpora

Requires general and heterogeneous

corpora

Language ambiguity hampers results Data sparseness produce weak statistics

( ) log ( ) IC c p c = −

Ontology-based IC computation

an ontology are more probable to appear in corpora

approximated according to taxonomic knowledge

( ) ( ) log hyponyms c IC c

= −

Feature-based measures

( , ) common_features(a,b) Sim a b disjoint_features(a,b) =

References

Other semantic operators

develop other semantically-grounded

∑

Aggregation

Sorting algorithm

References

grounded construction of centroids for datasets with textual

semantic framework to protect the privacy of electronic health records with non-numerical attributes. Journal of Biomedical Informatics 46(2): 294-303

Torrell: Anonymization of Nominal Data Based on Semantic Marginality. Information Sciences. To Appear

Contents

Statistical Disclosure Control

Publish anonymized data

Collect data

Statistical Disclosure Control

K-Anonymity

Statistical Disclosure Control

build groups of k-anonymous datasets

anonymization algorithms

Anonymization via microaggregation

Document sanitization

Information Content

terms (high IC) are removed

Document sanitization

from pancreatic cancer”

“Cancer”)