KNOWLEDGE MANAGEMENT AND APPLICATIONS David Snchez Department of - - PowerPoint PPT Presentation

knowledge management and applications
SMART_READER_LITE
LIVE PREVIEW

KNOWLEDGE MANAGEMENT AND APPLICATIONS David Snchez Department of - - PowerPoint PPT Presentation

KNOWLEDGE MANAGEMENT AND APPLICATIONS David Snchez Department of Computer April 2013 Science and Mathematics Tarragona 2 The university 3 Created in 1991 52 programmes of study Over 12,000 students The faculty 4


slide-1
SLIDE 1

KNOWLEDGE MANAGEMENT AND APPLICATIONS

April 2013

David Sánchez

Department of Computer Science and Mathematics

slide-2
SLIDE 2

Tarragona

2

slide-3
SLIDE 3

The university

3

 Created in 1991  52 programmes of study  Over 12,000 students

slide-4
SLIDE 4

The faculty

4

 Engineering degress

 Computer science  Telematics

 Masters

 Computer Security and

Intelligent Systems

 Artificial Intelligence  Security of the Information and

Communication technologies

 Doctoral program

 Computer Engineering

slide-5
SLIDE 5

Research group

5

 Data privacy and electronic commerce  Privacy and security in mobile environments  Private information recovery and codes  9 professors and lecturers  6 post doctoral researchers  7 Ph.D. students  7 Research assistants

slide-6
SLIDE 6

Contents

6

 Introduction  Knowledge acquisition  Semantic operators  Applications to privacy

slide-7
SLIDE 7

Motivation

7

 Numerical data is easy to manage and

transform

 3<4 = true  (1+2)/2 = 1.5  {3, 2, 5} -> {2, 3, 5}

 A plethora of algorithms rely on aritmetical

functions to deal with numerical data

slide-8
SLIDE 8

Motivation

8

 What about text?

 Car ¿>? bike  (apple + orange) / 2 = ??  {flu, cold, pneumonia} -> {?, ?, ?}

 Arithmetical functions do not make sense  Text (words, noun phrases) refers to concepts  Concepts should be managed according to

their formal semantics

slide-9
SLIDE 9

Ontologies

9

 Provide a structured representation of a

shared conceptualization

 Elements

 Classes (concepts)  Instances (individuals)

 Semantics

 Properties (semantic relationships)  Restrictions (logical definition of meanings)

slide-10
SLIDE 10

Contents

10

 Introduction  Knowledge acquisition  Semantic operators  Applications to privacy

slide-11
SLIDE 11

Creating ontologies

11

 Manually

 Knowledge formalization is challenging  Knowledge can be subjective  Time consuming

 Assisted

 Proactive knowledge modelling tools

 Wizards  Reasoners to check knowledge consistency

 Knowledge engineering methods

 101, METHONTOLOGY, On-To-Knowledge

slide-12
SLIDE 12

Ontology learning

12

 Semantics are implicitly referred in text  Textual corpora can be analysed to acquire

knowledge

 Discover concepts and individuals  Discover and label relations

 Taxonomic (cancer is a disease)  Non-taxonomic (cancer is treated with radiotherapy)  Attributes (cancer is non-contagious)

 Discover restrictions

 Axioms (Spain borders France -> France borders Spain)

slide-13
SLIDE 13

Ontology learning from the Web

13

 Corpora: the Web

 The largest electronic repository  Heterogenous  It approximates the distribution of information at a

social scale

 Availability of massive IR tools: Web search

engines

slide-14
SLIDE 14

Knowledge discovery from text

14

 NL processing tools to identify nouns, noun phrases

and named entities

 Concepts and individuals  Linguistic patterns to discover semantics  Taxonomic  “cities such as (Nimes)”, “cancers likes (melanoma)”  Non taxonomic  “cancer is treated with (surgery)”  Attributes  “camera has (10MP resolution)”, “camera features (3x zoom)”  Axioms (functionality, transitivity, symmetry, reflexibity, etc.)  “Spain borders France”, “France borders Spain” -> Symmetry

slide-15
SLIDE 15

Retrieval of suitable corpora

15

 Create appropriate web search queries

 Taxonomic: “cities such as” […]  Non taxonomic: “cancer is treated with” […]  Attributes: “camera features” […]  Axioms: “Spain borders” & “France borders”

slide-16
SLIDE 16

Statistical assessment

16

 Statistical assessor

 WSE page count approximates query

probabilities at a social scale

 Use an association score to filter noisy

extractions

 Point-wise mutual information

slide-17
SLIDE 17

References

17  Taxonomic learning

 David Sánchez, Antonio Moreno: Pattern-based automatic taxonomy

learning from the Web. AI Commununications 21(1): 27-48 (2008)

 Non-taxonomic learning

 David Sánchez, Antonio Moreno: Learning non-taxonomic relationships

from web documents for domain ontology construction. Data & Knowledge Engineering 64(3): 600-623 (2008)

 Attribute learning

 David Sánchez: A methodology to learn ontological attributes from the

  • Web. Data & Knowledge Engineering 69(6): 573-597 (2010)

 Axiom learning

 David Sánchez, Antonio Moreno, Luis Del Vasto Terrientes: Learning

relation axioms from text: An automatic Web-based approach. Expert Systems with Applications 39(5): 5792-5805 (2012)

slide-18
SLIDE 18

Contents

18

 Introduction  Knowledge acquisition  Semantic operators  Applications to privacy

slide-19
SLIDE 19

Exploiting ontologies

 Structured knowledge enables a

semantically-coherent interpretation of textual data by

 Defining semantically-grounded operators  Semantic similarity is the most basic operator

 Similarity(apple, orange) > Similarity(apple, bike)

slide-20
SLIDE 20

Semantic similarity

20

 Semantic similarity  Degree of taxonomical resemblance

 e.g., dogs and cats are similar as they are mammals

 Semantic relatedness  Other non taxonomic relationships are also considered

 e.g., car and wheel or pencil and paper

 Similarity measures can be grouped in several

families according to

 the type of knowledge exploited  the principles in which similarity estimation relies

slide-21
SLIDE 21

Ontology-based similarity

21

slide-22
SLIDE 22

Edge-counting measures

22

( , ) | min_ ( , ) | Distance a b path a b =

slide-23
SLIDE 23

IC-based measures

23

Least Common Subsumer (LCS)

( , ) ( ( , )) Sim a b IC LCS a b =

slide-24
SLIDE 24

IC-based semantic similarity

24

 IC calculus relies on probability

assessments

Based on corpora

Requires general and heterogeneous

corpora

Language ambiguity hampers results Data sparseness produce weak statistics

( ) log ( ) IC c p c = −

slide-25
SLIDE 25

Ontology-based IC computation

25

 Assumption: concepts with many hyponyms in

an ontology are more probable to appear in corpora

 Concept probabilities are intrinsically

approximated according to taxonomic knowledge

 Number of hyponyms

( ) ( ) log hyponyms c IC c

  • ntology_size

= −

slide-26
SLIDE 26

Feature-based measures

26

( , ) common_features(a,b) Sim a b disjoint_features(a,b) =

slide-27
SLIDE 27

References

27

Feature-based similarity measures

 Montserrat Batet, David Sánchez, Aïda Valls: An ontology-based measure to

compute semantic similarity in biomedicine. Journal of Biomedical Informatics 44(1): 118-125 (2011)

 David Sánchez, Montserrat Batet, David Isern, Aïda Valls: Ontology-based

semantic similarity: A new feature-based approach. Expert Systems with Applications 39(9): 7718-7728 (2012)

IC-based similarity mesures

 Based on corpora

 David Sánchez, Montserrat Batet, Aïda Valls, Karina Gibert: Ontology-driven web-based

semantic similarity. Journal of Intelligent Information Systems 35(3): 383-413 (2010)

 Based on ontologies

 David Sánchez, Montserrat Batet, David Isern: Ontology-based information content

  • computation. Knowledge-Based Systems 24(2): 297-303 (2011)

 David Sánchez, Montserrat Batet: A New Model to Compute the Information Content of

Concepts from Taxonomic Knowledge. International Journal on Semantic Web and Information Systems 8(2): 34-50 (2012)

 David Sánchez, Montserrat Batet: Semantic similarity estimation in the biomedical

domain: An ontology-based information-theoretic perspective. Journal of Biomedical Informatics 44(5): 749-759 (2011)

slide-28
SLIDE 28

Other semantic operators

28

 Semantic similarity/distance is the base to

develop other semantically-grounded

  • perators over a sample of textual data

 Aggregation (mean/centroid)

1 2 1

( , ,..., ) arg min ( , )

n n c i i

Mean x x x distance c x

=

  =    

slide-29
SLIDE 29

Aggregation

29

Sample colic lumbago lumbago migraine pain appendicitis gastritis lumbago migraine

Mean candidates colic (1) lumbago (3) migraine (2) appendicitis (1) gastritis (1) pain (1) Sum dist colic 3 3 4 4 1 24 lumbago 3 2 5 5 2 19 migraine 3 2 5 5 2 21 appendicitis 4 5 5 2 3 34 gastritis 4 5 5 2 3 34 pain 1 2 2 3 3 17 ache 2 1 1 4 4 1 16 inflammation 3 4 4 1 1 2 27 symptom 2 3 3 2 2 1 22

slide-30
SLIDE 30

Sorting algorithm

30

  • Algorithm. Sorting procedure

Inputs: P (dataset) Output: P’ (P sorted) 1 Compute the mean of all values in P 2 Consider the most distant value f to the mean 3 Add f to P’ and remove it from P 4 while (|P| > 0) do 5 Obtain the least distant value r to f 6 Add r to P’ and remove it from P 7 Output P’

slide-31
SLIDE 31

References

31

 Sergio Martínez, Aïda Valls, David Sánchez: Semantically-

grounded construction of centroids for datasets with textual

  • attributes. Knowledge-Based Systems 35: 160-172 (2012)

 Sergio Martínez, David Sánchez and Aida Valls: A

semantic framework to protect the privacy of electronic health records with non-numerical attributes. Journal of Biomedical Informatics 46(2): 294-303

 Josep Domingo-Ferrer, David Sánchez, Guillem Rufian-

Torrell: Anonymization of Nominal Data Based on Semantic Marginality. Information Sciences. To Appear

slide-32
SLIDE 32

Contents

32

 Introduction  Knowledge acquisition  Semantic operators  Applications to privacy

slide-33
SLIDE 33

Statistical Disclosure Control

33

Publish anonymized data

Remove or mask data that could reveal personal identities

Collect data

Statistical Offices Identifiers Quasi-identifying attributes Confidential attributes D.N.I.

Birth place Birth year Occupation

Income

12345678 Valls 1962 Lawyer 2,300 87654321 Cambrils 1965 Judge 5,000

 Statistical Disclosure Control (SDC)

discipline aims at protecting statistical data in a way that:  Can be released and exploited  Without publishing information that could identify a concrete individual

slide-34
SLIDE 34

Statistical Disclosure Control

34 34

K-Anonymity

For each combination of quasi-identifiers values, at least k records exist in the dataset with the same combination

Information loss

slide-35
SLIDE 35

Statistical Disclosure Control

35

 Many masking methods have been designed to

build groups of k-anonymous datasets

 Focusing mainly on numerical attributes

 Our goal

 To apply semantically-grounded operators to existing

anonymization algorithms

 Recoding  Microaggregation  Resampling

 The anonymized output would preserve the meaning

  • f the input

 Retain the analytical utility

slide-36
SLIDE 36

Anonymization via microaggregation

36

holidays sports nature nature sports nature relaxation nature nature surfing sports relaxation nature 1 2 2 5 relaxation holidays holidays 2 activity

Centroid:

nature 5

Clusters:

event relaxation 2 holidays 2 surfing 1 sports 2 sports

Centroid:

nature calmness sports

Original dataset Anonymised dataset

surfing calmness sports nature nature sports nature calmness nature nature calmness calmness sports

slide-37
SLIDE 37

Document sanitization

37

 Should

 Avoid disclosure  Preserve semantics

 We rely on the

Information Content

  • f terms

 Highly informative

terms (high IC) are removed

slide-38
SLIDE 38

Document sanitization

38

 Given an input document

 Ex: “Peter Greenow, from United States, suffers

from pancreatic cancer”

 Fix a sanitization threshold

 The most concrete concept to be known. (Ex:

“Cancer”)

 Extract noun-phrases from text

 “Peter Greenow”, “United States”, “Pancreatic

Cancer”

slide-39
SLIDE 39

Document sanitization

39

 Compute the IC of each noun phrase and

detect sensitive ones

 IC(Peter Grenow) > IC(Cancer)  IC(United States) < IC (Cancer)  IC(Pancreatic Cancer) > IC (Cancer)

 Replace sensitive terms by generalized

versions (from an ontology) fulfilling the treshold

 Ex: “[Person], from United States, suffers from

[cancer]”

slide-40
SLIDE 40

Publications

40  Data anonymization

 Sergio Martínez, David Sánchez, Aïda Valls: Semantic adaptive

microaggregation of categorical microdata. Computers & Security 31(5): 653-672 (2012)

 Sergio Martínez, David Sánchez, Aïda Valls, Montserrat Batet: Privacy

protection of textual attributes through a semantic-based masking method.Information Fusion 13(4): 304-314 (2012)

 Sergio Martínez, David Sánchez and Aida Valls: A semantic framework

to protect the privacy of electronic health records with non-numerical

  • attributes. Journal of Biomedical Informatics. 46(2): 294-303

 Montserrat Batet, Arnau Erola, David Sánchez, Jordi Castellà-Roca:

Utility preserving query log anonymization via semantic resampling. Information Sciences. To appear.

 Document sanitization

 David Sánchez, Montserrat Batet and Alexandre Viejo. Automatic

general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security. To Appear.