KNOWLEDGE MANAGEMENT AND APPLICATIONS
April 2013
David Sánchez
Department of Computer Science and Mathematics
KNOWLEDGE MANAGEMENT AND APPLICATIONS David Snchez Department of - - PowerPoint PPT Presentation
KNOWLEDGE MANAGEMENT AND APPLICATIONS David Snchez Department of Computer April 2013 Science and Mathematics Tarragona 2 The university 3 Created in 1991 52 programmes of study Over 12,000 students The faculty 4
Department of Computer Science and Mathematics
2
3
4
Engineering degress
Computer science Telematics
Masters
Computer Security and
Intelligent Systems
Artificial Intelligence Security of the Information and
Communication technologies
Doctoral program
Computer Engineering
5
Data privacy and electronic commerce Privacy and security in mobile environments Private information recovery and codes 9 professors and lecturers 6 post doctoral researchers 7 Ph.D. students 7 Research assistants
6
Introduction Knowledge acquisition Semantic operators Applications to privacy
7
Numerical data is easy to manage and
3<4 = true (1+2)/2 = 1.5 {3, 2, 5} -> {2, 3, 5}
A plethora of algorithms rely on aritmetical
8
What about text?
Car ¿>? bike (apple + orange) / 2 = ?? {flu, cold, pneumonia} -> {?, ?, ?}
Arithmetical functions do not make sense Text (words, noun phrases) refers to concepts Concepts should be managed according to
9
Provide a structured representation of a
Elements
Classes (concepts) Instances (individuals)
Semantics
Properties (semantic relationships) Restrictions (logical definition of meanings)
10
Introduction Knowledge acquisition Semantic operators Applications to privacy
11
Manually
Knowledge formalization is challenging Knowledge can be subjective Time consuming
Assisted
Proactive knowledge modelling tools
Wizards Reasoners to check knowledge consistency
Knowledge engineering methods
101, METHONTOLOGY, On-To-Knowledge
12
Semantics are implicitly referred in text Textual corpora can be analysed to acquire
Discover concepts and individuals Discover and label relations
Taxonomic (cancer is a disease) Non-taxonomic (cancer is treated with radiotherapy) Attributes (cancer is non-contagious)
Discover restrictions
Axioms (Spain borders France -> France borders Spain)
13
Corpora: the Web
The largest electronic repository Heterogenous It approximates the distribution of information at a
Availability of massive IR tools: Web search
14
NL processing tools to identify nouns, noun phrases
Concepts and individuals Linguistic patterns to discover semantics Taxonomic “cities such as (Nimes)”, “cancers likes (melanoma)” Non taxonomic “cancer is treated with (surgery)” Attributes “camera has (10MP resolution)”, “camera features (3x zoom)” Axioms (functionality, transitivity, symmetry, reflexibity, etc.) “Spain borders France”, “France borders Spain” -> Symmetry
15
Create appropriate web search queries
Taxonomic: “cities such as” […] Non taxonomic: “cancer is treated with” […] Attributes: “camera features” […] Axioms: “Spain borders” & “France borders”
16
Statistical assessor
WSE page count approximates query
Use an association score to filter noisy
Point-wise mutual information
17 Taxonomic learning
David Sánchez, Antonio Moreno: Pattern-based automatic taxonomy
learning from the Web. AI Commununications 21(1): 27-48 (2008)
Non-taxonomic learning
David Sánchez, Antonio Moreno: Learning non-taxonomic relationships
from web documents for domain ontology construction. Data & Knowledge Engineering 64(3): 600-623 (2008)
Attribute learning
David Sánchez: A methodology to learn ontological attributes from the
Axiom learning
David Sánchez, Antonio Moreno, Luis Del Vasto Terrientes: Learning
relation axioms from text: An automatic Web-based approach. Expert Systems with Applications 39(5): 5792-5805 (2012)
18
Introduction Knowledge acquisition Semantic operators Applications to privacy
Structured knowledge enables a
Defining semantically-grounded operators Semantic similarity is the most basic operator
Similarity(apple, orange) > Similarity(apple, bike)
20
Semantic similarity Degree of taxonomical resemblance
e.g., dogs and cats are similar as they are mammals
Semantic relatedness Other non taxonomic relationships are also considered
e.g., car and wheel or pencil and paper
Similarity measures can be grouped in several
the type of knowledge exploited the principles in which similarity estimation relies
21
22
23
Least Common Subsumer (LCS)
24
IC calculus relies on probability
25
Assumption: concepts with many hyponyms in
Concept probabilities are intrinsically
Number of hyponyms
26
27
Feature-based similarity measures
Montserrat Batet, David Sánchez, Aïda Valls: An ontology-based measure to
compute semantic similarity in biomedicine. Journal of Biomedical Informatics 44(1): 118-125 (2011)
David Sánchez, Montserrat Batet, David Isern, Aïda Valls: Ontology-based
semantic similarity: A new feature-based approach. Expert Systems with Applications 39(9): 7718-7728 (2012)
IC-based similarity mesures
Based on corpora
David Sánchez, Montserrat Batet, Aïda Valls, Karina Gibert: Ontology-driven web-based
semantic similarity. Journal of Intelligent Information Systems 35(3): 383-413 (2010)
Based on ontologies
David Sánchez, Montserrat Batet, David Isern: Ontology-based information content
David Sánchez, Montserrat Batet: A New Model to Compute the Information Content of
Concepts from Taxonomic Knowledge. International Journal on Semantic Web and Information Systems 8(2): 34-50 (2012)
David Sánchez, Montserrat Batet: Semantic similarity estimation in the biomedical
domain: An ontology-based information-theoretic perspective. Journal of Biomedical Informatics 44(5): 749-759 (2011)
28
Semantic similarity/distance is the base to
Aggregation (mean/centroid)
1 2 1
( , ,..., ) arg min ( , )
n n c i i
Mean x x x distance c x
=
=
29
Sample colic lumbago lumbago migraine pain appendicitis gastritis lumbago migraine
Mean candidates colic (1) lumbago (3) migraine (2) appendicitis (1) gastritis (1) pain (1) Sum dist colic 3 3 4 4 1 24 lumbago 3 2 5 5 2 19 migraine 3 2 5 5 2 21 appendicitis 4 5 5 2 3 34 gastritis 4 5 5 2 3 34 pain 1 2 2 3 3 17 ache 2 1 1 4 4 1 16 inflammation 3 4 4 1 1 2 27 symptom 2 3 3 2 2 1 22
30
Inputs: P (dataset) Output: P’ (P sorted) 1 Compute the mean of all values in P 2 Consider the most distant value f to the mean 3 Add f to P’ and remove it from P 4 while (|P| > 0) do 5 Obtain the least distant value r to f 6 Add r to P’ and remove it from P 7 Output P’
31
Sergio Martínez, Aïda Valls, David Sánchez: Semantically-
Sergio Martínez, David Sánchez and Aida Valls: A
Josep Domingo-Ferrer, David Sánchez, Guillem Rufian-
32
Introduction Knowledge acquisition Semantic operators Applications to privacy
33
Remove or mask data that could reveal personal identities
Statistical Offices Identifiers Quasi-identifying attributes Confidential attributes D.N.I.
Birth place Birth year Occupation
Income
12345678 Valls 1962 Lawyer 2,300 87654321 Cambrils 1965 Judge 5,000
Statistical Disclosure Control (SDC)
discipline aims at protecting statistical data in a way that: Can be released and exploited Without publishing information that could identify a concrete individual
34 34
For each combination of quasi-identifiers values, at least k records exist in the dataset with the same combination
Information loss
35
Many masking methods have been designed to
Focusing mainly on numerical attributes
Our goal
To apply semantically-grounded operators to existing
Recoding Microaggregation Resampling
The anonymized output would preserve the meaning
Retain the analytical utility
36
holidays sports nature nature sports nature relaxation nature nature surfing sports relaxation nature 1 2 2 5 relaxation holidays holidays 2 activity
Centroid:
nature 5
Clusters:
event relaxation 2 holidays 2 surfing 1 sports 2 sports
Centroid:
nature calmness sports
Original dataset Anonymised dataset
surfing calmness sports nature nature sports nature calmness nature nature calmness calmness sports
37
Should
Avoid disclosure Preserve semantics
We rely on the
Highly informative
38
Given an input document
Ex: “Peter Greenow, from United States, suffers
Fix a sanitization threshold
The most concrete concept to be known. (Ex:
Extract noun-phrases from text
“Peter Greenow”, “United States”, “Pancreatic
39
Compute the IC of each noun phrase and
IC(Peter Grenow) > IC(Cancer) IC(United States) < IC (Cancer) IC(Pancreatic Cancer) > IC (Cancer)
Replace sensitive terms by generalized
Ex: “[Person], from United States, suffers from
40 Data anonymization
Sergio Martínez, David Sánchez, Aïda Valls: Semantic adaptive
microaggregation of categorical microdata. Computers & Security 31(5): 653-672 (2012)
Sergio Martínez, David Sánchez, Aïda Valls, Montserrat Batet: Privacy
protection of textual attributes through a semantic-based masking method.Information Fusion 13(4): 304-314 (2012)
Sergio Martínez, David Sánchez and Aida Valls: A semantic framework
to protect the privacy of electronic health records with non-numerical
Montserrat Batet, Arnau Erola, David Sánchez, Jordi Castellà-Roca:
Utility preserving query log anonymization via semantic resampling. Information Sciences. To appear.
Document sanitization
David Sánchez, Montserrat Batet and Alexandre Viejo. Automatic
general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security. To Appear.