Introduction: Knowledge Discovery guided - - PDF document

introduction knowledge discovery guided by domain
SMART_READER_LITE
LIVE PREVIEW

Introduction: Knowledge Discovery guided - - PDF document

Marie-Dominique Devignes Laboratoire Lorrain de


slide-1
SLIDE 1

Marie-Dominique Devignes Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA) Equipe Orpailleur – INRIA Nancy Grand-Est

1.

Introduction: Knowledge Discovery guided by Domain Knowledge

2.

Bio-ontologies and data integration

3.

Bio-ontologies and data mining

  • !"

4.

Conclusion

2/42 Rennes, 18 octobre 2011

slide-2
SLIDE 2

#$ %& #

3/42 Rennes, 18 octobre 2011 Database

  • 3. Interpretation
  • 2. Data mining

Formatting Selection Integration Integrated data Dataset Formatted data Rules, patterns Knowledge

  • 1. Preparation

Expert

A three-step iterative process… …interactively controlled by an expert.

#$ %& & #$ '##

4/42 Rennes, 18 octobre 2011 Database

  • 3. Interpretation
  • 2. Data mining

Formatting Selection Integration Integrated data Dataset Formatted data Rules, patterns Knowledge

  • 1. Preparation

Expert

… at each step of the process. Ontologies to assist the expert…

slide-3
SLIDE 3

()##

5/42 Rennes, 18 octobre 2011

Data

Knowledge Base (KB)

DB1 DB3

  • 2. Data

Mining

  • 1. Data

extraction and formatting

  • 3. Result

interpretation and KB enrichment

Wrapper 1 Wrapper 2 Wrapper 3

DB2

Data integration: guided by domain knowledge … Etc. Data mining : guided by domain knowledge

Domain Knowledge

* #$ +

For biologists « knowledge bases » are rich, integrated databases

,-'.)/#/ 0 /0 ",/1#/ #$ #$ &)

For computer scientists « knowledge bases » are systems where data are

associated with explicit semantics (logic formulas) that can be used by programs

,-' )2 /324/

556

#$ /

Concepts and relations are organized as consistent hierarchies from most general to most

specialized

Reasoners are able to classify instances with respect to the concepts they instantiate.

6/42 Rennes, 18 octobre 2011

slide-4
SLIDE 4

,-) 2(

A « DL » knowledge base : SO-Pharm (Adrien Coulet)

0 ) "

Structured vocabularies

& (

Semantic similarity measure Use for functional classification of genes

!" ' & !& "% ,

  • Use for dimension reduction in symbolic mining methods.

7/42 Rennes, 18 octobre 2011

1.

Introduction: Knowledge Discovery guided by Domain Knowledge

2.

Bio-ontologies and data integration

,-) 2$ '(14"!

3.

Bio-ontologies and data mining

  • !"

4.

Conclusion

8/42 Rennes, 18 octobre 2011

slide-5
SLIDE 5
  • A challenge for many years

,-'3" 0(778

)'99$$$9 :%99

,-'$ 557

)'99$ 914%%$

"& ;

Not so many examples of

« true » knowledge bases

9/42 Rennes, 18 octobre 2011

Domain Knowledge Data DB1 DB3

Wrapper 1 Wrapper 2 Wrapper 3

DB2

Data integration = 1st step of KDDK … Etc.

2)2#$

Description Logics (DL) foundations

&

  • )! ) $ /

< &) 0/./=/!/

In DL, KB = T-Box + A-Box

3->3& ')

Atomic concepts (C) or roles (R): simple descriptions Composed concepts or roles : complex descriptions (terminological axioms)

"->"%9

Concept assertion :

C(a) -> instance a belongs to concept C

Role assertion:

R(a,b) -> instances a and b are in relation through role R

10/42 Rennes, 18 octobre 2011

slide-6
SLIDE 6

2)2#$

Implementation thanks to semantic web technologies

*2 04/1 1(%

4/556

(1"!?2'@& *2 1AA '& / 555 !/1/3 ' 2 11/42 Rennes, 18 octobre 2011

"-) )

Goal of pharmacogenomics

0& % %

&)

; % 1&) ;

GenNet Project

#0#" B1& B2!0"9

)

Example: SNP variants in geneCYP2D6

(Desmeules et al., 1991)

More or less active forms of a given enzyme Fast or slow transformation of codein into morphin Intoxication or absence of reaction to a given treatment

12/42 Rennes, 18 octobre 2011 Phenotype Genotype Drug

slide-7
SLIDE 7

"-) )

13/42 Rennes, 18 octobre 2011 Clinical item Genotype item Drug treatment Phenotype item PATO MPO Disease

  • ntology

CHeBI MECV SNP-O Articulation of existing ontologies (15) covering various biological domains !"# $" # %&' ' Patient Adrien Coulet PhD Thesis

T-Box

"-) ) 6

14/42 Rennes, 18 octobre 2011 Semantic integration : guided by the global schema of the ontology Set of mappings between each data source and the ontology (Poggi et al., 2008 ; Coulet PhD Thesis, 2008) Advantages : Consistency, lack of redundancy, new properties inferred by reasoners

T-Box A-Box

SO-Pharm KB

In Protégé 2000

PharmGKB dbSNP Gene

Wrapper 1 Wrapper 2 Wrapper 3

  • SO-Pharm

individuals

slide-8
SLIDE 8

0 1# (1 #

Diversity of responses to Montelukast (Singulair)

2/55C) & (C)/&) C(=1 D E

2 )$&

  • 61 assertions of the concept Patient e.g. Patient(pa01)
  • 162 assertions of the concept Clinical item and subconcepts e.g. ClinicalItem(exa:yes)
  • many assertions of various roles between the concepts e.g. HasClinicalItem(pa01, exa:yes)

15/42 Rennes, 18 octobre 2011

  • !
  • "

Genotype items Phenotype items Exa: Asthma exacerbation Per: % Change in Forced Expiratory Volume Drug treatment

Montelukast 6 months

#$ %& (1 #

16/42 Rennes, 18 octobre 2011 T-Box A-Box Formal Context Association Rules

  • 2. Formal

Concept Analysis

  • 1. Exploration of the

Graph

  • f Role Assertions
  • 3. Interpretation,

insertion into the

  • ntology

Instances of a given concept New concepts, roles, and role assertions

RAA : Role Assertion Analysis

Coulet et al., Advances in Experimental Medicine and Biology, 2011

slide-9
SLIDE 9

1

Proof of concept but…

( $

$

Data representation for mining : still under

study

3!)F%

$ !0D D ))/ )$ &%)$ /$ .%&

17/42 Rennes, 18 octobre 2011 http://sourceforge.net/apps/mediawiki/bio2 rdf/index.php?title=Main_Page

1G-)A

1.

Introduction: Knowledge Discovery guided by Domain Knowledge

2.

Bio-ontologies and data integration

,-) 2$ '(14"!

3.

Bio-ontologies and data mining

  • !"

4.

Conclusion

18/42 Rennes, 18 octobre 2011

slide-10
SLIDE 10

)

19/42 Rennes, 18 octobre 2011 Annotation vocabulary

  • 2. Data

Mining

  • 1. Data

extraction and formatting

  • 3. Result

interpretation and KB enrichment

Data mining : guided by domain knowledge

Domain Knowledge

)

20/42 Rennes, 18 octobre 2011 Annotation vocabulary

  • 2. Data

Mining

  • 1. Data

extraction and formatting

  • 3. Result

interpretation and KB enrichment

Data mining : guided by domain knowledge

Domain Knowledge

Semantic Clustering (1)

  • n GO
slide-11
SLIDE 11

,-) '(

Clustering means grouping together most similar objects and putting in

different clusters most dissimilar objects

  • &9

Clustering functional annotation of genes: functional classification

,-) "<0 "<

0 %&

21/42 Rennes, 18 octobre 2011 GO-t1 GO-t2 GO-t3 … PfamD1 … Gene1 X X O … X … Gene2 X O X … X … …

Similarity measure based on counting present and absent features: measured by Kappa statistics =>No Semantics

,-) '(

Clustering is semantic when it relies on a semantic similarity measure A semantic similarity measure :

0 $ F &

%& &

3 ) $ 22/42 Rennes, 18 octobre 2011 GO-t1 GO-t2 GO-t3 … Gene1 X X O … Gene2 X O X … …

Various strategies exist for

(i): capturing feature similarity in a bio-

  • ntology

(ii): agregating these similarities for comparing biological objects

slide-12
SLIDE 12

3 &

23/42 Rennes, 18 octobre 2011

« Node-based» approaches

Annotations Annotations Structure Structure IC IC MICA IC MICA IC DCAs IC DCAs IC

Depth Depth Number of childs Number of childs

  • Resnik et al. (1995) : Most Informative Common

Ancestor; Information Content

  • Bodenreider et al. (2005) : Shared annotations

Shared Shared

To compare two features, members of an ontologie

Pesquita et al., 2009: Semantic similarity in biomedical ontologies, PLOS Comp. Biol. July 2009, Volume 5 | Issue 7 | e1000443

3 &

24/42 Rennes, 18 octobre 2011

«Edge-based» approaches

Depth of LCA Depth of LCA Distance (min/average) Distance (min/average)

Hybrides approaches

Weighting edges by node depth Weighting edges by node depth

  • Wu et al. (2006) : Depth of LCA : Lowest Common

Ancestor, Shortest Path Length

  • Pozo et al. (2008) : Depth of LCA
  • Othman et al. (2007) : IC/Depth/number
  • f children; Distance

To compare two features, members of an ontologie

slide-13
SLIDE 13

&

25/42 Rennes, 18 octobre 2011

To compare two feature lists annotating genes or gene products

« Pairwise » approaches

All pairs All pairs Best pairs Best pairs

  • Lord et al. (2003) : All pairs/ Average/ Resnick,

Lin, Jiang measures

  • Wang et al. (2007) : Best pairs/Average/ Wang

measure

« Groupwise » approaches

Set Set Graph Graph Vector Vector

  • Martin et al. (2004) : Graph/Jaccard on lists

enriched with term ancestors

  • Chabalier et al. (2007) : Vectors compared with

the cosine measure

0'& % &

26/42 Rennes, 18 octobre 2011

Representation of genes in a vector space model

i

∑i

g = α ie ei : basis vector, one per feature (ti)

: Coefficient for feature ti

αi αi = w(g, ti) x IAF(ti)

: weight of evidence code * qualifying the assignment of feature ti to gène g

w(g, ti)

* When more than one code, take the maximal weight

IAF(ti)

: « Inverse Annotation Frequency » ~ Information Content of feature ti in annotation corpus.

Definition of coefficients

IAF(ti) = log NTOT Nti NTOT : Total number of genes in the corpus Nti

: Number of gènes with feature ti

Definition of information content

slide-14
SLIDE 14

") D (&

  • 27/42

Rennes, 18 octobre 2011

2 x Depth[LCA (ti, tj)]

=

ei ej . Depth(ti) + Depth(tj) Ganesan P, Garcia-Molina H, Widom J (2003) Exploiting hierarchical domain structure to compute similarity. Transactions on Information Systems, 21 : 64 - 93

Method proposed for « tree »-hierarchies of terms

(MeSH) in document retrieval

Principle: consider that the dimensions of the vector

space are not orthogonal to each other

Consequence in dot product:

ei ei .

=

1 And ∀i, i ≠ j, ei ej .

0 &

28/42 Rennes, 18 octobre 2011

2 x MaxDepth[LCA (ti, tj)]

=

e i e j . SimIntelliGO(ti, tj) MinSPL(ti, tj) + 2 x MaxDepth[LCA (ti,tj)] Benabderrahmane S. et al.(2010) BMC Bioinformatics 11:588.

GO is a rDAG (rooted Directed Acyclic Graph) In a rDAG, each term can have several parents

and therefore several paths to the Root

Consequence: LCA is not unique, Depth (ti) is not

unique

A4 Root t1 t3 A1 A3 t2 A2

=

slide-15
SLIDE 15

0 &

Generalized dot-product between two gene vectors

29/42 Rennes, 18 octobre 2011

i

α h g

∑ i,j

.

=

j

β ei ej . x x avec ei ej . ≠ 0,

Generalized cosine similarity

SimIntelliGO( , ) = g h g . h

√ h . h

x

√ g . g

∀i, i ≠ j

30/42 Rennes, 18 octobre 2011

2A)G)

Fichier NCBI: AnnotationFile Espèce Paramètre 1 Aspect de GO Paramètre 2

(Tax_ID, Gene_ID, GO_ID, Evid_Code, GO_Def, GO_aspect)

LCA

(GO_ID , GO_ID, LCA_Depth, LCA_ID_List

SPL

(GO_ID , GO-ID, SPL) (GO_ID, IAF, Array of [Gene_IDs])

Termes Genes

(Gene_ID , Array of [GO_ID, Evid-Code]) Calcul de l’IAF

Paramètre 3

Profondeurs des CA et du LCA (requêtes sur GO database) Calcul du SPL

w(g, ti) IAF(ti) Depth [LCA(ti, tj )] SPL(ti, tj ) Liste des poids des codes d’évidence Fichier spécifique: CuratedAnnotationFile Liste de gènes d’intérêt (Gene_IDs) Liste des mesures de similarités entre gènes 2 à2

slide-16
SLIDE 16

<

For each dataset

)$ "))& ') DD& ' 31/42 Rennes, 18 octobre 2011

Dataset Species Source Number of sets Total genes 1 Human KEGG pathways 13 275 2 Yeast KEGG pathways 13 169 3 Human Pfam Clans 10 94 4 Yeast Pfam Clans 10 118

4) %

  • 32/42

Rennes, 18 octobre 2011

slide-17
SLIDE 17

DD& ') # ) $ "<0

33/42 Rennes, 18 octobre 2011

Dataset (Nber of sets) Optimal global F-score Optimal K number Optimal global F-score Optimal K number Excluded genes 1 (13) 0.62 14 0.67 10 21% 2 (13) 0.67 14 0.68 9 18 % 3 (10) 0.75 11 0.64 11 27 % 4 (10) 0.82 11 0.70 10 41 %

"<0

Functional classification is reliable and robust with IntelliGO measure

/0 $)0"(5

)

34/42 Rennes, 18 octobre 2011 Annotation vocabulary

  • 2. Data

Mining

  • 1. Data

extraction and formatting

  • 3. Result

interpretation and KB enrichment

Data mining : guided by domain knowledge

Domain Knowledge

Semantic Clustering (2)

  • n MedDRA
slide-18
SLIDE 18

(

Study with MedDRA : Medical Directory of Regulatory Activities

  • 1. 2(/5/555

. )> !" )%&

(3"!3/88

!" D " $ %) %'(&

/42% )3/42% 3/1 3/2$ 2% 3

"96 !" % )

SIDER : Side Effect Repository at the EMBL (http://sideeffects.embl.de)

"855 $ 88 !" '))& &

% ),G 1 3/ 4 1

35/42 Rennes, 18 octobre 2011

!) $ 3

Such large matrices are untractable with symbolic methods

( @ H

Reduce the number of attributes without losing information

) /)$ " ( 0 & $ !" 36/42 Rennes, 18 octobre 2011 2 DepthMax[LCA (ti, tj)] SimIntelliGO(ti, tj) SPL(ti, tj)+ 2 DepthMax[LCA (ti,tj)] = DIstIntelliGO(ti, tj) = SPL (ti, tj) SPL(ti, tj)+ 2 DepthMax[LCA (ti,tj)]

slide-19
SLIDE 19

4 !"

Pairwise distances calculated for a

subset of 1288 terms

Hierarchical clustering (Wards) +

Kelley’s optimisation of cluster number

  • )%
  • <&-)

Example : TermCluster T54 Erythema

E ) 37/42 Rennes, 18 octobre 2011

Cluster terms T54 AvgDist to

  • ther cluster

terms

Erythema Lichen planus Parapsoriasis Pityriasis alba Rash papular Decubitus ulcer Lupus miliaris disseminatus faciei Pruritus Rash Sunburn Vulvovaginal pruritus Dandruff Rash Photosensitivity reaction Psoriasis 0.31 0.32 0.32 0.32 0.32 0.35 0.35 0.35 0.35 0.35 0.35 0.37 0.37 0.37 0.37

% """ 0% ""0"

Idem for Anti-infective Agents : 76 drugs, 2 datasets AIAAll and AIATC

38/42 Rennes, 18 octobre 2011 T1 T2 T3 T4 … Drug1 Drug2 … TC1 TC2 TC3 … Drug 1 Drug 2 …

  • 1288 attributes (terms)

112 attributes (term clusters) 7I 7I

slide-20
SLIDE 20

@ 0 0-

Minimal support 50 % 60 % 70 % 80 % 90 % 100 %

  • 386

94 41 11 1

  • 5,564

1,379 256 62 6

  • 178

41 9 2

  • 654

154 30 3

39/42 Rennes, 18 octobre 2011

  • Use of Zart algorithm (Coron platform for symbolic datamining)
  • FCI : maximal subsets of drugs sharing similar side effects (All) or similar TCs
  • More FCIs are found with TC representation

@ 0 0-

40/42 Rennes, 18 octobre 2011

FCIs obtained with TC representation are more informative Example : comparison of top- 5 FCIs obtained with AIA datasets

All representation Nausea_and_vomitting_symptoms, Vomiting (82%) Nausea (80%) Pruritus (79%) Nausea_and_vomitting_symptomes, Vomiting, Nausea (78%) Headache (76%) TC representation 54_Erythema (88 %) 64_Nausea_and_vomitting_symptoms (88%) 99_Neuromyopathy (82 %) 54_Erythema, 64_Nausea_and_ vomitting_symptoms (79%) 65_Blepharitis (78%)

/#0!5

slide-21
SLIDE 21
  • Bio-ontologies and integration: still difficult

& .D !%&

Bio-ontologies and mining

( ')& (& ) J- -)$

$

41/42 Rennes, 18 octobre 2011

1)

42/42 Rennes, 18 octobre 2011

1F 1F , ,= =

  • A

A . . =& =&

  • 1

1, , ! !A A ' 0(= ' 0(= 0= 0=

  • K

K ) )

  • Financements

LORIA, Equipe Orpailleur Nancy

  • %

%

  • (

(L L

  • 3

3 " "

  • (

(

  • M

M

  • N

N # #

Hôpital Saint Antoine Paris

1 1

Harmonic Pharma

  • (

( , ,

  • PhenoSystems

% %" "

KIKA medical

http://plateforme-mbi.loria.fr/intelliGO