Investigating bias in semantic similarity measures Marco Mina - - PowerPoint PPT Presentation

investigating bias in semantic similarity measures
SMART_READER_LITE
LIVE PREVIEW

Investigating bias in semantic similarity measures Marco Mina - - PowerPoint PPT Presentation

Investigating bias in semantic similarity measures Marco Mina mina@dei.unipd.it University of Padova, Italy and Pietro Hiram Guzzi University Magna Grcia of Catanzaro September 13, 2011 Marco Mina (University of Padova) Investigating bias


slide-1
SLIDE 1

Investigating bias in semantic similarity measures

Marco Mina

mina@dei.unipd.it University of Padova, Italy and Pietro Hiram Guzzi University Magna Græcia of Catanzaro

September 13, 2011

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 1 / 17

slide-2
SLIDE 2

What is the Gene Ontology (GO)?

GO is a hierarchical vocabulary of terms describing functions and processes within cells Genes and proteins are annotated with GO terms representing their functions, roles and localization

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 2 / 17

slide-3
SLIDE 3

Semantic similarity (SS) measures

Exploit GO structure to measure similarity between GO Terms Quantify the functional similarity between genes and proteins considering their annotations

protein A protein C protein B

Sim(DNA binding, transferase activity) = ? Sim(nucleotidyltransferase activity, DNA binding) = ? Sim(nucleotidyltransferase act., transferase act.) = ? Sim(protein A, protein B) = ? Sim(protein B, protein C) = ? Sim(protein A, protein C) = ? Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 3 / 17

slide-4
SLIDE 4

An Example of Semantic similarity

Determine the similarity between terms GO:0006139 and GO:0043283

GO:0006139 Nucleobase, nucleoside, nucleotide and nucleic acid metabolism GO:0043283 Biopolymer metabolism GO:0044237 Cellular metabolism GO:0043170 Macromolecule metabolism GO:0044238 Primary metabolism GO:0008152 metabolism GO:0009987 Cellular process GO:0008150 Biological process Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 4 / 17

slide-5
SLIDE 5

SimPL Semantic similarity

SimPL(t1, t2) = length of the shortest path between the two terms

GO:0006139 Nucleobase, nucleoside, nucleotide and nucleic acid metabolism GO:0043283 Biopolymer metabolism GO:0044237 Cellular metabolism GO:0043170 Macromolecule metabolism GO:0044238 Primary metabolism GO:0008152 metabolism GO:0009987 Cellular process GO:0008150 Biological process

However, SimPL does not take into account term specificity

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 5 / 17

slide-6
SLIDE 6

Applications of Semantic similarity

Assign reliability to protein-protein interactions Identify functional modules within protein interaction networks Assess algorithm performance Alignment of biological networks of different organisms

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 6 / 17

slide-7
SLIDE 7

Biasing factors in Semantic similarity

In general, semantic similarity measures are influenced by the incompleteness of current annotation corpora false or imprecise annotations In this work we focus on the impact of two potentially biasing factors: Use of Inferred Electronically Annotations (IEA) Shallow annotation problem

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 7 / 17

slide-8
SLIDE 8

Evidence Codes

Several different strategies to infer annotations have been proposed. Need to keep track of the method used to infer an annotation.

Evidence codes (EC)

Describe the way an annotation has been established. Roughly speaking, annotations can be divided in two categories:

Inferred Electronically Annotations (IEA) Not Inferred Electronically Annotations not as reliable as non IEA much more reliable than IEA generally they are rather generic generally very specific many annotations available not many annotations available

Ignoring IEA drastically reduces the number of annotations, but raises annotation corpus reliability. What’s the impact of considering IEA? We verified what is the impact of considering IEA on semantic similarity scores between proteins within biological complexes.

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 8 / 17

slide-9
SLIDE 9

Our Testbed

Complexes has been extracted from CYC2008 database

[Pu et al.].

408 manually curated Yeast protein complexes

P 4 0 3 0 2 P 2 1 2 4 2 P 2 3 6 3 8 P 2 3 7 2 4 P 3 2 3 7 9 P 2 3 6 3 9 P 2 5 4 5 1 P 3 0 6 5 7 P 2 1 2 4 3 P 3 8 6 2 4 P 2 5 0 4 3 P 2 2 1 4 1 P 4 0 3 0 3 P 3 0 6 5 6

We used Resnik BMA [Resnik et al.] and SimGIC [Pesquita et al.] as semantic similarity measures. Let SS(i, j) represent the semantic similarity between proteins i and j. The average complex semantic similarity SS(C) for complex C is: SS(C) = 2

  • i,j∈C

|SS(i, j)| |C|(|C| − 1)

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 9 / 17

slide-10
SLIDE 10

Evidence Codes

Let annot(i) be the set of annotations involving protein i. The average number of annotations (AC) for a complex C is: AC =

  • i∈C

|annot(i)| |C|

Complex Data Average number of annotations Complex Size with IEA without IEA Difference Camp-dependent Protein Kinase 4 7.75 2.50 5.25 MCM2-7 Complex 6 11.83 6.33 5.50 NSP1P Complex 4 17.25 13.00 4.25 Nucleotide-Excision Repair Factor 3 7 7.86 3.57 4.29 Dash Complex 10 11.60 3.90 7.70 Examples of complexes selected for the analysis.

We focused on protein complexes with the highest difference of average number of annotations.

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 10 / 17

slide-11
SLIDE 11

Evidence Codes

Complex Data average annotations SimGIC Resnik BMA Complex Size IEA/noIEA, Delta Delta Delta Camp-dependent Protein Kinase 4 7.75/2.50, 5.25 0.01 0.07 MCM2-7 Complex 6 11.83/6.33, 5.50 0.10 0.01 NSP1P Complex 4 17.25/13.00, 4.25 0.01 0.01 Nucleotide-Excision Repair Factor 3 7 7.86/3.57, 4.29 0.02 0.00 Dash Complex 10 11.60/3.90, 7.70 0.02 0.02 Variation of average SS scores for some selected complexes.

Semantic similarity does not vary significantly when IEA are considered. These are the complexes with highest variation in the number of annotations; the same behavior is likely to hold even for the other complexes. These results are valid only for protein complexes. The biasing effect is significant in other cases [Benabderrahmane et al.] [Couto et al.].

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 11 / 17

slide-12
SLIDE 12

Evidence Codes

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 12 / 17

slide-13
SLIDE 13

Shallow annotations

Shallow annotations

Many proteins are annotated with very generic terms. These annotations do not identify the specific role or function of the protein. Protein pairs annotated only with the same generic GO terms should not be considered similar. What’s the bias introduced by shallow annotations?

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 13 / 17

slide-14
SLIDE 14

Measuring Term specificity - Information Content (IC)

Information Content (IC) is based on the number of annotations involving a term and its descendants ICt(x) = −log( nx nroot ) nx being the number of annotations involving term x and its descendants

GO:0006139 Nucleobase, nucleoside, nucleotide and nucleic acid metabolism GO:0043283 Biopolymer metabolism GO:0044237 Cellular metabolism GO:0043170 Macromolecule metabolism GO:0044238 Primary metabolism GO:0009987 Cellular process GO:0008150 Biological process

5 15 15

GO:0008152 metabolism nroot = 56 nGO:0008512 = 46 IC(GO:0008512) = -log(46/56)

3 5 3 8 2 Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 14 / 17

slide-15
SLIDE 15

Shallow annotations

Let annot(i) be the set of terms annotated for protein i. The average complex Information Content ICc(C) for a complex C is: ICc(C) =

  • i∈C
  • j∈annot(i)

ICt(j)

  • i∈C

|annot(i)|

Complex ICs of annotated terms mean max min var. Serine C- Palmitoyltransferase 2.070, 6.909, 0.796, 2.070, 6.909 3.751 6.909 0.796 8.582

Example of average IC. Single ICs can vary a lot.

We selected protein complexes with lowest variance. We verified whether SS scores are somehow correlated to average IC.

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 15 / 17

slide-16
SLIDE 16

Shallow annotations

Surprisingly, complexes with lower average IC have higher SS scores.

IC Complex mean max min Resnik BMA SimGIC AP-2 Adaptor Complex 3.29 3.77 1.69 1.000 1.000 SMC5P-SMC6P Complex 3.97 4.48 3.49 0.893 0.934 EIF3 5.12 6.04 4.01 0.706 0.646 ARP2/3 Protein Complex 6.99 9.95 5.09 0.706 0.592 Signalosome Complex 7.49 8.16 5.64 0.749 0.631

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 16 / 17

slide-17
SLIDE 17

Conclusions

Semantic similarity measures are affected by several biasing factors

Electronically Inferred Annotations

Do not influence scores between proteins within the same complex Not uniform impact over the proteome

Significantly change results between other pairs of proteins (i.e. proteins sharing the same Pfam domains)

[Benabderrahmane et al., Couto et al.]

In general, better results are obtained when considering IEA

Shallow Annotations

Semantic similarity correlates inversely with average complex term specificity It is necessary to improve semantic similarity measures to deal with this problem

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 17 / 17

slide-18
SLIDE 18

Thank you for asking. Questions?

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 18 / 17

slide-19
SLIDE 19

References

Resnik, P. Using information content to evaluate semantic similarity in a taxonomy.

  • IJCAI. (1995) 448453

Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S.J. Up-to-date catalogues of yeast protein complexes. Nucleic acids research 37(3) (February 2009) 825831 Benabderrahmane, S., Smail-Tabbone, M., Poch, O., Napoli, A., Devignes, M.D. IntelliGO: a new vector-based semantic similarity measure including annotation

  • rigin.

BMC bioinformatics 11(1) (December 2010) 588 Couto, F., Silva, M., Coutinho, P. Measuring semantic similarity between Gene Ontology terms. Data & Knowledge Engineering 61(1) (April 2007) 137152 Pesquita, C., et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC bioinformatics 9 Suppl 5 (January 2008)

Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 19 / 17