investigating bias in semantic similarity measures
play

Investigating bias in semantic similarity measures Marco Mina - PowerPoint PPT Presentation

Investigating bias in semantic similarity measures Marco Mina mina@dei.unipd.it University of Padova, Italy and Pietro Hiram Guzzi University Magna Grcia of Catanzaro September 13, 2011 Marco Mina (University of Padova) Investigating bias


  1. Investigating bias in semantic similarity measures Marco Mina mina@dei.unipd.it University of Padova, Italy and Pietro Hiram Guzzi University Magna Græcia of Catanzaro September 13, 2011 Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 1 / 17

  2. What is the Gene Ontology (GO)? GO is a hierarchical vocabulary of terms describing functions and processes within cells Genes and proteins are annotated with GO terms representing their functions, roles and localization Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 2 / 17

  3. Semantic similarity (SS) measures Exploit GO structure to measure similarity between GO Terms Quantify the functional similarity between genes and proteins considering their annotations Sim(DNA binding, transferase activity) = ? Sim(nucleotidyltransferase activity, DNA binding) = ? Sim(nucleotidyltransferase act., transferase act.) = ? Sim(protein A, protein B) = ? protein C protein A protein B Sim(protein B, protein C) = ? Sim(protein A, protein C) = ? Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 3 / 17

  4. An Example of Semantic similarity Determine the similarity between terms GO:0006139 and GO:0043283 GO:0008150 Biological process GO:0008152 GO:0009987 metabolism Cellular process GO:0043170 GO:0044237 GO:0044238 Macromolecule Cellular metabolism Primary metabolism metabolism GO:0006139 GO:0043283 Nucleobase, nucleoside, nucleotide Biopolymer metabolism and nucleic acid metabolism Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 4 / 17

  5. SimPL Semantic similarity SimPL ( t 1 , t 2 ) = length of the shortest path between the two terms GO:0008150 Biological process GO:0008152 GO:0009987 metabolism Cellular process GO:0043170 GO:0044237 GO:0044238 Macromolecule Cellular metabolism Primary metabolism metabolism GO:0006139 GO:0043283 Nucleobase, nucleoside, nucleotide Biopolymer metabolism and nucleic acid metabolism However, SimPL does not take into account term specificity Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 5 / 17

  6. Applications of Semantic similarity Assign reliability to protein-protein interactions Identify functional modules within protein interaction networks Assess algorithm performance Alignment of biological networks of different organisms Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 6 / 17

  7. Biasing factors in Semantic similarity In general, semantic similarity measures are influenced by the incompleteness of current annotation corpora false or imprecise annotations In this work we focus on the impact of two potentially biasing factors: Use of Inferred Electronically Annotations (IEA) Shallow annotation problem Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 7 / 17

  8. Evidence Codes Several different strategies to infer annotations have been proposed. Need to keep track of the method used to infer an annotation. Evidence codes (EC) Describe the way an annotation has been established. Roughly speaking, annotations can be divided in two categories: Inferred Electronically Annotations (IEA) Not Inferred Electronically Annotations not as reliable as non IEA much more reliable than IEA generally they are rather generic generally very specific many annotations available not many annotations available Ignoring IEA drastically reduces the number of annotations, but raises annotation corpus reliability. What’s the impact of considering IEA? We verified what is the impact of considering IEA on semantic similarity scores between proteins within biological complexes. Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 8 / 17

  9. Our Testbed Complexes has been extracted from CYC2008 P 3 2 3 7 9 P 4 0 3 0 3 P 2 1 2 4 2 database [Pu et al.] . P 3 0 6 5 7 P 2 2 1 4 1 P 2 1 2 4 3 P 2 3 7 2 4 P 3 0 6 5 6 408 manually curated Yeast P 2 3 6 3 9 P 4 0 3 0 2 P 2 5 4 5 1 P 2 3 6 3 8 protein complexes P 3 8 6 2 4 P 2 5 0 4 3 We used Resnik BMA [Resnik et al.] and SimGIC [Pesquita et al.] as semantic similarity measures. Let SS ( i , j ) represent the semantic similarity between proteins i and j . The average complex semantic similarity SS ( C ) for complex C is: � 2 | SS ( i , j ) | i , j ∈ C SS ( C ) = | C | ( | C | − 1) Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 9 / 17

  10. Evidence Codes Let annot ( i ) be the set of annotations involving protein i . The average number of annotations ( A C ) for a complex C is: � | annot ( i ) | i ∈ C A C = | C | Complex Data Average number of annotations Complex Size with IEA without IEA Difference Camp-dependent Protein 4 7.75 2.50 5.25 Kinase MCM2-7 Complex 6 11.83 6.33 5.50 NSP1P Complex 4 17.25 13.00 4.25 Nucleotide-Excision Repair 7 7.86 3.57 4.29 Factor 3 Dash Complex 10 11.60 3.90 7.70 Examples of complexes selected for the analysis. We focused on protein complexes with the highest difference of average number of annotations. Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 10 / 17

  11. Evidence Codes Complex Data average annotations SimGIC Resnik BMA Complex Size IEA/noIEA, Delta Delta Delta Camp-dependent Protein 4 7.75/2.50, 5.25 0.01 0.07 Kinase MCM2-7 Complex 6 11.83/6.33, 5.50 0.10 0.01 NSP1P Complex 4 17.25/13.00, 4.25 0.01 0.01 Nucleotide-Excision Repair 7 7.86/3.57, 4.29 0.02 0.00 Factor 3 Dash Complex 10 11.60/3.90, 7.70 0.02 0.02 Variation of average SS scores for some selected complexes. Semantic similarity does not vary significantly when IEA are considered. These are the complexes with highest variation in the number of annotations; the same behavior is likely to hold even for the other complexes. These results are valid only for protein complexes. The biasing effect is significant in other cases [Benabderrahmane et al.] [Couto et al.] . Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 11 / 17

  12. Evidence Codes Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 12 / 17

  13. Shallow annotations Shallow annotations Many proteins are annotated with very generic terms. These annotations do not identify the specific role or function of the protein. Protein pairs annotated only with the same generic GO terms should not be considered similar. What’s the bias introduced by shallow annotations? Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 13 / 17

  14. Measuring Term specificity - Information Content (IC) Information Content (IC) is based on the number of annotations involving a term and its descendants IC t ( x ) = − log ( n x ) n root n x being the number of annotations involving term x and its descendants n root = 56 2 GO:0008150 n GO:0008512 = 46 Biological process IC( GO:0008512 ) = -log( 46/56 ) 8 GO:0008152 GO:0009987 metabolism Cellular process 3 5 15 GO:0043170 3 GO:0044237 GO:0044238 Macromolecule Cellular metabolism Primary metabolism metabolism 15 5 GO:0006139 GO:0043283 Nucleobase, nucleoside, nucleotide Biopolymer metabolism and nucleic acid metabolism Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 14 / 17

  15. Shallow annotations Let annot ( i ) be the set of terms annotated for protein i . The average complex Information Content IC c ( C ) for a complex C is: � � IC t ( j ) i ∈ C j ∈ annot ( i ) IC c ( C ) = � | annot ( i ) | i ∈ C Complex ICs of annotated terms mean max min var. Serine C- 2.070, 6.909, 0.796, 3.751 6.909 0.796 8.582 Palmitoyltransferase 2.070, 6.909 Example of average IC. Single ICs can vary a lot. We selected protein complexes with lowest variance. We verified whether SS scores are somehow correlated to average IC. Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 15 / 17

  16. Shallow annotations Surprisingly, complexes with lower average IC have higher SS scores. IC Complex mean max min Resnik BMA SimGIC AP-2 Adaptor Complex 3.29 3.77 1.69 1.000 1.000 SMC5P-SMC6P Complex 3.97 4.48 3.49 0.893 0.934 EIF3 5.12 6.04 4.01 0.706 0.646 ARP2/3 Protein Complex 6.99 9.95 5.09 0.706 0.592 Signalosome Complex 7.49 8.16 5.64 0.749 0.631 Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 16 / 17

  17. Conclusions Semantic similarity measures are affected by several biasing factors Electronically Inferred Annotations Do not influence scores between proteins within the same complex Not uniform impact over the proteome Significantly change results between other pairs of proteins (i.e. proteins sharing the same Pfam domains) [Benabderrahmane et al., Couto et al.] In general, better results are obtained when considering IEA Shallow Annotations Semantic similarity correlates inversely with average complex term specificity It is necessary to improve semantic similarity measures to deal with this problem Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 17 / 17

  18. Thank you for asking. Questions? Marco Mina (University of Padova) Investigating bias in SS measures September 13, 2011 18 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend