 
              Interspecies gene function prediction using semantic similarity Guoxian Yu*, Wei Luo, Guangyuan Fu, Jun Wang Machine Learning and Data Analysis Lab. Southwest University gxyu@swu.edu.cn www.mlda.swu.edu.cn 1
Outline 1 、 Backgrounds 2 、 Method 3 、 Experiments 4 、 Conclusions www.mlda.swu.edu.cn 2
Backgrounds p Why predict gene functions? a) Understand life process, pharmacy, disease analysis b) High-throughput bio-techniques bring in various protein data (PPIs, protein sequences) c) About 20000 proteins in human, but 1/3 are well- studied. Many others are partially annotated or un- annotated. www.mlda.swu.edu.cn 3
Backgrounds p Why interspecies? a) Homologous species share a large portion of homologous genes and these genes have similar (or same) functional annotations. b) These genes are actually annotated with different terms (annotations) because of experimental ethics and protocols, and research interests of biologists. c) These different annotations provide complementary functional clue. www.mlda.swu.edu.cn 4
Backgrounds p Gene Ontology (GO) annotations of a Human gene and a Mouse gene currently available annotations missing annotations www.mlda.swu.edu.cn 5
Method p Semantic similarity metrics: Between terms: Distance: shortest paths 、 average paths Edge-based: Common ancestor IC Node-based : Common descendants Number of ancestors 、 node depth 2 IC t ( ) × tsim ( , ) t t A = Lin 1 2 IC t ( ) IC t ( ) + 1 2 where IC(t) is the information content of t , it is defined as − log(p(t)), 𝑢 " is the most informative common ancestor between 𝑢 # and 𝑢 $ . Pesquita, C., et al. Semantic similarity in biomedical ontologies . PLoS Computational Biology, 2009, 5(7), e1000443. www.mlda.swu.edu.cn 6
Method p Semantic similarity metrics Between genes (sets of terms)[1]: Best pairs : MAX 、 BMA Pairwise All pairs : AVG Graph : UI 、 GIC Groupwise Set : Term Overlap Vector : CoSim Semantic similarity can be used for gene function prediction[2,3]. [1] Pesquita, C. et al. Metrics for GO based protein semantic similarity: a systematic evaluation . BMC Bioinformatics, 2008, 9(S5), 4. [2] Tao, Y. et al. Information theory applied to the sparse gene ontology annotation network to predict novel gene function . Bioinformatics, 2007, 23(13), 529-538. [3] Yu, G. et al. Predicting protein function via downward random walks on a gene ontology . BMC Bioinformatics, 2015, 16: 271. www.mlda.swu.edu.cn 7
Method p Three semantic similarity metrics between genes Best Match Average (BMA) [1] simGIC [1] Term Overlap (TO) [2] [1] Pesquita, C., Faria, D., Bastos, H., Ferreira, A.E., Falcao, A.O., Couto, F.M.: Metrics for GO based protein semantic similarity: a systematic evaluation . BMC Bioinformatics, 2008, 9(S5), 4. [2] Mistry M, Pavlidis P. Gene Ontology term overlap as a measure of gene functional similarity . BMC bioinformatics, 2008, 9(1): 1. www.mlda.swu.edu.cn 8
Method f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 p1 0 0 1 0 1 p1 0 0 1 0 1 Human p2 0 0 1 0 1 p2 0 0 1 0 1 GOA p3 1 1 0 1 1 p3 1 1 0 1 1 p4 0 0 1 0 0 p4 0 0 1 0 0 p5 0 1 0 0 0 f1 f2 f3 f4 f5 p6 1 0 1 0 0 p1 0 1 0 0 0 p7 1 1 0 1 0 p2 1 0 1 0 0 p8 1 0 0 0 0 Mouse p3 1 1 0 1 0 GOA p9 0 0 1 1 0 p4 1 0 0 0 0 p5 0 0 1 1 0 Human_MouseGOA www.mlda.swu.edu.cn 9
Method p Intrapecies- semantic similarity f1 f2 f3 f4 f5 p1 p2 p3 p4 simGIC p1 0 0 1 0 1 p1 1 1 0.23 0.35 0 0 1 0 1 1 1 0.23 0.35 p2 p2 1 1 0 1 1 0.23 0.23 1 0 p3 p3 p4 0 0 1 0 0 p4 0.35 0.35 0 1 Gene Semantic Similarity HumanGOA www.mlda.swu.edu.cn 10
Method Interspecies semantic similarity f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 p6 p7 p8 p9 p1 0 0 1 0 1 p1 1 1 0.23 0.35 0 0.24 0 0 0.21 p2 0 0 1 0 1 p2 1 1 0.23 0.35 0 0.24 0 0 0.21 simGIC p3 1 1 0 1 1 p3 0.23 0.23 1 0 0.27 0.17 0.73 0.2 0.23 p4 0 0 1 0 0 p4 0.35 0.35 0 1 0 0.42 0 0 0.35 p5 0 1 0 0 0 p5 0 0 0.27 0 1 0 0.37 0 0 p6 1 0 1 0 0 p6 0.24 0.24 0 0.17 0 1 0.23 0.58 0.24 p7 1 1 0 1 0 p7 0 0 0.73 0 0.37 0.23 1 0.27 0.3 p8 1 0 0 0 0 p8 0 0 0.2 0 0 0.58 0.27 1 0 p9 0 0 1 1 0 p9 0.21 0.21 0.23 0.35 0 0.24 0.3 0 1 Human_MouseGOA Gene Semantic Similarity www.mlda.swu.edu.cn 11
Method p Gene function prediction using k NN 1 p i t ( , ) A t ( ) ∑ Intraspecies: = j k j N ( ) i ∈ k 𝑂 & 𝑗 consists of 𝑙 nearest neighbors of the 𝑗 -th gene from the same species. 1 s k k k Interspecies: p i t ( , ) A ( , ) j t ∑ = + = j 1 2 k k + s 1 2 j N ( ) i ∈ ks 𝑙 # nearest neighborhood genes of the 𝑗 -th gene from its own species, 𝑙 $ nearest neighborhood genes of the 𝑗 -th gene from another species, s=1, or 2. www.mlda.swu.edu.cn 12
Experiments p Datasets History- historical GOA file (archived date: 2014-01-20 ) Recent- recent GOA file (archived date : 2016-01-04 ) p Investigations: n The improvement of interspecies gene function prediction n The impact of semantic similarities n The influence of homology between species www.mlda.swu.edu.cn 13
Experiments p Results on archived annotations using TO in CC branch. H → H uses GO annotations of genes from Human to predict missing annotations of Human genes M → H uses annotations of genes from Mouse to predict missing annotations of Human genes M+H → H uses annotations of genes from Mouse and Human to predict missing annotations of Human genes www.mlda.swu.edu.cn 14
Experiments p Results on archived annotations using GIC in CC branch p Observations: n Interspecies gene function prediction works better than single species; n A species with high homology contributes more than the one with low homology; n Semantic similarity do not affect the above observations and results. www.mlda.swu.edu.cn 15
Experiments p Combining annotations in CC, MF and BP together p Observation n Interspecies gene function prediction also has improvement, but the improvement is not as much as in single branch. n BP, CC and MF provide functional clue for each other branch. www.mlda.swu.edu.cn 16
Experiments p Results on simulated missing GO annotations 𝑟 is the number of simulated missing annotations of a gene p Observation n Interspecies gene function prediction brings more prominent results in simulated missing annotations. www.mlda.swu.edu.cn 17
Conclusions p Interspecies gene function prediction using GO annotations of two species with high homology is more prominent than that of single species or two species without such high homology. p GO annotations of two homologous species are complementary for each other. p Future work: synergy the semantic similarity with other biological data for interspecies gene function prediction. www.mlda.swu.edu.cn 18
Any question? Thanks! www.mlda.swu.edu.cn 19
Recommend
More recommend