Interspecies gene function prediction using semantic similarity - - PowerPoint PPT Presentation

interspecies gene function prediction using semantic
SMART_READER_LITE
LIVE PREVIEW

Interspecies gene function prediction using semantic similarity - - PowerPoint PPT Presentation

Interspecies gene function prediction using semantic similarity Guoxian Yu*, Wei Luo, Guangyuan Fu, Jun Wang Machine Learning and Data Analysis Lab. Southwest University gxyu@swu.edu.cn www.mlda.swu.edu.cn 1 Outline 1 Backgrounds 2


slide-1
SLIDE 1

www.mlda.swu.edu.cn

Interspecies gene function prediction using semantic similarity

1

Guoxian Yu*, Wei Luo, Guangyuan Fu, Jun Wang

Machine Learning and Data Analysis Lab. Southwest University

gxyu@swu.edu.cn

slide-2
SLIDE 2

www.mlda.swu.edu.cn

Outline

1、Backgrounds 2、Method 3、Experiments 4、Conclusions

2

slide-3
SLIDE 3

www.mlda.swu.edu.cn

Backgrounds

pWhy predict gene functions?

a) Understand life process, pharmacy, disease analysis b) High-throughput bio-techniques bring in various protein data (PPIs, protein sequences) c) About 20000 proteins in human, but 1/3 are well-

  • studied. Many others are partially annotated or un-

annotated.

3

slide-4
SLIDE 4

www.mlda.swu.edu.cn

Backgrounds

pWhy interspecies?

a) Homologous species share a large portion of homologous genes and these genes have similar (or same) functional annotations. b) These genes are actually annotated with different terms (annotations) because of experimental ethics and protocols, and research interests of biologists. c) These different annotations provide complementary functional clue.

4

slide-5
SLIDE 5

www.mlda.swu.edu.cn

Backgrounds

p Gene Ontology (GO) annotations of a Human gene and a

Mouse gene

5

currently available annotations missing annotations

slide-6
SLIDE 6

www.mlda.swu.edu.cn

Method

pSemantic similarity metrics:

6

Between terms:

Edge-based: Node-based :

1 2 1 2

2 ( ) ( , ) ( ) ( )

A Lin

IC t tsim t t IC t IC t × = + where IC(t) is the information content of t, it is defined as −log(p(t)), 𝑢" is the most informative common ancestor between 𝑢# and 𝑢$. Pesquita, C., et al. Semantic similarity in biomedical ontologies. PLoS Computational Biology, 2009, 5(7), e1000443. Distance: shortest paths、average paths

IC Common ancestor Common descendants Number of ancestors、node depth

slide-7
SLIDE 7

www.mlda.swu.edu.cn

Method

pSemantic similarity metrics

7

Semantic similarity can be used for gene function prediction[2,3].

Pairwise Groupwise Set : Term Overlap Vector : CoSim Graph : UI、GIC Best pairs : MAX、BMA All pairs : AVG

[1] Pesquita, C. et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics, 2008, 9(S5), 4. [2] Tao, Y. et al. Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics, 2007, 23(13), 529-538. [3] Yu, G. et al. Predicting protein function via downward random walks on a gene ontology. BMC Bioinformatics, 2015, 16: 271.

Between genes (sets of terms)[1]:

slide-8
SLIDE 8

www.mlda.swu.edu.cn

Method

pThree semantic similarity metrics between genes

8

Best Match Average (BMA) [1] simGIC[1] Term Overlap (TO)[2]

[1] Pesquita, C., Faria, D., Bastos, H., Ferreira, A.E., Falcao, A.O., Couto, F.M.: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics, 2008, 9(S5), 4. [2] Mistry M, Pavlidis P. Gene Ontology term overlap as a measure of gene functional

  • similarity. BMC bioinformatics, 2008, 9(1): 1.
slide-9
SLIDE 9

www.mlda.swu.edu.cn

9

f1 f2 f3 f4 f5 p1 1 1 p2 1 1 p3 1 1 1 1 p4 1 p5 1 p6 1 1 p7 1 1 1 p8 1 p9 1 1 f1 f2 f3 f4 f5 p1 1 1 p2 1 1 p3 1 1 1 1 p4 1 f1 f2 f3 f4 f5 p1 1 p2 1 1 p3 1 1 1 p4 1 p5 1 1

Human GOA Mouse GOA Human_MouseGOA

Method

slide-10
SLIDE 10

www.mlda.swu.edu.cn

10

f1 f2 f3 f4 f5 p1 1 1 p2 1 1 p3 1 1 1 1 p4 1

HumanGOA

p1 p2 p3 p4 p1 1 1 0.23 0.35 p2 1 1 0.23 0.35 p3 0.23 0.23 1 p4 0.35 0.35 1

simGIC Gene Semantic Similarity

pIntrapecies- semantic similarity

Method

slide-11
SLIDE 11

www.mlda.swu.edu.cn

11

Human_MouseGOA Gene Semantic Similarity

f1 f2 f3 f4 f5 p1 1 1 p2 1 1 p3 1 1 1 1 p4 1 p5 1 p6 1 1 p7 1 1 1 p8 1 p9 1 1

p1 p2 p3 p4 p5 p6 p7 p8 p9

p1 1 1 0.23 0.35 0.24 0.21 p2 1 1 0.23 0.35 0.24 0.21 p3 0.23 0.23 1 0.27 0.17 0.73 0.2 0.23 p4 0.35 0.35 1 0.42 0.35

p5

0.27 1 0.37

p6

0.24 0.24 0.17 1 0.23 0.58 0.24

p7

0.73 0.37 0.23 1 0.27 0.3

p8

0.2 0.58 0.27 1

p9

0.21 0.21 0.23 0.35 0.24 0.3 1

Method

simGIC

Interspecies semantic similarity

slide-12
SLIDE 12

www.mlda.swu.edu.cn

Method

pGene function prediction using kNN

12

𝑂& 𝑗 consists of 𝑙 nearest neighbors of the 𝑗-th gene from the same species.

Intraspecies: Interspecies:

1 2

k k k + =

𝑙# nearest neighborhood genes of the 𝑗-th gene from its own species, 𝑙$ nearest neighborhood genes of the 𝑗-th gene from another species, s=1,

  • r 2.

( )

1 ( , ) ( )

k

j j N i

p i t A t k

=

( ) 1 2

1 ( , ) ( , )

s ks

s j j N i

p i t A j t k k

= +

slide-13
SLIDE 13

www.mlda.swu.edu.cn

Experiments

pDatasets

History- historical GOA file (archived date: 2014-01-20) Recent- recent GOA file (archived date: 2016-01-04)

pInvestigations:

n The improvement of interspecies gene function prediction n The impact of semantic similarities n The influence of homology between species

13

slide-14
SLIDE 14

www.mlda.swu.edu.cn

Experiments

14

pResults on archived annotations using TO in CC branch.

H→H uses GO annotations of genes from Human to predict missing annotations of Human genes M→H uses annotations of genes from Mouse to predict missing annotations of Human genes M+H→H uses annotations of genes from Mouse and Human to predict missing annotations of Human genes

slide-15
SLIDE 15

www.mlda.swu.edu.cn

Experiments

15

p Results on archived annotations using GIC in CC branch

pObservations:

n Interspecies gene function prediction works better than single

species;

n A species with high homology contributes more than the one with

low homology;

n Semantic similarity do not affect the above observations and results.

slide-16
SLIDE 16

www.mlda.swu.edu.cn

Experiments

16

pCombining annotations in CC, MF and BP together pObservation

n Interspecies gene function prediction also has improvement,

but the improvement is not as much as in single branch.

n BP, CC and MF provide functional clue for each other branch.

slide-17
SLIDE 17

www.mlda.swu.edu.cn

Experiments

17

pResults on simulated missing GO annotations

𝑟 is the number of simulated missing annotations of a gene

pObservation

n Interspecies gene function prediction brings more prominent

results in simulated missing annotations.

slide-18
SLIDE 18

www.mlda.swu.edu.cn

Conclusions

18

pInterspecies

gene function prediction using GO annotations of two species with high homology is more prominent than that of single species or two species without such high homology.

pGO

annotations

  • f

two homologous species are complementary for each other.

pFuture work: synergy the semantic similarity with other

biological data for interspecies gene function prediction.

slide-19
SLIDE 19

www.mlda.swu.edu.cn

19

Any question? Thanks!