Phylogenomic inference Hauptseminar Frishman WS2013/2014 Uli Khler - - PowerPoint PPT Presentation

phylogenomic inference
SMART_READER_LITE
LIVE PREVIEW

Phylogenomic inference Hauptseminar Frishman WS2013/2014 Uli Khler - - PowerPoint PPT Presentation

Phylogenomic inference Hauptseminar Frishman WS2013/2014 Uli Khler February 3rd 2014 Folie 2 von 27 Structure of this talk Issues of non-phylogenic functional prediction What is phylogenomic inference? Phylogenetic tree


slide-1
SLIDE 1
slide-2
SLIDE 2

Phylogenomic inference

Hauptseminar Frishman WS2013/2014 Uli Köhler February 3rd 2014

Folie 2 von 27

slide-3
SLIDE 3

Structure of this talk

◮ Issues of non-phylogenic functional prediction ◮ What is phylogenomic inference? ◮ Phylogenetic tree reconciliation ◮ Phylogenomic inference methodology ◮ Phylogenomic databases and algorithms:

◮ SIFTER ◮ PhyloFacts

◮ Common problems of phylogenomic predictions ◮ Future of phylogenomics ◮ Seminar conclusion

Folie 3 von 27

slide-4
SLIDE 4

Non-phylogenomic function prediction

◮ High-throughput sequencing

→ Many proteins, few information available: ~90000 PDB structures vs 5.1 × 106 UniProt/TrEMBL sequences

◮ Alignment score does not distinguish between

matching domains

◮ Difficult to separate orthologs and paralogs

Folie 4 von 27

slide-5
SLIDE 5

What is phylogenomic inference? I

Phylogenomic inference

Evolutionary relationship (phylogenetics) analyze genomes infer function

Folie 5 von 27

slide-6
SLIDE 6

What is phylogenomic inference? II

◮ Concept to enhance homology-based function

predictions

◮ Can be applied to both genes and proteins ◮ Attempt to separate orthologs and paralogs

→ ortholog = high probability of similar or identical function

◮ Phylogenetic tree reconciliation:

Identify speciation and duplication events in phylogenetic trees

Folie 6 von 27

slide-7
SLIDE 7

Tree reconciliation

A B C Are B and C

  • rtholog
  • r paralog in

respect to A?

slide-8
SLIDE 8

Tree reconciliation

A B C Duplication or speciation?

slide-9
SLIDE 9

Tree reconciliation

A B C (Example) Speciation Duplication B: ortholog C: paralog

Folie 7 von 27

slide-10
SLIDE 10

Phylogenomic inference methodology I

  • 1. Cluster homolog proteins
  • 2. Compute multiple alignment
  • 3. Edit alignment (remove potential

non-homologs)

  • 4. Mask less-conserved regions in alignment
  • 5. Construct phylogenetic tree
  • 6. Identify closely related subtrees
  • 7. Overlay with experimental data
  • 8. Differentiate orthologs and paralogs

(Tree reconciliation)

  • 9. Infer function from orthologs

Folie 8 von 27

slide-11
SLIDE 11

Phylogenomic inference methodology II

  • 1. Cluster homolog proteins
  • 2. Compute multiple alignment
  • 3. Edit alignment
  • 4. Mask less-conserved regions in alignment

◮ Raw alignments would introduce noise ◮ Retain only high-scoring homology &

highly-conserved domains

Folie 9 von 27

slide-12
SLIDE 12

Phylogenomic inference methodology III

  • 5. Construct phylogenetic tree

◮ Core problems:

◮ No information about actual ancestors is available ◮ High computational complexity (optimal solution:

NP-Hard!)

◮ Use algorithms like maximum parsimony or

maximum likelihood

Folie 10 von 27

slide-13
SLIDE 13

Phylogenomic inference methodology IV

  • 6. Identify closely related subtrees
  • 7. Overlay with experimental data

◮ More filtering to reduce noise ◮ Given the tree topology, use only closely related

subgroups (in addition to filtering distant homologs in step 1)

Folie 11 von 27

slide-14
SLIDE 14

Phylogenomic inference methodology V

  • 8. Differentiate orthologs and paralogs

◮ Computational tree reconciliation – examples:

◮ NCBI COG DB: Bidirectional top BLAST hits ◮ Complex statistical algorithms like RIO (Resampled

inference of orthologs), orthostrapper or BETE

◮ Computationally intensive, requires

highly-filtered input data

Folie 12 von 27

slide-15
SLIDE 15

SIFTER

  • 9. Infer function from orthologs

◮ Statistical Inference of Function Through

Evolutionary Relationships

◮ Predicts protein function (homology-based)

given a reconciled tree → Tree construction & reconciliation remains a problem

◮ Based on bayesian statistics ◮ Complex mathematics (not shown here)

Folie 13 von 27

slide-16
SLIDE 16

PhyloFacts I

◮ „Encyclopedia“of „books“for known protein

(super)families and structura domains

◮ 92800 families (as of 2013-02-03) ◮ Precomputed phylogenetic trees &

phylogenomic family HMMs → Reasonably fast, but „Some results can take hours to complete“

◮ Provides structured access to annotated

phylogenomic information about protein (super)families

Folie 14 von 27

slide-17
SLIDE 17

PhyloFacts II

◮ FAT-CAT: PhyloFacts

Webservice to predict protein function using phylogenomic methods

◮ Integrates with Pfam and uses

HMMs to find the sequence position in the precomputed tree

Folie 15 von 27

slide-18
SLIDE 18

PhyloFacts III

Folie 16 von 27

slide-19
SLIDE 19

Issues of phylogenomic methods I

in-silico – Involves manual steps

  • 1. Cluster homolog proteins
  • 2. Compute multiple alignment
  • 3. Edit alignment
  • 4. Mask less-conserved regions in alignment
  • 5. Construct phylogenetic tree
  • 6. Identify closely related subtrees
  • 7. Overlay with experimental data
  • 8. Differentiate orthologs and paralogs
  • 9. Infer function from orthologs

Folie 17 von 27

slide-20
SLIDE 20

Issues of phylogenomic methods II

  • 1. Cluster homolog proteins
  • 2. Compute multiple alignment
  • 3. Edit alignment
  • 4. Mask less-conserved regions in alignment

◮ Manual annotation & selection

→ Subjective, error-prone, time/cost-intensive

◮ Information will be lost, does the annotator

just select what he wants to see?

◮ Algorithms too sensitive, are results always

reliable?

Folie 18 von 27

slide-21
SLIDE 21

Issues of phylogenomic methods III

  • 5. Construct phylogenetic tree

◮ Distance-based vs. character-based

construction algorithms

◮ Small, highly-conserved protein families

perform better than large (super)families

◮ Lack of consistency across methods ◮ Algorithms scale poorly → Can’t be used for

large (super)families

◮ Some methods produce millions of equivalently

scored topologies

Folie 19 von 27

slide-22
SLIDE 22

Issues of phylogenomic methods IV

  • 7. Overlay with experimental data

◮ Database = Experimental data + inferred data ◮ Experimental datasets available ↔ Protein

function already know

◮ Protein function unknown ↔ few experimental

datasets available

Folie 20 von 27

slide-23
SLIDE 23

Issues of phylogenomic methods V

◮ Multiple subsequent filter passes ◮ Huge sets of parameters, impossible to select

  • ptimal values

◮ Requires manual annotation & experimental

data

◮ Sometimes even orthology is not sufficient for

annotation transfer

◮ Doesn’t work well with distant homologs,

requires highly-conserved domains

Folie 21 von 27

slide-24
SLIDE 24

Future of phylogenomic inference

◮ Phylogenomics alone has too many problems

and open questions, but...

Folie 22 von 27

slide-25
SLIDE 25

Future of phylogenomic inference

◮ Phylogenomics alone has too many problems

and open questions, but...

◮ ...together with other concepts functional

prediction accuracy can be enhanced

◮ Computational complexity: Moore’s law and

alternative computational hardware → Large-scale application feasible in the future?

◮ Phylogenomic inference for DB verification ◮ Can also be applied to other attributes (besides

protein function)

◮ PhyloFacts & SIFTER: Usable tools, but

apparently not widely adopted or actively developed

Folie 22 von 27

slide-26
SLIDE 26

Conclusion (Phylogenomic inference)

◮ Powerful concept for enhancing function

prediction accuracy by identifying orthologs

Folie 23 von 27

slide-27
SLIDE 27

Conclusion (Phylogenomic inference)

◮ Powerful concept for enhancing function

prediction accuracy by identifying orthologs

◮ ... if it would actually work in practice ◮ Too complex, too manual, too many

parameters

◮ Pure in-silico phylogenomics

→ Low quality results

◮ Manual annotation can’t keep up with HTS ◮ PhyloFacts provides a useful database for

function prediction using phylogenomic approaches

Folie 23 von 27

slide-28
SLIDE 28

Conclusion (Seminar)

◮ in-silico protein function inference is a yet

unsolved problem in computational biology

◮ Combine any information that is available,

including:

◮ Context-based prediction ◮ Alternative splicing ◮ SNPs ◮ Phylogenomics ◮ Experimental results

◮ Only with all this information combined

sufficient accurracy for in-silico function prediction is achievable

Folie 24 von 27

slide-29
SLIDE 29

References

Kimmen Sjölander Phylogenomic inference of protein molecular function: advances and challenges Bioinformatics, 2004 Barbara E. Engelhardt et al. Protein Molecular Function Prediction by Bayesian Phylogenomics PLoS Computational Biology, 2005 Jonathan A. Eisen & Claire M. Frasier Phylogenomics:Intersection of Evolution and Genomics Science, 2003 Duncan Brown, Kimmen Sjölander Functional Classification using Phylogenomic Inference PLoS Computational Biology, 2006 Nandini Krishnamurthy et al. PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification Genome Biology, 2006 Barbara E. Engelhardt et al. A graphical model for predicting protein molecular function Proceedings of the International Conference on Machine Learning (ICML), 2006 Folie 25 von 27

slide-30
SLIDE 30

Web & image sources

http://phylogenomics.berkeley.edu/

Folie 26 von 27

slide-31
SLIDE 31

Thank you for your attention!

References and sources available at https://github.com/ulikoehler/Hauptseminar

Questions?

Folie 27 von 27