BRAKER2 : Incorporating GeneMark-EP and AUGUSTUS Katharina J. - - PowerPoint PPT Presentation

braker2 incorporating
SMART_READER_LITE
LIVE PREVIEW

BRAKER2 : Incorporating GeneMark-EP and AUGUSTUS Katharina J. - - PowerPoint PPT Presentation

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with BRAKER2 : Incorporating GeneMark-EP and AUGUSTUS Katharina J. Hoff, Protein Homology Information into Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene


slide-1
SLIDE 1

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.1

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS

A pipeline for fully automated training and prediction Plant and Animal Genomes XXVI, January 14th 2018 Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Presenting author: katharina.hoff@uni-greifswald.de

slide-2
SLIDE 2

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.2

Contents

1

Gene prediction

2

BRAKER1: RNAseq

3

BRAKER2: proteins Short evolutionary distance Long evolutionary distance

4

Summary

5

References

slide-3
SLIDE 3

Structural genome annotation problem

Input

  • genome assembly
  • extrinsic evidence, e.g. from RNAseq, protein database

Output

  • protein-coding genes: exon-intron structures (.gff)

Example (from Chr I in C. elegans)

slide-4
SLIDE 4

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.4

BRAKER1: RNAseq integration

  • >4000 downloads
  • 73 citations since 2016 (google scholar)
slide-5
SLIDE 5

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.4

BRAKER1: RNAseq integration

genome.fa RNAseq.bam GeneMark-ET genemark.gtf AUGUSTUS training AUGUSTUS prediction augustus.gtf

slide-6
SLIDE 6

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.5

BRAKER2: Part I - proteins of closely related species

genome.fa protein.fa GenomeThreader AUGUSTUS training AUGUSTUS prediction augustus.gtf

slide-7
SLIDE 7

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.6

Drosophila melanogaster and relatives

For a given species,

  • the average number of mutations per genomic site was computed

from alignments of ortholog gene sequences (including introns).

  • the protein identity was computed as average of identity values of

the best exonerate hit found for each protein of this species against the D. melanogaster genome.

  • 0.0

0.2 0.4 0.6 0.8 1.0 1.2 0.75 0.80 0.85 0.90 0.95 Average Mutations per Genomic Site Average Protein Identity

dsim dere dana dpse dwildvir

dgri Image: S. König, L. Romoth, M. Stanke (2018) Comparative Genome Annotation

slide-8
SLIDE 8

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.7

Increasing evolutionary distance leads to decreasing gene prediction accuracy of AUGUSTUS

AUGUSTUS ab initio prediction

Gene F1

  • dsim

dere dana dpse dwil dvir dgri drm5 40 50 60 70

  • BRAKER2 GenomeThreader training

expert training BRAKER1 RNAseq training

slide-9
SLIDE 9

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.7

Increasing evolutionary distance leads to decreasing gene prediction accuracy of AUGUSTUS

AUGUSTUS prediction with training set hints

Gene F1

  • dsim

dere dana dpse dwil dvir dgri drm5 40 50 60 70

  • BRAKER2 GenomeThreader training

BRAKER1 RNAseq training

slide-10
SLIDE 10

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.7

Increasing evolutionary distance leads to decreasing gene prediction accuracy of AUGUSTUS With increasing distance between query protein and target genome, spliced alignments become

  • less sensitive while keeping a constant level of specificity

(e.g. GenomeThreader),

  • or both less sensitive and less specific (e.g. Exonerate).

Therefore, training AUGUSTUS on spliced alignments is suitable upon availability of a very closely related query species, only!

slide-11
SLIDE 11

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.8

BRAKER2: Part II - proteins of more remote species

“Standard mapping approach”: proteins to genome

genome.fa GenomeThreader proteins.fa CDS, introns, starts, stops (protein.hints)

→ works well for closely related species, only

slide-12
SLIDE 12

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.8

BRAKER2: Part II - proteins of more remote species

genome.fa GeneMark-ES genemark.gtf predicted proteins database of orthologous gene clusters (proteins) BlastP “hits” For each “hit”: predicted gene nucleotide sequence (seed) ProSplign introns (protein.hints) genemark.gtf AUGUSTUS training AUGUSTUS prediction augustus.gtf braker.pl GeneMark-EP GeneMark-EP protein mapping pipeline

slide-13
SLIDE 13

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.9

Protein database for gene prediction in D. melanogaster

Insect portion of EggNOG (inNOG) excluding Drosophila species

  • Acyrthosiphon pisum
  • Aedes aegypti
  • Anopheles darlingi
  • Anopheles gambiae
  • Apis mellifera
  • Atta cephalotes
  • Bombyx mori
  • Culex quinquefasciatus
  • Danaus plexippus
  • Heliconius melpomene
  • Nasonia vitripennis
  • Pediculus humanus
  • Tribolium castaneum
slide-14
SLIDE 14

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.10

Intron recovery from protein mapping

Protein mapping with no Drosophila EggNOG (inNOG)

  • 30,996 introns predicted
  • 21,843 matched introns in CDS part of the annotated

genes

Protein mapping RNAseq mapping

Introns in CDS

% 30 40 50 60 70 80 90

Sensitivity Specificity

Mapping of proteins from remote species recovers ∼45% of introns with specificity of ∼70%.

slide-15
SLIDE 15

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.10

Intron recovery from protein mapping

Protein mapping with some Drosophila species present as external evidence no_Dro no Drosophila species w_gvw with D. grimshawi, D. virilis, D. willistoni w_gvwpa with D. grimshawi, D. virilis, D. willistoni, D. pseu- doobscura, D. ananassae

no_Dro w_gvw w_gvwpa RNAseq

Introns in CDS

% 30 40 50 60 70 80 90

Sensitivity Specificity

→ more introns were detected → performance of protein mapping with addition of 5 fly proteomes came closer to performance with RNAseq external evidence

slide-16
SLIDE 16

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.11

Accuracy of GeneMark-EX with different sources of evidence

  • results are on softmasked genome (strongly recommended!)

ES EP−no_Dro ET−RNAseq Ideal

Exon prediction accuracy

% 60 65 70 75 80 85 90

Sensitivity Specificity

ES EP−no_Dro ET−RNAseq Ideal

Introns in CDS prediction accuracy

% 60 65 70 75 80 85 90

Sensitivity Specificity

  • GeneMark-EP and GeneMark-ET outperformed GeneMark-ES
  • GeneMark-EP with “remote” proteins was comparable with

GeneMark-ET

  • GeneMark-EP and GeneMark-ET were close to the best possible

performance: compared to training with “ideal” introns

slide-17
SLIDE 17

Accuracy of BRAKER2

Gene prediction accuracy (F1)

45 50 55 60 65

BRAKER1 (RNAseq)

45 50 55 60 65

BRAKER2 (no_Dro)

45 50 55 60 65

BRAKER2 (w_gvw)

45 50 55 60 65

BRAKER2 (w_gvwpa)

slide-18
SLIDE 18

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.13

Summary

  • BRAKER2 is a novel fully automatic pipeline which makes

gene prediction in eukaryotic genomes with RNAseq or protein external evidence.

  • Training in BRAKER2 is done by GeneMark-EX which

particularly can use remote proteins as external evidence.

  • Prediction in BRAKER2 is done by AUGUSTUS using

RNAseq or proteins as hints.

slide-19
SLIDE 19

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.14

Ongoing & future work

  • Optimization of evidence integration in BRAKER2
  • Combining RNAseq and protein information
  • UTR training & integration of RNAseq coverage

information

slide-20
SLIDE 20

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.15

References

  • Hoff, Katharina J., et al. “BRAKER1: unsupervised RNAseq-based

genome annotation with GeneMark-ET and AUGUSTUS.” Bioinformatics 32.5 (2015): 767-769.

  • Stanke, Mario, et al. “Using native and syntenically mapped cDNA

alignments to improve de novo gene finding.” Bioinformatics 24.5 (2008): 637-644.

  • Lomsadze, Alexandre, Paul D. Burns, and Mark Borodovsky. “Integration
  • f mapped RNAseq reads into automatic training of eukaryotic gene

finding algorithm.” Nucleic acids research 42.15 (2014): e119-e119.

  • Slater, Guy St C., and Ewan Birney. “Automated generation of heuristics

for biological sequence comparison.” BMC bioinformatics 6.1 (2005): 31.

  • Gremme, Gordon. “GenomeThreader Gene Prediction Software.” (2014).
  • Dobin, Alexander, et al. “STAR: ultrafast universal RNA-seq aligner.”

Bioinformatics 29.1 (2013): 15-21.

BRAKER2 is available for download at

  • http://bioinf.uni-greifswald.de
  • http://exon.gatech.edu
slide-21
SLIDE 21

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.16

State of the art: BRAKER with RNAseq & proteins

Close homology genome.fa RNAseq.bam GeneMark-ET genemark.gtf AUGUSTUS training AUGUSTUS prediction augustus.gtf protein.fa GenomeThreader

slide-22
SLIDE 22

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.16

State of the art: BRAKER with RNAseq & proteins

AUGUSTUS ab initio prediction

Gene F1

  • dsim

dere dana dpse dwil dvir dgri drm5 40 50 60 70 80

  • BRAKER2 GenomeThreader training

BRAKER2 GenomeThreader & RNAseq training expert training BRAKER1 RNAseq training

slide-23
SLIDE 23

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.16

State of the art: BRAKER with RNAseq & proteins

AUGUSTUS prediction with training set hints

Gene F1

  • dsim

dere dana dpse dwil dvir dgri drm5 40 50 60 70 80

  • BRAKER2 GenomeThreader training

BRAKER2 GenomeThreader & RNAseq training BRAKER1 RNAseq training

slide-24
SLIDE 24

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.16

State of the art: BRAKER with RNAseq & proteins

Remote homology genome.fa protein.hints GeneMark-ETP genemark.gtf AUGUSTUS training AUGUSTUS prediction augustus.gtf RNAseq.bam

slide-25
SLIDE 25

BRAKER2: Incorporating Protein Homology Information into Gene Prediction with GeneMark-EP and AUGUSTUS Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky Gene prediction BRAKER1: RNAseq BRAKER2: proteins

Short evolutionary distance Long evolutionary distance

Summary References

1.16

State of the art: BRAKER with RNAseq & proteins

45 50 55 60 65

BRAKER1 (RNAseq)

45 50 55 60 65

BRAKER2 (noDro)

GeneMark AUGUSTUS ab initio AUGUSTUS hints

45 50 55 60 65

BRAKER2 (noDro+RNAseq)