A multi-source domain annotation pipeline for quantitative - - PowerPoint PPT Presentation

a multi source domain annotation pipeline for
SMART_READER_LITE
LIVE PREVIEW

A multi-source domain annotation pipeline for quantitative - - PowerPoint PPT Presentation

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini , Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and


slide-1
SLIDE 1

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Ari Ugarte, Riccardo Vicedomini, Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018

Laboratory of Computational and Quantitative Biology (LCQB) - Sorbonne University

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Metagenomic analysis workflow

Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment.

1

slide-4
SLIDE 4

Protein domains

Size of individual structural domains:

  • from 36 aa to 692 aa
  • most of them has < 200 residues
  • average of ∼ 100 residues

2

slide-5
SLIDE 5

Domain identification in short MG/MT reads

  • Very short fragments (150 - 300 bp)

Two possible approaches:

  • Assembly-based (e.g., HMM-GRASPx)
  • Direct read annotation (e.g., MetaCLADE, UProC)

3

slide-6
SLIDE 6

How can we represent a domain?

  • Multiple sequence alignment (MSA) of homologous domain sequences
WebLogo 3.6.0 A W I Y V

L

F

S T D R K

E Q A

C S L

K

R

A

F

WS

E T A

R

K Q C

S K T N D A

R Q

E

R F L Q L T I S D F V A C F L S S V

A

G

M T N L

E K Q

RTG

L

V

I

F

YP

M V

F W

L

I I VDA

N S L

I

C

G A

Q I V L

M

I L K A T

R

S K A

C

E

Q

I

M

L

T H Y A K L

W

R N

S K R H T

E A Q

M Q L H I D S

E

T

G

A

G

R T V

F

Y

W

K E

V

L

I

M

T

P S

N

H

Y Q

F H

N

I S

L

R

S C G M

A

V LR

N V

L Q

A

H

M

N S

L M A

V

I

I S C

L A

T

V

M S V G

A

G A T M

C

S

V L Y

F

V F

A

L

A C

I S

V

T

Y H

R

K G

Y T Q L G

N

H

D

M

C

L

Q I M F R

Y G H W

L

C T Q

V

L

I

N

H P S

D

P Y

W

W L T K

Q E

R

V M I A H K

L W E R

S T A P

G

G Y L Q M

A

E

L D Q

E K A

R

F

V H

Y

W

Y

L

F

H Q D L

R W

E A M

A N K S R Q

D E

W N C R Y

K Q

H L T F

L

A T N

L

I

VDW

F H Y

A G

E

D

I T Y A V W F

P

L

E P C

G S

A

C V L

I

N A

S

S C

D

H

N

P W Q I V S T Y

A

N

F S H P M

L A

G

M

S Q

N

G

A M C V

I F L

W

E N L

G

M

Q

S N

Y

F M

W

C I T Q

L S A V

G

S

A

A

C

S G

V S I L E V T

V C

S

T

C N K P L

G

N P M V

I C A T

F C V H E

N

D

Q G

S T

P

A

K I S R V

Q A

N A D E G

T

S

P

V G W S L

F

A

Y

S D V

Y

F

  • Probabilistic representations:
  • position-specific scoring matrices (PSSMs)
  • profile hidden Markov models (pHMMs)

4

slide-7
SLIDE 7

The Pfam database

  • A large collection of protein domain families
  • Each family is represented by two MSAs (Full and Seed) and a profile HMM

5

slide-8
SLIDE 8

CLADE (CLoser sequences for Annotations Directed by Evolution)

6

slide-9
SLIDE 9

MetaCLADE

slide-10
SLIDE 10

MetaCLADE - Main features

  • Extends CLADE to handle MG/MT data
  • Puts all domain hits in a two-dimensional space
  • Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to

assign confidence values to predictions

  • Provides data visualization of functional annotation

7

slide-11
SLIDE 11

MetaCLADE - General overview

MAKLKVANDKA... Input sequence Conservation profiles Domain hits on the input sequence

D1 D1 D2 D2 D2 D8 D6 D11 D2 D3 D1 D1 D2 D2 D2 D8 D6 D11 D2 D3

  • 1. Removal of overlapping hits

D1 D2 D8 D6 D11 D2 D3

  • 2. Selection of hits with prob ≥ 0.9

D1 D2 D8 D11 D3

  • 3. Selection of hits with best

bit-score and % identity

D2 D8 D11

… … … . … … … . … … … .

CLADE model library D1 DN Di ... ...

Sets of positive and negative sequences Identification of domain-specific separating parameters

Domain prediction Hit identification

CCMs Global-consensus Predicted CDS/ORFs in MG/MT reads

Domain Di in CLADE

Domain-dependent probability space pre-computation 8

slide-12
SLIDE 12

MetaCLADE - Model construction

Pfam domains

D1 Di DM ... ... Seedi pHMMi

  • Inherited from CLADE
  • Several models are built in order to represent each known Pfam family
  • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

9

slide-13
SLIDE 13

MetaCLADE - Model construction

Pfam domains

D1 Di DM ... ... Seedi pHMMi Si

1

Si

ni

...

ni ≤ 350

Fulli

  • Inherited from CLADE
  • Several models are built in order to represent each known Pfam family
  • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

9

slide-14
SLIDE 14

MetaCLADE - Model construction

Pfam domains

D1 Di DM ... ... Seedi pHMMi Si

1

Si

ni

...

CCMi

1

...

ni ≤ 350

Fulli NR PSI-BLAST

  • Inherited from CLADE
  • Several models are built in order to represent each known Pfam family
  • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

9

slide-15
SLIDE 15

MetaCLADE - Model construction

Pfam domains

D1 Di DM ... ... Seedi pHMMi Si

1

Si PSI-BLAST

ni

...

CCMi

1

CCMi

ni

...

ni ≤ 350

Fulli NR PSI-BLAST

  • Inherited from CLADE
  • Several models are built in order to represent each known Pfam family
  • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

9

slide-16
SLIDE 16

MetaCLADE - Model construction

Pfam domains

D1 Di DM ... ... Seedi pHMMi Si

1

Si PSI-BLAST

ni

...

CCMi

1

CCMi

ni

pHMM1 pHMMi pHMMM ... ... Global consensus models CCMi

1

CCMi

ni

... CCM1

1

CCM1

n1

... CCMM

1

CCMM

nM

...

...

ni ≤ 350

Clade-centered models

CLADE Library

Fulli NR PSI-BLAST ... ... ... ...

  • Inherited from CLADE
  • Several models are built in order to represent each known Pfam family
  • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

9

slide-17
SLIDE 17

MetaCLADE - General idea

D1 D1 D2 D2 D2 D8 D6 D1

1

D2 D3

10

slide-18
SLIDE 18

MetaCLADE - Training set

The set of positive sequences for a domain Di is based on suffixes, prefixes and random fragments from Seedi. The set of negative sequences for a domain Di is based on three different methods:

  • 1. 2-mer shuffling
  • 2. sequence reversal
  • 3. Markov model (probabilities based on 4-mers of Seedi)

11

slide-19
SLIDE 19

MetaCLADE - Training set

  • Negative sequences are generated until

|negative sequences| ≥ 1 2|positive sequences|

  • Generation of a 2-dimension space by considering the bit score and the

mean bit score of domain hits

12

slide-20
SLIDE 20

MetaCLADE - Hit classification

  • A discrete version of a naive Bayes classifier is used in order to partition the

hit space in regions with an associated probability

13

slide-21
SLIDE 21

MetaCLADE - Hit filtering and domain prediction

D1 D1 D2 D2 D2 D8 D6 D11 D2 D3

  • 1. Removal of overlapping hits

D1 D2 D8 D6 D11 D2 D3

  • 2. Selection of hits with prob ≥ 0.9

D1 D2 D8 D11 D3

  • 3. Selection of hits with best

bit-score and % identity

D2 D8 D11

Final prediction

14

slide-22
SLIDE 22

Results

slide-23
SLIDE 23

MetaCLADE on metatranscriptomics

  • Marine eukaryotic phytoplankton metatranscriptoms
  • 1.5M high quality cDNA sequences, average length of 242bp

15

slide-24
SLIDE 24

MetaCLADE - Functional Annotation

16

slide-25
SLIDE 25

MetaCLADE - MetaCLADE vs HMMER (Ion transport)

17

slide-26
SLIDE 26

MetaCLADE - Higher resolution

18

slide-27
SLIDE 27

MetaCLADE - Comparison with other methods

Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp

Tool TP FP FN TPR PPV F-score 100 bp UProC 336 302 20 258 448 249 42.9 94.3 58.9 MetaCLADE 323 009 12 145 461 542 41.2 96.4 57.7 HMMGRASP 328 729 37 224 455 822 41.9 89.8 57.1 MetaCLADE+UProC 405 734 20 145 378 817 51.7 95.3 67.0 UProC+MetaCLADE 406 370 25 965 378 181 51.8 94.0 66.8 200 bp UProC 264 787 19 060 220 368 54.6 93.3 68.9 MetaCLADE 347 936 18 138 137 219 71.7 95.0 81.7 HMMGRASP 290 155 37 189 195 000 59.8 88.6 71.4 MetaCLADE+UProC 363 667 21 479 121 488 75.0 94.4 83.6 UProC+MetaCLADE 364 641 28 444 120 514 75.2 92.8 83.0

0.000 0.095 0.190 0.285 0.380 0.475 0.570 0.665 0.760 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00

Precision−Recall curve

Recall Precision UProC (AUC=0.529) MetaCLADE (AUC=0.699) MetaCLADE+UProC (AUC=0.727)

19

slide-28
SLIDE 28

Future improvements

  • More domains and new models for an improved annotation
  • Constructing a library of conserved small motifs
  • Annotation of longer sequences
  • Reduction of the number of redundant models
  • New criteria to filter overlapping hits

20

slide-29
SLIDE 29

Conclusions

  • Learning about the functional activity of the community and its

sub-communities is a crucial step to understand species interactions and large-scale environmental impact

  • Functional annotation methods need to be as precise as possible in

identifying remote homology

  • MetaCLADE allows for the discovery of patterns in highly divergent

sequences

  • Unknown sequences will augment in number, hence probabilistic models

are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces

21

slide-30
SLIDE 30

Thank you for your attention!

Acknowledgments

  • Ari Ugarte
  • Juliana Silva Bernardes
  • Alessandra Carbone

References

  • A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, “A multi-source domain annotation pipeline for

quantitative metagenomic and metatranscriptomic functional profiling,” Microbiome, 2018, 6:149. J.S. Bernardes, C. Vaquero, G. Zaverucha, A. Carbone, “Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence,” PLoS Computational Biology, 2016 12(7):e1005038. 22