A multi-source domain annotation pipeline for quantitative - - PowerPoint PPT Presentation
A multi-source domain annotation pipeline for quantitative - - PowerPoint PPT Presentation
A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini , Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and
Introduction
Metagenomic analysis workflow
Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment.
1
Protein domains
Size of individual structural domains:
- from 36 aa to 692 aa
- most of them has < 200 residues
- average of ∼ 100 residues
2
Domain identification in short MG/MT reads
- Very short fragments (150 - 300 bp)
Two possible approaches:
- Assembly-based (e.g., HMM-GRASPx)
- Direct read annotation (e.g., MetaCLADE, UProC)
3
How can we represent a domain?
- Multiple sequence alignment (MSA) of homologous domain sequences
L
F
S T D R KE Q A
C S LK
R
A
FWS
E T AR
K Q C
S K T N D AR Q
E
R F L Q L T I S D F V A C F L S S VA
G
M T N LE K Q
RTG
LV
I
F
YP
M VF W
L
I I VDA
N S LI
C
G A
Q I V LM
I L K A TR
S K AC
E
Q
I
ML
T H Y A K LW
R N
S K R H TE A Q
M Q L H I D SE
T
GA
G
R T VF
Y
W
K EV
L
I
M
TP S
N
H
Y QF H
N
I SL
R
S C G MA
V LR
N VL Q
A
H
M
N SL M A
V
I
I S CL A
T
V
M S V GA
G A T MC
S
V L YF
V FA
L
A CI S
V
T
Y HR
K G
Y T Q L GN
H
D
MC
L
Q I M F RY G H W
L
C T QV
L
I
N
H P S
D
P YW
W L T KQ E
R
V M I A H KL W E R
S T A PG
G Y L Q MA
E
L D QE K A
R
F
V H
Y
W
YL
F
H Q D LR W
E A M
A N K S R QD E
W N C R YK Q
H L T F
L
A T NL
I
VDW
F H Y
A G
E
D
I T Y A V W FP
L
E P CG S
A
C V LI
N A
S
S CD
H
N
P W Q I V S T YA
N
F S H P ML A
G
MS Q
N
G
A M C VI F L
W
E N LG
M
Q
S NY
F M
W
C I T QL S A V
GS
A
AC
S G
V S I L E V TV C
S
T
C N K P LG
N P M VI C A T
F C V H EN
D
Q GS T
P
A
K I S R VQ A
N A D E GT
S
P
V G W S LF
A
Y
S D VY
F
- Probabilistic representations:
- position-specific scoring matrices (PSSMs)
- profile hidden Markov models (pHMMs)
4
The Pfam database
- A large collection of protein domain families
- Each family is represented by two MSAs (Full and Seed) and a profile HMM
5
CLADE (CLoser sequences for Annotations Directed by Evolution)
6
MetaCLADE
MetaCLADE - Main features
- Extends CLADE to handle MG/MT data
- Puts all domain hits in a two-dimensional space
- Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to
assign confidence values to predictions
- Provides data visualization of functional annotation
7
MetaCLADE - General overview
MAKLKVANDKA... Input sequence Conservation profiles Domain hits on the input sequence
D1 D1 D2 D2 D2 D8 D6 D11 D2 D3 D1 D1 D2 D2 D2 D8 D6 D11 D2 D3
- 1. Removal of overlapping hits
D1 D2 D8 D6 D11 D2 D3
- 2. Selection of hits with prob ≥ 0.9
D1 D2 D8 D11 D3
- 3. Selection of hits with best
bit-score and % identity
D2 D8 D11
… … … . … … … . … … … .
CLADE model library D1 DN Di ... ...
Sets of positive and negative sequences Identification of domain-specific separating parameters
Domain prediction Hit identification
CCMs Global-consensus Predicted CDS/ORFs in MG/MT reads
Domain Di in CLADE
Domain-dependent probability space pre-computation 8
MetaCLADE - Model construction
Pfam domains
D1 Di DM ... ... Seedi pHMMi
- Inherited from CLADE
- Several models are built in order to represent each known Pfam family
- Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
9
MetaCLADE - Model construction
Pfam domains
D1 Di DM ... ... Seedi pHMMi Si
1
Si
ni
...
ni ≤ 350
Fulli
- Inherited from CLADE
- Several models are built in order to represent each known Pfam family
- Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
9
MetaCLADE - Model construction
Pfam domains
D1 Di DM ... ... Seedi pHMMi Si
1
Si
ni
...
CCMi
1
...
ni ≤ 350
Fulli NR PSI-BLAST
- Inherited from CLADE
- Several models are built in order to represent each known Pfam family
- Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
9
MetaCLADE - Model construction
Pfam domains
D1 Di DM ... ... Seedi pHMMi Si
1
Si PSI-BLAST
ni
...
CCMi
1
CCMi
ni
...
ni ≤ 350
Fulli NR PSI-BLAST
- Inherited from CLADE
- Several models are built in order to represent each known Pfam family
- Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
9
MetaCLADE - Model construction
Pfam domains
D1 Di DM ... ... Seedi pHMMi Si
1
Si PSI-BLAST
ni
...
CCMi
1
CCMi
ni
pHMM1 pHMMi pHMMM ... ... Global consensus models CCMi
1
CCMi
ni
... CCM1
1
CCM1
n1
... CCMM
1
CCMM
nM
...
...
ni ≤ 350
Clade-centered models
CLADE Library
Fulli NR PSI-BLAST ... ... ... ...
- Inherited from CLADE
- Several models are built in order to represent each known Pfam family
- Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
9
MetaCLADE - General idea
D1 D1 D2 D2 D2 D8 D6 D1
1
D2 D3
10
MetaCLADE - Training set
The set of positive sequences for a domain Di is based on suffixes, prefixes and random fragments from Seedi. The set of negative sequences for a domain Di is based on three different methods:
- 1. 2-mer shuffling
- 2. sequence reversal
- 3. Markov model (probabilities based on 4-mers of Seedi)
11
MetaCLADE - Training set
- Negative sequences are generated until
|negative sequences| ≥ 1 2|positive sequences|
- Generation of a 2-dimension space by considering the bit score and the
mean bit score of domain hits
12
MetaCLADE - Hit classification
- A discrete version of a naive Bayes classifier is used in order to partition the
hit space in regions with an associated probability
13
MetaCLADE - Hit filtering and domain prediction
D1 D1 D2 D2 D2 D8 D6 D11 D2 D3
- 1. Removal of overlapping hits
D1 D2 D8 D6 D11 D2 D3
- 2. Selection of hits with prob ≥ 0.9
D1 D2 D8 D11 D3
- 3. Selection of hits with best
bit-score and % identity
D2 D8 D11
Final prediction
14
Results
MetaCLADE on metatranscriptomics
- Marine eukaryotic phytoplankton metatranscriptoms
- 1.5M high quality cDNA sequences, average length of 242bp
15
MetaCLADE - Functional Annotation
16
MetaCLADE - MetaCLADE vs HMMER (Ion transport)
17
MetaCLADE - Higher resolution
18
MetaCLADE - Comparison with other methods
Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp
Tool TP FP FN TPR PPV F-score 100 bp UProC 336 302 20 258 448 249 42.9 94.3 58.9 MetaCLADE 323 009 12 145 461 542 41.2 96.4 57.7 HMMGRASP 328 729 37 224 455 822 41.9 89.8 57.1 MetaCLADE+UProC 405 734 20 145 378 817 51.7 95.3 67.0 UProC+MetaCLADE 406 370 25 965 378 181 51.8 94.0 66.8 200 bp UProC 264 787 19 060 220 368 54.6 93.3 68.9 MetaCLADE 347 936 18 138 137 219 71.7 95.0 81.7 HMMGRASP 290 155 37 189 195 000 59.8 88.6 71.4 MetaCLADE+UProC 363 667 21 479 121 488 75.0 94.4 83.6 UProC+MetaCLADE 364 641 28 444 120 514 75.2 92.8 83.0
0.000 0.095 0.190 0.285 0.380 0.475 0.570 0.665 0.760 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00
Precision−Recall curve
Recall Precision UProC (AUC=0.529) MetaCLADE (AUC=0.699) MetaCLADE+UProC (AUC=0.727)
19
Future improvements
- More domains and new models for an improved annotation
- Constructing a library of conserved small motifs
- Annotation of longer sequences
- Reduction of the number of redundant models
- New criteria to filter overlapping hits
20
Conclusions
- Learning about the functional activity of the community and its
sub-communities is a crucial step to understand species interactions and large-scale environmental impact
- Functional annotation methods need to be as precise as possible in
identifying remote homology
- MetaCLADE allows for the discovery of patterns in highly divergent
sequences
- Unknown sequences will augment in number, hence probabilistic models
are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces
21
Thank you for your attention!
Acknowledgments
- Ari Ugarte
- Juliana Silva Bernardes
- Alessandra Carbone
References
- A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, “A multi-source domain annotation pipeline for