Rapid Identification of AMR Determinants from Metagenomic Samples - - PowerPoint PPT Presentation

rapid identification of amr determinants from metagenomic
SMART_READER_LITE
LIVE PREVIEW

Rapid Identification of AMR Determinants from Metagenomic Samples - - PowerPoint PPT Presentation

Rapid Identification of AMR Determinants from Metagenomic Samples AMRtime Progress Report Finlay Maguire June 22, 2018 Faculty of Computer Science, Dalhousie University Table of contents 1. Overview 2. Training Data 3. Read filtering 4.


slide-1
SLIDE 1

Rapid Identification of AMR Determinants from Metagenomic Samples

AMRtime Progress Report

Finlay Maguire June 22, 2018

Faculty of Computer Science, Dalhousie University

slide-2
SLIDE 2

Table of contents

  • 1. Overview
  • 2. Training Data
  • 3. Read filtering
  • 4. Sensitive Homology Search
  • 5. Variant Models
  • 6. Summary
  • 7. Acknowledgements

1

slide-3
SLIDE 3

Overview

slide-4
SLIDE 4

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:

2

slide-5
SLIDE 5

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms

2

slide-6
SLIDE 6

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:

2

slide-7
SLIDE 7

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

2

slide-8
SLIDE 8

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

  • Protein Variant e.g. GyrA fluoroquinolone mutation, FolP

sulfonamide mutation

2

slide-9
SLIDE 9

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

  • Protein Variant e.g. GyrA fluoroquinolone mutation, FolP

sulfonamide mutation

  • rRNA gene variants e.g. Mycobacterium aminoglycoside resistance

2

slide-10
SLIDE 10

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

  • Protein Variant e.g. GyrA fluoroquinolone mutation, FolP

sulfonamide mutation

  • rRNA gene variants e.g. Mycobacterium aminoglycoside resistance
  • Efflux pump e.g. AcrAB-TolC, MexAB-OprM mutations

2

slide-11
SLIDE 11

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

  • Protein Variant e.g. GyrA fluoroquinolone mutation, FolP

sulfonamide mutation

  • rRNA gene variants e.g. Mycobacterium aminoglycoside resistance
  • Efflux pump e.g. AcrAB-TolC, MexAB-OprM mutations
  • Gene cluster e.g. Van glycopeptide resistance clusters

2

slide-12
SLIDE 12

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

  • Protein Variant e.g. GyrA fluoroquinolone mutation, FolP

sulfonamide mutation

  • rRNA gene variants e.g. Mycobacterium aminoglycoside resistance
  • Efflux pump e.g. AcrAB-TolC, MexAB-OprM mutations
  • Gene cluster e.g. Van glycopeptide resistance clusters
  • Resistance Gene Identifier (RGI): contigs, predicted genes and

merged metagenomic reads

2

slide-13
SLIDE 13

Comprehensive Antibiotic Resistance Database

  • https://card.mcmaster.ca/ (Jia et al., 2016) as of June 2018:
  • Built around Antibiotic Resistance Ontology (ARO): 3996 terms
  • 2536 AMR Detection Models with manually curated criteria:
  • Homology e.g. NDM beta-lactamases, aminoglycoside

acetyltransferase

  • Protein Variant e.g. GyrA fluoroquinolone mutation, FolP

sulfonamide mutation

  • rRNA gene variants e.g. Mycobacterium aminoglycoside resistance
  • Efflux pump e.g. AcrAB-TolC, MexAB-OprM mutations
  • Gene cluster e.g. Van glycopeptide resistance clusters
  • Resistance Gene Identifier (RGI): contigs, predicted genes and

merged metagenomic reads

  • CARDPredicted prevalence dataset

2

slide-14
SLIDE 14

Metagenomic Analysis

modified from https://www.gatc-biotech.com/en/expertise/genomics/metagenome-analysis.html

Key difficulties:

  • Variation in abundance and diversity

3

slide-15
SLIDE 15

Metagenomic Analysis

modified from https://www.gatc-biotech.com/en/expertise/genomics/metagenome-analysis.html

Key difficulties:

  • Variation in abundance and diversity
  • Short fragmentary data

3

slide-16
SLIDE 16

Metagenomic Analysis

modified from https://www.gatc-biotech.com/en/expertise/genomics/metagenome-analysis.html

Key difficulties:

  • Variation in abundance and diversity
  • Short fragmentary data
  • Large amounts of data

3

slide-17
SLIDE 17

Metagenomic Analysis

modified from https://www.gatc-biotech.com/en/expertise/genomics/metagenome-analysis.html

Key difficulties:

  • Variation in abundance and diversity
  • Short fragmentary data
  • Large amounts of data
  • Compositionality

3

slide-18
SLIDE 18

Metagenomic Analysis

modified from https://www.gatc-biotech.com/en/expertise/genomics/metagenome-analysis.html

Key difficulties:

  • Variation in abundance and diversity
  • Short fragmentary data
  • Large amounts of data
  • Compositionality
  • Spare and imbalanced labels

3

slide-19
SLIDE 19

AMRtime Structure

Metagenomic Reads Input files Processes Intermediate files Output files AMR Filtering Filtered reads Sensitive Homology Search CARD Homology predictions Variant Identification Variant predictions Metamodels Metamodel predictions

4

slide-20
SLIDE 20

Training Data

slide-21
SLIDE 21

Dataset Generator

Assembled Genomes (*.fna) Resistance Gene Identifier (RGI) CARD AMR Annotations (*.gff) Abundance/Diversity Resampling ’Assembled’ metagenome (.fna) Illumina Simulator (ART) Synthetic metagenome (.fq) Labelling Read labels (.txt)

5

slide-22
SLIDE 22

Determinants are scarce

6

slide-23
SLIDE 23

Determinants are imbalanced

7

slide-24
SLIDE 24

AMR sequence space is biased

8

slide-25
SLIDE 25

Read filtering

slide-26
SLIDE 26

Homology Filter Approaches

  • BLASTX (Gish et al., 1993)
  • DIAMOND (Buchfink et al., 2015)
  • PALADIN (Westbrook et al., 2017)
  • MMSeqs2 (Steinegger and S¨
  • ding, 2017)

9

slide-27
SLIDE 27

Performance at defaults?

10

slide-28
SLIDE 28

How computationally efficient are they?

11

slide-29
SLIDE 29

What about in terms of memory?

12

slide-30
SLIDE 30

Is there a cap on overall performance?

13

slide-31
SLIDE 31

What about to hit any ARO?

14

slide-32
SLIDE 32

Performance for best setting per tool

15

slide-33
SLIDE 33

But what about individual ARO performance?

16

slide-34
SLIDE 34

Systematically missing AROs

17

slide-35
SLIDE 35

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

18

slide-36
SLIDE 36

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724

18

slide-37
SLIDE 37

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732

18

slide-38
SLIDE 38

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732
  • OXA-2 (M95287.4):

18

slide-39
SLIDE 39

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732
  • OXA-2 (M95287.4):
  • Protein 2456-3280

18

slide-40
SLIDE 40

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732
  • OXA-2 (M95287.4):
  • Protein 2456-3280
  • DNA 1-828

18

slide-41
SLIDE 41

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732
  • OXA-2 (M95287.4):
  • Protein 2456-3280
  • DNA 1-828
  • Acinetobacter OprD conferring resistance to imipenem

(CP006768.1):

18

slide-42
SLIDE 42

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732
  • OXA-2 (M95287.4):
  • Protein 2456-3280
  • DNA 1-828
  • Acinetobacter OprD conferring resistance to imipenem

(CP006768.1):

  • Protein 3513470-3514777

18

slide-43
SLIDE 43

Why are these 10 always missed?

  • Enterococcus faecalis liaS mutant conferring daptomycin resistance

(AE016830.1):

  • Protein 2790824-2789724
  • DNA 1-732
  • OXA-2 (M95287.4):
  • Protein 2456-3280
  • DNA 1-828
  • Acinetobacter OprD conferring resistance to imipenem

(CP006768.1):

  • Protein 3513470-3514777
  • DNA 3514887-3515414

18

slide-44
SLIDE 44

CARD Full Length Alignment QC

  • 11 AROs protein not detected from DNA

19

slide-45
SLIDE 45

CARD Full Length Alignment QC

  • 11 AROs protein not detected from DNA
  • 2 AROs different top protein hit from DNA

19

slide-46
SLIDE 46

CARD Full Length Alignment QC

  • 11 AROs protein not detected from DNA
  • 2 AROs different top protein hit from DNA
  • Warnings: 119 AROs with different top protein but ID% > 99

19

slide-47
SLIDE 47

CARD Full Length Alignment QC

  • 11 AROs protein not detected from DNA
  • 2 AROs different top protein hit from DNA
  • Warnings: 119 AROs with different top protein but ID% > 99
  • Warnings: 2 AROs with ID% < 99 to correct protein

19

slide-48
SLIDE 48

Sensitive Homology Search

slide-49
SLIDE 49

First attempt at sensitive classification

Filtered Reads Read encoding Encoded reads Torch Classifier ARO predictions

20

slide-50
SLIDE 50

Revised classifier structure

Filtering Scores Filtered Reads Read encoding Encoded reads AMR Family Classifier AMR Families Family 1 Classifier Family ... Classifier Family N Classifier ARO predictions

21

slide-51
SLIDE 51

Encodings

  • Raw sequence

22

slide-52
SLIDE 52

Encodings

  • Raw sequence
  • Filtering homology search family similarity/dissimilarity

22

slide-53
SLIDE 53

Encodings

  • Raw sequence
  • Filtering homology search family similarity/dissimilarity
  • Manual feature extraction (GC/TNF/compositional)

22

slide-54
SLIDE 54

Encodings

  • Raw sequence
  • Filtering homology search family similarity/dissimilarity
  • Manual feature extraction (GC/TNF/compositional)
  • One-hot K-mer representation

22

slide-55
SLIDE 55

Encodings

  • Raw sequence
  • Filtering homology search family similarity/dissimilarity
  • Manual feature extraction (GC/TNF/compositional)
  • One-hot K-mer representation
  • K-mer embeddings (DNA2vec/BioVec)

22

slide-56
SLIDE 56

Variant Models

slide-57
SLIDE 57

Ribosomal Variant Models

Metagenomic reads Ribosomal fragment identification 5S/16S/23S binned reads Taxonomic classification Taxa binnned reads Alignment SNPs

23

slide-58
SLIDE 58

Identifying Ribosomal Reads

  • MetaRNA (Huang et al., 2009)
  • Ribopicker (Schmieder et al., 2011)
  • SortmeRNA (Kopylova et al., 2012)
  • 77 models
  • Reads simulated from the underlying 30 species reference genomes

24

slide-59
SLIDE 59

Identifying Ribosomal Reads

25

slide-60
SLIDE 60

Identifying Ribosomal Reads

26

slide-61
SLIDE 61

Identifying Ribosomal Reads

27

slide-62
SLIDE 62

Identifying Taxonomy

28

slide-63
SLIDE 63

Some are relatively easy

29

slide-64
SLIDE 64

Others are a mess

30

slide-65
SLIDE 65

Some are group ambiguous

Probably a Mycobacterium?

31

slide-66
SLIDE 66

Others are just a toss-up

32

slide-67
SLIDE 67

Ambiguity in classification

33

slide-68
SLIDE 68

Next Steps

  • Mapping reads to reference to assess presence or absence of

mutation related SNP

  • Comparison of whole pipeline with just direct mapping to database
  • f ribosomal sequences and SNP calling approaches.
  • Tuning of sensitivity for number of potential SNPs required to make

a prediction of AMR.

34

slide-69
SLIDE 69

Summary

slide-70
SLIDE 70

Conclusions

  • AMRtime still not a ‘fait accompli‘

35

slide-71
SLIDE 71

Conclusions

  • AMRtime still not a ‘fait accompli‘
  • Filtering analysis possibly needs redone for fixed CARD

35

slide-72
SLIDE 72

Conclusions

  • AMRtime still not a ‘fait accompli‘
  • Filtering analysis possibly needs redone for fixed CARD
  • False positive analysis pending for best settings

35

slide-73
SLIDE 73

Conclusions

  • AMRtime still not a ‘fait accompli‘
  • Filtering analysis possibly needs redone for fixed CARD
  • False positive analysis pending for best settings
  • Framework and code developed for sensitive homology classification

but optimisation and evaluation work still required

35

slide-74
SLIDE 74

Conclusions

  • AMRtime still not a ‘fait accompli‘
  • Filtering analysis possibly needs redone for fixed CARD
  • False positive analysis pending for best settings
  • Framework and code developed for sensitive homology classification

but optimisation and evaluation work still required

  • Not shown but preliminary family level classification shows 100x

improvements over previous ARO attempts

35

slide-75
SLIDE 75

Conclusions

  • AMRtime still not a ‘fait accompli‘
  • Filtering analysis possibly needs redone for fixed CARD
  • False positive analysis pending for best settings
  • Framework and code developed for sensitive homology classification

but optimisation and evaluation work still required

  • Not shown but preliminary family level classification shows 100x

improvements over previous ARO attempts

  • Ribosomal Variant Model work progressing well with full pipeline

metrics available soon.

35

slide-76
SLIDE 76

Acknowledgements

slide-77
SLIDE 77

Acknowledgements

  • Zhou Zhilei
  • Brian Alcock, Amos Raphenya, Kara Tsang
  • Rob Beiko, Fiona Brinkman and Andrew McArthur
  • Funding: Genome Canada and a NERC Undergraduate Student

Research Award

36

slide-78
SLIDE 78

References

Buchfink, B., Xie, C., and Huson, D. H. (2015). Fast and sensitive protein alignment using diamond. Nature methods, 12(1):59. Gish, W. et al. (1993). Identification of protein coding regions by database similarity search. Nature genetics, 3(3):266. Huang, Y., Gilna, P., and Li, W. (2009). Identification of ribosomal rna genes in metagenomic fragments. Bioinformatics, 25(10):1338–1340. Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang,

  • K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., et al.

(2016). Card 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic acids research, page gkw1004.

37

slide-79
SLIDE 79

Kopylova, E., No´ e, L., and Touzet, H. (2012). Sortmerna: fast and accurate filtering of ribosomal rnas in metatranscriptomic data. Bioinformatics, 28(24):3211–3217. Schmieder, R., Lim, Y. W., and Edwards, R. (2011). Identification and removal of ribosomal rna sequences from metatranscriptomes. Bioinformatics, 28(3):433–435. Steinegger, M. and S¨

  • ding, J. (2017). Mmseqs2 enables sensitive protein

sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026. Westbrook, A., Ramsdell, J., Schuelke, T., Normington, L., Bergeron,

  • R. D., Thomas, W. K., and MacManes, M. D. (2017). Paladin:

protein alignment for functional profiling whole metagenome shotgun

  • data. Bioinformatics, 33(10):1473–1478.

38