AMRtime Precise identification of antimicrobial resistance - - PowerPoint PPT Presentation

amrtime
SMART_READER_LITE
LIVE PREVIEW

AMRtime Precise identification of antimicrobial resistance - - PowerPoint PPT Presentation

AMRtime Precise identification of antimicrobial resistance determinants from metagenomic data Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University Table of contents 1. Background 2.


slide-1
SLIDE 1

AMRtime

Precise identification of antimicrobial resistance determinants from metagenomic data

Finlay Maguire finlaymaguire@gmail.com December 3, 2019

Faculty of Computer Science, Dalhousie University

slide-2
SLIDE 2

Table of contents

  • 1. Background
  • 2. AMRtime Overview
  • 3. Filtering out non-AMR reads
  • 4. Sensitive Homology Classification

1

slide-3
SLIDE 3

Background

slide-4
SLIDE 4

AMR-metagenomics

Genomes Reads AMR Genes

Sequencing AMR detection 2

slide-5
SLIDE 5

Comprehensive Antibiotic Resistance Database

card.mcmaster.ca

3

slide-6
SLIDE 6

Why is AMR metagenomics difficult?

slide-7
SLIDE 7

AMR genes are rare genomically All (~324M) AMR (~2.1M) 107 108

log(Read Count) AMR Reads in Metagenome (0.643%)

2184 CARD-Prevalence Genomes at 1-10X abundance

4

slide-8
SLIDE 8

AMR genes have wildly different abundances

1236 AMR PATRIC genomes

5

slide-9
SLIDE 9

AMR sequence space overlaps

1000 500 500 1000 1000 500 500 1000 Actual Families 1000 500 500 1000 1000 500 500 1000 Affinity Clusters (Adj. Rand=0.30041)

MDS of CARD Proteins BLASTP-%ID

6

slide-10
SLIDE 10

AMRtime Overview

slide-11
SLIDE 11

AMRtime structure

Metagenomic Reads Input files Processes Intermediate files Output files AMR Filtering Filtered reads Sensitive Homology Classification CARD Homology predictions Variant Identification Variant predictions Metamodels Metamodel predictions

7

slide-12
SLIDE 12

AMRtime structure

Metagenomic Reads Input files Processes Intermediate files Output files AMR Filtering Filtered reads Sensitive Homology Classification CARD Homology predictions Variant Identification Variant predictions Metamodels Metamodel predictions

8

slide-13
SLIDE 13

AMRtime structure

Metagenomic Reads CARD Input Files Processes Intermediate Files Output Files Read Filtering Filtered Reads Features Sensitive AMR Classification ARO Predictions

9

slide-14
SLIDE 14

Filtering out non-AMR reads

slide-15
SLIDE 15

Testing sequence similarity search tools

NT Query & NT CARD Database Methods ESKAPE Genomes Resistance Gene Identifier + CARD ART Read Simulator Labeled Simulated Metagenome ORFM Predicted ORF Protein Sequences NT Query & AA CARD Database Methods AA Query & AA CARD Database Methods

  • BLASTN
  • bowtie2
  • BWA-MEM
  • biobloom*
  • groot
  • HMMSearch
  • BLASTX
  • DIAMOND BLASTX
  • PALADIN
  • BLASTP
  • DIAMOND BLASTP
  • HMMSearch

10

slide-16
SLIDE 16

Terminology refresher interlude

https://commons.wikimedia.org/wiki/File:Precisionrecall.svg

11

slide-17
SLIDE 17

DNA subject best for precision, Protein subject best for recall

0.00 0.25 0.50 0.75 1.00

Recall

0.2 0.4 0.6 0.8 1.0

Precision

Domain DNA Query/DB DNA Query, Protein DB Protein Query/DB

Simulated MiSeq v3 250bp reads, 30.31M reads (7.21M AMR derived)

12

slide-18
SLIDE 18

K-mer methods perform poorly

0.0 0.2 0.4 0.6 0.8 1.0

Recall

0.2 0.4 0.6 0.8 1.0

Precision

Paradigm BWT BLAST k-mer HMM

BWT: bowtie2, bwa-mem, paladin; BLAST: blast, diamond; HMM: hmmsearch; K-MER: biobloom, groot.

13

slide-19
SLIDE 19

DIAMOND-BLASTX best compromise

0.90 0.92 0.94 0.96 0.98 1.00

Recall

0.90 0.92 0.94 0.96 0.98 1.00

Precision

Tool blastx bwa diamond_blastx paladin blastp diamond_blastp

DIAMOND-BLASTX ‘more sensitive’ setting (min < 1e−10): 4.926 hours with 2 cores and 8.3Gb of memory. AMR Reads: 7.15M detected, 59.26K missed, 1.87M false positives.

14

slide-20
SLIDE 20

Why not just use these sequence searches?

slide-21
SLIDE 21

Poor gene-level accuracy

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of reads per ARO correct

groot diamond_blastp diamond_blastx blastp blastx paladin blastn bowtie2 bwa

Tool ARO Accuracy Performance at optimal settings for ARO accuracy

15

slide-22
SLIDE 22

Good family-level accuracy

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of reads per family correct

groot hmmsearch_nt bowtie2 bwa hmmsearch_aa blastn paladin diamond_blastp diamond_blastx blastx blastp

Tool Correct Family Performance at optimal settings for Family accuracy

16

slide-23
SLIDE 23

Sensitive Homology Classification

slide-24
SLIDE 24

Initial classifier

Training Data Classifier ARO predictions

17

slide-25
SLIDE 25

Initial classifier

Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63

17

slide-26
SLIDE 26

Initial classifier

Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 %

17

slide-27
SLIDE 27

Revised classifier structure: exploiting the ARO

Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family 1 Data Family 1 Classifier Family ... SMOTE Family ... Data Family ... Classifier Family N SMOTE Family N Data Family N Classifier ARO predictions

18

slide-28
SLIDE 28

Read encoding

Sequence bitscore matrix =       

gene1 gene2 ... genej−1 genej read1

1256 ... 63

read2

...

...

... ... ... ... ...

readi−1

512 ...

readi

... 785 129        Advantages: read length invariant, low dimensionality, uses filtering data

19

slide-29
SLIDE 29

Held-out test results

Precision Recall

Family Test Peformance

0.00 0.25 0.50 0.75 1.00

Proportion Normalised Bitscore Random Forest

Mean Precision: 0.995, Mean Recall: 0.985

20

slide-30
SLIDE 30

ARO level classification more variable

25 50 75 100 125 150 175 200 225

Ordered AMR Family Index

0.00 0.25 0.50 0.75 1.00

Proportion Median Precision-Recall Within Families

Precision Recall

21

slide-31
SLIDE 31

On-going work

  • Soft-threshold (i.e. propagating probabilities through layers)
  • Multiset labels based on sequence redundancy within families.
  • Threshold identification for variant model counts.
  • Metamodel rule parsing.
  • Galaxy bindings (CARD/IRIDA integration).

22

slide-32
SLIDE 32

Summary

slide-33
SLIDE 33

Conclusions

  • Direct homology searches are suprisingly poor for AMR

metagenomics.

23

slide-34
SLIDE 34

Conclusions

  • Direct homology searches are suprisingly poor for AMR

metagenomics.

  • K-mer based approaches fall flat with sequencing error, low coverage

and sparse labels.

23

slide-35
SLIDE 35

Conclusions

  • Direct homology searches are suprisingly poor for AMR

metagenomics.

  • K-mer based approaches fall flat with sequencing error, low coverage

and sparse labels.

  • Direct homology search results ARE useful when combined with

machine learning.

23

slide-36
SLIDE 36

Conclusions

  • Direct homology searches are suprisingly poor for AMR

metagenomics.

  • K-mer based approaches fall flat with sequencing error, low coverage

and sparse labels.

  • Direct homology search results ARE useful when combined with

machine learning.

  • The Antibiotic Resistance Ontology provides useful structure to

improve predictions.

23

slide-37
SLIDE 37

Conclusions

  • Direct homology searches are suprisingly poor for AMR

metagenomics.

  • K-mer based approaches fall flat with sequencing error, low coverage

and sparse labels.

  • Direct homology search results ARE useful when combined with

machine learning.

  • The Antibiotic Resistance Ontology provides useful structure to

improve predictions.

  • AMRtime: coming soon to CARD and your local government

genomic epidemiology platform.

23

slide-38
SLIDE 38

Acknowledgements

slide-39
SLIDE 39

Acknowledgements

  • McMaster University: Brian Alcock and Andrew McArthur
  • Simon Fraser University: Fiona Brinkman
  • Dalhousie University: Robert Beiko
  • Funding: Donald Hill Family Fellowship, Genome Canada Grant.

24

slide-40
SLIDE 40

Questions?

24

slide-41
SLIDE 41

Insufficient Intrafamily Signal

200 400 600 800

Number of Shared 250mers

TEM beta-lactamase SHV beta-lactamase OCH beta-lactamase MIR beta-lactamase LEN beta-lactamase GES beta-lactamase PDC beta-lactamase NDM beta-lactamase GOB beta-lactamase KPC beta-lactamase SME beta-lactamase GIM beta-lactamase TMB beta-lactamase BEL beta-lactamase CfxA beta-lactamase VEB beta-lactamase

AMR Family Intra-Family Shared 250mers

slide-42
SLIDE 42

Interfamily Collisions

slide-43
SLIDE 43

Interfamily Collisions