[PPT] - AMR and machine-learning Prediction of AMR from metagenomes among PowerPoint Presentation

SLIDE 1

AMR and machine-learning

Prediction of AMR from metagenomes among other things

Finlay Maguire finlaymaguire@gmail.com December 3, 2019

Faculty of Computer Science, Dalhousie University

SLIDE 2

Genomic Phenotype Prediction

SLIDE 4

Antibiotic Susceptibility Testing

Bradley et al. (2015)

2

SLIDE 5

AAFC Salmonella Data-set

3193 (V2) 3125 (G) 3146 (I) 3 1 4 9 ( J 1 ) 3147 (I) 3 1 8 6 ( L ) 3 2 ( N ) 3191 (M) 3 1 9 7 ( M ) 3 1 4 2 ( H ) 3 3 5 3 ( O ) 3 1 3 3 ( H ) 3 1 3 5 ( H ) 3 1 3 7 ( H ) 3 1 3 8 ( H ) 3139 (H) 3156 (K) 3158 (K) 1760 (P1) 1773 (P1) 1 7 6 2 ( P 1 ) 1 7 7 2 ( P 1 ) 1 7 7 ( P 1 ) 1771 (P1) 1 7 6 6 ( P 1 ) 1775 (P1) 1 7 6 7 ( P 1 ) 1 7 6 8 ( P 1 ) 1769 (P1) 3 1 4 ( Q 1 ) 3168 (Q1) 3332 (R) 3342 (R) 1 7 9 2 ( Z ) 1793 (T) 1 8 3 ( A 2 ) 1890 (A2) 1 8 8 8 ( A 2 ) 1 8 9 1 ( A 2 ) 1811 (A2) 3126 (Q2) 3176 (AD) 3128 (AB) 3171 (V1) 3 3 3 9 ( S 1 ) 3 1 4 3 ( A C ) 1 8 9 2 ( A A ) 1893 (AA) 3333 (S2) 3151 (W) 3162 (U) 3 1 9 8 ( X ) 3 1 6 ( U ) 3 1 6 6 ( U ) 3 1 9 9 ( Y ) 1797 (A1) 2003 (B) 2005 (B) 3167 (C) 3169 (D) 3 1 8 ( D ) 3348 (F) 3352 (O) 3 1 8 1 ( E ) 3 3 3 ( F ) 3179 (J2) 3184 (J2) 3305 (S1) 3306 (S1) 3324 (S1) 3 3 5 1 ( S 1 ) 3 3 2 2 ( S 1 ) 3344 (S1) 3 3 4 1 ( S 1 ) 3 3 2 1 ( S 1 ) 3 3 2 6 ( S 1 ) 3132 (Q2) 3314 (S1) 3134 (Q2) 3144 (Q2) 3145 (Q2) 3302 (S1) 3323 (S1) 3311 (S1) 3319 (S1) 3 3 3 6 ( S 1 ) 3337 (S1) 3 3 1 3 ( S 1 ) 3315 (S1) 3318 (S1) 3338 (S2) 3310 (S1) 3349 (S1) 3317 (S1) 1783 (P2) 1 7 5 8 ( P 2 ) 1778 (P2)

0.056229

3

SLIDE 6

Genomic RGI Predictions

4

SLIDE 7

Linking AMR determinants to Phenotype

McArthur et al. (2013)

5

SLIDE 8

Logistic Regression

RGI =     

amr1 amr2 ... amrJ genome1

1 ... 1

genome2

1 ... 1

...

... ... ... ...

genomeI

... 1      AST =     

abx1 abx2 ... abxK genome1

S S ... R

genome2

R R ... S

...

... ... ... ...

genomeI

S S ... S      βRGI = AST

6

SLIDE 9

Set-Covering Machines

Genomes AST Decompose into K-mers Genomic K-mers Set-Covering Machine Boolean K-mer Rules

7

SLIDE 10

AST Prediction Performance

A B C D

A: RGI, B: RGI-efflux, C: Logistic Regression, D: Set Covering Machines. Major Disagreement is overprediction of resistance, Very Major Disagreement is underprediction

8

SLIDE 11

Learnt features/weights

A B

9

SLIDE 12

Extending beyond Salmonella

ARO Predictions (Kara Tsang)

10

SLIDE 13

Extending beyond Salmonella

Logistic Regression

11

SLIDE 14

Genomic AST Prediction

Using direct annotations works very poorly across different
rganisms and resistance mechanisms.

12

SLIDE 15

Genomic AST Prediction

Using direct annotations works very poorly across different
rganisms and resistance mechanisms.
Even very simple logistic regression models greatly improve

predictions.

12

SLIDE 16

Genomic AST Prediction

Using direct annotations works very poorly across different
rganisms and resistance mechanisms.
Even very simple logistic regression models greatly improve

predictions.

Investigation of learnt weights and features can be very scientifically

informative.

12

SLIDE 17

Non-Bioinformatics Interlude

SLIDE 18

Non-profits have data and lots of contextualising knowledge.

13

SLIDE 19

Non-profits have data and lots of contextualising knowledge.
No time or resources to analyse or use it

13

SLIDE 20

Non-profits have data and lots of contextualising knowledge.
No time or resources to analyse or use it
Informaticians have the skills and resources but no specific

understanding of the context.

13

SLIDE 21

Non-profits have data and lots of contextualising knowledge.
No time or resources to analyse or use it
Informaticians have the skills and resources but no specific

understanding of the context.

Many low-hanging fruit that can make big differences.

13

SLIDE 22

Refugee Women’s Health Clinic

14

SLIDE 23

Staff Scheduling

15

SLIDE 24

Language Development in Autism

Qualitative Social Media Analysis (Tamara Sorenson-Duncan)

16

SLIDE 25

Alpha Diversity of Posting Activity

17

SLIDE 26

Beta Diversity of Posting Activity

18

SLIDE 27

Other on-going Projects

Halifax Community Learning Network
Shelter Nova Scotia
211 Nova Scotia

19

SLIDE 28

AMRtime

SLIDE 29

AMR-metagenomics

Genomes Reads AMR Genes

Sequencing AMR detection 20

SLIDE 30

Why is this difficult?

SLIDE 31

AMR genes are rare genomically All (~324M) AMR (~2.1M) 107 108

log(Read Count) AMR Reads in Metagenome (0.643%)

2184 CARD-Prevalence Genomes at 1-10X abundance

21

SLIDE 32

AMR genes have wildly different abundances

1236 AMR PATRIC genomes

22

SLIDE 33

AMR genes have highly variable diversity

23

SLIDE 34

AMR sequence space overlaps

1000 500 500 1000 1000 500 500 1000 Actual Families 1000 500 500 1000 1000 500 500 1000 Affinity Clusters (Adj. Rand=0.30041)

MDS of CARD Proteins BLASTP-%ID

24

SLIDE 35

Insufficient Signal in 250bp Fragments

NDM Multiple Sequence Alignment

25

SLIDE 36

Insufficient Signal in 250bp Fragments

NDM Multiple Sequence Alignment

26

SLIDE 37

Other constraints

No point doing what we do if people can’t use it.
Limited hardware requirements (a standard workstation or instance

< 8 − 12Gb, 1 − 8 cores).

Fast enough (< 12 hours).
Easy to install/configure.
Easy to use.
Easy to update.

27

SLIDE 38

AMRtime

SLIDE 39

AMRtime structure

Metagenomic Reads Input files Processes Intermediate files Output files AMR Filtering Filtered reads Sensitive Homology Classification CARD Homology predictions Variant Identification Variant predictions Metamodels Metamodel predictions

28

SLIDE 40

Read filtering

SLIDE 41

Homology Filter Approaches

10 20 30 40 50

Elapsed Time (hours)

2 4 6 8

Max Resident Memory (GB)

Tool blastn biobloom groot bwa bowtie2 hmmsearch_nt blastx diamond_blastx paladin blastp diamond_blastp hmmsearch_aa

Relative Computational Demands

29

SLIDE 42

Precision-Recall of Homology Search

0.0 0.2 0.4 0.6 0.8 1.0

Recall

0.2 0.4 0.6 0.8 1.0

Precision

Paradigm BWT BLAST k-mer HMM

30

SLIDE 43

Optimising for recall

0.90 0.92 0.94 0.96 0.98 1.00

Recall

0.90 0.92 0.94 0.96 0.98 1.00

Precision

Tool blastx bwa diamond_blastx paladin blastp diamond_blastp

31

SLIDE 44

Sensitive Homology Classification

SLIDE 45

Dealing with imbalanced training data

Simulated AMR Reads (.fq) Encoding Encoded Reads Stratified Test-Train (20%) Split Labels (.tsv) Training Data Testing Data SMOTE Resampled Training Data Stratified 5-fold CV Training Data Folds

32

SLIDE 46

What is balance?

Different gene lengths within families (coverage vs read number)?
Different family sizes?
Different family diversity?
Using a generator to improve on SMOTE.

33

SLIDE 47

Initial classifier

Training Data Classifier ARO predictions

34

SLIDE 48

Initial classifier

Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63

34

SLIDE 49

Initial classifier

Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 %

34

SLIDE 50

Revised classifier structure: exploiting the ARO

Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family 1 Data Family 1 Classifier Family ... SMOTE Family ... Data Family ... Classifier Family N SMOTE Family N Data Family N Classifier ARO predictions

35

SLIDE 51

Sequence similarity encoding

Sequence bitscore matrix =       

gene1 gene2 ... genej−1 genej read1

1256 ... 63

...

... ... ... ... ...

readi−1

512 ...

readi

... 785 129        Advantages: read length invariant, low dimensionality, uses filtering data computation

36

SLIDE 52

Cross-Validation

Encodings:
Raw sequence
Filtering homology search family similarity/dissimilarity
Manual feature extraction (GC/TNF/compositional)
One-hot K-mer representation
K-mer embeddings (DNA2vec/BioVec)
Classifiers:
Random Forests
Naive Bayes
Logistic Regression
Neural Networks of varying architecture (Torch)

37

SLIDE 53

Cross-validation Model

0.0 0.2 0.4 0.6 0.8 1.0

Proportion Family Cross-Validation Performance

Metric

Precision Recall

38

SLIDE 54

Held-out test results

Precision Recall

Family Test Peformance

0.00 0.25 0.50 0.75 1.00

Proportion Normalised Bitscore Random Forest

39

SLIDE 55

ARO level classification more variable

25 50 75 100 125 150 175 200 225

Ordered AMR Family Index

0.00 0.25 0.50 0.75 1.00

Proportion Median Precision-Recall Within Families

Precision Recall

40

SLIDE 56

Family diversity as explanation?

100 200 300

AMR Family Cardinality

0.0 0.2 0.4 0.6 0.8 1.0

Precision

41

SLIDE 57

Within family label imbalance

0.0 0.2 0.4 0.6 0.8 1.0

ARO Proportion of Family Size

0.0 0.2 0.4 0.6 0.8 1.0

Precision

42

SLIDE 58

On-going Problems

Multiset prediction when insufficient signal.
Systematic benchmarking.
Full end-to-end comparisons with other approaches (soliciting ideas!)
rRNA and variant models (not discussed here).
Integration into CARD platform and IRIDA.

43

SLIDE 59

Acknowledgements

SLIDE 60

Acknowledgements

McMaster University: Kara Tsang, Brian Alcock, and Andrew

McArthur

Simon Fraser University: Fiona Brinkman
Dalhousie University: Robert Beiko
Funding: Genome Canada, NERC Undergraduate Student Research

Award, Donald Hill Family Fellowship

44

SLIDE 61

References

Bradley, P., Gordon, N. C., Walker, T. M., Dunn, L., Heys, S., Huang, B., Earle, S., Pankhurst, L. J., Anson, L., De Cesare, M., et al. (2015). Rapid antibiotic-resistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis. Nature communications, 6:10063. Huang, Y., Gilna, P., and Li, W. (2009). Identification of ribosomal rna genes in metagenomic fragments. Bioinformatics, 25(10):1338–1340. Kopylova, E., No´ e, L., and Touzet, H. (2012). Sortmerna: fast and accurate filtering of ribosomal rnas in metatranscriptomic data. Bioinformatics, 28(24):3211–3217.

45

SLIDE 62

McArthur, A. G., Waglechner, N., Nizam, F., Yan, A., Azad, M. A., Baylay, A. J., Bhullar, K., Canova, M. J., De Pascale, G., Ejim, L., et al. (2013). The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy, 57(7):3348–3357. Schmieder, R., Lim, Y. W., and Edwards, R. (2011). Identification and removal of ribosomal rna sequences from metatranscriptomes. Bioinformatics, 28(3):433–435.

46

SLIDE 63

Backup

46

SLIDE 64

Variant Models

SLIDE 65

Ribosomal Variant Models

Metagenomic reads Ribosomal fragment identification 5S//16S/23S binned reads Taxonomic classification Taxa binnned reads Alignment SNPs

47

SLIDE 66

Identifying Ribosomal Reads

MetaRNA (Huang et al., 2009)
Ribopicker (Schmieder et al., 2011)
SortmeRNA (Kopylova et al., 2012)
77 models
Reads simulated from the underlying 30 species reference genomes

48

SLIDE 67

Identifying Ribosomal Reads

49

SLIDE 68

Identifying Ribosomal Reads

50

SLIDE 69

Identifying Ribosomal Reads

51

SLIDE 70

Identifying Taxonomy

52

SLIDE 71

Some are relatively easy

53

SLIDE 72

Others are a mess

54

SLIDE 73

Some are group ambiguous

Probably a Mycobacterium?

55

SLIDE 74

Others are just a toss-up

56

SLIDE 75

Ambiguity in classification

57

SLIDE 76

Meta-models

SLIDE 77

Meta-models

Efflux Pump
Gene Cluster

Predicted components Phylogenetic Placement Phylogenetic groups Sequence feature extraction Sequence features Grouping classifier Multicomponent AMR hits CARDPredicted Phylogenies

58

AMR and machine-learning

Prediction of AMR from metagenomes among other things

Finlay Maguire finlaymaguire@gmail.com December 3, 2019

Table of contents

Genomic Phenotype Prediction

Antibiotic Susceptibility Testing

Bradley et al. (2015)

AAFC Salmonella Data-set

Genomic RGI Predictions

Linking AMR determinants to Phenotype

McArthur et al. (2013)

Logistic Regression

RGI =     

1 ... 1

1 ... 1

... ... ... ...

... 1      AST =     

S S ... R

R R ... S

... ... ... ...

S S ... S      βRGI = AST

Set-Covering Machines

Genomes AST Decompose into K-mers Genomic K-mers Set-Covering Machine Boolean K-mer Rules

AST Prediction Performance

A B C D

A: RGI, B: RGI-efflux, C: Logistic Regression, D: Set Covering Machines. Major Disagreement is overprediction of resistance, Very Major Disagreement is underprediction

Learnt features/weights

A B

Extending beyond Salmonella

ARO Predictions (Kara Tsang)

Extending beyond Salmonella

Logistic Regression

Genomic AST Prediction

Genomic AST Prediction

predictions.

Genomic AST Prediction

predictions.

informative.

Non-Bioinformatics Interlude

understanding of the context.

understanding of the context.

Refugee Women’s Health Clinic

Staff Scheduling

Language Development in Autism

Qualitative Social Media Analysis (Tamara Sorenson-Duncan)

Alpha Diversity of Posting Activity

Beta Diversity of Posting Activity

Other on-going Projects

AMRtime

AMR-metagenomics

Genomes Reads AMR Genes

Why is this difficult?

AMR genes are rare genomically All (~324M) AMR (~2.1M) 107 108

log(Read Count) AMR Reads in Metagenome (0.643%)

2184 CARD-Prevalence Genomes at 1-10X abundance

AMR genes have wildly different abundances

1236 AMR PATRIC genomes

AMR genes have highly variable diversity

AMR sequence space overlaps

Insufficient Signal in 250bp Fragments

NDM Multiple Sequence Alignment

Insufficient Signal in 250bp Fragments

NDM Multiple Sequence Alignment

Other constraints

< 8 − 12Gb, 1 − 8 cores).

AMRtime

AMRtime structure

Metagenomic Reads Input files Processes Intermediate files Output files AMR Filtering Filtered reads Sensitive Homology Classification CARD Homology predictions Variant Identification Variant predictions Metamodels Metamodel predictions

Read filtering

Homology Filter Approaches

Relative Computational Demands

Precision-Recall of Homology Search

Optimising for recall

Sensitive Homology Classification

Dealing with imbalanced training data

Simulated AMR Reads (.fq) Encoding Encoded Reads Stratified Test-Train (20%) Split Labels (.tsv) Training Data Testing Data SMOTE Resampled Training Data Stratified 5-fold CV Training Data Folds

What is balance?

Initial classifier

Training Data Classifier ARO predictions

Initial classifier