AMR and machine-learning Prediction of AMR from metagenomes among - - PowerPoint PPT Presentation

amr and machine learning
SMART_READER_LITE
LIVE PREVIEW

AMR and machine-learning Prediction of AMR from metagenomes among - - PowerPoint PPT Presentation

AMR and machine-learning Prediction of AMR from metagenomes among other things Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University Table of contents 1. Genomic Phenotype Prediction 2.


slide-1
SLIDE 1

AMR and machine-learning

Prediction of AMR from metagenomes among other things

Finlay Maguire finlaymaguire@gmail.com December 3, 2019

Faculty of Computer Science, Dalhousie University

slide-2
SLIDE 2

Table of contents

  • 1. Genomic Phenotype Prediction
  • 2. Non-Bioinformatics Interlude
  • 3. AMRtime

1

slide-3
SLIDE 3

Genomic Phenotype Prediction

slide-4
SLIDE 4

Antibiotic Susceptibility Testing

Bradley et al. (2015)

2

slide-5
SLIDE 5

AAFC Salmonella Data-set

3193 (V2) 3125 (G) 3146 (I) 3 1 4 9 ( J 1 ) 3147 (I) 3 1 8 6 ( L ) 3 2 ( N ) 3191 (M) 3 1 9 7 ( M ) 3 1 4 2 ( H ) 3 3 5 3 ( O ) 3 1 3 3 ( H ) 3 1 3 5 ( H ) 3 1 3 7 ( H ) 3 1 3 8 ( H ) 3139 (H) 3156 (K) 3158 (K) 1760 (P1) 1773 (P1) 1 7 6 2 ( P 1 ) 1 7 7 2 ( P 1 ) 1 7 7 ( P 1 ) 1771 (P1) 1 7 6 6 ( P 1 ) 1775 (P1) 1 7 6 7 ( P 1 ) 1 7 6 8 ( P 1 ) 1769 (P1) 3 1 4 ( Q 1 ) 3168 (Q1) 3332 (R) 3342 (R) 1 7 9 2 ( Z ) 1793 (T) 1 8 3 ( A 2 ) 1890 (A2) 1 8 8 8 ( A 2 ) 1 8 9 1 ( A 2 ) 1811 (A2) 3126 (Q2) 3176 (AD) 3128 (AB) 3171 (V1) 3 3 3 9 ( S 1 ) 3 1 4 3 ( A C ) 1 8 9 2 ( A A ) 1893 (AA) 3333 (S2) 3151 (W) 3162 (U) 3 1 9 8 ( X ) 3 1 6 ( U ) 3 1 6 6 ( U ) 3 1 9 9 ( Y ) 1797 (A1) 2003 (B) 2005 (B) 3167 (C) 3169 (D) 3 1 8 ( D ) 3348 (F) 3352 (O) 3 1 8 1 ( E ) 3 3 3 ( F ) 3179 (J2) 3184 (J2) 3305 (S1) 3306 (S1) 3324 (S1) 3 3 5 1 ( S 1 ) 3 3 2 2 ( S 1 ) 3344 (S1) 3 3 4 1 ( S 1 ) 3 3 2 1 ( S 1 ) 3 3 2 6 ( S 1 ) 3132 (Q2) 3314 (S1) 3134 (Q2) 3144 (Q2) 3145 (Q2) 3302 (S1) 3323 (S1) 3311 (S1) 3319 (S1) 3 3 3 6 ( S 1 ) 3337 (S1) 3 3 1 3 ( S 1 ) 3315 (S1) 3318 (S1) 3338 (S2) 3310 (S1) 3349 (S1) 3317 (S1) 1783 (P2) 1 7 5 8 ( P 2 ) 1778 (P2)

0.056229

3

slide-6
SLIDE 6

Genomic RGI Predictions

4

slide-7
SLIDE 7

Linking AMR determinants to Phenotype

McArthur et al. (2013)

5

slide-8
SLIDE 8

Logistic Regression

RGI =     

amr1 amr2 ... amrJ genome1

1 ... 1

genome2

1 ... 1

...

... ... ... ...

genomeI

... 1      AST =     

abx1 abx2 ... abxK genome1

S S ... R

genome2

R R ... S

...

... ... ... ...

genomeI

S S ... S      βRGI = AST

6

slide-9
SLIDE 9

Set-Covering Machines

Genomes AST Decompose into K-mers Genomic K-mers Set-Covering Machine Boolean K-mer Rules

7

slide-10
SLIDE 10

AST Prediction Performance

A B C D

A: RGI, B: RGI-efflux, C: Logistic Regression, D: Set Covering Machines. Major Disagreement is overprediction of resistance, Very Major Disagreement is underprediction

8

slide-11
SLIDE 11

Learnt features/weights

A B

9

slide-12
SLIDE 12

Extending beyond Salmonella

ARO Predictions (Kara Tsang)

10

slide-13
SLIDE 13

Extending beyond Salmonella

Logistic Regression

11

slide-14
SLIDE 14

Genomic AST Prediction

  • Using direct annotations works very poorly across different
  • rganisms and resistance mechanisms.

12

slide-15
SLIDE 15

Genomic AST Prediction

  • Using direct annotations works very poorly across different
  • rganisms and resistance mechanisms.
  • Even very simple logistic regression models greatly improve

predictions.

12

slide-16
SLIDE 16

Genomic AST Prediction

  • Using direct annotations works very poorly across different
  • rganisms and resistance mechanisms.
  • Even very simple logistic regression models greatly improve

predictions.

  • Investigation of learnt weights and features can be very scientifically

informative.

12

slide-17
SLIDE 17

Non-Bioinformatics Interlude

slide-18
SLIDE 18
  • Non-profits have data and lots of contextualising knowledge.

13

slide-19
SLIDE 19
  • Non-profits have data and lots of contextualising knowledge.
  • No time or resources to analyse or use it

13

slide-20
SLIDE 20
  • Non-profits have data and lots of contextualising knowledge.
  • No time or resources to analyse or use it
  • Informaticians have the skills and resources but no specific

understanding of the context.

13

slide-21
SLIDE 21
  • Non-profits have data and lots of contextualising knowledge.
  • No time or resources to analyse or use it
  • Informaticians have the skills and resources but no specific

understanding of the context.

  • Many low-hanging fruit that can make big differences.

13

slide-22
SLIDE 22

Refugee Women’s Health Clinic

14

slide-23
SLIDE 23

Staff Scheduling

15

slide-24
SLIDE 24

Language Development in Autism

Qualitative Social Media Analysis (Tamara Sorenson-Duncan)

16

slide-25
SLIDE 25

Alpha Diversity of Posting Activity

17

slide-26
SLIDE 26

Beta Diversity of Posting Activity

18

slide-27
SLIDE 27

Other on-going Projects

  • Halifax Community Learning Network
  • Shelter Nova Scotia
  • 211 Nova Scotia

19

slide-28
SLIDE 28

AMRtime

slide-29
SLIDE 29

AMR-metagenomics

Genomes Reads AMR Genes

Sequencing AMR detection 20

slide-30
SLIDE 30

Why is this difficult?

slide-31
SLIDE 31

AMR genes are rare genomically All (~324M) AMR (~2.1M) 107 108

log(Read Count) AMR Reads in Metagenome (0.643%)

2184 CARD-Prevalence Genomes at 1-10X abundance

21

slide-32
SLIDE 32

AMR genes have wildly different abundances

1236 AMR PATRIC genomes

22

slide-33
SLIDE 33

AMR genes have highly variable diversity

23

slide-34
SLIDE 34

AMR sequence space overlaps

1000 500 500 1000 1000 500 500 1000 Actual Families 1000 500 500 1000 1000 500 500 1000 Affinity Clusters (Adj. Rand=0.30041)

MDS of CARD Proteins BLASTP-%ID

24

slide-35
SLIDE 35

Insufficient Signal in 250bp Fragments

NDM Multiple Sequence Alignment

25

slide-36
SLIDE 36

Insufficient Signal in 250bp Fragments

NDM Multiple Sequence Alignment

26

slide-37
SLIDE 37

Other constraints

  • No point doing what we do if people can’t use it.
  • Limited hardware requirements (a standard workstation or instance

< 8 − 12Gb, 1 − 8 cores).

  • Fast enough (< 12 hours).
  • Easy to install/configure.
  • Easy to use.
  • Easy to update.

27

slide-38
SLIDE 38

AMRtime

slide-39
SLIDE 39

AMRtime structure

Metagenomic Reads Input files Processes Intermediate files Output files AMR Filtering Filtered reads Sensitive Homology Classification CARD Homology predictions Variant Identification Variant predictions Metamodels Metamodel predictions

28

slide-40
SLIDE 40

Read filtering

slide-41
SLIDE 41

Homology Filter Approaches

10 20 30 40 50

Elapsed Time (hours)

2 4 6 8

Max Resident Memory (GB)

Tool blastn biobloom groot bwa bowtie2 hmmsearch_nt blastx diamond_blastx paladin blastp diamond_blastp hmmsearch_aa

Relative Computational Demands

29

slide-42
SLIDE 42

Precision-Recall of Homology Search

0.0 0.2 0.4 0.6 0.8 1.0

Recall

0.2 0.4 0.6 0.8 1.0

Precision

Paradigm BWT BLAST k-mer HMM

30

slide-43
SLIDE 43

Optimising for recall

0.90 0.92 0.94 0.96 0.98 1.00

Recall

0.90 0.92 0.94 0.96 0.98 1.00

Precision

Tool blastx bwa diamond_blastx paladin blastp diamond_blastp

31

slide-44
SLIDE 44

Sensitive Homology Classification

slide-45
SLIDE 45

Dealing with imbalanced training data

Simulated AMR Reads (.fq) Encoding Encoded Reads Stratified Test-Train (20%) Split Labels (.tsv) Training Data Testing Data SMOTE Resampled Training Data Stratified 5-fold CV Training Data Folds

32

slide-46
SLIDE 46

What is balance?

  • Different gene lengths within families (coverage vs read number)?
  • Different family sizes?
  • Different family diversity?
  • Using a generator to improve on SMOTE.

33

slide-47
SLIDE 47

Initial classifier

Training Data Classifier ARO predictions

34

slide-48
SLIDE 48

Initial classifier

Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63

34

slide-49
SLIDE 49

Initial classifier

Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 %

34

slide-50
SLIDE 50

Revised classifier structure: exploiting the ARO

Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family 1 Data Family 1 Classifier Family ... SMOTE Family ... Data Family ... Classifier Family N SMOTE Family N Data Family N Classifier ARO predictions

35

slide-51
SLIDE 51

Sequence similarity encoding

Sequence bitscore matrix =       

gene1 gene2 ... genej−1 genej read1

1256 ... 63

read2

...

...

... ... ... ... ...

readi−1

512 ...

readi

... 785 129        Advantages: read length invariant, low dimensionality, uses filtering data computation

36

slide-52
SLIDE 52

Cross-Validation

  • Encodings:
  • Raw sequence
  • Filtering homology search family similarity/dissimilarity
  • Manual feature extraction (GC/TNF/compositional)
  • One-hot K-mer representation
  • K-mer embeddings (DNA2vec/BioVec)
  • Classifiers:
  • Random Forests
  • Naive Bayes
  • Logistic Regression
  • Neural Networks of varying architecture (Torch)

37

slide-53
SLIDE 53

Cross-validation Model

0.0 0.2 0.4 0.6 0.8 1.0

Proportion Family Cross-Validation Performance

Metric

Precision Recall

38

slide-54
SLIDE 54

Held-out test results

Precision Recall

Family Test Peformance

0.00 0.25 0.50 0.75 1.00

Proportion Normalised Bitscore Random Forest

39

slide-55
SLIDE 55

ARO level classification more variable

25 50 75 100 125 150 175 200 225

Ordered AMR Family Index

0.00 0.25 0.50 0.75 1.00

Proportion Median Precision-Recall Within Families

Precision Recall

40

slide-56
SLIDE 56

Family diversity as explanation?

100 200 300

AMR Family Cardinality

0.0 0.2 0.4 0.6 0.8 1.0

Precision

41

slide-57
SLIDE 57

Within family label imbalance

0.0 0.2 0.4 0.6 0.8 1.0

ARO Proportion of Family Size

0.0 0.2 0.4 0.6 0.8 1.0

Precision

42

slide-58
SLIDE 58

On-going Problems

  • Multiset prediction when insufficient signal.
  • Systematic benchmarking.
  • Full end-to-end comparisons with other approaches (soliciting ideas!)
  • rRNA and variant models (not discussed here).
  • Integration into CARD platform and IRIDA.

43

slide-59
SLIDE 59

Acknowledgements

slide-60
SLIDE 60

Acknowledgements

  • McMaster University: Kara Tsang, Brian Alcock, and Andrew

McArthur

  • Simon Fraser University: Fiona Brinkman
  • Dalhousie University: Robert Beiko
  • Funding: Genome Canada, NERC Undergraduate Student Research

Award, Donald Hill Family Fellowship

44

slide-61
SLIDE 61

References

Bradley, P., Gordon, N. C., Walker, T. M., Dunn, L., Heys, S., Huang, B., Earle, S., Pankhurst, L. J., Anson, L., De Cesare, M., et al. (2015). Rapid antibiotic-resistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis. Nature communications, 6:10063. Huang, Y., Gilna, P., and Li, W. (2009). Identification of ribosomal rna genes in metagenomic fragments. Bioinformatics, 25(10):1338–1340. Kopylova, E., No´ e, L., and Touzet, H. (2012). Sortmerna: fast and accurate filtering of ribosomal rnas in metatranscriptomic data. Bioinformatics, 28(24):3211–3217.

45

slide-62
SLIDE 62

McArthur, A. G., Waglechner, N., Nizam, F., Yan, A., Azad, M. A., Baylay, A. J., Bhullar, K., Canova, M. J., De Pascale, G., Ejim, L., et al. (2013). The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy, 57(7):3348–3357. Schmieder, R., Lim, Y. W., and Edwards, R. (2011). Identification and removal of ribosomal rna sequences from metatranscriptomes. Bioinformatics, 28(3):433–435.

46

slide-63
SLIDE 63

Backup

46

slide-64
SLIDE 64

Variant Models

slide-65
SLIDE 65

Ribosomal Variant Models

Metagenomic reads Ribosomal fragment identification 5S//16S/23S binned reads Taxonomic classification Taxa binnned reads Alignment SNPs

47

slide-66
SLIDE 66

Identifying Ribosomal Reads

  • MetaRNA (Huang et al., 2009)
  • Ribopicker (Schmieder et al., 2011)
  • SortmeRNA (Kopylova et al., 2012)
  • 77 models
  • Reads simulated from the underlying 30 species reference genomes

48

slide-67
SLIDE 67

Identifying Ribosomal Reads

49

slide-68
SLIDE 68

Identifying Ribosomal Reads

50

slide-69
SLIDE 69

Identifying Ribosomal Reads

51

slide-70
SLIDE 70

Identifying Taxonomy

52

slide-71
SLIDE 71

Some are relatively easy

53

slide-72
SLIDE 72

Others are a mess

54

slide-73
SLIDE 73

Some are group ambiguous

Probably a Mycobacterium?

55

slide-74
SLIDE 74

Others are just a toss-up

56

slide-75
SLIDE 75

Ambiguity in classification

57

slide-76
SLIDE 76

Meta-models

slide-77
SLIDE 77

Meta-models

  • Efflux Pump
  • Gene Cluster

Predicted components Phylogenetic Placement Phylogenetic groups Sequence feature extraction Sequence features Grouping classifier Multicomponent AMR hits CARDPredicted Phylogenies

58