amr and machine learning
play

AMR and machine-learning Prediction of AMR from metagenomes among - PowerPoint PPT Presentation

AMR and machine-learning Prediction of AMR from metagenomes among other things Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University Table of contents 1. Genomic Phenotype Prediction 2.


  1. AMR and machine-learning Prediction of AMR from metagenomes among other things Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University

  2. Table of contents 1. Genomic Phenotype Prediction 2. Non-Bioinformatics Interlude 3. AMRtime 1

  3. Genomic Phenotype Prediction

  4. Antibiotic Susceptibility Testing Bradley et al. (2015) 2

  5. AAFC Salmonella Data-set 3132 (Q2) 3314 (S1) (S1) ( ) 3134 (Q2) 3144 (Q2) ( S 1 ( S 3145 (Q2) (S1) ( S 1 1 S 3302 (S1) (S1) S 1 ) ( 1 ) 3323 (S1) (S1) ) ) 3344 3 3 6 3311 (S1) (J2) 3 3 3 2 3 3 3319 (S1) 3324 3 4 2 3 (J2) 3306 2 1 1 3 5 2 ) 3305 1 3337 (S1) 1 3184 S ( F ( 3179 ( ) 6 ) (O) E 3 3 1 ) 3 3315 (S1) S 3 ( 3 0 3 (F) 1 3 3 3318 (S1) 3352 8 1 1 3 ( 3 3338 (S2) D 3348 ) (D) 3 3310 (S1) 1 8 (C) 3169 0 3349 (S1) 3167 (B) 3317 (S1) 2005 (B) 1783 (P2) 2003 (A1) 2 ) 1797 ( P 5 8 ( Y 1 7 ) 3 1778 (P2) 1 9 9 ( U ) 3 1 6 6 ( U ) 3 1 6 0 ( X ) 3 1 9 8 (U) 3162 3193 (V2) 3151 (W) 3125 (G) (S2) 3333 3146 (I) 1893 3 1 (AA) 4 9 ( J 1 ) 9 2 3147 (I) 1 8 A A ) ( 4 3 3 3 1 1 8 C ) 6 A ( 3 9 3 ( L ) 3 3 2 0 0 S 1 ) 3171 3191 (M) ( N ( ) (V1) 3128 3 1 9 7 (AB) 3176 3 1 4 ( M 3126 3 2 ) (AD) 3 5 ( H 1811 3 3 ) (Q2) 1 1 3 3 ( O 9 1 3 ) (A2) 8 3 1 8 3 5 ( H 8 1 ) ) 8 1890 3 3 2 1 1 7 ( H A 3 3139 (H) 3 ) ( ) 0 8 ( (A2) 2 8 1793 3156 (K) H A 1 2 3158 (K) ) ( H ( 9 3342 1760 (P1) 2 ) 7 3332 1773 (P1) ) (T) 1 3168 1 A 0 1 ( 1769 1 7 ) (R) 4 8 7 1775 1 1771 (P1) 7 6 Z 1 6 7 7 ( (R) 6 7 7 2 (Q1) 3 7 7 6 0 2 1 1 (P1) 6 ( P 1 ) (P1) ( ( P ) P 1 Q 1 1 ) P ( 1 ) P 1 ) ( P 1 ) ( ( ) 0.056229 3

  6. Genomic RGI Predictions 4

  7. Linking AMR determinants to Phenotype McArthur et al. (2013) 5

  8. Logistic Regression amr 1 amr 2 amr J ...   1 0 ... 1 genome 1 0 1 ... 1 genome 2   RGI =    ... ... ... ...  ...   0 0 ... 1 genome I abx 1 abx 2 abx K ...   S S ... R genome 1 R R S ...   genome 2 AST =    ... ... ... ...  ...   S S ... S genome I β RGI = AST 6

  9. Set-Covering Machines Genomes AST Decompose into K-mers Genomic K-mers Set-Covering Machine Boolean K-mer Rules 7

  10. AST Prediction Performance A B C D A : RGI, B : RGI-efflux, C : Logistic Regression, D : Set Covering Machines. Major Disagreement is overprediction of resistance, Very Major Disagreement is underprediction 8

  11. Learnt features/weights B A 9

  12. Extending beyond Salmonella ARO Predictions (Kara Tsang) 10

  13. Extending beyond Salmonella Logistic Regression 11

  14. Genomic AST Prediction • Using direct annotations works very poorly across different organisms and resistance mechanisms. 12

  15. Genomic AST Prediction • Using direct annotations works very poorly across different organisms and resistance mechanisms. • Even very simple logistic regression models greatly improve predictions. 12

  16. Genomic AST Prediction • Using direct annotations works very poorly across different organisms and resistance mechanisms. • Even very simple logistic regression models greatly improve predictions. • Investigation of learnt weights and features can be very scientifically informative. 12

  17. Non-Bioinformatics Interlude

  18. • Non-profits have data and lots of contextualising knowledge. 13

  19. • Non-profits have data and lots of contextualising knowledge. • No time or resources to analyse or use it 13

  20. • Non-profits have data and lots of contextualising knowledge. • No time or resources to analyse or use it • Informaticians have the skills and resources but no specific understanding of the context. 13

  21. • Non-profits have data and lots of contextualising knowledge. • No time or resources to analyse or use it • Informaticians have the skills and resources but no specific understanding of the context. • Many low-hanging fruit that can make big differences. 13

  22. Refugee Women’s Health Clinic 14

  23. Staff Scheduling 15

  24. Language Development in Autism Qualitative Social Media Analysis (Tamara Sorenson-Duncan) 16

  25. Alpha Diversity of Posting Activity 17

  26. Beta Diversity of Posting Activity 18

  27. Other on-going Projects • Halifax Community Learning Network • Shelter Nova Scotia • 211 Nova Scotia 19

  28. AMRtime

  29. AMR-metagenomics Genomes Sequencing Reads AMR detection AMR Genes 20

  30. Why is this difficult?

  31. AMR genes are rare genomically AMR Reads in Metagenome (0.643%) log(Read Count) 10 8 10 7 All (~324M) AMR (~2.1M) 2184 CARD-Prevalence Genomes at 1-10X abundance 21

  32. AMR genes have wildly different abundances 1236 AMR PATRIC genomes 22

  33. AMR genes have highly variable diversity 23

  34. AMR sequence space overlaps MDS of CARD Proteins BLASTP-%ID Actual Families Affinity Clusters (Adj. Rand=0.30041) 1000 1000 500 500 0 0 500 500 1000 1000 1000 500 0 500 1000 1000 500 0 500 1000 24

  35. Insufficient Signal in 250bp Fragments NDM Multiple Sequence Alignment 25

  36. Insufficient Signal in 250bp Fragments NDM Multiple Sequence Alignment 26

  37. Other constraints • No point doing what we do if people can’t use it. • Limited hardware requirements (a standard workstation or instance < 8 − 12Gb, 1 − 8 cores). • Fast enough ( < 12 hours). • Easy to install/configure. • Easy to use. • Easy to update. 27

  38. AMRtime

  39. AMRtime structure Input files Metagenomic Reads Processes AMR Filtering Intermediate files Output files Filtered reads CARD Sensitive Homology Classification Homology predictions Variant Identification Metamodels Variant predictions Metamodel predictions 28

  40. Read filtering

  41. Homology Filter Approaches Tool blastn biobloom 8 groot Max Resident Memory (GB) bwa bowtie2 6 hmmsearch_nt blastx diamond_blastx paladin 4 blastp diamond_blastp hmmsearch_aa 2 0 0 10 20 30 40 50 Elapsed Time (hours) Relative Computational Demands 29

  42. Precision-Recall of Homology Search 1.0 Paradigm BWT BLAST k-mer 0.8 HMM Precision 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Recall 30

  43. Optimising for recall 1.00 Tool blastx bwa 0.98 diamond_blastx paladin blastp diamond_blastp 0.96 Precision 0.94 0.92 0.90 0.90 0.92 0.94 0.96 0.98 1.00 Recall 31

  44. Sensitive Homology Classification

  45. Dealing with imbalanced training data Simulated AMR Reads (.fq) Encoding Encoded Reads Labels (.tsv) Stratified Test-Train (20%) Split Training Data Testing Data SMOTE Resampled Training Data Stratified 5-fold CV Training Data Folds 32

  46. What is balance? • Different gene lengths within families (coverage vs read number)? • Different family sizes? • Different family diversity? • Using a generator to improve on SMOTE. 33

  47. Initial classifier Training Data Classifier ARO predictions 34

  48. Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 34

  49. Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 % 34

  50. Revised classifier structure: exploiting the ARO Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family ... SMOTE Family N SMOTE Family 1 Data Family ... Data Family N Data Family 1 Classifier Family ... Classifier Family N Classifier ARO predictions 35

  51. Sequence similarity encoding gene 1 gene 2 gene j − 1 gene j ...  1256 0 0 63  ... read 1 0 0 0 0 ... read 2     Sequence bitscore matrix = ... ... ... ... ...   ...     0 512 ... 0 0 read i − 1   0 0 785 129 ... read i Advantages: read length invariant, low dimensionality, uses filtering data computation 36

  52. Cross-Validation • Encodings: • Raw sequence • Filtering homology search family similarity/dissimilarity • Manual feature extraction (GC/TNF/compositional) • One-hot K-mer representation • K-mer embeddings (DNA2vec/BioVec) • Classifiers: • Random Forests • Naive Bayes • Logistic Regression • Neural Networks of varying architecture (Torch) 37

  53. Cross-validation Family Cross-Validation Performance 1.0 Metric Precision 0.8 Recall Proportion 0.6 0.4 0.2 0.0 Model 38

  54. Held-out test results Normalised Bitscore Random Forest 1.00 0.75 Proportion 0.50 0.25 0.00 Precision Recall Family Test Peformance 39

  55. ARO level classification more variable Median Precision-Recall Within Families 1.00 Precision Recall 0.75 Proportion 0.50 0.25 0.00 0 25 50 75 100 125 150 175 200 225 Ordered AMR Family Index 40

  56. Family diversity as explanation? 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0 100 200 300 AMR Family Cardinality 41

  57. Within family label imbalance 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 ARO Proportion of Family Size 42

Recommend


More recommend