hiv 1 tropism prediction 1 tropism prediction hiv
play

HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF - PowerPoint PPT Presentation

HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF Prosperi ahnven@yahoo.it University of Roma TRE Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 00149


  1. HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF Prosperi ahnven@yahoo.it University of “ Roma TRE ” Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 – 00149 – Rome, ITALY

  2. Summary • State of the art – From charge rule to structural descriptors • Roma TRE modelling – Data collection • Sequence manipulation • Enhanced domain coding – Univariable analysis and clustering – Model technologies • Logistic regression and feature selection • Validation and comparison with other models • Interpretation of relevant features

  3. State of the art • Charge rule (De Jong, 1992) • Neural Networks, Decision Trees, Support Vector Machines (Resch, Pillai) on 200-300 examples • Position Specific Scoring Matrices (Jensen) • Support Vector Machines (Sing) on 1’100 examples with AUC maximisation adding CD4+ cell count as additional input variable • Support Vector Machines + Structural Analysis (Sander, 2007) with AUC maximisation • Neural Networks for dual-tropism prediction (Lamers, 2008) • All models work on the sole V3 loop

  4. State of the art (2) • SVM + Structural Analysis (Sander, 2007) seems to be the best performing model at present – 91.56% accuracy – 0.93 AUC – Minor critics concerning sample collection (all different sequences, regardless patient, without accounting for real sequence population distribution) – Improvements gained with the structural analysis, over a reference SVM trained only on the V3 dummy variable encoding

  5. Roma TRE approach • Data: collection of samples from “Los Alamos” data base – Only one sequence per patient (the longest available, no clones) except for sequences with different tropism – No problematic sequences – All subtypes – At least V3 loop, possibly all envelope gene – Clinical markers recorded • Goal: prediction of CXCR4 usage probability (regardless CCR5 usage, dual tropic strains are pooled into X4 strains)

  6. Sequence manipulation • Previous works used multiple alignment (clustalw or muscle) either on nucleotide or amino-acids • We used local pairwise alignment (Smith- Waterman-Gotoh) with ambiguities and frameshifts correction/detection against HXB2 strain (which is X4) – Minor differences with the output of other models

  7. Domain coding • Binary dummy variables for specific amino acidic changes (plus ins-del and “any” substitution) in the V3 loop and in the envelope • Phisico-chemical coding for position changes • Subtype • Clinical markers (HIV RNA load, CD4 and CD8 cell counts)

  8. Univariable analysis • CD4+ are significantly associated with tropism (low CD4+ → X4) • Subtype B, D isolates are prevalently X4 • Subtype A, C, 02_AG isolates are prevalently R5

  9. Univariable analysis • Highly significant positions in the V3 loop • 306 ( 11 ) • 302 ( 7 ) • 303 ( 8 ) • 323 ( 28 ) • 301 ( 6 ) • 313 ( 18 ) • 321 ( 26 ) • 322 ( 27 ) • 300 ( 5 ) • 315 ( 20 ) • 320 ( 25 ) • 307 ( 12 ) • 316 ( 21 ) • 325 ( 30 ) • 304 ( 9 ) • … • A few positions outside the V3 loop are significant, but slightly over the Benj-Hoch adjusted threshold (adj.p<0.1) • 440, 192, 169

  10. Hierachical Clustering • Threshold of 0.35: {318A, ins317}, {311I, 308S, 306del, 307del}, {322I, 320 hydrophilic, 326I} • mutations positively associated with X4 viruses tend to behave more independently (306S, 303I, 308K, 300Y and 307T)

  11. Machine Learning • Logistic Regression (LR) • Feature selection via filter and embedded methods (univariable analysis, AIC selection, CFS, ridge shrinkage) • Comparison with other (non-linear) machine learning techniques – SVM (same settings as Sander, 2007) – Random Forests and Decision Trees (RF, DT) – Rule Bases (RIPPER, JRIP) – Instance Based Reasoning (IBR) • Multiple 10-fold cross validation for model performance assessment and model comparison – Student’s t-test adjusted (Bengio and Nadeau) for sample overlap and multiple comparisons over 10 independent runs

  12. Results • Logistic Regression – High accuracy (92.76%) and AUC (0.93) – Enhanced domain coding performs significantly better that naïve variable encoding and sole V3 loop – Equally performing as the reference SVM

  13. Results (2)

  14. Conclusions • Logistic Regression is a powerful and interpretable tool for tropism prediction – Importance of envelope region analysis – Importance of enhanced variable encoding – Importance of feature selection techniques – Importance of robust validation and comparison statistics – We have a linear model: from the comparison analysis, non-linear models seem not to improve performances • The modelling technique is also suitable for combination with structure-based methods

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend