Mattia CF Prosperi ahnven@yahoo.it University of “Roma TRE” Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 – 00149 – Rome, ITALY
HIV HIV-
- 1 tropism prediction
HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF - - PowerPoint PPT Presentation
HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF Prosperi ahnven@yahoo.it University of Roma TRE Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 00149
Mattia CF Prosperi ahnven@yahoo.it University of “Roma TRE” Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 – 00149 – Rome, ITALY
– From charge rule to structural descriptors
– Data collection
– Univariable analysis and clustering – Model technologies
(Resch, Pillai) on 200-300 examples
maximisation adding CD4+ cell count as additional input variable
AUC maximisation
be the best performing model at present
– 91.56% accuracy – 0.93 AUC – Minor critics concerning sample collection (all different sequences, regardless patient, without accounting for real sequence population distribution) – Improvements gained with the structural analysis, over a reference SVM trained only on the V3 dummy variable encoding
– Only one sequence per patient (the longest available, no clones) except for sequences with different tropism – No problematic sequences – All subtypes – At least V3 loop, possibly all envelope gene – Clinical markers recorded
tropic strains are pooled into X4 strains)
– Minor differences with the output of other models
(low CD4+ → X4)
the V3 loop
V3 loop are significant, but slightly over the Benj-Hoch adjusted threshold (adj.p<0.1)
hydrophilic, 326I}
independently (306S, 303I, 308K, 300Y and 307T)
(univariable analysis, AIC selection, CFS, ridge shrinkage)
techniques
– SVM (same settings as Sander, 2007) – Random Forests and Decision Trees (RF, DT) – Rule Bases (RIPPER, JRIP) – Instance Based Reasoning (IBR)
assessment and model comparison
– Student’s t-test adjusted (Bengio and Nadeau) for sample overlap and multiple comparisons over 10 independent runs
Regression
– High accuracy (92.76%) and AUC (0.93) – Enhanced domain coding performs significantly better that naïve variable encoding and sole V3 loop – Equally performing as the reference SVM
– Importance of envelope region analysis – Importance of enhanced variable encoding – Importance of feature selection techniques – Importance of robust validation and comparison statistics – We have a linear model: from the comparison analysis, non-linear models seem not to improve performances