HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF - - PowerPoint PPT Presentation

hiv 1 tropism prediction 1 tropism prediction hiv
SMART_READER_LITE
LIVE PREVIEW

HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF - - PowerPoint PPT Presentation

HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF Prosperi ahnven@yahoo.it University of Roma TRE Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 00149


slide-1
SLIDE 1

Mattia CF Prosperi ahnven@yahoo.it University of “Roma TRE” Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 – 00149 – Rome, ITALY

HIV HIV-

  • 1 tropism prediction

1 tropism prediction

slide-2
SLIDE 2

Summary

  • State of the art

– From charge rule to structural descriptors

  • Roma TRE modelling

– Data collection

  • Sequence manipulation
  • Enhanced domain coding

– Univariable analysis and clustering – Model technologies

  • Logistic regression and feature selection
  • Validation and comparison with other models
  • Interpretation of relevant features
slide-3
SLIDE 3

State of the art

  • Charge rule (De Jong, 1992)
  • Neural Networks, Decision Trees, Support Vector Machines

(Resch, Pillai) on 200-300 examples

  • Position Specific Scoring Matrices (Jensen)
  • Support Vector Machines (Sing) on 1’100 examples with AUC

maximisation adding CD4+ cell count as additional input variable

  • Support Vector Machines + Structural Analysis (Sander, 2007) with

AUC maximisation

  • Neural Networks for dual-tropism prediction (Lamers, 2008)
  • All models work on the sole V3 loop
slide-4
SLIDE 4

State of the art (2)

  • SVM + Structural Analysis (Sander, 2007) seems to

be the best performing model at present

– 91.56% accuracy – 0.93 AUC – Minor critics concerning sample collection (all different sequences, regardless patient, without accounting for real sequence population distribution) – Improvements gained with the structural analysis, over a reference SVM trained only on the V3 dummy variable encoding

slide-5
SLIDE 5

Roma TRE approach

  • Data: collection of samples from “Los Alamos” data base

– Only one sequence per patient (the longest available, no clones) except for sequences with different tropism – No problematic sequences – All subtypes – At least V3 loop, possibly all envelope gene – Clinical markers recorded

  • Goal: prediction of CXCR4 usage probability (regardless CCR5 usage, dual

tropic strains are pooled into X4 strains)

slide-6
SLIDE 6

Sequence manipulation

  • Previous works used multiple alignment

(clustalw or muscle) either on nucleotide

  • r amino-acids
  • We used local pairwise alignment (Smith-

Waterman-Gotoh) with ambiguities and frameshifts correction/detection against HXB2 strain (which is X4)

– Minor differences with the output of other models

slide-7
SLIDE 7

Domain coding

  • Binary dummy variables for specific amino

acidic changes (plus ins-del and “any” substitution) in the V3 loop and in the envelope

  • Phisico-chemical coding for position

changes

  • Subtype
  • Clinical markers (HIV RNA load, CD4 and

CD8 cell counts)

slide-8
SLIDE 8

Univariable analysis

  • CD4+ are significantly associated with tropism

(low CD4+ → X4)

  • Subtype B, D isolates are prevalently X4
  • Subtype A, C, 02_AG isolates are prevalently R5
slide-9
SLIDE 9

Univariable analysis

  • Highly significant positions in

the V3 loop

  • 306 (11)
  • 302 (7)
  • 303 (8)
  • 323 (28)
  • 301 (6)
  • 313 (18)
  • 321 (26)
  • 322 (27)
  • 300 (5)
  • 315 (20)
  • 320 (25)
  • 307 (12)
  • 316 (21)
  • 325 (30)
  • 304 (9)
  • A few positions outside the

V3 loop are significant, but slightly over the Benj-Hoch adjusted threshold (adj.p<0.1)

  • 440, 192, 169
slide-10
SLIDE 10

Hierachical Clustering

  • Threshold of 0.35: {318A, ins317}, {311I, 308S, 306del, 307del}, {322I, 320

hydrophilic, 326I}

  • mutations positively associated with X4 viruses tend to behave more

independently (306S, 303I, 308K, 300Y and 307T)

slide-11
SLIDE 11

Machine Learning

  • Logistic Regression (LR)
  • Feature selection via filter and embedded methods

(univariable analysis, AIC selection, CFS, ridge shrinkage)

  • Comparison with other (non-linear) machine learning

techniques

– SVM (same settings as Sander, 2007) – Random Forests and Decision Trees (RF, DT) – Rule Bases (RIPPER, JRIP) – Instance Based Reasoning (IBR)

  • Multiple 10-fold cross validation for model performance

assessment and model comparison

– Student’s t-test adjusted (Bengio and Nadeau) for sample overlap and multiple comparisons over 10 independent runs

slide-12
SLIDE 12

Results

  • Logistic

Regression

– High accuracy (92.76%) and AUC (0.93) – Enhanced domain coding performs significantly better that naïve variable encoding and sole V3 loop – Equally performing as the reference SVM

slide-13
SLIDE 13

Results (2)

slide-14
SLIDE 14

Conclusions

  • Logistic Regression is a powerful and

interpretable tool for tropism prediction

– Importance of envelope region analysis – Importance of enhanced variable encoding – Importance of feature selection techniques – Importance of robust validation and comparison statistics – We have a linear model: from the comparison analysis, non-linear models seem not to improve performances

  • The modelling technique is also suitable for

combination with structure-based methods