COMET: A Novel approach to HIV-1 subtype prediction (Context-based - - PowerPoint PPT Presentation

comet a novel approach to hiv 1 subtype prediction
SMART_READER_LITE
LIVE PREVIEW

COMET: A Novel approach to HIV-1 subtype prediction (Context-based - - PowerPoint PPT Presentation

COMET: A Novel approach to HIV-1 subtype prediction (Context-based Modeling for Expeditious Typing) Daniel Struck CRP-SANT Laboratory of Retrovirology (daniel.struck@crp-sante.lu) comet.retrovirology.lu Background HIV-1 subtype is


slide-1
SLIDE 1

comet.retrovirology.lu

COMET: A Novel approach to HIV-1 subtype prediction

(Context-based Modeling for Expeditious Typing) Daniel Struck CRP-SANTÉ Laboratory of Retrovirology (daniel.struck@crp-sante.lu)

slide-2
SLIDE 2

comet.retrovirology.lu

Background

  • HIV-1 subtype is often used for epidemiological studies
  • Many different subtyping tools exist:

– jpHMM, RIP (LANL), NCBI genotyping, STAR, REGA Subtyping Tool, …

  • Subtyping remains a controversial topic → compare the results from

different approaches

slide-3
SLIDE 3

comet.retrovirology.lu

COMET HIV-1 subtyping tool

  • Context-based modeling for classification of HIV-1 sequences adapted from

ppm compression algorithm (prediction by partial match) – take ambiguities from population sequencing into consideration

  • Software written in Java (Linux, Windows, Apple, …)
  • Core algorithm holds in approx. 300 lines of code
  • Does not require any external analysis tool (muscle / mafft / clustal, paup /

raxml / phyml)

  • Multi-threaded (takes advantage of all the cpu cores available)
slide-4
SLIDE 4

comet.retrovirology.lu

Algorithm

  • Training of the model with the subtype reference sequences from Los Alamos National Lab

(LANL) from 2008 and 30 additional near full length sequences from LANL.

  • Slide over the sequence and determine the probabilities for each subtype.

Simplified example with a model 4:

C T A G C A A C A

C T A G C A A C A C T A G C A A C A C T A G C A A C A Subtype A 0.5 0.5 0.1 0.2 0.3 Subtype B 0.5 0.5 0.4 0.6 0.8 Subtype C 0.3 0.2 0.1 0.2 0.1

  • Determine the most probable subtype.
  • Then slide over the table of probabilities with a window size of 250bp and a stepping size of

2bp to detect possible recombination events.

slide-5
SLIDE 5

comet.retrovirology.lu

Analysis of 27017 prot-RT sequences from LANL

  • Dataset for analysis:

– 27017 prot-RT sequences downloaded from LANL.

  • Query parameters:

– HXB2 start point: 2253, end point: 3450 (prot-RT region) – Sequence length < 1700 bp

  • Download subtype results from the STAR and REGA subtyping tools.

– STAR: all PURE, CRF: 01_AE - 02_AG – REGA v2: all PURE, CRF: 01_AE - 14_BG

slide-6
SLIDE 6

comet.retrovirology.lu

Subtype distribution of the dataset

(27017 prot-RT sequences) STAR REGA COMET B 19988 19722 20282 C 1329 1334 1329 A 672 1200 1194 D 555 186 499 G 246 441 393 F 205 206 193 H 19 21 19 J 3 6 2 CRF02_AG 867 787 829 CRF01_AE 414 419 416

  • ther CRF

653 806 unassigned 2719 2042 1055

slide-7
SLIDE 7

comet.retrovirology.lu

Comparison of STAR, REGA & COMET

(27017 prot-RT sequences)

  • All 3 tools agreed in 88.3% cases (23854)

– 22352 PURE – 777 CRF – 725 unassigned

  • All 3 tools disagreed in only 0.1% cases (30).
  • COMET & REGA agreed in 6.4% cases (1722); STAR disagreed

– 1034 PURE, 582 CRF, 106 unassigned

  • COMET & STAR agreed in 4.0% cases (1090); REGA disagreed

– 910 PURE, 40 CRF, 140 unassigned

  • REGA & STAR agreed in 1.2% cases (321); COMET disagreed

– 77 PURE, 8 CRF, 236 unassigned

slide-8
SLIDE 8

comet.retrovirology.lu

Comparison of REGA & COMET to LANL

Of the 27017 from the dataset, 24735 had a subtype (PURE, CRF, URF) assigned in the LANL database. For comparison 24576 sequences were analyzed (PURE, CRF: 01_AE → 14_BG, URF) REGA & LANL agreed in 93.9% cases (23077) and disagreed in 6.1% of the cases (1499). Fleiss kappa = 0.84 COMET & LANL agreed in 96.9% cases (23818) and disagreed in 3.1% of the cases (758). Fleiss kappa = 0.92 “The Fleiss kappa measure calculates the degree of agreement in classification over that which would be expected by chance and is scored as a number between 0 and 1.”

slide-9
SLIDE 9

comet.retrovirology.lu

Cohen Kappa

REGA ↔ LANL COMET ↔ LANL training set 01_AE 0.98 0.98 5 02_AG 0.92 0.93 6 03_AB 2 04_CPX 0.86 0.86 4 05_DF 1 3 06_CPX 0.83 0.77 5 07_BC 1 0.98 4 08_BC 0.97 0.97 2 09_CPX 0.8 4 10_CD

  • 1.09E-04

2 11_CPX 0.64 0.64 3 12_BF 0.65 0.61 5 13_CPX 0.8 0.8 3 14_BG 2 A 0.96 0.96 7 A1 ,2 A2 B 0.92 0.98 7 C 0.99 0.99 6 D 0.41 0.94 6 F 0.94 0.92 6 F1, 2 F2 G 0.9 0.91 4 H 0.97 0.91 4 J 0.5 0.5 3 K 2 URF 0.38 0.55

slide-10
SLIDE 10

comet.retrovirology.lu

Benchmark

Anaylsis of the 27017 prot-RT sequences: 392+/-2 seconds (6 ½ minutes) on Opteron server (2 x Quad-core, 2.5GHz) => 68 prot-RT sequences / second 144+/-0 seconds (2 ½ minutes) on new Intel server (2 x Quad-core, newest generation, 2.93 GHz) => 187 prot-RT sequences / second

slide-11
SLIDE 11

comet.retrovirology.lu

Ultra-deep sequencing (UDS) application

In-house UDS (454) software:

  • alignment, trimming
  • filtering
  • compressing
  • automatic correction of homopolymer count & “carry forward” errors
  • added adapted COMET module with bootstrap analysis (100 values per

sequence, threshold 75%)

slide-12
SLIDE 12

comet.retrovirology.lu

UDS application, dataset:

64 patients from Rwanda AMATA study

454 Sequence length: 333 bp (454, RT, AA 88 → 198)

Total sequences analyzed: 267749 (seq. with frameshifts excluded)

Time needed for analysis (100 bootstraps / seq. ): 5 ½ minutes

Sanger (prot-RT)

(URF: 2 AC, 5CA, 1 CAC, 1 AD, 2CD, 1 DC, 1GH)

slide-13
SLIDE 13

comet.retrovirology.lu

UDS application, results:

21 out of 64 patients (32.81%) seem to be dually infected by two different subtypes

COMET subtype confirmation patient major subtype number minor subtype number unassigned minority % REGA STAR jpHMM man. align. insp. Sanger 5 A1 4312 C 1 0.02 ok A1/u

  • k
  • k

URF_CA 8 D 6853 A1 1 57 0.01 ok

  • k
  • k
  • k

D 9 C 6603 A1 14 28 0.21 u/A1 u/A1 H/A1 C-H?/A1 URF_GH 17 A1 5727 C 3 0.05 ok

  • k
  • k
  • k

A1 18 C 3279 A1 5 0.15 ok

  • k
  • k
  • k

C 21 A1 2856 C 4 0.14 u/u

  • k
  • k
  • k

A1 22 C 5995 A1 5 0.08 u/A1

  • k
  • k
  • k

C 24 A1 6361 C 13 0.2 u/C

  • k
  • k
  • k

A1 25 C 6412 A1 15 0.23 C/u

  • k
  • k
  • k

URF_CD 26 A1 7350 C 1 0.01 u/C

  • k
  • k
  • k

C 32 C 6094 A1 11 0.18 C/u

  • k
  • k
  • k

URF_DC 33 A1 2226 C 1 0.04 ok

  • k
  • k
  • k

A1 35 A1 4864 C 4 0.08 A1/u

  • k
  • k
  • k

A1 36 A1 670 C 1 0.15 ok

  • k
  • k
  • k

A1 47 A1 3290 C 2 0.06 u/C

  • k
  • k
  • k

A1 48 A1 4120 C 1 0.02 u/C

  • k
  • k
  • k

A1 49 C 5279 A1 58 1.09 ok

  • k
  • k
  • k

C 64 C 1695 A1 9 0.53 ok

  • k
  • k
  • k

URF_CA 65 A1 6346 C 8 0.13 A1/u

  • k
  • k
  • k

A1 73 C 3335 A1 1 0.03 ok

  • k
  • k
  • k

C 79 A1 3244 C 3 0.09 ok

  • k
  • k
  • k

A1

slide-14
SLIDE 14

comet.retrovirology.lu

Summary

  • Reliable prediction of HIV-1 subtype
  • Generally it is best to compare the results of different approaches to

define the subtype of a sequence

  • High performance and scalability

– suitable for deep sequencing (454) analysis

  • In preparation: stand-alone desktop version with possibility to inspect the

recombination pattern

slide-15
SLIDE 15

comet.retrovirology.lu

http://comet.retrovirology.lu

subtype results can be downloaded in CSV format

slide-16
SLIDE 16

comet.retrovirology.lu

Acknowledgements

CRP-Santé, Laboratory of Retrovirology Jean-Claude Schmit Carole Devaux Danielle Perez Bercoff Jean-Claude Karasi CRP-Santé, Laboratory of Cardiovascular Research Francisco Azuaje