COMET: A Novel approach to HIV-1 subtype prediction (Context-based - - PowerPoint PPT Presentation
COMET: A Novel approach to HIV-1 subtype prediction (Context-based - - PowerPoint PPT Presentation
COMET: A Novel approach to HIV-1 subtype prediction (Context-based Modeling for Expeditious Typing) Daniel Struck CRP-SANT Laboratory of Retrovirology (daniel.struck@crp-sante.lu) comet.retrovirology.lu Background HIV-1 subtype is
comet.retrovirology.lu
Background
- HIV-1 subtype is often used for epidemiological studies
- Many different subtyping tools exist:
– jpHMM, RIP (LANL), NCBI genotyping, STAR, REGA Subtyping Tool, …
- Subtyping remains a controversial topic → compare the results from
different approaches
comet.retrovirology.lu
COMET HIV-1 subtyping tool
- Context-based modeling for classification of HIV-1 sequences adapted from
ppm compression algorithm (prediction by partial match) – take ambiguities from population sequencing into consideration
- Software written in Java (Linux, Windows, Apple, …)
- Core algorithm holds in approx. 300 lines of code
- Does not require any external analysis tool (muscle / mafft / clustal, paup /
raxml / phyml)
- Multi-threaded (takes advantage of all the cpu cores available)
comet.retrovirology.lu
Algorithm
- Training of the model with the subtype reference sequences from Los Alamos National Lab
(LANL) from 2008 and 30 additional near full length sequences from LANL.
- Slide over the sequence and determine the probabilities for each subtype.
Simplified example with a model 4:
C T A G C A A C A
C T A G C A A C A C T A G C A A C A C T A G C A A C A Subtype A 0.5 0.5 0.1 0.2 0.3 Subtype B 0.5 0.5 0.4 0.6 0.8 Subtype C 0.3 0.2 0.1 0.2 0.1
- Determine the most probable subtype.
- Then slide over the table of probabilities with a window size of 250bp and a stepping size of
2bp to detect possible recombination events.
comet.retrovirology.lu
Analysis of 27017 prot-RT sequences from LANL
- Dataset for analysis:
– 27017 prot-RT sequences downloaded from LANL.
- Query parameters:
– HXB2 start point: 2253, end point: 3450 (prot-RT region) – Sequence length < 1700 bp
- Download subtype results from the STAR and REGA subtyping tools.
– STAR: all PURE, CRF: 01_AE - 02_AG – REGA v2: all PURE, CRF: 01_AE - 14_BG
comet.retrovirology.lu
Subtype distribution of the dataset
(27017 prot-RT sequences) STAR REGA COMET B 19988 19722 20282 C 1329 1334 1329 A 672 1200 1194 D 555 186 499 G 246 441 393 F 205 206 193 H 19 21 19 J 3 6 2 CRF02_AG 867 787 829 CRF01_AE 414 419 416
- ther CRF
653 806 unassigned 2719 2042 1055
comet.retrovirology.lu
Comparison of STAR, REGA & COMET
(27017 prot-RT sequences)
- All 3 tools agreed in 88.3% cases (23854)
– 22352 PURE – 777 CRF – 725 unassigned
- All 3 tools disagreed in only 0.1% cases (30).
- COMET & REGA agreed in 6.4% cases (1722); STAR disagreed
– 1034 PURE, 582 CRF, 106 unassigned
- COMET & STAR agreed in 4.0% cases (1090); REGA disagreed
– 910 PURE, 40 CRF, 140 unassigned
- REGA & STAR agreed in 1.2% cases (321); COMET disagreed
– 77 PURE, 8 CRF, 236 unassigned
comet.retrovirology.lu
Comparison of REGA & COMET to LANL
Of the 27017 from the dataset, 24735 had a subtype (PURE, CRF, URF) assigned in the LANL database. For comparison 24576 sequences were analyzed (PURE, CRF: 01_AE → 14_BG, URF) REGA & LANL agreed in 93.9% cases (23077) and disagreed in 6.1% of the cases (1499). Fleiss kappa = 0.84 COMET & LANL agreed in 96.9% cases (23818) and disagreed in 3.1% of the cases (758). Fleiss kappa = 0.92 “The Fleiss kappa measure calculates the degree of agreement in classification over that which would be expected by chance and is scored as a number between 0 and 1.”
comet.retrovirology.lu
Cohen Kappa
REGA ↔ LANL COMET ↔ LANL training set 01_AE 0.98 0.98 5 02_AG 0.92 0.93 6 03_AB 2 04_CPX 0.86 0.86 4 05_DF 1 3 06_CPX 0.83 0.77 5 07_BC 1 0.98 4 08_BC 0.97 0.97 2 09_CPX 0.8 4 10_CD
- 1.09E-04
2 11_CPX 0.64 0.64 3 12_BF 0.65 0.61 5 13_CPX 0.8 0.8 3 14_BG 2 A 0.96 0.96 7 A1 ,2 A2 B 0.92 0.98 7 C 0.99 0.99 6 D 0.41 0.94 6 F 0.94 0.92 6 F1, 2 F2 G 0.9 0.91 4 H 0.97 0.91 4 J 0.5 0.5 3 K 2 URF 0.38 0.55
comet.retrovirology.lu
Benchmark
Anaylsis of the 27017 prot-RT sequences: 392+/-2 seconds (6 ½ minutes) on Opteron server (2 x Quad-core, 2.5GHz) => 68 prot-RT sequences / second 144+/-0 seconds (2 ½ minutes) on new Intel server (2 x Quad-core, newest generation, 2.93 GHz) => 187 prot-RT sequences / second
comet.retrovirology.lu
Ultra-deep sequencing (UDS) application
In-house UDS (454) software:
- alignment, trimming
- filtering
- compressing
- automatic correction of homopolymer count & “carry forward” errors
- …
- added adapted COMET module with bootstrap analysis (100 values per
sequence, threshold 75%)
comet.retrovirology.lu
UDS application, dataset:
64 patients from Rwanda AMATA study
454 Sequence length: 333 bp (454, RT, AA 88 → 198)
Total sequences analyzed: 267749 (seq. with frameshifts excluded)
Time needed for analysis (100 bootstraps / seq. ): 5 ½ minutes
Sanger (prot-RT)
(URF: 2 AC, 5CA, 1 CAC, 1 AD, 2CD, 1 DC, 1GH)
comet.retrovirology.lu
UDS application, results:
21 out of 64 patients (32.81%) seem to be dually infected by two different subtypes
COMET subtype confirmation patient major subtype number minor subtype number unassigned minority % REGA STAR jpHMM man. align. insp. Sanger 5 A1 4312 C 1 0.02 ok A1/u
- k
- k
URF_CA 8 D 6853 A1 1 57 0.01 ok
- k
- k
- k
D 9 C 6603 A1 14 28 0.21 u/A1 u/A1 H/A1 C-H?/A1 URF_GH 17 A1 5727 C 3 0.05 ok
- k
- k
- k
A1 18 C 3279 A1 5 0.15 ok
- k
- k
- k
C 21 A1 2856 C 4 0.14 u/u
- k
- k
- k
A1 22 C 5995 A1 5 0.08 u/A1
- k
- k
- k
C 24 A1 6361 C 13 0.2 u/C
- k
- k
- k
A1 25 C 6412 A1 15 0.23 C/u
- k
- k
- k
URF_CD 26 A1 7350 C 1 0.01 u/C
- k
- k
- k
C 32 C 6094 A1 11 0.18 C/u
- k
- k
- k
URF_DC 33 A1 2226 C 1 0.04 ok
- k
- k
- k
A1 35 A1 4864 C 4 0.08 A1/u
- k
- k
- k
A1 36 A1 670 C 1 0.15 ok
- k
- k
- k
A1 47 A1 3290 C 2 0.06 u/C
- k
- k
- k
A1 48 A1 4120 C 1 0.02 u/C
- k
- k
- k
A1 49 C 5279 A1 58 1.09 ok
- k
- k
- k
C 64 C 1695 A1 9 0.53 ok
- k
- k
- k
URF_CA 65 A1 6346 C 8 0.13 A1/u
- k
- k
- k
A1 73 C 3335 A1 1 0.03 ok
- k
- k
- k
C 79 A1 3244 C 3 0.09 ok
- k
- k
- k
A1
comet.retrovirology.lu
Summary
- Reliable prediction of HIV-1 subtype
- Generally it is best to compare the results of different approaches to
define the subtype of a sequence
- High performance and scalability
– suitable for deep sequencing (454) analysis
- In preparation: stand-alone desktop version with possibility to inspect the