Retroviruses integrate into a shared, non-palindromic motif
Paul Kirk
MASAMB 2016, Cambridge October 4, 2016
Paul Kirk MASAMB 2016, Cambridge October 4, 2016 Central dogma of - - PowerPoint PPT Presentation
Retroviruses integrate into a shared, non-palindromic motif Paul Kirk MASAMB 2016, Cambridge October 4, 2016 Central dogma of molecular biology (Crick, 1956) General transfers of biological sequential information: Protein translation RNA
MASAMB 2016, Cambridge October 4, 2016
Central dogma of molecular biology (Crick, 1956)
General transfers of biological sequential information:
Protein
RNA
DNA
transcription translation replication
MRC | Medical Research Council
1 of 22
Central dogma of molecular biology (Crick, 1956)
General transfers of biological sequential information:
Protein
RNA
DNA
transcription translation replication
There are also special transfers of sequential information.
MRC | Medical Research Council
1 of 22
For example: retroviruses
Integrase Reverse transcriptase Protease viral RNA A retrovirus:
MRC | Medical Research Council
2 of 22
For example: retroviruses
Integrase Reverse transcriptase Protease viral RNA A retrovirus:
Retroviruses are obligate parasites: they require a host cell to complete their “life”-cycle.
MRC | Medical Research Council
2 of 22
For example: retroviruses
Integrase Reverse transcriptase Protease viral RNA A retrovirus:
Retroviruses are obligate parasites: they require a host cell to complete their “life”-cycle. Examples: HIV, HTLV-1, . . . .
MRC | Medical Research Council
2 of 22
For example: retroviruses
host DNA
MRC | Medical Research Council
3 of 22
For example: retroviruses
host DNA
MRC | Medical Research Council
3 of 22
For example: retroviruses
viral RNA host DNA
MRC | Medical Research Council
3 of 22
For example: retroviruses
Reverse transcriptase
viral RNA viral DNA host DNA
MRC | Medical Research Council
3 of 22
For example: retroviruses
Integrase
Reverse transcriptase
viral RNA
viral DNA host DNA host DNA
MRC | Medical Research Council
3 of 22
For example: retroviruses
Integrase
Reverse transcriptase
viral RNA viral DNA host DNA host DNA provirus
MRC | Medical Research Council
3 of 22
Characterising retroviral integration sites
HOST DNA
MRC | Medical Research Council
4 of 22
Characterising retroviral integration sites
HOST DNA RETROVIRUS DNA INTERMEDIATE
MRC | Medical Research Council
4 of 22
Characterising retroviral integration sites
CUT! HOST DNA RETROVIRUS DNA INTERMEDIATE
MRC | Medical Research Council
4 of 22
Characterising retroviral integration sites
HOST DNA PROVIRUS
PASTE!
MRC | Medical Research Council
4 of 22
Characterising retroviral integration sites
HOST DNA PROVIRUS
PASTE!
We would like to characterise the target integration site
MRC | Medical Research Council
4 of 22
Aligning integration sites
Given a collection of integration sites, we can align them according to the position of the provirus. . .
INTEGRATION SITE 1
INTEGRATION SITE 2
INTEGRATION SITE 3
INTEGRATION SITE 4
INTEGRATION SITE 5
MRC | Medical Research Council
5 of 22
Aligning integration sites
Given a collection of integration sites, we can align them according to the position of the provirus. . . . . . and then ignore/remove/mask the provirus sequence, so that we just look at the target sites:
INTEGRATION SITE 1
INTEGRATION SITE 2
INTEGRATION SITE 3
INTEGRATION SITE 4
INTEGRATION SITE 5 MRC | Medical Research Council
5 of 22
Summarising a collection of target sites
Example (5 sequences) Sequences
...ATC... ...TTA... ...AAC... ...TTC... ...AGC...
Complements
...TAG... ...AAT... ...TTG... ...AAG... ...TCG...
Reverse complements
...GAT... ...TAA... ...GTT... ...GAA... ...GCT...
Consensus sequence
Just take the most frequent letter at each position: ...ATC...
Position probability matrix (PPM), P
Estimate the probability of each letter at each position: P = A . . . 3/5 1/5 1/5 . . . T . . . 2/5 3/5 . . . C . . . 4/5 . . . G . . . 1/5 . . .
MRC | Medical Research Council
6 of 22
Summarising a collection of target sites
Example (5 sequences) Sequences
...ATC... ...TTA... ...AAC... ...TTC... ...AGC...
Complements
...TAG... ...AAT... ...TTG... ...AAG... ...TCG...
Reverse complements
...GAT... ...TAA... ...GTT... ...GAA... ...GCT...
Reverse complement PPM, P(RC)
The PPM for the reverse complement sequences: P(RC) = A . . . 3/5 2/5 . . . T . . . 1/5 1/5 3/5 . . . C . . . 1/5 . . . G . . . 4/5 . . . Note: we can get P(RC) from P (and vice versa) by swapping the rows A ↔ T and C ↔ G, and reversing the order of the columns.
MRC | Medical Research Council
7 of 22
Palindromic consensus sequences for HTLV-1 and HIV-1 target integration sites
From 4,521 HTLV-1 target integration sites, we find the consensus:
From 13,442 HIV-1 target integration sites, we find the consensus:
MRC | Medical Research Council
8 of 22
Palindromic consensus sequences for HTLV-1 and HIV-1 target integration sites
From 4,521 HTLV-1 target integration sites, we find the consensus:
From 13,442 HIV-1 target integration sites, we find the consensus:
MRC | Medical Research Council
8 of 22
Palindromic consensus sequences for HTLV-1 and HIV-1 target integration sites
From 4,521 HTLV-1 target integration sites, we find the consensus:
From 13,442 HIV-1 target integration sites, we find the consensus:
MRC | Medical Research Council
8 of 22
Palindromic consensus sequences for HTLV-1 and HIV-1 target integration sites
From 4,521 HTLV-1 target integration sites, we find the consensus:
From 13,442 HIV-1 target integration sites, we find the consensus:
The target integration sites are palindromic (as already known!)
MRC | Medical Research Council
8 of 22
Palindromic PPMs for HTLV-1 and HIV-1 target integration sites
For both HTLV-1 and HIV-1, we have P(RC) ≈ P HTLV-1
Entries of PPM, P
0.1 0.2 0.3 0.4 0.5 0.6
Entries of reverse-complement PPM, P(RC)
0.1 0.2 0.3 0.4 0.5 0.6
P(RC) = P 95% credible region
HIV-1
Entries of PPM, P
0.1 0.2 0.3 0.4 0.5 0.6
Entries of reverse-complement PPM, P(RC)
0.1 0.2 0.3 0.4 0.5 0.6
P(RC) = P 95% credible region
MRC | Medical Research Council
9 of 22
Palindromic sequence logos
HTLV-1:
0.0 0.1 0.2 0.3 0.4
bits
G C
T
A
C G
T A
G
C
A
T
A
T
C G T ACG
T
A
C
G T
AT
C
A G
G
C
A
T
C
A G
C
T G
AG
C
A
T
1
C G
T
A
G C ATA G
T C
G T CCG
T
A
G
T
C
G
A C
T
G
C
A
T
G C T A C G T A11
C G
T
A
G C
A
T
C G
T A
2 3 4 5 6 10 9 8 7 12 13
HIV-1:
0.0 0.1 0.2 0.3 0.4
bits G C
T A
G
C
A
T
G C A TC GA T
C G A TC GT A
GC A
TC
G A
C
G
A
T
C
T
A G T
C
A
G
C G
A
T
G C
T
A
G C
T
A
A
G
T
C
G
A
T C
G
C
T A
G
T C
A
G CT A
C T A GC
T A
G T ACG
T A
C G
A
T
1 11 2 3 4 5 6 10 9 8 7 12
MRC | Medical Research Council
10 of 22
An attack of aibohphobia
MRC | Medical Research Council
11 of 22
An attack of aibohphobia
individual sequences, or just at the level of these summaries?
MRC | Medical Research Council
11 of 22
An attack of aibohphobia
individual sequences, or just at the level of these summaries?
palindromic” each sequence is
MRC | Medical Research Council
11 of 22
The palindrome index
MRC | Medical Research Council
12 of 22
The palindrome index
s-8 s-7 s-6 s-5 s-1 s-2 s-3 s-4 s1 s2 s3 s4 s8 s7 s6 s5 S =
MRC | Medical Research Council
12 of 22
The palindrome index
s-8 s-7 s-6 s-5 s-1 s-2 s-3 s-4 s1 s2 s3 s4 s8 s7 s6 s5 S =
Define ρ(S) = 1 n
n
I(si = c(s−i)), where 2n is the sequence length, I is the indicator function, and c(x) is the complement of x (e.g. c(T) = A).
MRC | Medical Research Council
12 of 22
The palindrome index
s-8 s-7 s-6 s-5 s-1 s-2 s-3 s-4 s1 s2 s3 s4 s8 s7 s6 s5 S =
Define ρ(S) = 1 n
n
I(si = c(s−i)), where 2n is the sequence length, I is the indicator function, and c(x) is the complement of x (e.g. c(T) = A). (In practice, we use an “adjusted for chance” version, which is maximally 1, and is 0 if S is no more palindromic than expected by chance.)
MRC | Medical Research Council
12 of 22
Observed palindrome indices
0.31 0.42 0.54 0.65 0.88
Frequency
200 400 600 800 1000 1200
HTLV-1
Distribution of API scores API for consensus
Adjusted Palindrome Index, API
0.12 0.24 0.37 0.5 0.62 0.75 0.87
Frequency
500 1000 1500 2000 2500 3000 3500
HIV-1
Distribution of API scores API for consensus
Adjusted Palindrome Index, API
0.88 0.87
MRC | Medical Research Council
13 of 22
Where do the palindromes come from?
MRC | Medical Research Council
14 of 22
Where do the palindromes come from?
number of sequences?
MRC | Medical Research Council
14 of 22
Where do the palindromes come from?
“reverse complement” sequence orientations,
MRC | Medical Research Council
15 of 22
Where do the palindromes come from?
“reverse complement” sequence orientations, e.g. in the noiseless case
Sequence 1: AATTTAAGTGGAT (Forward) Sequence 2: ATCCACTTAAATT (Reverse complement) Sequence 3: ATCCACTTAAATT (Reverse complement) Sequence 4: AATTTAAGTGGAT (Forward) Sequence 5: ATCCACTTAAATT (Forward) Sequence 6: AATTTAAGTGGAT (Reverse complement)
P = A 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 T 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 C 0.5 0.5 0.5 G 0.5 0.5 0.5 = P(RC)
MRC | Medical Research Council
15 of 22
Analogy
If we have a sample of many real numbers, and we take their mean and find it to be exactly zero, one possibility is that this mean is representative of the sample:
MRC | Medical Research Council
16 of 22
Analogy
If we have a sample of many real numbers, and we take their mean and find it to be exactly zero, one possibility is that this mean is representative of the sample: Another possibility is that we have 2 symmetric components, one positive and one negative:
MRC | Medical Research Council
16 of 22
Mixture modelling
◮ one with PPM P; and ◮ one with reverse complement PPM P(RC).
π(S) = ωπ(S|P) + (1 − ω)π(S|P(RC)).
MRC | Medical Research Council
17 of 22
Mixture modelling
◮ one with PPM P; and ◮ one with reverse complement PPM P(RC).
π(S) = ωπ(S|P) + (1 − ω)π(S|P(RC)).
population with PPM P.
MRC | Medical Research Council
17 of 22
Mixture modelling
◮ one with PPM P; and ◮ one with reverse complement PPM P(RC).
π(S) = ωπ(S|P) + (1 − ω)π(S|P(RC)).
population with PPM P.
numerous ways. I will show results from using an EM-algorithm, but identical results are obtained by: (i) maximum profile likelihood; (ii) Gibbs sampling; (iii) greedy Gibbs.
MRC | Medical Research Council
17 of 22
Unmixing the forward and reverse sequences
0.5 0.5HIV-1 (13442 seqs)
0.0 0.1 0.2 0.3 0.4bits
G CA
T
G C AT
G A C T G C A T G C A TG C A TG A CTG
A C
T
CG
A
T
TA
C G
G
C A
T
GC
A
T
GC
T A
A
G
T
C
GA
T
C
GA C
T
GT C
A
G C T AG C A T G C A TG A C T G C A TG C ATG C
A
T
HIV-1 (13442 seqs)
bits C G
T
A
C GT A
C G T AC T G A C G T A C G T A C G A TCA G
T
CT G A
CT
A
G
T
C
A
G
CG
A
T
CG
T
A
C
G
T
A
AT G
C
GC
T
A T
C G
A
C T G A C G T A C G T AC G T A C T G AC G TA
C GT A
HTLV-1 (4521 seqs)
0.0 0.1 0.2 0.3 0.4 0.5bits
G C AT
G CA T
G C AT
G A C T C A TG C T A C G T AT A C GGC A
T
G T A CG T A C G C A TGC A
T
GC
A
T
GA
C
T
A G
T
C
GA
T C
G
C
T
A
AG
T C
G A CT
GC A
T
G A C T G C T AG C T AG CA
T
G CA
T
HTLV-1 (4521 seqs)
0.0 0.1 0.2 0.3 0.4 0.5bits
C GT A
C GT A
C G A T C G A TC T G ACG T
A
CT G A T
C
A G
C
G
A
T
CT
A
G
T C
A
G
CT
G A
CG
T A
CG
T
A
C G T A C A T G C A T GCG
T
A
A T G C G C A TC G A T G T AC T G A C G TA
C GT A
C GT A
Subpopulation 1 Subpopulation 2
MRC | Medical Research Council
18 of 22
Unmixing the forward and reverse sequences
5 10 1 2 3 4 6 7 8 9 11 12 13
0.0 0.1 0.2 0.3 0.4 0.5
bits
G
C
A
T
G
C
T A
G
C A
T
G
A
C
T
G C A TGC
T A
CG T ATA C GG
C A
G
T A
C
G
T
A C
G
A C
TG
A C
T
G
C
A
T
G
A
C
T
G A
T
C
G
A
T
C
G
C
A
G
T C
G
A
C
T
G
C A
T
GA C
T
G C T AGC
T A
G
C
A
T
G
C
A
T
0.0 0.1 0.2 0.3 0.4 0.5
bits
G
C
A
T
G
A C
T
GA
C
T
G C AT
G C ATG
C
A
TG A
C
TA
G
C
C
G
A
T
T
A
C G
G
A
C
G
C
A
T
G
C
T
A
A
G
T
G
A
T
G
A
C
T
G
T
C
A
G C T AGC A
T
GA C
TG A C
T
G C ATG
C
A
TG C
A
T
5 10
1 2 3 4 6 7 8 9 11 12
HIV-1 HTLV-1 MRC | Medical Research Council
18 of 22
Unmixing the forward and reverse sequences
5 10 1 2 3 4 6 7 8 9 11 12 13
0.0 0.1 0.2 0.3 0.4 0.5
bits
GC
A
T
GC
T A
G
C A
T
GA
C
T
G C A TG CT A
CG T ATA C GG
C A
C
GT
A C
GA C
TG
A C
T
G
C
A
T
G
A
C
T
G A
T
C
G
A
T
C
G
C
T
A
A
G
T C
GA
C
T
G
C A
T
G A C T G C T AG CT A
GC
A
T
G CA
T
0.0 0.1 0.2 0.3 0.4 0.5
bits
G
C
A
T
G
A C
T
G A CT
G C AT
G C ATG
C ATG A
C
TA
G
C
T
C
G
A
T
T
A
C G
G
A C
T
G
C
A
T
G
C
T
A
A
G
T
C
G
A
T
C
GA C
T
G
T C
A
G C T AG C AT
G A CTG A C
T
G C A TGC
A
TG C
A
T
5 10
1 2 3 4 6 7 8 9 11 12
HIV-1 HTLV-1
5 10 1 2 3 4 6 7 8 9 11 12 13
0.0 0.1 0.2 0.3 0.4 0.5
bits
G
C A
T
G
C A
T
GA C
T
GA C
T
G
C
A
T
G A C TA GT CG
C A
T
GT C A
C G A TTA
G
C
G
A
C
T
G AC
T
G C AT
GA
C
T
AT G
C
GA
T C
GC A
T
G
C
T
A
G AT C G
A C
T
G A
C
T
G
C A
T
GA C
T
GC
A
T
G CA
T
ASLV
5 10
1 2 3 4 6 7 8 9 11 12 0.0 0.1 0.2 0.3 0.4 0.5
bits
G C AT
G A CT
G A T C G C A T G A C TG CT A
G A C TA GC
TG
C
A
T CT G
A
C
G
A
C
T
GC A
T
A G
C
T
G AT CG
C
T
A
GT C G AC T
GA
C
T
GA
C
T
G A CT
GC A
T
G C T AC GA
T
MLV
MRC | Medical Research Council
18 of 22
Summary
MRC | Medical Research Council
19 of 22
Summary
sequences that contain a non-palindromic motif in approximately equal proportions in “forward” and “reverse complement” orientations
MRC | Medical Research Council
19 of 22
Summary
sequences that contain a non-palindromic motif in approximately equal proportions in “forward” and “reverse complement” orientations
across 4 retroviruses: 5’-T(N1/2)[C(N0/1)T|(W1/2)C]CW-3’
MRC | Medical Research Council
19 of 22
Summary
sequences that contain a non-palindromic motif in approximately equal proportions in “forward” and “reverse complement” orientations
across 4 retroviruses: 5’-T(N1/2)[C(N0/1)T|(W1/2)C]CW-3’
MRC | Medical Research Council
19 of 22
Summary
sequences that contain a non-palindromic motif in approximately equal proportions in “forward” and “reverse complement” orientations
across 4 retroviruses: 5’-T(N1/2)[C(N0/1)T|(W1/2)C]CW-3’
retroviral intasomes.
MRC | Medical Research Council
19 of 22
Availability
◮ Kirk, Huvet, Melamed, Maertens & Bangham (2015). Retroviruses
integrate into a shared, non-palindromic motif. bioRxiv.
Matlab code (and the HTLV-1 dataset) are available online: http://www.mrc-bsu.cam.ac.uk/software/ bioinformatics-and-statistical-genomics/ Just click on retroCode to download!
MRC | Medical Research Council
20 of 22
Acknowledgements
Charles Bangham Maxime Huvet Anat Melamed Goedele Maertens Sylvia Richardson MRC Biostatistics Unit Michael Stumpf Imperial College Theoretical Systems Biology group
MRC | Medical Research Council
21 of 22
Thanks for listening!
@pauldwkirk http://www.mrc-bsu.cam.ac.uk/people/paul-kirk/
MRC | Medical Research Council
22 of 22