Guideline Introduction Methods Results Outlook
Determining coding CpG islands as regions significant for Markov - - PowerPoint PPT Presentation
Determining coding CpG islands as regions significant for Markov - - PowerPoint PPT Presentation
Guideline Introduction Methods Results Outlook Determining coding CpG islands as regions significant for Markov chain based counting statistics Alexander Schnhuth Centrum Wiskunde & Informatica Amsterdam joint work with Meromit
Guideline Introduction Methods Results Outlook
Guideline
Introduction Cytosine Deamination Problem Definition Methods The Null Model The Algorithm Results Epigenetic Association New Findings Outlook
Guideline Introduction Methods Results Outlook
Introduction
Cytosine Deamination
- Degradation of CpG dinucleotides more
frequent than for other constellations
- Methylated cytosines mutate to thymine
through deamination (C → T)
- CpG islands:
CG CG A AG C TTG CG CG
substrings in the genome with unusually high CpG content
CG
CH3 Deamination
ATATG TTGGA TG ATATG TTGGA
Guideline Introduction Methods Results Outlook
Introduction
Cytosine Deamination
- Degradation of CpG dinucleotides more
frequent than for other constellations
- Methylated cytosines mutate to thymine
through deamination (C → T)
- CpG islands:
CG CG A AG C TTG CG CG
substrings in the genome with unusually high CpG content
CG
CH3 Deamination
ATATG TTGGA TG ATATG TTGGA
- CpG islands are not affected by neutral mutation rates due to epigenetic
constraint ☞ computational inference possible
- Still most popular:
- G.-Garden / Frommer: length ≥ 200bp, GC % ≥ 0.5, CpG Obs/Exp ≥ 0.6
- Takai / Jones: length ≥ 500bp, GC % ≥ 0.55, CpG Obs/Exp ≥ 0.65
Guideline Introduction Methods Results Outlook
Generic Motivation
Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following). Output: A set of non-overlapping substrings G1, ..., GL which are “most significant” in terms of their CpG content.
Guideline Introduction Methods Results Outlook
Generic Motivation
Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following). Output: A set of non-overlapping substrings G1, ..., GL which are “most significant” in terms of their CpG content.
- Thereby one would like to control the false discovery rate
E( V L ) where V = # False Positives that is the fraction of false positives to be expected.
Guideline Introduction Methods Results Outlook
Methods
Definitions
- Let Σ = {A, C, G, T},
G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.
- For example,
G = CGACG: | G| = 5, #( G, CG) = 2.
Guideline Introduction Methods Results Outlook
Methods
Definitions
- Let Σ = {A, C, G, T},
G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.
- For example,
G = CGACG: | G| = 5, #( G, CG) = 2.
- Let Z n be a random variable defined by
Z n : Σn − → N
- G
→ #( G, CG)
Guideline Introduction Methods Results Outlook
Methods
Definitions
- Let Σ = {A, C, G, T},
G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.
- For example,
G = CGACG: | G| = 5, #( G, CG) = 2.
- Let Z n be a random variable defined by
Z n : Σn − → N
- G
→ #( G, CG)
- Let
G be a genomic substring of length n, m := #( G, CG).
- Consider the tail probability
p( G) := pn,m := P({Z n ≥ m}). which reflects that a randomly drawn n-mer contains at least m CGs.
Guideline Introduction Methods Results Outlook
Methods
Definitions
- Let Σ = {A, C, G, T},
G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.
- For example,
G = CGACG: | G| = 5, #( G, CG) = 2.
- Let Z n be a random variable defined by
Z n : Σn − → N
- G
→ #( G, CG)
- Let
G be a genomic substring of length n, m := #( G, CG).
- Consider the tail probability
p( G) := pn,m := P({Z n ≥ m}). which reflects that a randomly drawn n-mer contains at least m CGs. Wanted: Genomic substrings G of significantly small p( G).
Guideline Introduction Methods Results Outlook
Methods
Problem Specification Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following) and a user-specified threshold α ∈ [0, 1]. Output: A set of non-overlapping substrings G1, ..., GL in G resp. the Gi which minimize
L
- l=1
p( Gl) =
L
- l=1
pnl ,ml where nl := | Gl|, ml := #( Gl, CG), such that E( V L ) ≤ α.
Guideline Introduction Methods Results Outlook
Methods
Problem Specification Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following) and a user-specified threshold α ∈ [0, 1]. Output: A set of non-overlapping substrings G1, ..., GL in G resp. the Gi which minimize
L
- l=1
p( Gl) =
L
- l=1
pnl ,ml where nl := | Gl|, ml := #( Gl, CG), such that E( V L ) ≤ α.
- Some additional, biologically reasonable constraints will apply.
- Still missing: Specification of P.
Guideline Introduction Methods Results Outlook
Null Model
Markov Chains Standard hidden Markov model for CpG island detection
Issue: Specification of an “island model” necessary.
Guideline Introduction Methods Results Outlook
Null Model
Markov Chains Parameter estimation for only a null model straightforward: Collect dinucleotide frequencies into Markov transition probability matrix M = pAA pAC pAG pAT pCA pCC pCG pCT pGA pGC pGG pGT pTA pTC pTG pTT .
Guideline Introduction Methods Results Outlook
Methods
Computation of Probabilities
- Consider the probability vectors
πn,m = [πn,m(A), πn,m(C), πn,m(G), πn,m(T)] ∈ [0, 1]4 where πn,m(x) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ {A, C, G, T}.
Guideline Introduction Methods Results Outlook
Methods
Computation of Probabilities
- Consider the probability vectors
πn,m = [πn,m(A), πn,m(C), πn,m(G), πn,m(T)] ∈ [0, 1]4 where πn,m(x) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ {A, C, G, T}.
- For all n ∈ N initialize
πn,0 = π where πT M = πT is the stationary eigenvector associated with the Markov chain.
- Recursively compute
(πn,m)T = (πn−1,m)T · pAA pAC pAG pAT pCA pCC pCT pGA pGC pGG pGT pTA pTC pTG pTT + (πn−1,m−1)T · pCG
Guideline Introduction Methods Results Outlook
Bona Fide Islands
Significance Vs. Epigenetic Score
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Alarm Rate Hit Rate
A
episcore p-value
- bs/exp cg
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Alarm Rate Hit Rate
B
episcore p-value
- bs/exp cg
ROC Plots: p-values vs. epigenetic score vs. CpG Obs/Exp on bona fide islands for prediction of open chromatin and differential methylation
Guideline Introduction Methods Results Outlook
Exonic CpG Islands
Coding Constraint vs. Epigenetic Constraint
- In exons, preservation of CpGs due to
both coding and epigenetic constraint.
CG
Constraint Coding Constraint
CG CG A AG C TTG CG
Epigenetic
- Coding CpG island: exonic substring with
significant CpG content due to epigenetic constraint The Genetic Code
Guideline Introduction Methods Results Outlook
Null Model
5-th order Markov chain
A C G T AAAAA P(A5 → A) P(A5 → C) P(A5 → G) P(A5 → T) AAAAC P(A4C → A) P(A4C → C) P(A4C → G) P(A4C → T) AAAAG P(A4G → A) P(A4G → C) P(A4G → G) P(A4G → T) AAAAT P(A4T → A) P(A4T → C) P(A4T → G) P(A4T → T) . . . . . . . . . . . . . . . TTTTA P(T 4A → A) P(T 4A → C) P(T 4A → G) P(T 4A → T) TTTTG P(T 4C → A) P(T 4C → C) P(T 4C → G) P(T 4C → T) TTTTG P(T 4G → A) P(T 4G → C) P(T 4G → G) P(T 4G → T) TTTTT P(T 5 → A) P(T 5 → C) P(T 5 → G) P(T 5 → T)
- 26 = 64 parameters to be learned from data
- Needed: Dinucleotide counting statistics on 5-th order Markov chains
- Goal: Determine significance of exonic substrings
Guideline Introduction Methods Results Outlook
Coding Vs. Non Coding Model
Differences
Histogram of log_ratio_cd_wg
log_ratio_cd_wg Frequency 5 10 15 20 10000 20000 30000 40000 50000
!"#$%#&'() *+,-./01*%#'+23&,4)5)*+,-./01*%#,#&+6#4)
!" #" $!" $#" %!" !" $!!!!" %!!!!" &!!!!" '!!!!" #!!!!"
Figure: Fold change in the log p-value for genome null model versus coding null model. 82,505 exons have p-values that differ by more than a factor of 2.
Guideline Introduction Methods Results Outlook
The Algorithm
Determining Island Candidates
- Order all (K many) exonic substrings Ge
s which
are bounded by CG from both ends by their tail probabilities pcg→cg(Ge
s):
pcg→cg(G1) ≤ · · · ≤ pcg→cg(GK ).
- Fix a false discovery rate α, determine
k∗ := max{i : pi ≤ i K α} Declare the Gi, 1 ≤ i ≤ k∗ candidate CCGIs. ☞ Benjamini-Hochberg procedure
CG
Selection
CG CG CG CG CG CGCG CG CG CG CG CG CG CG CG CG CG CG CG
Exon / Chromosome Candidate
Guideline Introduction Methods Results Outlook
The Algorithm
Computing Islands: The Greedy Algorithm
- Let N(Gi) be all candidate islands which
- verlap with Gi.
- For the candidate islands Gi, i ≤ K replace
pcg→cg(Gi) by p(Gi), reorder if necessary.
- GREEDYCCGI
1: CCGI ← ∅, CAND ← {Gi, 1 ≤ i ≤ K} 2: while CAND = ∅ do 3: i∗ ← argmin
i
{p(Gi) | Gi ∈ CAND} 4: CCGI ← CCGI ∪ {Gi∗}, 5: CAND ← CAND \ N(Gi∗) 6: end while 7: Output CCGI as the set of coding CpG islands.
CG
Computation
CG CG CG CG CG CGCG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CGCG CG CG CG CG CG CG CG CG CG CG CG CG
Island
Guideline Introduction Methods Results Outlook
The Algorithm
The Greedy Algorithm
Theorem:
- Let (Gi)1≤i≤K be possibly overlapping substrings of a superstring G.
- Let S(Gi) be a score for a substring Gi.
- Applying GREEDYCCGI yields L non-overlapping substrings (Hl)1≤l≤L such that
- L
l=1 S(Hl) is minimized
- subject to the constraint
S(G) ≥ min
l≤j≤k S(Hj) for all 1 ≤ l, k ≤ L, G ⊂ ←
− → HlHk where ← − → HlHk is the substring covering all in between Hl−1 and Hk+1. Remark: The constraint says that the most significant substring in the area between any two islands is also an island (or there is no island).
Guideline Introduction Methods Results Outlook
Results
Overlap with Functional Regions
Overall OC 17-Cons 1st Ex.
- Alt. TSS
All CCGIs No 12445 2734 11041 5539 7248 %
- 0.21
0.88 0.44 0.58 No GFCGIs No. 3000 189 2687 433 1248 %
- 0.06
0.89 0.14 0.41 No HMMCGIs No 802 25 706 82 286 %
- 0.03
0.87 0.10 0.35 Overlap of CCGI sets with different functional regions
OC: Open Chromatin 17-Cons: UCSC 17-Way Conservation Track 1st Ex.: Overlap with Initial Exons
- Alt. TSS: Alternative Transcription Start Site
Guideline Introduction Methods Results Outlook
Results
Differential Methylation
Coding CGIs HMM CGIs GFCGIs Dist. No R / P / F-M No R / P / F-M No R / P / F-M 12934 22.5 / 3.1 / 5.5 26848 53.5 / 3.6 / 6.7 16320 31.1 / 3.4 / 6.2 No Cov 12923 22.5 / 3.1 / 5.5 11809 26.3 / 4.0 / 7.0 5465 14.2 / 4.7 / 7.0 1 12324 21.9 / 3.2 / 5.6 3785 7.3 / 3.4 / 4.7 1213 3.4 / 5.0 / 4.1 15 4957 12.1 / 4.1 / 6.2 3138 6.5 / 3.5 / 4.6 946 2.9 / 5.2 / 3.7 30 2870 9.0 / 5.0 / 6.5 2561 5.8 / 3.6 / 4.5 790 2.6 / 5.2 / 3.4 45 1899 6.4 / 5.1 / 5.6 2084 5.4 / 3.9 / 4.5 671 2.3 / 5.2 / 3.1 60 1365 4.9 / 5.0 / 4.9 1753 5.3 / 4.2 / 4.7 602 2.3 / 5.3 / 3.2 90 859 3.9 / 5.4 / 4.5 1365 5.3 / 4.5 / 4.9 468 2.3 / 5.8 / 3.3 120 632 3.9 / 6.2 / 4.8 1098 5.0 / 4.6 / 4.8 357 2.2 / 6.2 / 3.3
Differential methylation relative to exon location R: recall, P: precision, F-M: F-measure
Guideline Introduction Methods Results Outlook
Results
Alternatively Spliced Exons Exons CCGIs (12445) HMMCGIs (26020) GFCGIs (15821) ∩ No No Pr p No Pr p No Pr p ≥ 2 4,092 443 3.56 2e-25 767 2.95 3e-20 453 2.86 4e-10 ≥ 3 266 41 0.33 3e-7 56 0.22 6e-4 41 0.26 1e-4 ≥ 4 31 3 0.02 0.33 2 0.01 0.94 2 0.01 0.74 Overlap with alternatively spliced exons Pr: precision, p: hypergeometric tail probability Overall Exon Count: 190181
Guideline Introduction Methods Results Outlook
Results
HOX genes
Scale chr7: 50 kb 27100000 27110000 27120000 27130000 27140000 27150000 27160000 27170000 27180000 27190000 27200000 27210000 CCGIs within exons (total 12,849) RefSeq Genes 5.83e-04 8.34e-06 5.71e-05 1.53e-10 7.12e-10 2.88e-08 5.71e-05 1.25e-21 4.02e-16 4.83e-06 1.22e-06 2.31e-09 1.74e-09 1.37e-13 8.70e-36 1.11e-18 5.35e-34 HOXA1 HOXA1 HOXA2 HOXA3 HOXA3 HOXA4 HOXA5 HOXA6 HOXA7 HOXA9 MIR196B HOXA10 HOXA10 HOXA11 HOXA11AS HOXA13
Fig.: The HOX gene cluster was studied in [Branciamore et al., 2010]: CpGs in exons
- f HOX genes are subject to pro-epigenetic selection.
Guideline Introduction Methods Results Outlook
Results
Differential Methylation / Alternative Splicing
Scale chr14: FLJ10357 FLJ10357 FLJ00056 ARHGEF40 DNase Clusters Txn Factor ChIP RepeatMasker 500 bases 20625100 20625200 20625300 20625400 20625500 20625600 20625700 20625800 20625900 20626000 20626100 CCGIs UCSC CpG islands HMM islands Human ESTs That Have Been Spliced UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes ENCODE Enhancer and Promoter Associated Histone Mark (H3K4Me1) on 8 Cell Lines ENCODE Promoter-Associated Histone Mark (H3K4Me3) on 9 Cell Lines ENCODE Digital DNaseI Hypersensitivity Clusters ENCODE Transcription Factor ChIP-seq Repeating Elements by RepeatMasker 1.92e-04 Enhanced H3K4Me1 50 _ 0 _ Promoter H3K4Me3 100 _ 0 _ Scale chr1: CGIs HMM 500 bases 44255000 44255500 44256000 44256500 CCGIs UCSC CpG islands HMM islands RefSeq Genes 2.75e-05 SLC6A9 SLC6A9 SLC6A9
!"#$$ !%#$$ (a) Differentially methylated CCGI in an alternatively spliced exon [methylated in embryonic stem cells, unmethylated in fetal lung fibroblasts] (b) CCGI located in a gene with alternative transcription start sites
Guideline Introduction Methods Results Outlook
Outlook
Whole Genome
Scale chr1: NDUFS5 NDUFS5 Human ESTs DNase Clusters Txn Factor ChIP CpG Islands RepeatMasker 5 kb 39261000 39262000 39263000 39264000 39265000 39266000 39267000 39268000 39269000 39270000 GSI (Greedy Stat-Islands) e11 (26,946 total) UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes Human ESTs Including Unspliced ENCODE Enhancer and Promoter Histone Mark (H3K4Me1) on 8 Cell Lines ENCODE Promoter Histone Mark (H3K4Me3) on 9 Cell Lines ENCODE Digital DNaseI Hypersensitivity Clusters ENCODE Transcription Factor ChIP-seq CpG Islands (Islands < 300 Bases are Light Green) Repeating Elements by RepeatMasker 3.852632e-16 NDUFS5 NDUFS5 Enhancer H3K4Me1 50 _ 0 _ Promoter H3K4Me3 100 _ 0 _ Scale chr9: DNase Clusters Txn Factor ChIP CpG Islands RepeatMasker 2 kb 106792000 106793000 106794000 106795000 106796000 106797000 GSI (Greedy Stat-Islands) e11 (26,946 total) UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes Human ESTs Including Unspliced ENCODE Enhancer and Promoter Histone Mark (H3K4Me1) on 8 Cell Lines ENCODE Promoter Histone Mark (H3K4Me3) on 9 Cell Lines ENCODE Digital DNaseI Hypersensitivity Clusters ENCODE Transcription Factor ChIP-seq CpG Islands (Islands < 300 Bases are Light Green) Repeating Elements by RepeatMasker 1.984208e-23 AA757392 AA601323 BI827907 Enhancer H3K4Me1 50 _ 0 _ Promoter H3K4Me3 100 _ 0 _
Fig.: Whole Genome Statistically Significant Islands
Guideline Introduction Methods Results Outlook