Determining coding CpG islands as regions significant for Markov - - PowerPoint PPT Presentation

determining coding cpg islands
SMART_READER_LITE
LIVE PREVIEW

Determining coding CpG islands as regions significant for Markov - - PowerPoint PPT Presentation

Guideline Introduction Methods Results Outlook Determining coding CpG islands as regions significant for Markov chain based counting statistics Alexander Schnhuth Centrum Wiskunde & Informatica Amsterdam joint work with Meromit


slide-1
SLIDE 1

Guideline Introduction Methods Results Outlook

Determining coding CpG islands

as regions significant for Markov chain based counting statistics Alexander Schönhuth Centrum Wiskunde & Informatica Amsterdam

joint work with Meromit Singer, Alexander Engström and Lior Pachter UC Berkeley

Rutgers University, New Jersey October 12, 2011

slide-2
SLIDE 2

Guideline Introduction Methods Results Outlook

Guideline

Introduction Cytosine Deamination Problem Definition Methods The Null Model The Algorithm Results Epigenetic Association New Findings Outlook

slide-3
SLIDE 3

Guideline Introduction Methods Results Outlook

Introduction

Cytosine Deamination

  • Degradation of CpG dinucleotides more

frequent than for other constellations

  • Methylated cytosines mutate to thymine

through deamination (C → T)

  • CpG islands:

CG CG A AG C TTG CG CG

substrings in the genome with unusually high CpG content

CG

CH3 Deamination

ATATG TTGGA TG ATATG TTGGA

slide-4
SLIDE 4

Guideline Introduction Methods Results Outlook

Introduction

Cytosine Deamination

  • Degradation of CpG dinucleotides more

frequent than for other constellations

  • Methylated cytosines mutate to thymine

through deamination (C → T)

  • CpG islands:

CG CG A AG C TTG CG CG

substrings in the genome with unusually high CpG content

CG

CH3 Deamination

ATATG TTGGA TG ATATG TTGGA

  • CpG islands are not affected by neutral mutation rates due to epigenetic

constraint ☞ computational inference possible

  • Still most popular:
  • G.-Garden / Frommer: length ≥ 200bp, GC % ≥ 0.5, CpG Obs/Exp ≥ 0.6
  • Takai / Jones: length ≥ 500bp, GC % ≥ 0.55, CpG Obs/Exp ≥ 0.65
slide-5
SLIDE 5

Guideline Introduction Methods Results Outlook

Generic Motivation

Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following). Output: A set of non-overlapping substrings G1, ..., GL which are “most significant” in terms of their CpG content.

slide-6
SLIDE 6

Guideline Introduction Methods Results Outlook

Generic Motivation

Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following). Output: A set of non-overlapping substrings G1, ..., GL which are “most significant” in terms of their CpG content.

  • Thereby one would like to control the false discovery rate

E( V L ) where V = # False Positives that is the fraction of false positives to be expected.

slide-7
SLIDE 7

Guideline Introduction Methods Results Outlook

Methods

Definitions

  • Let Σ = {A, C, G, T},

G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.

  • For example,

G = CGACG: | G| = 5, #( G, CG) = 2.

slide-8
SLIDE 8

Guideline Introduction Methods Results Outlook

Methods

Definitions

  • Let Σ = {A, C, G, T},

G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.

  • For example,

G = CGACG: | G| = 5, #( G, CG) = 2.

  • Let Z n be a random variable defined by

Z n : Σn − → N

  • G

→ #( G, CG)

slide-9
SLIDE 9

Guideline Introduction Methods Results Outlook

Methods

Definitions

  • Let Σ = {A, C, G, T},

G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.

  • For example,

G = CGACG: | G| = 5, #( G, CG) = 2.

  • Let Z n be a random variable defined by

Z n : Σn − → N

  • G

→ #( G, CG)

  • Let

G be a genomic substring of length n, m := #( G, CG).

  • Consider the tail probability

p( G) := pn,m := P({Z n ≥ m}). which reflects that a randomly drawn n-mer contains at least m CGs.

slide-10
SLIDE 10

Guideline Introduction Methods Results Outlook

Methods

Definitions

  • Let Σ = {A, C, G, T},

G ∈ Σn an n-mer and | G| and #( G, CG) the length and number of CG occurrences in G.

  • For example,

G = CGACG: | G| = 5, #( G, CG) = 2.

  • Let Z n be a random variable defined by

Z n : Σn − → N

  • G

→ #( G, CG)

  • Let

G be a genomic substring of length n, m := #( G, CG).

  • Consider the tail probability

p( G) := pn,m := P({Z n ≥ m}). which reflects that a randomly drawn n-mer contains at least m CGs. Wanted: Genomic substrings G of significantly small p( G).

slide-11
SLIDE 11

Guideline Introduction Methods Results Outlook

Methods

Problem Specification Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following) and a user-specified threshold α ∈ [0, 1]. Output: A set of non-overlapping substrings G1, ..., GL in G resp. the Gi which minimize

L

  • l=1

p( Gl) =

L

  • l=1

pnl ,ml where nl := | Gl|, ml := #( Gl, CG), such that E( V L ) ≤ α.

slide-12
SLIDE 12

Guideline Introduction Methods Results Outlook

Methods

Problem Specification Computation of CpG Islands Input: A genome G resp. a set of genomic sequences Gi (exons in the following) and a user-specified threshold α ∈ [0, 1]. Output: A set of non-overlapping substrings G1, ..., GL in G resp. the Gi which minimize

L

  • l=1

p( Gl) =

L

  • l=1

pnl ,ml where nl := | Gl|, ml := #( Gl, CG), such that E( V L ) ≤ α.

  • Some additional, biologically reasonable constraints will apply.
  • Still missing: Specification of P.
slide-13
SLIDE 13

Guideline Introduction Methods Results Outlook

Null Model

Markov Chains Standard hidden Markov model for CpG island detection

Issue: Specification of an “island model” necessary.

slide-14
SLIDE 14

Guideline Introduction Methods Results Outlook

Null Model

Markov Chains Parameter estimation for only a null model straightforward: Collect dinucleotide frequencies into Markov transition probability matrix M =     pAA pAC pAG pAT pCA pCC pCG pCT pGA pGC pGG pGT pTA pTC pTG pTT     .

slide-15
SLIDE 15

Guideline Introduction Methods Results Outlook

Methods

Computation of Probabilities

  • Consider the probability vectors

πn,m = [πn,m(A), πn,m(C), πn,m(G), πn,m(T)] ∈ [0, 1]4 where πn,m(x) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ {A, C, G, T}.

slide-16
SLIDE 16

Guideline Introduction Methods Results Outlook

Methods

Computation of Probabilities

  • Consider the probability vectors

πn,m = [πn,m(A), πn,m(C), πn,m(G), πn,m(T)] ∈ [0, 1]4 where πn,m(x) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ {A, C, G, T}.

  • For all n ∈ N initialize

πn,0 = π where πT M = πT is the stationary eigenvector associated with the Markov chain.

  • Recursively compute

(πn,m)T = (πn−1,m)T ·     pAA pAC pAG pAT pCA pCC pCT pGA pGC pGG pGT pTA pTC pTG pTT     + (πn−1,m−1)T ·     pCG    

slide-17
SLIDE 17

Guideline Introduction Methods Results Outlook

Bona Fide Islands

Significance Vs. Epigenetic Score

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Alarm Rate Hit Rate

A

episcore p-value

  • bs/exp cg

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Alarm Rate Hit Rate

B

episcore p-value

  • bs/exp cg

ROC Plots: p-values vs. epigenetic score vs. CpG Obs/Exp on bona fide islands for prediction of open chromatin and differential methylation

slide-18
SLIDE 18

Guideline Introduction Methods Results Outlook

Exonic CpG Islands

Coding Constraint vs. Epigenetic Constraint

  • In exons, preservation of CpGs due to

both coding and epigenetic constraint.

CG

Constraint Coding Constraint

CG CG A AG C TTG CG

Epigenetic

  • Coding CpG island: exonic substring with

significant CpG content due to epigenetic constraint The Genetic Code

slide-19
SLIDE 19

Guideline Introduction Methods Results Outlook

Null Model

5-th order Markov chain

                    A C G T AAAAA P(A5 → A) P(A5 → C) P(A5 → G) P(A5 → T) AAAAC P(A4C → A) P(A4C → C) P(A4C → G) P(A4C → T) AAAAG P(A4G → A) P(A4G → C) P(A4G → G) P(A4G → T) AAAAT P(A4T → A) P(A4T → C) P(A4T → G) P(A4T → T) . . . . . . . . . . . . . . . TTTTA P(T 4A → A) P(T 4A → C) P(T 4A → G) P(T 4A → T) TTTTG P(T 4C → A) P(T 4C → C) P(T 4C → G) P(T 4C → T) TTTTG P(T 4G → A) P(T 4G → C) P(T 4G → G) P(T 4G → T) TTTTT P(T 5 → A) P(T 5 → C) P(T 5 → G) P(T 5 → T)                    

  • 26 = 64 parameters to be learned from data
  • Needed: Dinucleotide counting statistics on 5-th order Markov chains
  • Goal: Determine significance of exonic substrings
slide-20
SLIDE 20

Guideline Introduction Methods Results Outlook

Coding Vs. Non Coding Model

Differences

Histogram of log_ratio_cd_wg

log_ratio_cd_wg Frequency 5 10 15 20 10000 20000 30000 40000 50000

!"#$%#&'() *+,-./01*%#'+23&,4)5)*+,-./01*%#,#&+6#4)

!" #" $!" $#" %!" !" $!!!!" %!!!!" &!!!!" '!!!!" #!!!!"

Figure: Fold change in the log p-value for genome null model versus coding null model. 82,505 exons have p-values that differ by more than a factor of 2.

slide-21
SLIDE 21

Guideline Introduction Methods Results Outlook

The Algorithm

Determining Island Candidates

  • Order all (K many) exonic substrings Ge

s which

are bounded by CG from both ends by their tail probabilities pcg→cg(Ge

s):

pcg→cg(G1) ≤ · · · ≤ pcg→cg(GK ).

  • Fix a false discovery rate α, determine

k∗ := max{i : pi ≤ i K α} Declare the Gi, 1 ≤ i ≤ k∗ candidate CCGIs. ☞ Benjamini-Hochberg procedure

CG

Selection

CG CG CG CG CG CGCG CG CG CG CG CG CG CG CG CG CG CG CG

Exon / Chromosome Candidate

slide-22
SLIDE 22

Guideline Introduction Methods Results Outlook

The Algorithm

Computing Islands: The Greedy Algorithm

  • Let N(Gi) be all candidate islands which
  • verlap with Gi.
  • For the candidate islands Gi, i ≤ K replace

pcg→cg(Gi) by p(Gi), reorder if necessary.

  • GREEDYCCGI

1: CCGI ← ∅, CAND ← {Gi, 1 ≤ i ≤ K} 2: while CAND = ∅ do 3: i∗ ← argmin

i

{p(Gi) | Gi ∈ CAND} 4: CCGI ← CCGI ∪ {Gi∗}, 5: CAND ← CAND \ N(Gi∗) 6: end while 7: Output CCGI as the set of coding CpG islands.

CG

Computation

CG CG CG CG CG CGCG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CG CGCG CG CG CG CG CG CG CG CG CG CG CG CG

Island

slide-23
SLIDE 23

Guideline Introduction Methods Results Outlook

The Algorithm

The Greedy Algorithm

Theorem:

  • Let (Gi)1≤i≤K be possibly overlapping substrings of a superstring G.
  • Let S(Gi) be a score for a substring Gi.
  • Applying GREEDYCCGI yields L non-overlapping substrings (Hl)1≤l≤L such that
  • L

l=1 S(Hl) is minimized

  • subject to the constraint

S(G) ≥ min

l≤j≤k S(Hj) for all 1 ≤ l, k ≤ L, G ⊂ ←

− → HlHk where ← − → HlHk is the substring covering all in between Hl−1 and Hk+1. Remark: The constraint says that the most significant substring in the area between any two islands is also an island (or there is no island).

slide-24
SLIDE 24

Guideline Introduction Methods Results Outlook

Results

Overlap with Functional Regions

Overall OC 17-Cons 1st Ex.

  • Alt. TSS

All CCGIs No 12445 2734 11041 5539 7248 %

  • 0.21

0.88 0.44 0.58 No GFCGIs No. 3000 189 2687 433 1248 %

  • 0.06

0.89 0.14 0.41 No HMMCGIs No 802 25 706 82 286 %

  • 0.03

0.87 0.10 0.35 Overlap of CCGI sets with different functional regions

OC: Open Chromatin 17-Cons: UCSC 17-Way Conservation Track 1st Ex.: Overlap with Initial Exons

  • Alt. TSS: Alternative Transcription Start Site
slide-25
SLIDE 25

Guideline Introduction Methods Results Outlook

Results

Differential Methylation

Coding CGIs HMM CGIs GFCGIs Dist. No R / P / F-M No R / P / F-M No R / P / F-M 12934 22.5 / 3.1 / 5.5 26848 53.5 / 3.6 / 6.7 16320 31.1 / 3.4 / 6.2 No Cov 12923 22.5 / 3.1 / 5.5 11809 26.3 / 4.0 / 7.0 5465 14.2 / 4.7 / 7.0 1 12324 21.9 / 3.2 / 5.6 3785 7.3 / 3.4 / 4.7 1213 3.4 / 5.0 / 4.1 15 4957 12.1 / 4.1 / 6.2 3138 6.5 / 3.5 / 4.6 946 2.9 / 5.2 / 3.7 30 2870 9.0 / 5.0 / 6.5 2561 5.8 / 3.6 / 4.5 790 2.6 / 5.2 / 3.4 45 1899 6.4 / 5.1 / 5.6 2084 5.4 / 3.9 / 4.5 671 2.3 / 5.2 / 3.1 60 1365 4.9 / 5.0 / 4.9 1753 5.3 / 4.2 / 4.7 602 2.3 / 5.3 / 3.2 90 859 3.9 / 5.4 / 4.5 1365 5.3 / 4.5 / 4.9 468 2.3 / 5.8 / 3.3 120 632 3.9 / 6.2 / 4.8 1098 5.0 / 4.6 / 4.8 357 2.2 / 6.2 / 3.3

Differential methylation relative to exon location R: recall, P: precision, F-M: F-measure

slide-26
SLIDE 26

Guideline Introduction Methods Results Outlook

Results

Alternatively Spliced Exons Exons CCGIs (12445) HMMCGIs (26020) GFCGIs (15821) ∩ No No Pr p No Pr p No Pr p ≥ 2 4,092 443 3.56 2e-25 767 2.95 3e-20 453 2.86 4e-10 ≥ 3 266 41 0.33 3e-7 56 0.22 6e-4 41 0.26 1e-4 ≥ 4 31 3 0.02 0.33 2 0.01 0.94 2 0.01 0.74 Overlap with alternatively spliced exons Pr: precision, p: hypergeometric tail probability Overall Exon Count: 190181

slide-27
SLIDE 27

Guideline Introduction Methods Results Outlook

Results

HOX genes

Scale chr7: 50 kb 27100000 27110000 27120000 27130000 27140000 27150000 27160000 27170000 27180000 27190000 27200000 27210000 CCGIs within exons (total 12,849) RefSeq Genes 5.83e-04 8.34e-06 5.71e-05 1.53e-10 7.12e-10 2.88e-08 5.71e-05 1.25e-21 4.02e-16 4.83e-06 1.22e-06 2.31e-09 1.74e-09 1.37e-13 8.70e-36 1.11e-18 5.35e-34 HOXA1 HOXA1 HOXA2 HOXA3 HOXA3 HOXA4 HOXA5 HOXA6 HOXA7 HOXA9 MIR196B HOXA10 HOXA10 HOXA11 HOXA11AS HOXA13

Fig.: The HOX gene cluster was studied in [Branciamore et al., 2010]: CpGs in exons

  • f HOX genes are subject to pro-epigenetic selection.
slide-28
SLIDE 28

Guideline Introduction Methods Results Outlook

Results

Differential Methylation / Alternative Splicing

Scale chr14: FLJ10357 FLJ10357 FLJ00056 ARHGEF40 DNase Clusters Txn Factor ChIP RepeatMasker 500 bases 20625100 20625200 20625300 20625400 20625500 20625600 20625700 20625800 20625900 20626000 20626100 CCGIs UCSC CpG islands HMM islands Human ESTs That Have Been Spliced UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes ENCODE Enhancer and Promoter Associated Histone Mark (H3K4Me1) on 8 Cell Lines ENCODE Promoter-Associated Histone Mark (H3K4Me3) on 9 Cell Lines ENCODE Digital DNaseI Hypersensitivity Clusters ENCODE Transcription Factor ChIP-seq Repeating Elements by RepeatMasker 1.92e-04 Enhanced H3K4Me1 50 _ 0 _ Promoter H3K4Me3 100 _ 0 _ Scale chr1: CGIs HMM 500 bases 44255000 44255500 44256000 44256500 CCGIs UCSC CpG islands HMM islands RefSeq Genes 2.75e-05 SLC6A9 SLC6A9 SLC6A9

!"#$$ !%#$$ (a) Differentially methylated CCGI in an alternatively spliced exon [methylated in embryonic stem cells, unmethylated in fetal lung fibroblasts] (b) CCGI located in a gene with alternative transcription start sites

slide-29
SLIDE 29

Guideline Introduction Methods Results Outlook

Outlook

Whole Genome

Scale chr1: NDUFS5 NDUFS5 Human ESTs DNase Clusters Txn Factor ChIP CpG Islands RepeatMasker 5 kb 39261000 39262000 39263000 39264000 39265000 39266000 39267000 39268000 39269000 39270000 GSI (Greedy Stat-Islands) e11 (26,946 total) UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes Human ESTs Including Unspliced ENCODE Enhancer and Promoter Histone Mark (H3K4Me1) on 8 Cell Lines ENCODE Promoter Histone Mark (H3K4Me3) on 9 Cell Lines ENCODE Digital DNaseI Hypersensitivity Clusters ENCODE Transcription Factor ChIP-seq CpG Islands (Islands < 300 Bases are Light Green) Repeating Elements by RepeatMasker 3.852632e-16 NDUFS5 NDUFS5 Enhancer H3K4Me1 50 _ 0 _ Promoter H3K4Me3 100 _ 0 _ Scale chr9: DNase Clusters Txn Factor ChIP CpG Islands RepeatMasker 2 kb 106792000 106793000 106794000 106795000 106796000 106797000 GSI (Greedy Stat-Islands) e11 (26,946 total) UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics RefSeq Genes Human ESTs Including Unspliced ENCODE Enhancer and Promoter Histone Mark (H3K4Me1) on 8 Cell Lines ENCODE Promoter Histone Mark (H3K4Me3) on 9 Cell Lines ENCODE Digital DNaseI Hypersensitivity Clusters ENCODE Transcription Factor ChIP-seq CpG Islands (Islands < 300 Bases are Light Green) Repeating Elements by RepeatMasker 1.984208e-23 AA757392 AA601323 BI827907 Enhancer H3K4Me1 50 _ 0 _ Promoter H3K4Me3 100 _ 0 _

Fig.: Whole Genome Statistically Significant Islands

slide-30
SLIDE 30

Guideline Introduction Methods Results Outlook

Thanks for the attention!