Determining coding CpG islands as regions significant for Markov - PowerPoint PPT Presentation

Guideline Introduction Methods Results Outlook Determining coding CpG islands as regions significant for Markov chain based counting statistics Alexander Schönhuth Centrum Wiskunde & Informatica Amsterdam joint work with Meromit Singer, Alexander Engström and Lior Pachter UC Berkeley Rutgers University, New Jersey October 12, 2011

Guideline Introduction Methods Results Outlook Guideline Introduction Cytosine Deamination Problem Definition Methods The Null Model The Algorithm Results Epigenetic Association New Findings Outlook

Guideline Introduction Methods Results Outlook Introduction Cytosine Deamination • Degradation of CpG dinucleotides more frequent than for other constellations CH3 • Methylated cytosines mutate to thymine ATATG TTGGA CG through deamination ( C → T ) • CpG islands: Deamination A AG C TTG CG CG CG CG substrings in the genome with unusually high ATATG TTGGA TG CpG content

Guideline Introduction Methods Results Outlook Introduction Cytosine Deamination • Degradation of CpG dinucleotides more frequent than for other constellations CH3 • Methylated cytosines mutate to thymine ATATG TTGGA CG through deamination ( C → T ) • CpG islands: Deamination A AG C TTG CG CG CG CG substrings in the genome with unusually high ATATG TTGGA TG CpG content • CpG islands are not affected by neutral mutation rates due to epigenetic constraint ☞ computational inference possible • Still most popular: • G.-Garden / Frommer: length ≥ 200 bp , GC % ≥ 0 . 5, CpG Obs/Exp ≥ 0 . 6 • Takai / Jones: length ≥ 500 bp , GC % ≥ 0 . 55, CpG Obs/Exp ≥ 0 . 65

Guideline Introduction Methods Results Outlook Generic Motivation Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following). Output: A set of non-overlapping substrings G 1 , ..., G L which are “most significant” in terms of their CpG content.

Guideline Introduction Methods Results Outlook Generic Motivation Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following). Output: A set of non-overlapping substrings G 1 , ..., G L which are “most significant” in terms of their CpG content. • Thereby one would like to control the false discovery rate E ( V L ) where V = # False Positives that is the fraction of false positives to be expected.

Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 .

Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 . • Let Z n be a random variable defined by Z n : Σ n − → N � #( � G �→ G , CG )

Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 . • Let Z n be a random variable defined by Z n : Σ n − → N � #( � G �→ G , CG ) • Let � G be a genomic substring of length n , m := #( � G , CG ) . • Consider the tail probability G ) := p n , m := P ( { Z n ≥ m } ) . p ( � which reflects that a randomly drawn n -mer contains at least m CG s.

Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 . • Let Z n be a random variable defined by Z n : Σ n − → N � #( � G �→ G , CG ) • Let � G be a genomic substring of length n , m := #( � G , CG ) . • Consider the tail probability G ) := p n , m := P ( { Z n ≥ m } ) . p ( � which reflects that a randomly drawn n -mer contains at least m CG s. Wanted : Genomic substrings � G of significantly small p ( � G ) .

Guideline Introduction Methods Results Outlook Methods Problem Specification Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following) and a user-specified threshold α ∈ [ 0 , 1 ] . Output: A set of non-overlapping substrings � G 1 , ..., � G L in G resp. the G i which minimize L L � p ( � � G l ) = p n l , m l l = 1 l = 1 where n l := | � G l | , m l := #( � G l , CG ) , such that E ( V L ) ≤ α.

Guideline Introduction Methods Results Outlook Methods Problem Specification Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following) and a user-specified threshold α ∈ [ 0 , 1 ] . Output: A set of non-overlapping substrings � G 1 , ..., � G L in G resp. the G i which minimize L L � p ( � � G l ) = p n l , m l l = 1 l = 1 where n l := | � G l | , m l := #( � G l , CG ) , such that E ( V L ) ≤ α. • Some additional, biologically reasonable constraints will apply. • Still missing: Specification of P .

Guideline Introduction Methods Results Outlook Null Model Markov Chains Standard hidden Markov model for CpG island detection Issue : Specification of an “island model” necessary.

Guideline Introduction Methods Results Outlook Null Model Markov Chains Parameter estimation for only a null model straightforward : Collect dinucleotide frequencies into Markov transition probability matrix   p AA p AC p AG p AT p CA p CC p CG p CT   M =  .   p GA p GC p GG p GT  p TA p TC p TG p TT

Guideline Introduction Methods Results Outlook Methods Computation of Probabilities • Consider the probability vectors π n , m = [ π n , m ( A ) , π n , m ( C ) , π n , m ( G ) , π n , m ( T )] ∈ [ 0 , 1 ] 4 where π n , m ( x ) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ { A , C , G , T } .

Guideline Introduction Methods Results Outlook Methods Computation of Probabilities • Consider the probability vectors π n , m = [ π n , m ( A ) , π n , m ( C ) , π n , m ( G ) , π n , m ( T )] ∈ [ 0 , 1 ] 4 where π n , m ( x ) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ { A , C , G , T } . • For all n ∈ N initialize π n , 0 = π where π T M = π T is the stationary eigenvector associated with the Markov chain. • Recursively compute ( π n , m ) T =     p AA p AC p AG p AT 0 0 0 0 p CA p CC 0 p CT 0 0 p CG 0 ( π n − 1 , m ) T ·  + ( π n − 1 , m − 1 ) T ·         p GA p GC p GG p GT 0 0 0 0    p TA p TC p TG p TT 0 0 0 0

Guideline Introduction Methods Results Outlook Bona Fide Islands Significance Vs. Epigenetic Score 1.0 1.0 A B 0.8 0.8 0.6 0.6 Hit Rate Hit Rate 0.4 0.4 0.2 0.2 episcore episcore p-value p-value obs/exp cg obs/exp cg 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Alarm Rate False Alarm Rate ROC Plots: p-values vs. epigenetic score vs. CpG Obs/Exp on bona fide islands for prediction of open chromatin and differential methylation

Guideline Introduction Methods Results Outlook Exonic CpG Islands Coding Constraint vs. Epigenetic Constraint • In exons, preservation of CpGs due to both coding and epigenetic constraint. A AG C TTG CG CG CG CG Epigenetic Constraint Coding Constraint • Coding CpG island: exonic substring with significant CpG content due to epigenetic constraint The Genetic Code

Guideline Introduction Methods Results Outlook Null Model 5-th order Markov chain  A C G T    P ( A 5 → A ) P ( A 5 → C ) P ( A 5 → G ) P ( A 5 → T )   AAAAA    P ( A 4 C → A ) P ( A 4 C → C ) P ( A 4 C → G ) P ( A 4 C → T )  AAAAC     P ( A 4 G → A ) P ( A 4 G → C ) P ( A 4 G → G ) P ( A 4 G → T ) AAAAG     P ( A 4 T → A ) P ( A 4 T → C ) P ( A 4 T → G ) P ( A 4 T → T ) AAAAT     . . . . .   . . . . . . . . . .     P ( T 4 A → A ) P ( T 4 A → C ) P ( T 4 A → G ) P ( T 4 A → T )   TTTTA    P ( T 4 C → A ) P ( T 4 C → C ) P ( T 4 C → G ) P ( T 4 C → T )  TTTTG    P ( T 4 G → A ) P ( T 4 G → C ) P ( T 4 G → G ) P ( T 4 G → T )  TTTTG   P ( T 5 → A ) P ( T 5 → C ) P ( T 5 → G ) P ( T 5 → T ) TTTTT • 2 6 = 64 parameters to be learned from data • Needed : Dinucleotide counting statistics on 5-th order Markov chains • Goal : Determine significance of exonic substrings

Determining coding CpG islands as regions significant for Markov - PowerPoint PPT Presentation

Guideline Introduction Methods Results Outlook Determining coding CpG islands as regions significant for Markov chain based counting statistics Alexander Schnhuth Centrum Wiskunde & Informatica Amsterdam joint work with Meromit

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Identifying CpG islands using hidden Markov models Matthew Macauley Department of Mathematical

Highlands and Islands Highlands and Islands Highlands and Islands Highlands and Islands

CpG Islands - (Durbin Ch.3) In human genomes the C nucleotide of a dinucleotide CG is typically

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

INFRASTRUCTURE NEEDS OF THE TIWI ISLANDS The Tiwi Islands lie 80 km to the North of Darwin and

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Applications of Random Coding and Algebraic Coding Theories to Universal Lossless Source Coding

Coding and Applications in Sensor Networks Why coding? Information compression

Addition Rates in Kinetic Test Results Ronald H. Mullennex, CPG, CGWP Senior Principal

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSEP 590 A Watson-Crick pair;

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSE 527 Watson-Crick pair; p

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

Lecture 9: Mapping Reads to a Reference Burrows Wheeler Transform and FM Index Spring 2020

Beyond Genomics: Detecting Codes and Signals in the Cellular Transcriptome Brendan J. Frey

Genomic sequence analysis: AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC

Common intervals of genomes Mathieu Raffinot CNRS - LIAFA Context: - comparative genomics. -

Structural Biology Michael Sattler Institute of Structural Biology (STB)

Folding, Assembly, Flexible Systems Maxim Petoukhov EMBL, Hamburg Outstation Outline Outline