Whole Genome Analysis and Annotation Adam Siepel Biological - PowerPoint PPT Presentation

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational Biology Cornell University

2 The Challenge Whole Genome Analysis

3 Genome Browsers Whole Genome Analysis

4 Whole Genome Analysis

6 Comparative Analysis of Complete Mammalian Genomes human chimp macaque mouse rat cow dog opossum platypus chicken zfish tetra fugu Whole Genome Analysis

7 Detection of Functional Elements human mouse rat dog chicken Fugu Whole Genome Analysis

8 Conservation Track Siepel, Bejerano, Pedersen, et al., Genome Res , 2005 Whole Genome Analysis

9 Conservation Track: GAL1 Siepel, Bejerano, Pedersen, et al., Genome Res , 2005 Whole Genome Analysis

10 Solanaceae Browser Whole Genome Analysis

12 Possible Positive Selection Chondrosarcoma associated gene 1 isoform a Whole Genome Analysis

13 “Human Accelerated Region 1” (HAR1) Whole Genome Analysis Pollard, Salama, et al., Nature, 2006

14 New Human RNA Structure Human Human Chimp U G C A - 0 10 30 U G C A - 0 1030 - 40 - 50 - 60 Chimp - 70 Whole Genome Analysis Pollard, Salama, et al., Nature, 2006

15 Exon Predictions Data from E. Green & colleagues (Thomas et al., Nature 2003) Whole Genome Analysis

16 Whole Mount in situ Hybridizations to Zebra Fish Embryos ch1.5081.18 OTP Telencephelon Telencephelon Hindbrain Telencephelon Telencephelon Hindbrain Hindbrain Hindbrain 48hpf Diencephelon Diencephelon Telencephelon Hindbrain Hindbrain Hindbrain Hindbrain Telencephelon 72hpf Diencephelon Diencephelon Bruce Roe & colleagues Whole Genome Analysis

17 non-coding 1 1 2 3 2 3 5´ splice 3´ splice codon positions C T C A G C A G … G A G G T A A G human T T A A G C A G … G A A G T G T G mouse T T A A G C A G … G A A G T G T T rat A A T A G C A A … G A G G T C C A dog C A C A G C A A … G A G G T C A A chicken Whole Genome Analysis

18 Phylo-HMM Used by PhastCons Siepel, Bejerano, Pedersen, et al., Genome Res , 2005 Whole Genome Analysis

Introduction to Hidden Markov Models, Phylogenetic Models, and Phylo-HMMs

2 A Markov Model (Chain) • Suppose Z = ( Z 1 , ..., Z L ) is a sequence of cloudy ( Z i = 0) or sunny ( Z i = 1) days • We could assume days are iid with probability theta of sun but cloudy and sunny days occur in runs • We can capture the correlation between successive days by assuming a first-order Markov model: P ( Z 1 , . . . , Z L ) = P ( Z 1 ) P ( Z 2 | Z 1 ) P ( Z 3 | Z 2 ) · · · P ( Z L | Z L − 1 ) instead of complete independence: P ( Z 1 , . . . , Z L ) = P ( Z 1 ) · · · P ( Z L )

3 Three Views L 1. � P ( z ) = P ( z 1 ) a z i − 1 ,z i i =2 where a c,d = P ( z i = d | z i − 1 = c ) a z 1 ,z 2 a z 2 ,z 3 a z L − 1 ,z L 2. · · · Z 1 Z 2 Z L a 0 , 0 a 1 , 1 a 0 , 1 3. P ( z 1 = 0) 0 1 B a 1 , 0 P ( z 1 = 1)

4 Process Interpretation • Let’s add an end state and cap the sequence with z 0 = B , z L +1 = E, e.g. z = B 000011000 E a 0 , 0 a 1 , 1 a 0 , 1 0 1 a B, 0 a 1 ,E B E a 1 , 0 a B, 1 a 0 ,E • This is a probabilistic machine that generates sequences of any length. It is a stochastic finite state machine and defines a grammar. L • We can now simply say: � P ( z ) = a z i ,z i +1 i =0 P ( z ) is a probability distribution over all sequences (for given alphabet).

5 A Hidden Markov Model • Let X = ( X 1 , ..., X L ) indicate whether AS bikes on day i ( X i = 1) or not ( X i = 0) • Suppose AS bikes on day i with probability theta 0 = 0.25 if it is cloudy ( Z i = 0) and with probability theta 1 = 0.75 if it is sunny ( Z i =1) • Further suppose the Z i s are hidden; we see only X = ( X 1 , ..., X L ) • This hidden Markov model is a mixture model in which the Z i s are correlated • We call Z = ( Z 1 , ..., Z L ) the path

6 HMM, cont. • Z is determined by the Markov chain: a 0 , 0 a 1 , 1 a 0 , 1 0 1 a B, 0 a 1 ,E B E a 1 , 0 a B, 1 a 0 ,E • The joint probability of X and Z is: L � P ( x , z ) = P ( z ) P ( x | z ) = a B,z 1 e z i ,x i a z i ,z i +1 i =1 where e z i ,x i = P ( x i | z i ) • The X i s are conditionally independent given the Z i s Z 1 Z 2 Z 3 Z L · · · X 1 X 2 X 3 X L

7 Parameters of the Model • Transition parameters: for all a s 1 ,s 2 s 1 , s 2 ∈ S ∪ { B, E } • Emission parameters: for all , s ∈ S x ∈ A e s,x • The transition parameters define conditional distributions for state s 2 at position i given state s 1 at position i -1 • The emission parameters define conditional distributions over observation x given state s , both at position i • The observations can be anything!

8 Key Questions • Given the model (parameter values) and a sequence X , what is the most likely path? ˆ z = argmax z P ( x , z ) • What is the likelihood of the sequence? � P ( x ) = P ( x , z ) z • What is the posterior probability of Z i given X • What is the maximum likelihood estimate of all parameters?

9 Graph Interpretation of Most Likely Path x i 0 0 1 0 0 1 0 0 B 0 z i 1 E

10 Graph Interpretation of Probability of x x i 0 0 1 0 0 1 0 0 B 0 z i 1 E

11 Viterbi Algorithm for Most Likely Path • Let v i,j be the weight of the most likely path for ( x 1 , ..., x i ) that ends in state j • Base case: v 0, B = 1, v i,B = 0 for i > 0 • Recurrence: v i,j = e x i ,j max v i − 1 ,k a k,j k • Termination: P ( x , ˆ z ) = max v L,k a k,E k • Keep back-pointers for traceback, as in alignment • See Durbin et al. for algorithm

12 Example a 0 , 0 a 1 , 1 a 0 , 1 P ( x i = 1 | z i = 0) = 0 . 25 0 1 a B, 0 a 1 ,E B E a 1 , 0 P ( x i = 1 | z i = 1) = 0 . 75 a B, 1 a 0 ,E Z = ? ? ? ? ? ? ? ? ? ? ? ? X = 0 1 0 0 1 1 0 1 0 0 1 0

13 Example a 0 , 0 a 1 , 1 a 0 , 1 P ( x i = 1 | z i = 0) = 0 . 25 0 1 a B, 0 a 1 ,E B E a 1 , 0 P ( x i = 1 | z i = 1) = 0 . 75 a B, 1 a 0 ,E Z = 0 0 0 0 1 1 1 1 0 0 0 0 X = 0 1 0 0 1 1 0 1 0 0 1 0

14 Why HMMs Are Cool • Extremely general and flexible models for sequence modeling • Efficient tools for parsing sequences • Also proper probability models: allow maximum likelihood parameter estimation, likelihood ratio tests, etc. • Inherently modular, accommodating of complexity • In many cases, strike an ideal balance between simplicity and expressiveness

Whole Genome Analysis and Annotation Adam Siepel Biological - PowerPoint PPT Presentation

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational Biology Cornell University 2 The Challenge Whole Genome Analysis 3 Genome Browsers Whole Genome Analysis 4 Whole Genome Analysis 5 Whole Genome

Bacterial Genome Annotation Lucile Soler Annotation course 9 th -11 th may 2017 Bacterial genome

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Analysis of structural genome varia3on in whole genome and exome sequencing data Victor Guryev

Analysis of structural genome variation in whole genome and exome sequencing data Victor Guryev

Scrambling of locally perturbed thermal states Joan Sim on University of Edinburgh and Maxwell

Optimizing Compilers Alias Analysis Markus Schordan Institut f ur Computersprachen

Causality & Holographic Entanglement Entropy Mukund Rangamani DURHAM UNIVERSITY & IAS

Science, Ethics and Gods Will: Approaches to Medical Technology D Gareth Jones Background

$NIGMS $FDA The Central Dogma of Molecular Pharmacology (~1985 to present): Target Ligand

Common Pediatric ID Curbsides Susannah Kussmaul, MD Pediatric Infectious Diseases Kaiser

Formal Concept Analysis Part II Radim B ELOHL AVEK Dept. Computer Science Palacky

Winter engagement event 22 November 2018 1 Working together with the Barnet population to

Whole Genome Analysis and Annotation Adam Siepel Biological - PowerPoint PPT Presentation

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational Biology Cornell University 2 The Challenge Whole Genome Analysis 3 Genome Browsers Whole Genome Analysis 4 Whole Genome Analysis 5 Whole Genome

Bacterial Genome Annotation Lucile Soler Annotation course 9 th -11 th may 2017 Bacterial genome

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Genomes and Metagenomes Whole Genome Sequencing and Metagenomics Whole Genome Sequencing

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools &amp; Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Analysis of structural genome varia3on in whole genome and exome sequencing data Victor Guryev

Analysis of structural genome variation in whole genome and exome sequencing data Victor Guryev

Scrambling of locally perturbed thermal states Joan Sim on University of Edinburgh and Maxwell

Optimizing Compilers Alias Analysis Markus Schordan Institut f ur Computersprachen

Causality &amp; Holographic Entanglement Entropy Mukund Rangamani DURHAM UNIVERSITY &amp; IAS

Science, Ethics and Gods Will: Approaches to Medical Technology D Gareth Jones Background

$NIGMS $FDA The Central Dogma of Molecular Pharmacology (~1985 to present): Target Ligand

Common Pediatric ID Curbsides Susannah Kussmaul, MD Pediatric Infectious Diseases Kaiser

Formal Concept Analysis Part II Radim B ELOHL AVEK Dept. Computer Science Palacky

Winter engagement event 22 November 2018 1 Working together with the Barnet population to

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Causality & Holographic Entanglement Entropy Mukund Rangamani DURHAM UNIVERSITY & IAS