SLIDE 1 Crash course
for Computer Scientists
Bartek Wilczyński bartek@mimuw.edu.pl http://regulomics.mimuw.edu.pl Phd Open lecture series 17-19 XI 2016
SLIDE 2 Topics for the course
- Sequences in Biology – what do we study?
- Sequence comparison and searching – how to quickly find
relatives in large sequence banks
- Tree-of-life and its construction(s)
- Short sequence mapping – where did this word come from
- DNA sequencing and assembly – puzzles for experts
- Sequence segmentation – finding modules by flipping coins
- Data storage and compression – from DNA to bits and back
again
- Structures in Biology – small and smaller
SLIDE 3
Markov Models
SLIDE 4 Hidden Markov Models
is not observable
some emitted signals, probabilisticly depending on the chain state
transition matrix, we have a emission matrix
SLIDE 5 Trajectories of HMMs
- The Markov model changes states (Xs) over
time using transition matrix
- At each state a random symbol is emitted
based on the emission probabilities
SLIDE 6
HMM example
SLIDE 7
Reconstructing trajectory states
SLIDE 8
Viterbi algorithm
SLIDE 9
The forward and backward probabilities of trajectories
SLIDE 10
Where were we at time t?
Given the sequence of emitted symbols, we can estiimate the likely states of the hidden system
SLIDE 11
The emission matrix can be then estimated
SLIDE 12
As well as the transition matrix
SLIDE 13
Baum Welch algorithm
SLIDE 14
Expectation-Maximization
SLIDE 15
Protein structure
SLIDE 16
Protein domains
SLIDE 17
Profile HMMs
SLIDE 18
Finding a domain in a longer protein sequence
SLIDE 19
PFAM sequence annotation
SLIDE 20 What is the chromatin state?
UCSF School of medicine
SLIDE 21
ChIP data from ENCODE project
SLIDE 22 Chromatin Immunoprecipitation data
SLIDE 23 HMM model
(Ji&Wong 2005, Bioinfiormatics)
model for segmentation of ChIP data with 2 states:
– 0 – no enrichment – 1 - enrichment
Gaussian
SLIDE 24
Emission model in TileMap
SLIDE 25 Using Gaussian HMM for Stock Market
From scikit.learn documentation
SLIDE 26 Fillion et al, Cell 2010
You can use HMMs for chromatin
SLIDE 27 Using PCA to limit the emission space dimension
- Principal component analysis is a method of
identifying orthogonal vectors with maximal variance in the multidimensional data
SLIDE 28 Independent multidimensional emissions
- ChromHMM is taking a different approach
- One can assume that all of the different ChIP
measurements are independent of each other
- Then instead of exponential emission explosion, we
have a matrix of emission probabilities for each state
- For each observable ChIP we need the probabilities
vector for each hidden state
- This is even extendable to Gaussian emissions
SLIDE 29 Ernst&Kellis, 2012, Nat Biotech
SLIDE 30 Emission matrix for Drosophila
Modencode, Roy et al, Science 2010
SLIDE 31 Bayesian Networks and Dynamic Bayesian Networks
SLIDE 32 Segway Dynamic Bayesian Network
Hoffman et al. Nat. Methods 2012
SLIDE 33
SLIDE 34
SLIDE 35
SLIDE 36
SLIDE 37 Protein structure prediction
- We can predict the protein sequence
from reading DNA, but we do not know how it will fold to perform its function
SLIDE 38 Protein structure energy function
- Given our understanding of molecular dynamics, we
should be able to score difgerent conformations of the same protein chain
- This is expensive, as proteins contain thousands of
atoms
SLIDE 39 Simplifjed Computational models of protein structure
SLIDE 40 Anfjnsen's „conjecture”
- Since proteins can fold in the real world, the
energy landscape should have a very strong global optimum
SLIDE 41 Computationally this is difficult
model:
– hydrophobic/polar
representation of residues
– On a rectangular
lattive
problem of finding the optimal configuration
SLIDE 42 CASP experiment
Assessment of Structure Prediction methods
solve structures and release sequences to scientists so that they can make blind predictions
SLIDE 43
Gamification of protein folding
SLIDE 44
SLIDE 45
Solving new HIV protein structure
SLIDE 46
Finding new algorithms
SLIDE 47
Making improved enzymes
SLIDE 48 Kryder's law
cost of magnetic storage was following Kryder's law of exponential reduction
- It is no longer the case
- It creates problems for
storing all the sequencing data
SLIDE 49 Storing data in DNA
- Stored a text file, few images, a sound file in the
DNA
SLIDE 50
Encoding of a binary stream in a sequencable DNA
SLIDE 51
Cost of storing data in DNA
SLIDE 52
Cost of retrieving DNA stored data
SLIDE 53
Cost comparison with tape storage
SLIDE 54
DNA is not only small it's also extremely durable
SLIDE 55
But they were not first to publish
SLIDE 56
This is all petty dispute about months...