Crash course on Computational Biology for Computer Scientists - - PowerPoint PPT Presentation

crash course on computational biology for computer
SMART_READER_LITE
LIVE PREVIEW

Crash course on Computational Biology for Computer Scientists - - PowerPoint PPT Presentation

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski bartek@mimuw.edu.pl http://regulomics.mimuw.edu.pl Phd Open lecture series 17-19 XI 2016 Topics for the course Sequences in Biology what do we study?


slide-1
SLIDE 1

Crash course

  • n Computational Biology

for Computer Scientists

Bartek Wilczyński bartek@mimuw.edu.pl http://regulomics.mimuw.edu.pl Phd Open lecture series 17-19 XI 2016

slide-2
SLIDE 2

Topics for the course

  • Sequences in Biology – what do we study?
  • Sequence comparison and searching – how to quickly find

relatives in large sequence banks

  • Tree-of-life and its construction(s)
  • Short sequence mapping – where did this word come from
  • DNA sequencing and assembly – puzzles for experts
  • Sequence segmentation – finding modules by flipping coins
  • Data storage and compression – from DNA to bits and back

again

  • Structures in Biology – small and smaller
slide-3
SLIDE 3

Markov Models

slide-4
SLIDE 4

Hidden Markov Models

  • Now the Markov Chain

is not observable

  • We only observe

some emitted signals, probabilisticly depending on the chain state

  • So in addition to the

transition matrix, we have a emission matrix

slide-5
SLIDE 5

Trajectories of HMMs

  • The Markov model changes states (Xs) over

time using transition matrix

  • At each state a random symbol is emitted

based on the emission probabilities

slide-6
SLIDE 6

HMM example

slide-7
SLIDE 7

Reconstructing trajectory states

slide-8
SLIDE 8

Viterbi algorithm

slide-9
SLIDE 9

The forward and backward probabilities of trajectories

slide-10
SLIDE 10

Where were we at time t?

Given the sequence of emitted symbols, we can estiimate the likely states of the hidden system

slide-11
SLIDE 11

The emission matrix can be then estimated

slide-12
SLIDE 12

As well as the transition matrix

slide-13
SLIDE 13

Baum Welch algorithm

slide-14
SLIDE 14

Expectation-Maximization

slide-15
SLIDE 15

Protein structure

slide-16
SLIDE 16

Protein domains

slide-17
SLIDE 17

Profile HMMs

slide-18
SLIDE 18

Finding a domain in a longer protein sequence

slide-19
SLIDE 19

PFAM sequence annotation

slide-20
SLIDE 20

What is the chromatin state?

UCSF School of medicine

slide-21
SLIDE 21

ChIP data from ENCODE project

slide-22
SLIDE 22

Chromatin Immunoprecipitation data

  • Considereble noise level
slide-23
SLIDE 23

HMM model

  • TileMap method

(Ji&Wong 2005, Bioinfiormatics)

  • Hidden Markov

model for segmentation of ChIP data with 2 states:

– 0 – no enrichment – 1 - enrichment

  • Emissions are

Gaussian

slide-24
SLIDE 24

Emission model in TileMap

slide-25
SLIDE 25

Using Gaussian HMM for Stock Market

From scikit.learn documentation

slide-26
SLIDE 26

Fillion et al, Cell 2010

You can use HMMs for chromatin

slide-27
SLIDE 27

Using PCA to limit the emission space dimension

  • Principal component analysis is a method of

identifying orthogonal vectors with maximal variance in the multidimensional data

slide-28
SLIDE 28

Independent multidimensional emissions

  • ChromHMM is taking a different approach
  • One can assume that all of the different ChIP

measurements are independent of each other

  • Then instead of exponential emission explosion, we

have a matrix of emission probabilities for each state

  • For each observable ChIP we need the probabilities

vector for each hidden state

  • This is even extendable to Gaussian emissions
slide-29
SLIDE 29

Ernst&Kellis, 2012, Nat Biotech

slide-30
SLIDE 30

Emission matrix for Drosophila

Modencode, Roy et al, Science 2010

slide-31
SLIDE 31

Bayesian Networks and Dynamic Bayesian Networks

slide-32
SLIDE 32

Segway Dynamic Bayesian Network

Hoffman et al. Nat. Methods 2012

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

Protein structure prediction

  • We can predict the protein sequence

from reading DNA, but we do not know how it will fold to perform its function

slide-38
SLIDE 38

Protein structure energy function

  • Given our understanding of molecular dynamics, we

should be able to score difgerent conformations of the same protein chain

  • This is expensive, as proteins contain thousands of

atoms

slide-39
SLIDE 39

Simplifjed Computational models of protein structure

slide-40
SLIDE 40

Anfjnsen's „conjecture”

  • Since proteins can fold in the real world, the

energy landscape should have a very strong global optimum

slide-41
SLIDE 41

Computationally this is difficult

  • Even the simplest

model:

– hydrophobic/polar

representation of residues

– On a rectangular

lattive

  • leads to a NP-hard

problem of finding the optimal configuration

slide-42
SLIDE 42

CASP experiment

  • Critical

Assessment of Structure Prediction methods

  • Crystallographers

solve structures and release sequences to scientists so that they can make blind predictions

slide-43
SLIDE 43

Gamification of protein folding

slide-44
SLIDE 44
slide-45
SLIDE 45

Solving new HIV protein structure

slide-46
SLIDE 46

Finding new algorithms

slide-47
SLIDE 47

Making improved enzymes

slide-48
SLIDE 48

Kryder's law

  • For a long time the

cost of magnetic storage was following Kryder's law of exponential reduction

  • It is no longer the case
  • It creates problems for

storing all the sequencing data

slide-49
SLIDE 49

Storing data in DNA

  • Stored a text file, few images, a sound file in the

DNA

slide-50
SLIDE 50

Encoding of a binary stream in a sequencable DNA

slide-51
SLIDE 51

Cost of storing data in DNA

slide-52
SLIDE 52

Cost of retrieving DNA stored data

slide-53
SLIDE 53

Cost comparison with tape storage

slide-54
SLIDE 54

DNA is not only small it's also extremely durable

slide-55
SLIDE 55

But they were not first to publish

slide-56
SLIDE 56

This is all petty dispute about months...