Crash course on Computational Biology for Computer Scientists - - PowerPoint PPT Presentation

crash course on computational biology for computer
SMART_READER_LITE
LIVE PREVIEW

Crash course on Computational Biology for Computer Scientists - - PowerPoint PPT Presentation

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski bartek@mimuw.edu.pl http://regulomics.mimuw.edu.pl Phd Open lecture series 17-19 XI 2016 Topics for the course Sequences in Biology what do we study?


slide-1
SLIDE 1

Crash course

  • n Computational Biology

for Computer Scientists

Bartek Wilczyński bartek@mimuw.edu.pl http://regulomics.mimuw.edu.pl Phd Open lecture series 17-19 XI 2016

slide-2
SLIDE 2

Topics for the course

  • Sequences in Biology – what do we study?
  • Sequence comparison and searching – how to quickly find

relatives in large sequence banks

  • Tree-of-life and its construction(s)
  • DNA sequencing – puzzles for experts
  • Short sequence mapping – where did this word come from
  • Sequence segmentation – finding modules by flipping coins
  • Data storage and compression – from DNA to bits and back

again

  • Structures in Biology – small and smaller
slide-3
SLIDE 3

How to make it efficient

  • Diverse audience, I don’t know what you know
  • Please do interrupt me if you have a question!
  • I will not go very deeply into biological details, so

if you want more, please ask me later for links to more materials

  • I will not go deeply into proofs or derivations, so

if you want more, please ask me later for links to more materials

  • If you need to ask later: bartek@mimuw.edu.pl
slide-4
SLIDE 4

Homework

  • I will post a few (>= 5) questions at the end,

depending how far we will get in the lectures

  • The nature of them will be diverse: derivation,

proofs, computation, data analysis.

  • If you want to pass the course and get credit, I’d

ask you to solve N-1 questions to get grade N

  • You e-mail solutions to me at

bartek@mimuw.edu.pl

slide-5
SLIDE 5

Alan Turing (1912 - 1954)

  • Very influential

mathematician

  • Turing machine
  • Turing test
  • Enigma cracking
  • Why is he here?
slide-6
SLIDE 6
slide-7
SLIDE 7

“morphogen” in publications

slide-8
SLIDE 8

Molecular morphogens

Skin pattern Molecular level

slide-9
SLIDE 9

The foundation of molecular biology

  • Watson and Crick publish

DNA structure in 1953 (using data from Franklin and Wilkins)

  • That leads to understanding
  • f the nature of information

storage in DNA

  • Now it is possible to have a

vastly simplified model of DNA sequence just as a sequence of letters over DNA alphabet, that captures most

  • f the heritable information
slide-10
SLIDE 10

DNA structure

slide-11
SLIDE 11

The DNA is not the only sequence

slide-12
SLIDE 12

Another idea ahead of its time

  • Gregor Mendel (1822
  • 1884)
  • Introduced the idea of

“factors” that we now (since early XX century) call genes

  • Smallest units of

heritable information

  • Now we know they

reside in DNA

slide-13
SLIDE 13

Where are the genes?

slide-14
SLIDE 14

The really big picture - evolution

Genome Organism (phenotype)

Environment Selection Time

Genome

Reproduction

Genome Organism (phenotype)

Environment Selection Reproduction …..... regulation epigenetics regulation epigenetics

slide-15
SLIDE 15

Sequence evolution

  • Conceptually simple

model, reproduction with mutation

  • Mutation rate very small,

but given genome sizes and cell number, considerable

  • Mutation on the DNA

level, selection on the protein level

slide-16
SLIDE 16

Fundamental problem

slide-17
SLIDE 17

Lack of data on ancestral DNA

slide-18
SLIDE 18

Time reversibility

slide-19
SLIDE 19

Naive approach

slide-20
SLIDE 20

More reasonable model – Jukes-Cantor JC-69

  • Since 1969, many more models: K80, F81, T92, etc, all generalizing for more than just one parameter
slide-21
SLIDE 21

Genetic code is degenerate

  • 64 DNA triplets encodes only 20 aminoacids
slide-22
SLIDE 22

Question?

slide-23
SLIDE 23

Evolution models based on protein alphabet

slide-24
SLIDE 24

Hamming distance

slide-25
SLIDE 25

Errors in DNA are not just substitutions

slide-26
SLIDE 26

Edit distance

slide-27
SLIDE 27

Sequence alignment

slide-28
SLIDE 28

Simple sequence comparison by dot-plotting

slide-29
SLIDE 29

Needleman-Wunsch dynamic algorithm

Images adapted from Durbin et al.

slide-30
SLIDE 30

Smith-Waterman – local version

  • f alignment
  • If we add 0 to the

dynamic algorithm formula

  • We get a local version
  • f the algorithm, giving

us the best matching substrings

slide-31
SLIDE 31

Inconsistencies in pairwise alignments

slide-32
SLIDE 32

A consistent alignment

  • f many sequences
slide-33
SLIDE 33

Scoring multiple sequence alignments (MSAs)

slide-34
SLIDE 34

Complexity of finding the optimal multiple alignment

slide-35
SLIDE 35

Can we overcome the complexity issue?

  • Theoretically, we could try to prove that P=NP,

and then solve MSA

  • In practice, we are not (usually) making multiple

alignments of random sequences. Usually we know they are related

  • Can we use the knowledge that they originated

from an evolutionary process to guide our search for optimal MSA?

slide-36
SLIDE 36

Back to how evolution works

  • Tree-like model of

sequence evolution

  • Common ancestor -

root

  • Internal nodes –

ancestral sequences

  • Leafs – curently

available sequence pool or dead-ends

slide-37
SLIDE 37

The tree of life hypothesis

Interactive Tree of Life http://itol.embl.de/

slide-38
SLIDE 38

Evolution of species and within species

slide-39
SLIDE 39

Finding the phylogenetic tree

slide-40
SLIDE 40

Bifurcating or multifurcating trees

  • Even though real

evolution might very well include multifurcating nodes (i.e. the speciation events involving more species)

  • It is enough to

consider binary trees (which may lead to mutliple binary tree topologies)

slide-41
SLIDE 41

How many different binary trees?

  • How many

different binary trees can there be for the given N sequences?

  • The answer is the

Catalan number sequence (2(n-1))!/((n-1)!n!)

slide-42
SLIDE 42

Rooted vsa unrooted trees

  • Many different

rooted trees actually correspond to the same unrooted tree topology

  • This unrooted tree

with branch lengths can correspond to a distance matrix

slide-43
SLIDE 43

Reconstructing a tree from distance matrix

slide-44
SLIDE 44

Non-ultrametric vs Ultrametric trees

slide-45
SLIDE 45

Ultrametric vs metric

  • Any metric requires:
  • If it is ultrametric it also satisfies, that any 3

leaves can be renamed x,y,z so that:

slide-46
SLIDE 46

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

slide-47
SLIDE 47

How does it work?

  • We start from a matrix and finish with an

ultrametric tree

  • If the matrix is not ultrametric, the result might

not be optimal

slide-48
SLIDE 48

Neighbor-joining

slide-49
SLIDE 49

Properties of NJ algorithm

slide-50
SLIDE 50

Further tree-related problems

  • Gene-species tree reconciliation
  • Tree refinement
  • Horizontal gene transfer - Phylogenetic networks
  • Comparison of large trees
  • Optimality measures for phylogenetic trees
  • True Ancestral sequence reconstruction
  • Etc...
slide-51
SLIDE 51

Gene- species-tree reconciliation

slide-52
SLIDE 52

Horizontal gene transfer

slide-53
SLIDE 53

Now back to multiple alignments

  • Theoretically, we could try to prove that P=NP,

and then solve MSA

  • In practice, we are not (usually) making multiple

alignments of random sequences. Usually we know they are related

  • Can we use the knowledge that they originated

from an evolutionary process to guide our search for optimal MSA?

slide-54
SLIDE 54

Feng-Doolitle approach

slide-55
SLIDE 55

Score for profile alignment

slide-56
SLIDE 56

A first proper approach - CLUSTALW

slide-57
SLIDE 57

Practical issues with the simple incremental approach

slide-58
SLIDE 58

T-Coffee algorithm (Notredamme 2000) Create one library of global pairwise alignments And one library of local pairwise alignments Use the signals in both for imptrovement of the progressive alignment

slide-59
SLIDE 59

T-Coffee in action

slide-60
SLIDE 60

Muscle method (Edgar 2004)

slide-61
SLIDE 61

Books to read more