Assembly Assembly Computational Challenge: assemble individual short - - PowerPoint PPT Presentation

assembly assembly
SMART_READER_LITE
LIVE PREVIEW

Assembly Assembly Computational Challenge: assemble individual short - - PowerPoint PPT Presentation

Assembly Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (superstring) Shortest common superstring Problem: Given a set of strings, find a shortest string that contains all of


slide-1
SLIDE 1

Assembly

slide-2
SLIDE 2

Assembly

Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”)

slide-3
SLIDE 3

Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s1, s2,…., sn Output: A string s that contains all strings s1, s2, …., sn as substrings, such that the length of s is minimized

Shortest common superstring

slide-4
SLIDE 4

Shortest common superstring

slide-5
SLIDE 5

Overlap Graph

slide-6
SLIDE 6

De Bruijn Graph

slide-7
SLIDE 7

AT GT CG CA GC TG GG Path visited every EDGE once

Overlap graph vs De Bruijn graph

slide-8
SLIDE 8

Some definitions

slide-9
SLIDE 9

Eulerian walk/path

zero or

slide-10
SLIDE 10
  • a. Start with an arbitrary

vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.

Assume all nodes are balanced

slide-11
SLIDE 11
  • b. If cycle from (a) is not an

Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.

slide-12
SLIDE 12
  • c. Combine the cycles from

(a) and (b) into a single cycle and iterate step (b).

slide-13
SLIDE 13
  • A vertex v is semibalanced if

| in-degree(v) - out-degree(v)| = 1

  • If a graph has an Eulerian path starting from s and

ending at t, then all its vertices are balanced with the possible exception of s and t

  • Add an edge between two semibalanced vertices:

now all vertices should be balanced (assuming there was an Eulerian path to begin with). Find the Eulerian cycle, and remove the edge you had added. You now have the Eulerian path you wanted.

Eulerian path

slide-14
SLIDE 14

Complexity?

slide-15
SLIDE 15

Hidden Markov Models

slide-16
SLIDE 16

Markov Model (Finite State Machine with Probs)

Modeling a sequence of weather observations

slide-17
SLIDE 17

Hidden Markov Models

Assume the states in the machine are not observed and we can observe some output at certain states.

slide-18
SLIDE 18

Hidden Markov Models

Assume the states in the machine are not observed and we can observe some output at certain states.

Hidden: Sunny Hidden: Rainy Observation: Walk Observation: Shop Observation: Clean

slide-19
SLIDE 19

s(i-1) s(i+1) s(i)

p(s(i)|s(i − 1)) p(s(i + 1)|s(i))

x(i-1) x(i) x(i+1)

p(x(i − 1)|s(i − 1)) p(x(i)|s(i)) p(x(i + 1)|s(i + 1)) Hidden Observed

Generate a sequence from a HMM

slide-20
SLIDE 20

HHHHHHCCCCCCCHHHHHH 3323332111111233332

Hidden: temperature Observed: number of ice creams

Generate a sequence from a HMM

slide-21
SLIDE 21

Speech recognition Action recognition

Hidden Markov Models: Applications

slide-22
SLIDE 22

Motif Finding

Problem: Find frequent motifs with length L in a sequence dataset Assumption: the motifs are very similar to each other but look very different from the rest part of sequences

ATCGCGCGGCGCGGAATCGDTATCGCGCGCCCAGGTAAGT GCGCGCGCAGGTAAGGTATTATGCGAGACGATGTGCTATT GTAGGCTGATGTGGGGGGAAGGTAAGTCGAGGAGTGCATG CTAGGGAAACCGCGCGCGCGCGATAAGGTGAGTGGGAAAG

slide-23
SLIDE 23

Motif: a first approximation

Assumption 1: lengths of motifs are fixed to L Assumption 2: states on different positions on the sequence are independently distributed p(x) =

L

Y

i=1

pi(x(i)) pi(A) = Ni(A) Ni(A) + Ni(T) + Ni(G) + Ni(C)

slide-24
SLIDE 24

Motif: (Hidden) Markov models

Assumption 1: lengths of motifs are fixed to L Assumption 2: future letters depend only on the present letter p(x) = p1(x(1))

L

Y

i=2

pi(x(i)|x(i − 1)) pi(A|G) = Ni−1,i(G, A) Ni−1(G)

slide-25
SLIDE 25

Motif Finding

Problem: We don’t know the exact locations of motifs in the sequence dataset Assumption: the motifs are very similar to each other but look very different from the rest part of sequences

ATCGCGCGGCGCGGAATCGDTATCGCGCGCCCAGGTAAGT GCGCGCGCAGGTAAGGTATTATGCGAGACGATGTGCTATT GTAGGCTGATGTGGGGGGAAGGTAAGTCGAGGAGTGCATG CTAGGGAAACCGCGCGCGCGCGATAAGGTGAGTGGGAAAG

slide-26
SLIDE 26

Hidden state space

null start end

slide-27
SLIDE 27

Hidden Markov Model (HMM)

null start end

0.9 0.08 0.95 0.05 0.01 0.99 0.02

slide-28
SLIDE 28

How to build HMMs?

slide-29
SLIDE 29

Computational problems in HMMs

slide-30
SLIDE 30

Hidden Markov Models

slide-31
SLIDE 31

Hidden Markov Model

q(i-1) q(i+1) q(i)

  • (i-1)
  • (i)
  • (i+1)

Hidden Observed

slide-32
SLIDE 32

Conditional Probability of Observations

Example:

slide-33
SLIDE 33

Joint and marginal probabilities

Joint: Marginal:

slide-34
SLIDE 34

How to compute the probability of observations

slide-35
SLIDE 35

Forward algorithm

slide-36
SLIDE 36

Forward algorithm

slide-37
SLIDE 37

Forward algorithm

slide-38
SLIDE 38

Decoding: finding the most probable states

Similar to the forward algorithm, we can define the following value:

slide-39
SLIDE 39
slide-40
SLIDE 40

Viterbi algorithm