SLIDE 1
Genomics Sequencing tech Sequencing tech: next generation What do - - PowerPoint PPT Presentation
Genomics Sequencing tech Sequencing tech: next generation What do - - PowerPoint PPT Presentation
Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How to analyze these reads? Mutation identification: Mapping Cancer Heart Disease Brain Disease Genome projects: Assembly Use sequencing for other
SLIDE 2
SLIDE 3
Sequencing tech: next generation
SLIDE 4
SLIDE 5
What do we get from sequencing?
SLIDE 6
How to analyze these reads?
SLIDE 7
Cancer Heart Disease Brain Disease
Mutation identification: Mapping
SLIDE 8
Genome projects: Assembly
SLIDE 9
Use sequencing for other types of data
X-seq technology
SLIDE 10
RNA-seq
SLIDE 11
Assembly
SLIDE 12
Assembly
Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”)
SLIDE 13
Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s1, s2,…., sn Output: A string s that contains all strings s1, s2, …., sn as substrings, such that the length of s is minimized
Shortest common superstring
SLIDE 14
Shortest common superstring
SLIDE 15
Any ideas?
SLIDE 16
Directed Graph
SLIDE 17
Overlap Graph
SLIDE 18
Example
SLIDE 19
Shortest common superstring problem is hard
SLIDE 20
Shortest common superstring problem is hard
SLIDE 21
Is there a better or more feasible way?
SLIDE 22
Matching a superstring to a set of short reads
Assume we have a set S of reads with length k (k-mers) Goal: Find a string that can be exactly split in to set S.
SLIDE 23
Overlap graph approach
Assume we have a set S of reads with length k (k-mers) Goal: Find a string that can be exactly split in to set S.
SLIDE 24
Overlap graph approach is hard
Assume we have a set S of reads with length k (k-mers) Goal: Find a string that can be exactly split in to set S.
SLIDE 25
There is an alternative way
SLIDE 26
De Bruijn Graph
SLIDE 27
De Bruijn Graph
SLIDE 28
What is the goal now?
SLIDE 29
AT GT CG CA GC TG GG Path visited every EDGE once
Overlap graph vs De Bruijn graph
SLIDE 30
MultiEdge
SLIDE 31
MultiGraph
SLIDE 32
Some definitions
SLIDE 33
Eulerian walk/path
zero or
SLIDE 34
Eulerian walk/path
SLIDE 35
Proof? Algorithm?
SLIDE 36
- a. Start with an arbitrary
vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.
Assume all nodes are balanced
SLIDE 37
- b. If cycle from (a) is not an
Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.
SLIDE 38
- c. Combine the cycles from
(a) and (b) into a single cycle and iterate step (b).
SLIDE 39
- A vertex v is semibalanced if
| in-degree(v) - out-degree(v)| = 1
- If a graph has an Eulerian path starting from s and
ending at t, then all its vertices are balanced with the possible exception of s and t
- Add an edge between two semibalanced vertices: