Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome - - PowerPoint PPT Presentation

genome reassembly from fragments
SMART_READER_LITE
LIVE PREVIEW

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome - - PowerPoint PPT Presentation

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding of hereditary information for an organism in its DNA The mathematical model of a genome is a string of character , where each character is one


slide-1
SLIDE 1

Genome Reassembly From Fragments

7 January 2019 OSU CSE 1

slide-2
SLIDE 2

Genome

  • A genome is the encoding of hereditary

information for an organism in its DNA

  • The mathematical model of a genome is a

string of character, where each character is one of 'A', 'C', 'G', or 'T', which stand for the names of the four nucleotides that occur on a DNA backbone

7 January 2019 OSU CSE 2

slide-3
SLIDE 3

Quoted from Wikipedia:

  • An analogy to the human genome stored on

DNA is that of instructions stored in a book:

– The book (genome) contains 23 chapters (chromosomes); – Each chapter contains 48 to 250 million letters (A,C,G,T) without spaces; – Hence, the book contains over 3.2 billion letters total; – The book fits into a cell nucleus the size of a pinpoint; – At least one copy of the book (all 23 chapters) is contained in most cells of our body.

7 January 2019 OSU CSE 3

slide-4
SLIDE 4

Quoted from Wikipedia:

  • An analogy to the human genome stored on

DNA is that of instructions stored in a book:

– The book (genome) contains 23 chapters (chromosomes); – Each chapter contains 48 to 250 million letters (A,C,G,T) without spaces; – Hence, the book contains over 3.2 billion letters total; – The book fits into a cell nucleus the size of a pinpoint; – At least one copy of the book (all 23 chapters) is contained in most cells of our body.

7 January 2019 OSU CSE 4

This is what we care about for the next project...

slide-5
SLIDE 5

Genome Sequencing

  • The Human Genome Project was designed to

determine the entire sequence of human DNA and to map its mathematical model (genotype) to physical and functional manifestations in a person (phenotype)

  • Sequencing is done “piece-by-piece” because it

is effectively impossible to do anything directly with 3.2 billion nucleotides

5

length = 3.2 billion

7 January 2019 OSU CSE

slide-6
SLIDE 6

Genome Sequencing: Step 1

  • Use enzymes that can cut up many

strands of the same DNA (each a string of length about 3.2 billion letters or “bases”) into pieces at different locations, creating a “soup” of fragments each of much smaller length (on the order of 1000)

6 7 January 2019 OSU CSE

slide-7
SLIDE 7

Genome Sequencing: Step 2

  • Use machines that can physically

sequence each of these fragments to determine their mathematical models

– Example: "TCTAAGCCTA..."

7 7 January 2019 OSU CSE

slide-8
SLIDE 8

Genome Sequencing: Step 2

  • Use machines that can physically

sequence each of these fragments to determine their mathematical models

– Example: "AGTAGAACG..."

8 7 January 2019 OSU CSE

slide-9
SLIDE 9

Genome Sequencing: Step 2

  • Use machines that can physically

sequence each of these fragments to determine their mathematical models

9 7 January 2019 OSU CSE

slide-10
SLIDE 10

Genome Sequencing: Step 3

  • Use computer algorithms to reassemble

the original very long string model from the models of its fragments, by combining fragments based on their overlaps

10 7 January 2019 OSU CSE

slide-11
SLIDE 11

Genome Sequencing: Step 3

  • Use computer algorithms to reassemble

the original very long string model from the models of its fragments, by combining fragments based on their overlaps

11

How would you do it — at all, never mind doing it efficiently?

7 January 2019 OSU CSE

slide-12
SLIDE 12

Greedy Reassembly: Step 1

  • A naïve (but still interesting) idea is to pick

two fragments with the most overlap and to combine them into a longer fragment

12 7 January 2019 OSU CSE

slide-13
SLIDE 13

Greedy Reassembly: Step 1

  • A naïve (but still interesting) idea is to pick

two fragments with the most overlap and to combine them into a longer fragment

13 7 January 2019 OSU CSE

slide-14
SLIDE 14

Greedy Reassembly: Step 1

  • A naïve (but still interesting) idea is to pick

two fragments with the most overlap and to combine them into a longer fragment

14 7 January 2019 OSU CSE

slide-15
SLIDE 15

Finding Overlaps

  • Given two strings, what is the longest

string that is a prefix of one and a suffix of the other?

  • Example of one pair of strings:

s1 = "AGTAGAACG" s2 = "CGAGGTAGT"

7 January 2019 OSU CSE 15

slide-16
SLIDE 16

Finding Overlaps

  • Given two strings, what is the longest

string that is a prefix of one and a suffix of the other?

  • Example of one pair of strings:

s1 = "AGTAGAACG" s2 = "CGAGGTAGT"

7 January 2019 OSU CSE 16

slide-17
SLIDE 17

Finding Overlaps

  • Given two strings, what is the longest

string that is a prefix of one and a suffix of the other?

  • Example of one pair of strings:

s1 = "AGTAGAACG" s2 = "CGAGGTAGT"

7 January 2019 OSU CSE 17

slide-18
SLIDE 18

Finding Overlaps

  • Given two strings, what is the longest

string that is a prefix of one and a suffix of the other?

  • Example of one pair of strings:

s1 = "AGTAGAACG" s2 = "CGAGGTAGT"

7 January 2019 OSU CSE 18

The longest string that is a prefix of one and a suffix of the other is "AGT".

slide-19
SLIDE 19

Combine

  • If these two strings have the most overlap
  • f any pair in the “soup”, then we remove

these two strings from the “soup”:

"AGTAGAACG" "CGAGGTAGT"

and replace them by this one:

"CGAGGTAGTAGAACG"

7 January 2019 OSU CSE 19

slide-20
SLIDE 20

Combine

  • If these two strings have the most overlap
  • f any pair in the “soup”, then we remove

these two strings from the “soup”:

"AGTAGAACG" "CGAGGTAGT"

and replace them by this one:

"CGAGGTAGTAGAACG"

7 January 2019 OSU CSE 20

The idea is that both the shorter strings could have been fragments of this longer string.

slide-21
SLIDE 21

Combine

  • If these two strings have the most overlap
  • f any pair in the “soup”, then we remove

these two strings from the “soup”:

"AGTAGAACG" "CGAGGTAGT"

and replace them by this one:

"CGAGGTAGTAGAACG"

7 January 2019 OSU CSE 21

Notice that math model of the “soup” is a finite set of string of character, so in a Java program it can be of type Set<String>.

slide-22
SLIDE 22

Greedy Reassembly: Step 2

  • Continue the process until there is only
  • ne fragment in the “soup” (declare

success)

22 7 January 2019 OSU CSE

slide-23
SLIDE 23

Greedy Reassembly: Step 2

  • Continue the process until there is only
  • ne fragment in the “soup” (declare

success), or until no two fragments

  • verlap at all (too bad)

23 7 January 2019 OSU CSE

slide-24
SLIDE 24

Success?

  • Even if there is only one fragment left, it

might not be the original long string that was chopped up — but it’s a good guess!

– And after all, we are just guessing; critical information is lost when the long strand is chopped up into fragments, but we can reassemble it from fragments with high probability if enough copies of the original string are chopped up into fragments

7 January 2019 OSU CSE 24

slide-25
SLIDE 25

Project

  • The project is to do greedy reassembly,

not for a genome of length 3.2 billion, but rather for a reasonably short piece of text (e.g., the Gettysburg Address), many copies of which have been chopped up into random fragments for you to reassemble

7 January 2019 OSU CSE 25

slide-26
SLIDE 26

Resources

  • Wikipedia: Genome

– http://en.wikipedia.org/wiki/Genome

  • Wikipedia: Human Genome Project

– http://en.wikipedia.org/wiki/Human_Genome_Project

  • Wikipedia: Whole Genome Sequencing

– http://en.wikipedia.org/wiki/Genome_sequencing

  • Wikipedia: Sequence Assembly

– http://en.wikipedia.org/wiki/Sequence_assembly

7 January 2019 OSU CSE 26