CS Research for The Tree of Life Tandy Warnow The Tree of Life - - PowerPoint PPT Presentation

cs research for the tree of life
SMART_READER_LITE
LIVE PREVIEW

CS Research for The Tree of Life Tandy Warnow The Tree of Life - - PowerPoint PPT Presentation

CS Research for The Tree of Life Tandy Warnow The Tree of Life Fundamental science: Molecular biology, Genetics, Ecology, Behavior, etc. Applications: Drug design, Forensics, Human migrations, etc. 2 Estimating evolutionary trees


slide-1
SLIDE 1

CS Research for The Tree of Life

Tandy Warnow

slide-2
SLIDE 2

2

“The Tree of Life”

Fundamental science: Molecular biology, Genetics, Ecology, Behavior, etc. Applications: Drug design, Forensics, Human migrations, etc.

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Estimating evolutionary trees

slide-6
SLIDE 6

Easy cases: use morphology

slide-7
SLIDE 7

DNA Sequence Evolution

AAGACTT TGGACTT AAGGCCT

  • 3 mil yrs
  • 2 mil yrs
  • 1 mil yrs

today AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT AGGGCAT TAGCCCT AGCACTT AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AGGGCAT

slide-8
SLIDE 8

TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT

U V W X Y U V W X Y

slide-9
SLIDE 9

Harder problems!

slide-10
SLIDE 10

Harder problems need DNA!

slide-11
SLIDE 11
slide-12
SLIDE 12

12

Many, Many Trees

# of Species # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10,395 9 135,135 10 2,027,025 20 2.2 x 10 100 4.5 x 10x 1000 2.7 x 10 x

20 190 2900

slide-13
SLIDE 13

8+ million species NP-hard problems

slide-14
SLIDE 14

Today (this lecture)

  • What is a computational problem?
  • What is an algorithm?
  • How to design and analyze algorithms
  • What NP-hardness means (and what to

do about it)

  • My research (phylogeny estimation)
slide-15
SLIDE 15

Some computational problems

1. Given a list of numbers, put it into sorted order 2. Given a map and a collection of cities, find the shortest tour that visits every city 3. Given a collection of people, find the largest subset

  • f them that all know each other

4. Given a collection of people, find the smallest number of groups so that no two people in the same group know each other.

slide-16
SLIDE 16

Some computational problems

1. Given a list of numbers, put it into sorted order 2. Given a map and a collection of cities, find the shortest tour that visits every city 3. Given a collection of people, find the largest subset

  • f them that all know each other

4. Given a collection of people, find the smallest number of groups so that no two people in the same group know each other. Which ones can be solved in polynomial time?

slide-17
SLIDE 17

Sorting

  • Given a list of n numbers, put it into

sorted order

  • Algorithm: find smallest number, and

put it in the front of the list. Repeat the process on the last n-1 numbers.

  • Running time: O(n2) (polynomial time)
slide-18
SLIDE 18

Some computational problems

1. Given a list of numbers, put it into sorted order 2. Given a map and a collection of cities, find the shortest tour that visits every city 3. Given a collection of people, find the largest subset

  • f them that all know each other

4. Given a collection of people, find the smallest number of groups so that no two people in the same group know each other. Which ones can be solved in polynomial time?

slide-19
SLIDE 19

Some computational problems

1. Given a list of numbers, put it into sorted order 2. Given a map and a collection of cities, find the shortest tour that visits every city 3. Given a collection of people, find the largest subset

  • f them that all know each other

4. Given a collection of people, find the smallest number of groups so that no two people in the same group know each other. Which ones can be solved in polynomial time?

slide-20
SLIDE 20

Is this problem polynomial?

Problem: Given a collection of people, determine if they can be put into 2 groups so that no two people in the same group know each other Graph-theoretic representation: Create a graph with vertices for the people, and edges between vertices if the two people know each other!

Mary Sue Tom Henry Carol

slide-21
SLIDE 21

2-coloring

  • 2-colorability: Given graph G = (V,E), determine if we

can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

  • Greedy Algorithm. Start with one vertex and make it red,

and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2- colored.

  • Running time: O(n+m) time, where n is the number of

vertices and m is the number of edges.

slide-22
SLIDE 22

2-coloring

  • 2-colorability: Given graph G = (V,E), determine if we

can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

  • Greedy Algorithm. Start with one vertex and make it red,

and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2- colored.

  • Running time: O(n+m) time, where n is the number of

vertices and m is the number of edges.

slide-23
SLIDE 23

2-coloring

  • 2-colorability: Given graph G = (V,E), determine if we

can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

  • Greedy Algorithm. Start with one vertex and make it red,

and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2- colored.

  • Running time: O(n2) time, where n is the number of

vertices.

slide-24
SLIDE 24

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph?

Mary Sue Tom Henry Carol

slide-25
SLIDE 25

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph?

Mary Sue Tom Henry Carol

slide-26
SLIDE 26

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph?

Mary Sue Tom Henry Carol

slide-27
SLIDE 27

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph?

Mary Sue Tom Henry Carol No! We cannot!

slide-28
SLIDE 28

What about this?

  • 3-colorability: Given graph G,

determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

slide-29
SLIDE 29

What about this?

  • 3-colorability: Given graph G, determine if

we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. A brute-force solution seems to require O(3n) time, where n is the number of vertices.

slide-30
SLIDE 30
  • Some decision problems can be solved

in polynomial time:

– Can graph G be 2-colored?

  • Some decision problems seem to not

be solvable in polynomial time:

– Can graph G be 3-colored? – Does graph G have a Hamiltonian cycle (a cycle that visits every vertex exactly once)?

slide-31
SLIDE 31

In fact, some problems are “NP-hard”

  • 3-colorability: Given graph G,

determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

  • 3-colorability is provably NP-hard.

What does this mean?

slide-32
SLIDE 32

Most computer scientists are willing to bet that no NP-hard problem can be solved in polynomial time. Therefore, the options are:

– Solve the problem exactly (but use lots of time on some inputs) – Use heuristics which may not solve the problem correctly (and which might be computationally expensive, anyway)

slide-33
SLIDE 33

Computational problems in Biology are almost always NP-hard! In particular, inferring evolutionary trees generally involves trying to solve NP- hard problems.

slide-34
SLIDE 34

My research

Methods that produce accurate phylogenetic trees

  • n hard-to-analyze datasets

(thousands of sequences) within reasonable times

Problem: all the “good” methods require finding “good” solutions to NP-hard optimization problems!

slide-35
SLIDE 35

Maximum Parsimony

  • Given a set of DNA sequences
  • Find a tree for the sequences with the

minimum total number of changes

slide-36
SLIDE 36

Maximum parsimony (example)

  • Input: Four sequences

– ACT – ACA – GTT – GTA

  • Question: which of the three trees has the

best MP scores?

slide-37
SLIDE 37

Maximum Parsimony

ACT GTT ACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

slide-38
SLIDE 38

Maximum Parsimony

ACT GTT GTT GTA ACA GTA 1 2 2 MP score = 5 ACA ACT GTA GTT ACA ACT 3 1 3 MP score = 7 ACT ACA GTT GTA ACA GTA 1 2 1 MP score = 4 Optimal MP tree

slide-39
SLIDE 39

Maximum Parsimony

ACT ACA GTT GTA ACA GTA 1 2 1 MP score = 4

Finding the optimal MP tree is NP-hard Optimal labeling can be computed in polynomial time using Dynamic Programming

slide-40
SLIDE 40

Solving NP-hard problems exactly is … unlikely

  • The number
  • f (unrooted)

binary trees

  • n n leaves is

(2n-5)!!

4.5 x 10190 100 2.2 x 1020 20 2.7 x 102900 1000 2027025 10 135135 9 10395 8 945 7 105 6 15 5 3 4 #trees #leaves

slide-41
SLIDE 41

Problems with techniques for MP and ML

Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.

Performance of TNT with time

slide-42
SLIDE 42

Research: we try to develop better heuristics

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 4 8 12 16 20 24 Hours Average MP score above

  • ptimal, shown as

a percentage of the optimal

Current best techniques DCM boosted version of best techniques

slide-43
SLIDE 43

Other problems I study

  • Multiple sequence alignment
  • Detecting Horizontal Gene Transfers (and

hybrid species)

  • Whole genome evolution
  • Evolution of languages and human origins

And more!

slide-44
SLIDE 44

Possible Indo-European tree

(Ringe, Warnow and Taylor 2000)

slide-45
SLIDE 45

Possible IE Phylogenetic Network

(Nakhleh et al. 2005)

slide-46
SLIDE 46

Computational biology research is fun, multi-disciplinary, and collaborative!

  • Software development
  • Mathematics
  • Probability and Statistics
  • Biology
  • Chemistry
  • Linguistics

Plus, you will get to travel to far away lands

slide-47
SLIDE 47

My research group

  • Tandy Warnow (UT-Austin)
  • Randy Linder (UT-Austin)
  • UT PhD Students: Serita Nelesen, Kevin Liu, Sindhu Raghavan, Shel

Swenson

  • Collaborators at many other universities around the world