Algorithms for the validation and correction of gene relations - - PowerPoint PPT Presentation

algorithms for the validation and
SMART_READER_LITE
LIVE PREVIEW

Algorithms for the validation and correction of gene relations - - PowerPoint PPT Presentation

Algorithms for the validation and correction of gene relations Manuel Lafond, Universit de Montral Introduction Gene trees, species trees Duplication, speciation Orthologs, paralogs, and why? Validation of relations Cograph (P 4 -free)


slide-1
SLIDE 1

Manuel Lafond, Université de Montréal

Algorithms for the validation and correction of gene relations

slide-2
SLIDE 2

Introduction

Gene trees, species trees Duplication, speciation Orthologs, paralogs, and why?

Validation of relations

Cograph (P4-free) characterization of valid relations Relations consistent with a species tree

Relation correction Open theoretical and practical problems

slide-3
SLIDE 3

Take some gene, say my favorite RPGR : Retinitis pigmentosa GTPase regulator Participates in eye coloring. What is the history of RPGR ? Almost all vertebrates have a copy of this gene. Some have more than one. Some don’t have it. What happened exactly? A gene can be :

  • Transmitted to descending species by speciation
  • Duplicated
  • Lost
slide-4
SLIDE 4

RPGR RPGR1 RPGR2 Gibbon Orangutan Orangutan Human Mouse Rat Rat Duplication Speciation

Here’s what happened: History = gene tree labeled with duplications and speciations

slide-5
SLIDE 5

Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-6
SLIDE 6

Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-7
SLIDE 7

RPGR Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-8
SLIDE 8

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-9
SLIDE 9

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-10
SLIDE 10

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-11
SLIDE 11

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-12
SLIDE 12

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-13
SLIDE 13

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-14
SLIDE 14

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-15
SLIDE 15

RPGR RPGR1 RPGR2 Super-mammal Super-primate Super-rodent Mouse Rat Human Orangutan Gibbon Humanutan

slide-16
SLIDE 16

RPGR RPGR1 RPGR2 G2 O1 O2 H2 M1 R1 R1’ Duplication Spéciation

slide-17
SLIDE 17

RPGR RPGR1 RPGR2 G2 O1 O2 H2 R1 R1’ Duplication Speciation M1

slide-18
SLIDE 18

RPGR RPGR1 RPGR2 G2 O1 O2 H2 R1 R1’ Duplication Speciation M1

slide-19
SLIDE 19

RPGR RPGR1 RPGR2 G2 O1 O2 H2 R1 R1’ Duplication Speciation M1

slide-20
SLIDE 20

RPGR RPGR1 RPGR2 G2 O1 O2 H2 R1 R1’ Duplication Speciation M1

slide-21
SLIDE 21

Orthologs et paralogs

Two genes are: Orthologs if their lowest common ancestor underwent speciation Paralogs if their lowest common ancestor underwent duplication

slide-22
SLIDE 22

RPGR1 RPGR2 G2 O1 O2 H2 M1 R1 R1’ Duplication Speciation

slide-23
SLIDE 23

RPGR1 RPGR2 G2 O1 O2 H2 M1 R1 R1’ Duplication Speciation O1 and M1 are orthologs (lca is a speciation)

slide-24
SLIDE 24

RPGR1 RPGR2 G2 O1 O2 H2 M1 R1 R1’ Duplication Speciation O1 and G2 are paralogs (lca is a duplication)

slide-25
SLIDE 25

Why bother?

Orthology/paralogy relations are related to gene functionality Some gene functional annotation databases assume that orthologs to share the same functionality

(e.g. COG, eggNOG databases)

slide-26
SLIDE 26

Why bother?

Orthologs conjecture: orthologous genes tend to be similar in sequence and function, whereas paralogous genes tend to differ.

  • Any hope of proving or disproving this conjecture first requires

computational tools that can accurately infer gene relations.

slide-27
SLIDE 27

Why bother?

Orthologs conjecture: orthologous genes tend to be similar in sequence and function, whereas paralogous genes tend to differ.

  • Any hope of proving or disproving this conjecture first requires

computational tools that can accurately infer gene relations. Quest For Orthologs consortium: "a joint effort to benchmark, improve and standardize orthology predictions through collaboration, the use of shared reference datasets, and evaluation

  • f emerging new methods".
slide-28
SLIDE 28

Traditional inference method

Clustering genes into groups of orthologs:

  • If g1 and g2 and "similar enough" in terms of sequence, we say that g1

and g2 are putative orthologs.

  • Make a graph G of putative orthologs.
  • Partition G into clusters, i.e. highly connected components

Otherwise, too many false positives occur

  • OrthoMCL, InParanoid, proteinortho, …
slide-29
SLIDE 29

Traditional inference method

These methods are very often incomplete - have false positives or false negatives.

In (Lafond & El-Mabrouk, 2014), we found that >70% of inferred sets of relations were unsatisfiable – corresponded to no possible gene tree.

slide-30
SLIDE 30

What we want to do

Given a set of orthologs / paralogs:

  • Verify that they "make sense"

Satisfiable: can some gene tree display the relations? Consistent: does it agree with our species tree?

  • If they don't make sense, correct them in a minimal way

Everything is NP-Complete Approximation algorithms

slide-31
SLIDE 31

Validation of f gene relations

slide-32
SLIDE 32

Orthology/paralogy graph

Orthologs = (a,b) (a, c) (c, d) Paralogs = (a, d) (b, c) (b, d)

Orthologs Paralogs a b c d

slide-33
SLIDE 33

G2 O1 O2 H2 S1 R1 R1’ O1 S1 R1 R1’ G2 O2 H2

R

slide-34
SLIDE 34

G2 O1 O2 H2 S1 R1 R1’ O1 S1 R1 R1’ G2 O2 H2

??? R

slide-35
SLIDE 35

O1 S1 R1 R1’ G2 O2 H2

??? R

slide-36
SLIDE 36

Problem : Given a relation graph R, is R satisfiable? Does there exist a gene tree G that display the relations

  • f R ?

O1 S1 R1 R1’ G2 O2 H2

??? R

slide-37
SLIDE 37

Let's say it exists…what is the first split then ?

O1 S1 R1 R1’ G2 O2 H2

??? R

??? ???

slide-38
SLIDE 38

O1 S1 R1 R1’ G2 O2 H2

???

O1 S1 R1 R1’ G2 O2 H2

R

slide-39
SLIDE 39

O1 S1 R1 R1’ G2 O2 H2

???

O1 S1 R1 R1’ G2 O2 H2 Monochromatic edge-cut

R

slide-40
SLIDE 40

O1 S1 R1 R1’ G2 O2 H2

???

O1 S1 R1 R1’ G2 O2 H2

slide-41
SLIDE 41

O1 S1 R1 R1’ G2 O2 H2

slide-42
SLIDE 42

G2 O2 H2 O1 S1 R1 R1’

slide-43
SLIDE 43

G2 O2 H2 O1 S1 R1 R1’

slide-44
SLIDE 44

Lemma: If each subgraph of the relation graph R has a monochromatic edge-cut, we can build a gene tree from R. Conversely?? If R has a subgraph with no such cut, does it mean that we can't build a gene tree?

slide-45
SLIDE 45

Lemma: If each subgraph of the relation graph R has a monochromatic edge-cut, we can build a gene tree from R. Conversely?? If R has a subgraph with no such cut, does it mean that we can't build a gene tree? YES, the converse also holds.

slide-46
SLIDE 46

a b c d

Every cut has 2 colors  No possible rooting

a c b d a c b d

Misses the (c, b) paralogy.

slide-47
SLIDE 47

a b c d

Every cut has 2 colors  No possible rooting

a c b d a c b d

Misses the (a, b) orthology.

slide-48
SLIDE 48

Theorem: A relation graph R is satisfiable if and only if each subgraph has a monochromatic edge-cut. Can we test that easily (in polynomial time) ?

slide-49
SLIDE 49

Theorem: A relation graph R is satisfiable if and only if each subgraph has a monochromatic edge-cut. Theorem (restated): A relation graph R is satisfiable if and only if for each subgraph R', one of R'BLACK or R'BLUE is disconnected.

a b c d a b c d a b c d

RBLACK RBLUE R

slide-50
SLIDE 50

Theorem: A relation graph R is satisfiable if and only if each subgraph has a monochromatic edge-cut. Theorem (restated): A relation graph R is satisfiable if and only if for each subgraph R', one of R'BLACK or R'BLUE is disconnected. Theorem (again): A relation graph R is satisfiable if and only if for each subgraph R', either R'BLACK or its complement is disconnected.

slide-51
SLIDE 51

Theorem (again): A relation graph R is satisfiable if and only if for each subgraph R', either R'BLACK or its complement is disconnected. These graphs are well-known! They are called cographs, aka P4-free graphs.

slide-52
SLIDE 52

Theorem (finally): A relation graph R is satisfiable if and only if RBLACK is P4-free (no induced path of length 3).

a b c d a b c d

RBLACK R

a b c d a b c d

RBLACK R NO YES

slide-53
SLIDE 53

S-Consistency

What if we want our relations to agree with a given species tree?

R A B C S

a = gene from species A b = gene from species B c = gene from species C c a b

slide-54
SLIDE 54

S-Consistency

What if we want our relations to agree with a given species tree S?

c a b

R A B C S a b c G satisfied by

slide-55
SLIDE 55

S-Consistency

What if we want our relations to agree with a given species tree S?

c a b

R A B C satisfied by a b c G

slide-56
SLIDE 56

S-Consistency

What if we want our relations to agree with a given species tree S?

A B C a b c G

slide-57
SLIDE 57

S-Consistency

What if we want our relations to agree with a given species tree S?

A B C a b c G

slide-58
SLIDE 58

S-Consistency

What if we want our relations to agree with a given species tree S?

A B C a b c G

slide-59
SLIDE 59

S-Consistency

What if we want our relations to agree with a given species tree S?

A B C a b c G Inconsistent speciation

slide-60
SLIDE 60

Theorem: A relation graph R is S-Consistent if and only if R is satisfiable, and every 3-vertex subgraph of R "agrees" with S. Agreement only adds a requirement on the speciations. Only a black P3 can possibly disagree with S.

A B C S

c a b

slide-61
SLIDE 61

Experiments

We looked at 265 inferred families from ProteinOrtho, under 5 parameter sets {-2, -1, 0, +1, +2}.

Looser => More orthologies Stricter => Less orthologies

  • 2
  • 1

+1 +2 Default

slide-62
SLIDE 62

Experiments

Looser => More orthologies Stricter => Less orthologies

  • 2
  • 1

+1 +2 Default

slide-63
SLIDE 63

Experiments

Looser => More orthologies Stricter => Less orthologies

  • 2
  • 1

+1 +2 Default

Satisfiable ? Consistent ?

slide-64
SLIDE 64

Experiments

Looser => More orthologies Stricter => Less orthologies

  • 2
  • 1

+1 +2 Default

Satisfiable ? NO (~90% of families) Consistent ? NO (~96% of families)

slide-65
SLIDE 65

Experiments

Looser => More orthologies Stricter => Less orthologies

  • 2
  • 1

+1 +2 Default NOT Satisfiable NOT Consistent 80% 82% 90% 83% 70% 93% 95% 96% 95% 89%

slide-66
SLIDE 66

Gene relation correction

slide-67
SLIDE 67

Gene relation correction

Make R satisfiable by changing a minimum number of relations. That is, change as few edge colors as possible to make RBLACK P4-free

a b c d

slide-68
SLIDE 68

Gene relation correction

Make R satisfiable by changing a minimum number of relations. That is, change as few edge colors as possible to make RBLACK P4-free

a b c d a b c d

slide-69
SLIDE 69

Gene relation correction

Make R satisfiable by changing a minimum number of relations. That is, change as few edge colors as possible to make RBLACK P4-free NP-Complete (El-Mallah & Colbourn, 1988)

a b c d a b c d

slide-70
SLIDE 70

Gene relation correction

Make R S-Consistent by changing a minimum number of relations. That is, change as few edges colors so that R is P4-free, and every P3 agrees with S. (hey, maybe S can help reduce the complexity)

slide-71
SLIDE 71

Gene relation correction

Make R S-Consistent by changing a minimum number of relations. That is, change as few edges colors so that R is P4-free, and every P3 agrees with S. (hey, maybe S can help reduce the complexity) NO NP-Complete (Lafond & El-Mabrouk, 2014)

slide-72
SLIDE 72

Gene relation correction

Make R S-Consistent by removing a minimum number of genes. That is, delete as few vertices from R so that R is P4-free, and every P3 agrees with S.

slide-73
SLIDE 73

Gene relation correction

Make R S-Consistent by removing a minimum number of genes. That is, delete as few vertices from R so that R is P4-free, and every P3 agrees with S. NP-Hard to approximate within a n1-ε factor. (Lafond, Dondi, & El- Mabrouk, 2016)

slide-74
SLIDE 74

Weighted gene relation correction

To make things easier: Give each edge a weight, representing some degree of confidence

  • ver the inferred orthology/paralogy.

This weight represents the cost for changing the edge's color.

a b c d a b c d 0.8 1 0.75 0.75 0.5 0.6 0.5

slide-75
SLIDE 75

Weighted gene relation correction

Something we can handle: If edges all have weights of 0 or 1 0 = don't care, 1 = don't touch We can tell in polynomial time if there is an edge editing of weight 0.

a b c d a b c d 1 1 1 1

slide-76
SLIDE 76

Weighted gene relation correction

If weights are arbitrary, NP-Hardness follows from the unweighted version (for both satisfiability and consistency). Worse than that, there is no constant factor approximation assuming the unique games conjecture.

a b c d a b c d 0.8 1 0.75 0.75 0.5 0.6 0.5

slide-77
SLIDE 77

Min-cut approximation for satisfiability

Recall: Theorem (again): A relation graph R is satisfiable if and only if for each subgraph R',

  • ne of R'BLACK or R'BLUE is disconnected.

In particular, RBLACK or its complement RBLUEmust be disconnected. So we'll disconnect it then.

slide-78
SLIDE 78

Min-cut approximation for satisfiability

In particular, RBLACK or its complement RBLUEmust be disconnected. Find a min-cut on RBLACK Find a min-cut on RBLUE Take the best of the two and apply. Repeat on the resulting components. (min-cut = minimum weight edge-set that disconnect R, can be found in time O(n3))

slide-79
SLIDE 79

Min-cut approximation for satisfiability

In particular, RBLACK or its complement RBLUEmust be disconnected. Find a min-cut on RBLACK Find a min-cut on RBLUE Take the best of the two and apply. Repeat on the resulting components. Gives a solution that is at most n times worse than optimal. (not great, but shows that approximability is bounded) (min-cut = minimum weight edge-set that disconnect R, can be found in time O(n3))

slide-80
SLIDE 80

Theoretical and practical problems

slide-81
SLIDE 81

Theoretical problems

Unweighted case: can we approximate satisfiability? Consistency? Weighted case: gap in approximability results. Is there better than a n-factor approximation? Somewhere in-between constant and n. Self-consistency: we don't know the species tree S, but we want the relations to be consistent with some species tree. HGT, ILS, etc. : how can we handle other events such as horizontal gene trasnfer or incomplete lineage sorting? What are their impact

  • n relation graphs?
slide-82
SLIDE 82

Practical problems

We don't even know how to test our correction methods. Gold standard datasets are extremely rare, if nonexistent. Most software are interested into forming clusters of

  • rthologs. How do we compare with others?
slide-83
SLIDE 83

Practical problems

Faster approximations and heuristics are still needed. The Min-Cut algorithm takes time O(n3), and our implementation is too slow for, say, 1000 genes. How to handle other events? How can we distinguish species tree disagreement with HGT

  • r ILS? Beyond graph theory, what is their practical impact in

the ortholgoy/paralogy inference process?