Error Detection and Correction of Gene Trees Using Gene Order - - PowerPoint PPT Presentation

error detection and correction
SMART_READER_LITE
LIVE PREVIEW

Error Detection and Correction of Gene Trees Using Gene Order - - PowerPoint PPT Presentation

Error Detection and Correction of Gene Trees Using Gene Order Manuel Lafond , Krister M. Swenson and Nadia El- Mabrouk Universit de Montral 1 Introduction Gene trees reflect the evolutionary history of a family of homologous genes


slide-1
SLIDE 1

Error Detection and Correction

  • f Gene Trees Using Gene

Order

Manuel Lafond, Krister M. Swenson and Nadia El- Mabrouk Université de Montréal

1

slide-2
SLIDE 2

Introduction

 Gene trees reflect the evolutionary

history of a family of homologous genes

  • Genes that all descend from a common

ancestor

g1 G : g2 g3 g4 g5

2

slide-3
SLIDE 3

Introduction

 Ancestral genes may have undergone

speciation or duplication

Duplication Speciatio n g1 G : g2 g3 g4 g5

3

slide-4
SLIDE 4

Introduction

 Modern genes relationships

  • Orthologs : LCA is a speciation

 g1, g5 are orthologs

  • Paralogs : LCA is a duplication

 g1, g3 are paralogs

G : Duplication Speciatio n g1 g2 g3 g4 g5 (LCA = Lowest Common Ancestor)

4

slide-5
SLIDE 5

Introduction

 Speciations and duplications are typically

inferred by reconciling G with its corresponding species tree S

  • Idea : map each modern gene to the species

containing it, and add duplications to make G “agree” with S

G : a b c d S :

5

a1 a2 b1 c1 d1

slide-6
SLIDE 6

Introduction

 An internal node g of V(G) is a speciation

when there is a s in V(S) such that

  • The leaves in the left subtree of g all map to

leaves in the left subtree of s

  • Idem for the right side

G : a b c d S :

6

g s a1 a2 b1 c1 d1

slide-7
SLIDE 7

Introduction

 An internal node g of V(G) is a speciation

when there is a s in V(S) such that

  • The leaves in the left subtree of g all map to

leaves in the left subtree of s

  • Idem for the right side

G : a b c d S :

7

g s a1 a2 b1 c1 d1

slide-8
SLIDE 8

Introduction

G : a b c d S : a1 a2 b1 c1 d1

8

g s

 Otherwise, g is a duplication

  • In this case, duplication is apparent :

 Two copies of the same gene ended up in the ‘a’ species  Non-apparent duplications are possible (we will se later)

slide-9
SLIDE 9

Introduction

 Suppose we are given the

  • rthology/paralogy relationships
  • For instance, some deity lets us know that a1,

b1 are orthologous

  • Then this gene tree is wrong !

9

G : a b c d S : a1 a2 b1 c1 d1

slide-10
SLIDE 10

Introduction

 How can we make a1, b1 orthologous ?

10

G : a b c d S : a1 a2 b1 c1 d1

slide-11
SLIDE 11

Introduction

 How can we make a1, b1 orthologous ?

11

G : a b c d S : a1 a2 b1 c1 d1

slide-12
SLIDE 12

Introduction

 How can we make a1, b1 orthologous ?

12

G : a b c d S : a1 a2 b1 c1 d1

slide-13
SLIDE 13

Introduction

 How can we make a1, b1 orthologous ?

13

G : a b c d S : a1 a2 b1 c1 d1

slide-14
SLIDE 14

Introduction

 How can we make a1, b1 orthologous ?  And mess up G as least as possible ?  What if we’re given many orthology

constraints ?

14

G : a b c d S : a1 a2 b1 c1 d1

slide-15
SLIDE 15

Problem statement

 Given : a gene tree G, a species tree S, and a

set P of pairs of genes that are required to be

  • rthologous

 Find : a corrected gene tree G’ in which every

pair (g1, g2) in P are orthologous in G’, such that the Robinson-Foulds distance between G and G’ is minimized

15

G : a b c d S : a1 a2 b1 c1 d1

slide-16
SLIDE 16

Introduction

 Two copies of the same gene were

found twice in the same species (g1, g2) => We need to infer a duplication

a b c d S :

16

G : a a b c d

slide-17
SLIDE 17

Accuracy of gene trees

 A few misplaced leaves in G can lead to

a completely different reconciliation

g1:a G : g2:a g3:b g4:c g5:d a b c d S :

17

slide-18
SLIDE 18

Accuracy of gene trees

 A few misplaced leaves in G can lead to

a completely different reconciliation

g1:a G : g2:a g3:b g4:c g5:d a b c d S : g1:a G’ : g2:a g3:b g4:c g5:d

18

slide-19
SLIDE 19

Accuracy of gene trees

 A few misplaced leaves in G can lead to

a completely different reconciliation

g1:a G : g2:a g3:b g4:c g5:d a b c d S : g1:a G’ : g2:a g3:b g4:c g5:d

19

slide-20
SLIDE 20

Accuracy of gene trees

 Inaccuracies in gene trees lead to

  • Erroneous topologies
  • Erroneous orthology/paralogy relationships

 We use gene order to detect and correct

such errors

a b c d S : G : g1: a g2:a g3:b g4: c g5:d

20

slide-21
SLIDE 21

Gene tree inference and correction

 Some available information to infer

and correct gene trees

  • Sequences (MP, ML, Bayesian, …)
  • Species tree topology (GIGA)
  • Branch/clade support (LSM)
  • Speciation/duplication events inferred by

reconciliation (TreeBeST)

  • Gene synteny (SYNERGY)
  • Gene position and order on genome

21

slide-22
SLIDE 22

Gene order

 Genome : a string of genes, giving the order in

which genes are found in a given species

  • Genome for X species : “a b c d e f g …”

 Region : a subsequence of a genome

  • Pick a subset of a genome’s genes, maintaining

the order

  • a b c d e f g h ...

=>

b c e g

region

 Typically, we impose a limit on the size of a

region and on the genome distance between its members

22

slide-23
SLIDE 23

Region homology

23

 Two genes are homologous if they

descend from a common ancestral gene

  • This ancestral has undergone speciation or

duplication

slide-24
SLIDE 24

Region homology

 Two genes are homologous if they

descend from a common ancestral gene

  • This ancestral has undergone speciation or

duplication

 Can we define region homology

similarly?

24

slide-25
SLIDE 25

Region homology

 Two genes are homologous if they

descend from a common ancestral gene, which has undergone speciation or duplication

 Can we define region homology similarly

?

 Two regions are homologous if they

descend from a common ancestral region, which has undergone speciation

  • r duplication

25

slide-26
SLIDE 26

Region homology

 Two genes are homologous if they

descend from a common ancestral gene, which has undergone speciation or duplication

 Can we define region homology similarly

?

 Two regions are homologous if they

descend from a common ancestral region, which has undergone speciation

  • r duplication
  • What does that even mean ?

26

slide-27
SLIDE 27

Region homology

 Common ancestral region

  • For two given regions R1, R2

Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

27

slide-28
SLIDE 28

Region homology

 Common ancestral region

  • For two given regions R1, R2

 Subdivide their genes into gene families F1, F2, …, Fn

 In the example, four families (a,b,c,d)

 Look at the roots of the gene trees for all the Fi’s

a b c d Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

28

slide-29
SLIDE 29

Region homology

 Common ancestral region

 If all these ancestral genes are in the same ancestral genome, R1, R2 share a common ancestral region RA

a b c d RA Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

29

slide-30
SLIDE 30

Region homology

 Region speciation

  • All the roots are speciation

a b c d RA Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

30

slide-31
SLIDE 31

a b c d RA Genome X Genome Y

Region homology

 Region duplication

  • All the roots are duplications
  • Corresponds to a segmental duplication (or

“region duplication” in the ancestral genome

R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

31

slide-32
SLIDE 32

Region homology

 Not homologous regions

a b c d RA Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

32

slide-33
SLIDE 33

No convergent evolution hypothesis

 Hypothesis : similar regions are

homologous

a b c d RA Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

33

slide-34
SLIDE 34

Homology contradiction

 If we find two similar regions and look at

the roots of the gene family trees, we expect them all to be the same type

Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

37

slide-35
SLIDE 35

Homology contradiction

 If we find two similar regions and look at

the roots of the gene family trees, we expect them all to be the same type

a b c d Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

38

slide-36
SLIDE 36

Homology contradiction

 If we find two similar regions and look at

the roots of the gene family trees, we expect them all to be the same type

Genome X Genome Y a b c d R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

39

slide-37
SLIDE 37

Homology contradiction

 Otherwise, there is a homology

contradiction (an error in one of the gene trees)

Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2

40

slide-38
SLIDE 38

Homology contradiction

 Why not ?

  • If bA duplicated, the copy typically went

somewhere else on the ancestral genome

41

Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2 bA

slide-39
SLIDE 39

Homology contradiction

 Why not ?

  • If bA duplicated, the copy typically went

somewhere else on the ancestral genome

42

Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2 bA bA’

slide-40
SLIDE 40

Homology contradiction

 Why not ?

  • If bA duplicated, the copy typically went

somewhere else on the ancestral genome

  • And somehow, during evolution, it ended up in

a region similar to R1, mostly by chance

43

Genome X Genome Y R1 a1 b1 c1 d1 a2 b2 c2 d2 R2 bA bA’

slide-41
SLIDE 41

Strong no convergent evolution

 Hypothesis : similarity is inherited from

the common ancestral region, and is preserved during the course of evolution

a1 g1 b1 a2 g4 b2 g2 g3

G : gene tree for g family

44

slide-42
SLIDE 42

Strong no convergent evolution

 Hypothesis : similarity is inherited from

the common ancestral region, and is preserved during the course of evolution

aA gA bA a1 g1 b1 a2 g4 b2 g2 g3

45

slide-43
SLIDE 43

Strong no convergent evolution

 Hypothesis : similarity is inherited from

the common ancestral region, and is preserved during the course of evolution

aA gA bA aB gB bB aC gC bC a1 g1 b1 a2 g4 b2 g2 g3

46

slide-44
SLIDE 44

Strong no convergent evolution

 Otherwise, we must assume g1 and g2

gained their region similarity by chance

a1 g1 b1 a2 g4 b2 g2 g3

47

slide-45
SLIDE 45

Region overlapping

 Two ancestral genes may belong to two

different region families simultaneously

aA gA bA aB gB bB aC gC bC a1 g1 b1 a2 g4 b2 x1 g2 y1 x2 g3 y2

48

slide-46
SLIDE 46

Region overlapping

 Two ancestral genes may belong to two

different region families simultaneously

aA gA bA aB gB bB aC gC bC a1 g1 b1 a2 g4 b2 x1 g2 y1 x2 g3 y2 xA gA yA xB gB yB xC gC yC

49

slide-47
SLIDE 47

Results

 We looked for homology contradictions and

context overlapping in ~6000 Ensembl gene trees

 All trees contained genes for the Zebrafish,

Medaka, Stickleback and Tetraodon species, and we included Human and Mouse as

  • utgroups

50

slide-48
SLIDE 48

Results

 Each gene was assigned a size 3

region

  • Triplet containing the gene, and its

left/right adjacencies

 The central gene is the gene of interest  Two regions (a g1 b), (x g2 y)

are homologous if a, x are in the same family, as well as b, y

51

slide-49
SLIDE 49

Results

 Paralogy contradiction

  • gA should not be a duplication

a1 g1 b1 a2 g2 b2 aA gA bA

52

slide-50
SLIDE 50

Results

 Orthology contradiction

  • gA should not be a speciation

a1 g1 b1 a2 g2 b2 aA gA bA

53

slide-51
SLIDE 51

Results

Number of trees 6241 Paralogy contradiction 22.5 % (1407 trees) Orthology contradiction 10.8 % (677 trees) Region overlap 3.4 % (210 trees) At least one contradiction 31.3 % (1959 trees) Table 1 : Number of Ensembl gene trees with errors

54

slide-52
SLIDE 52

Results

Number of trees 6241 Paralogy contradiction 22.5 % (1407 trees) Orthology contradiction 10.8 % (677 trees) Region overlap 3.4 % (210 trees) At least one contradiction 31.3 % (1959 trees) Table 1 : Number of Ensembl gene trees with errors 77% of paralogy contradictions correspond to duplications marked as “dubious” by Ensembl (dubious are Non-Apparent Duplications)

55

slide-53
SLIDE 53

Gene tree correction

 How should such errors be corrected ?  We need to find an error-free gene tree

with equal or better statistical support

  • Explore the original gene tree’s

neighborhood

  • Algorithmically free the gene tree from

errors, minimizing some criteria

56

slide-54
SLIDE 54

Gene tree correction

 How should such errors be corrected ?  We need to find an error-free gene tree

with equal or better statistical support

  • Explore the original gene tree’s

neighborhood

  • Algorithmically free the gene tree from

errors, minimizing some criteria

 Distance from original tree (NNI, SPR, TBR, RF, …)  Reconciliation cost  Get rid of dubious duplications  … ?

57

slide-55
SLIDE 55

Gene tree correction

 How should such errors be corrected ?  We need to find an error-free gene tree

with equal or better statistical support

  • Explore the original gene tree’s

neighborhood

  • Algorithmically free the gene tree from

errors, minimizing some criteria

 Distance from original tree (NNI, SPR, TBR, RF, …)  Reconciliation cost  Get rid of dubious duplications

 Does an error-free tree even exist ?

58

slide-56
SLIDE 56

Gene tree correction

 For homology contradictions

  • R : a set of gene pairs that must be
  • rthologs
  • P : a set of gene pairs must be paralogs

59

slide-57
SLIDE 57

Gene tree correction

  • R : a set of gene pairs that must be
  • rthologs
  • P : a set of gene pairs must be paralogs

60

G : a b c d S : a1 b1 a2 c1 d1 c

2

R = {(a1, b1)} P = {(a2, c1)}

slide-58
SLIDE 58

Gene tree correction

  • R : a set of gene pairs that must be
  • rthologs
  • P : a set of gene pairs must be paralogs

61

a1 G : b1 a2 c1 d1 a b c d S : c

2

a2 G’ : b1 a1 c

2

c1 d1 R = {(a1, b1)} P = {(a2, c1)}

slide-59
SLIDE 59

Gene tree correction

  • R : a set of gene pairs that must be
  • rthologs
  • P : a set of gene pairs must be paralogs
  • It is possible to have R and P such that no

gene tree can satisfy all constraints

 Deciding if R and P are satisfiable : complexity unknown

62

slide-60
SLIDE 60

Correction of paralogy contradictions

 Input : a gene tree G, a species tree S,

and R a set of gene pairs that must be

  • rthologs

 Output : a corrected gene tree G’ in

which

  • every required orthologs in R are orthologs

in G’

  • Robinson-Foulds distance between G, G’

is minimized (among all possible solutions)

 Feasible in polynomial time

63

slide-61
SLIDE 61

Conclusion

 3 types of errors in gene trees

  • Paralogy contradiction
  • Orthology contradiction
  • Context overlap

 How can we free a gene tree from such

errors in order to get more accurate trees ?

 Do unsatisfiable constraints exist in real

data ? If so, how can we interpret them ?

64