The Potential of Family-Free Genome Comparison Mar lia D. V. Braga, - - PowerPoint PPT Presentation
The Potential of Family-Free Genome Comparison Mar lia D. V. Braga, - - PowerPoint PPT Presentation
The Potential of Family-Free Genome Comparison Mar lia D. V. Braga, Cedric Chauve, Daniel Doerr, Katharina Jahn, Jens Stoye, Annelyse Th evenin, Roland Wittler (Bielefeld, Bordeaux, Rio de Janeiro, Vancouver) MAGE, 26 August 2013 The
The Potential of Family-Free Genome Comparison
Mar´ ılia D. V. Braga, Cedric Chauve, Daniel Doerr, Katharina Jahn, Jens Stoye, Annelyse Th´ evenin, Roland Wittler
(Bielefeld, Bordeaux, Rio de Janeiro, Vancouver)
MAGE, 26 August 2013
Introduction
Comparative genomics
Two levels of genome evolution: Small scale mutations: point mutations Large scale mutations: rearrangements, duplications, insertions, deletions Structural organization provides insights into: phylogeny and evolution gene function and interactions
The Potential of Family-Free Genome Comparison (5 / 27) Jens Stoye
1 2 3 4 5 6 7 8 6 5 4 3 2 1 7 8
Introduction
Comparative genomics with gene families
Picture with gene families: Simple and powerful data type Many databases and tools available Produce reasonable results
The Potential of Family-Free Genome Comparison (7 / 27) Jens Stoye
Introduction
The Family-free Principle
More realistic picture: Computational prediction of gene families is (mostly) unsupervised Do not always correspond to biological gene families Wrong gene family assignments may produce incorrect results in subsequent analyses
The Potential of Family-Free Genome Comparison (8 / 27) Jens Stoye
Introduction
The Family-free Principle
Gene family assignments not necessary:
◮ If subsequent analyses can deal with
- riginal data
◮ For example gene similarity scores
We may even invert the scenario:
◮ Integrated analysis: ortholog
assignments and gene order analysis
◮ Gene family assignment based on
positional orthology
The Potential of Family-Free Genome Comparison (9 / 27) Jens Stoye
Introduction
The Family-free Principle
Family-free Principle
Gene similarities Other applications Combined methods... Conserved structures Rearrangements Ancestral genome reconstruction Pairwise proximities Gene set proximities Single
- peration
models Combined
- perations
Median-
- f-three
Whole genome duplication Contig layouting Gene family prediction Phylogenetic distances ...for conserved structure detection, ancestral genome reconstruction and gene family prediction. The Potential of Family-Free Genome Comparison (10 / 27) Jens Stoye
Introduction
The Family-free Principle
Family-free Principle
Gene similarities Other applications Combined methods... Conserved structures Rearrangements Ancestral genome reconstruction Pairwise proximities Gene set proximities Single
- peration
models Combined
- perations
Median-
- f-three
Whole genome duplication Contig layouting Gene family prediction Phylogenetic distances ...for conserved structure detection, ancestral genome reconstruction and gene family prediction. The Potential of Family-Free Genome Comparison (10 / 27) Jens Stoye
Conserved structures
Conserved structures
Family-free Principle
Gene similarities Other applications Combined methods... Conserved structures Rearrangements Ancestral genome reconstruction Pairwise proximities Gene set proximities Single
- peration
models Combined
- perations
Median-
- f-three
Whole genome duplication Contig layouting Gene family prediction Phylogenetic distances ...for conserved structure detection, ancestral genome reconstruction and gene family prediction. The Potential of Family-Free Genome Comparison (11 / 27) Jens Stoye
Conserved structures
Gene similarity graph
Gene similarity graph of 3 genomes:
The Potential of Family-Free Genome Comparison (12 / 27) Jens Stoye
Conserved structures
Gene similarity graph
Gene similarity graph of 3 genomes:
The Potential of Family-Free Genome Comparison (12 / 27) Jens Stoye
Conserved structures
Gene similarity graph
Gene similarity graph of 3 genomes:
The Potential of Family-Free Genome Comparison (12 / 27) Jens Stoye
Conserved structures
Partial k-matching Partial k-(dimensional) matching
Given a gene similarity graph B = (G1, . . . , Gk, E), a partial k-matching M ⊆ E is a selection of edges such that for each connected component C ⊆ BM := (G1, . . . , Gk, M) no two genes in C belong to the same genome. For k = 3: 2k − 1 = 7 valid components
The Potential of Family-Free Genome Comparison (13 / 27) Jens Stoye
Conserved structures
Partial k-matching
Gene similarity graph of 3 genomes: . . . how to construct such a matching?
The Potential of Family-Free Genome Comparison (14 / 27) Jens Stoye
Conserved structures
Assessing matching properties
Adjacency: proximity relation between two genes Adjacency score for consecutive genes (g, g′) in genome G and (h, h′) in genome H:
s(g, g′, h, h′) = w(eg,h) · w(eg′,h′) if (g, g′), (h, h′) form a conserved adjacency
- therwise
The Potential of Family-Free Genome Comparison (15 / 27) Jens Stoye
Conserved structures
Assessing matching properties
Adjacency: proximity relation between two genes Adjacency score for consecutive genes (g, g′) in genome G and (h, h′) in genome H:
s(g, g′, h, h′) = w(eg,h) · w(eg′,h′) if (g, g′), (h, h′) form a conserved adjacency
- therwise
Adjacency measure in M:
adj(M) =
- G,H
- g left of g′ in G
h,h′ in H
s(g, g′, h, h′)
The Potential of Family-Free Genome Comparison (15 / 27) Jens Stoye
Conserved structures
Assessing matching properties
Adjacency: proximity relation between two genes Adjacency score for consecutive genes (g, g′) in genome G and (h, h′) in genome H:
s(g, g′, h, h′) = w(eg,h) · w(eg′,h′) if (g, g′), (h, h′) form a conserved adjacency
- therwise
Adjacency measure in M:
adj(M) =
- G,H
- g left of g′ in G
h,h′ in H
s(g, g′, h, h′)
Similarity measure in M:
edg(M) =
- e∈M
w(e)
The Potential of Family-Free Genome Comparison (15 / 27) Jens Stoye
Conserved structures
Family-free Adjacencies Problem Family-free Adjacencies Problem
Find matching M that maximizes the following formula: Fα(M) = α · adj(M) + (1 − α) · edg(M) . α Similarity Synteny 1
The Potential of Family-Free Genome Comparison (16 / 27) Jens Stoye
Conserved structures
Gene set proximities: gene clusters
Relaxation: conserved neighborhood up to θ > 0 genes Scoring θ-adjacencies:
sθ(g, g′, h, h′) = w(eg,h) · w(eg′,h′) if (g, g′) and (h, h′) form a θ-adjacency
- therwise
The Potential of Family-Free Genome Comparison (17 / 27) Jens Stoye
Conserved structures
Gene set proximities: gene clusters
Based on θ-adjacencies we can define gene clusters as pairs of intervals with large maximum weight matching M:
The Potential of Family-Free Genome Comparison (18 / 27) Jens Stoye
Conserved structures
Gene set proximities: consimilar intervals
Calculating a maximum matching for all pairs of intervals is expensive. Therefore use unweighted gene similarity graph Consimilar interval: many edges inside, no edges to neighbors. Algorithm: O(n3) time
The Potential of Family-Free Genome Comparison (19 / 27) Jens Stoye
Conserved structures
Gene set proximities: consimilar intervals
Calculating a maximum matching for all pairs of intervals is expensive. Therefore use unweighted gene similarity graph Consimilar interval: many edges inside, no edges to neighbors. Algorithm: O(n3) time Ranking by score of maximum weight matching inside the intervals.
The Potential of Family-Free Genome Comparison (19 / 27) Jens Stoye
Conserved structures
Gene set proximities: consimilar intervals
Calculating a maximum matching for all pairs of intervals is expensive. Therefore use unweighted gene similarity graph Consimilar interval: many edges inside, no edges to neighbors. Algorithm: O(n3) time Ranking by score of maximum weight matching inside the intervals.
The Potential of Family-Free Genome Comparison (19 / 27) Jens Stoye
Rearrangements
Rearrangements
Family-free Principle
Gene similarities Other applications Combined methods... Conserved structures Rearrangements Ancestral genome reconstruction Pairwise proximities Gene set proximities Single
- peration
models Combined
- perations
Median-
- f-three
Whole genome duplication Contig layouting Gene family prediction Phylogenetic distances ...for conserved structure detection, ancestral genome reconstruction and gene family prediction. The Potential of Family-Free Genome Comparison (20 / 27) Jens Stoye
Rearrangements
DCJ – Double Cut and Join
DCJ accounts for rearrangement events: inversion, translocation, fusion, fission, transposition, block interchange Adjacency graph: distance dDCJ = N − C − I
2
The Potential of Family-Free Genome Comparison (21 / 27) Jens Stoye
Rearrangements
DCJ – Double Cut and Join
From the gene similarity graph . . .
The Potential of Family-Free Genome Comparison (22 / 27) Jens Stoye
Rearrangements
DCJ – Double Cut and Join
From the gene similarity graph to the weighted adjacency graph (WAG):
The Potential of Family-Free Genome Comparison (22 / 27) Jens Stoye
Rearrangements
DCJ – Double Cut and Join
From the gene similarity graph to the weighted adjacency graph (WAG):
The Potential of Family-Free Genome Comparison (22 / 27) Jens Stoye
Rearrangements
Family-free Rearrangement Problem Family-free Rearrangement Problem
Find matching MGH that maximizes the following formula: FDCJ
α
(MGH) = α · cyc(MGH) + (1 − α) · edg(MGH) where cyc(MGH) =
- C∈C(MGH)
- 1
|C|
- e∈C
w(e)
- C(MGH) := set of connected components in WAG(MGH)
The Potential of Family-Free Genome Comparison (23 / 27) Jens Stoye
Ancestral genome reconstruction
Ancestral genome reconstruction
Family-free Principle
Gene similarities Other applications Combined methods... Conserved structures Rearrangements Ancestral genome reconstruction Pairwise proximities Gene set proximities Single
- peration
models Combined
- perations
Median-
- f-three
Whole genome duplication Contig layouting Gene family prediction Phylogenetic distances ...for conserved structure detection, ancestral genome reconstruction and gene family prediction. The Potential of Family-Free Genome Comparison (24 / 27) Jens Stoye
Ancestral genome reconstruction
Reconstruction of Ancestral Adjacencies
Emphasize adjacencies that are conserved in closely related genomes.
Phylogeny Aware Optimization Problem
Given an additive distance matrix DT , find matching M that maximizes the following formula:
Fα,T (M) =
- G,H
- (DT
max − DT GH) (α · adj(MGH) + (1 − α) · edg(MGH))
- where
DT
max = max G,H {DT GH} The Potential of Family-Free Genome Comparison (25 / 27) Jens Stoye
Conclusion and outlook
Conclusion and outlook
Family-free Principle
Gene similarities Other applications Combined methods... Conserved structures Rearrangements Ancestral genome reconstruction Pairwise proximities Gene set proximities Single
- peration
models Combined
- perations
Median-
- f-three
Whole genome duplication Contig layouting Gene family prediction Phylogenetic distances ...for conserved structure detection, ancestral genome reconstruction and gene family prediction. The Potential of Family-Free Genome Comparison (26 / 27) Jens Stoye
Thanks to:
Mar´ ılia D. V. Braga Cedric Chauve Daniel Doerr Katharina Jahn Annelyse Th´ evenin Roland Wittler
The Potential of Family-Free Genome Comparison (27 / 27) Jens Stoye
Thanks to:
Mar´ ılia D. V. Braga Cedric Chauve Daniel Doerr Katharina Jahn Annelyse Th´ evenin Roland Wittler Andreas Dress: Happy birthday!
The Potential of Family-Free Genome Comparison (27 / 27) Jens Stoye
Thanks to:
Mar´ ılia D. V. Braga Cedric Chauve Daniel Doerr Katharina Jahn Annelyse Th´ evenin Roland Wittler Andreas Dress: Happy birthday!
You!
The Potential of Family-Free Genome Comparison (27 / 27) Jens Stoye