Hierarchical orga- nization of syn- tenic blocks in large genomic - - PowerPoint PPT Presentation

hierarchical orga nization of syn tenic blocks in large
SMART_READER_LITE
LIVE PREVIEW

Hierarchical orga- nization of syn- tenic blocks in large genomic - - PowerPoint PPT Presentation

Hierarchical orga- nization of syn- tenic blocks in large genomic datasets Daniel Doerr Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University Workshop on Data Structures in Bioinformatics, February 4, 2020


slide-1
SLIDE 1

Hierarchical orga- nization of syn- tenic blocks in large genomic datasets

Daniel Doerr Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University

Workshop on Data Structures in Bioinformatics, February 4, 2020

slide-2
SLIDE 2

Hierarchical organization of syntenic blocks in large genomic datasets 1

Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO

slide-3
SLIDE 3

Data structures for large-scale comparisons

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 2

Objective: multi-species whole-genome comparisons Solution: pan-genome data structures

  • nly suitable for very similar genomes
slide-4
SLIDE 4

Data structures for large-scale comparisons

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 2

Objective: multi-species whole-genome comparisons Solution: pan-genome data structures ... only suitable for very similar genomes

slide-5
SLIDE 5

Abstraction by decomposition

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 3

genomes decomposed into syntenic blocks essential for studying genome evolution between distant species current studies restricted to protein-coding genes

  • mission of many other

conserved genomic regions

syntenic block

CCTTGTGCGAGAATGCCCGCCAGTTCTCCCT GGAACACGCTCTTACGGGCGGTCAAGAGGGA

slide-6
SLIDE 6

What is synteny?

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 4

A zoo of definitions:

“the same ribbon” (Renwick, 1971) , set of markers co-located on same chromosome markers must be collinear local rearrangements allowed mostly tool-centric: FISH, GRIMM/DRIMM-Synteny, Cyntenator, i-ADHoRe, Sibelia, CoGe, Satsuma, etc.

G H

A B

slide-7
SLIDE 7

What is synteny?

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 4

homology assignment: set H of pairwise (equivalence) relations

Definition [Ghiurcuta and Moret, 2014] Given two genomes G, H and homology assignment H, two SBs A ⊆ G and B ⊆ H are homologous if for each a ∈ A: ∃ (a, h) ∈ H, h ∈ H = ⇒ (a, b′) ∈ H, b′ ∈ B b ∈ B: ∃ (b, g) ∈ H, g ∈ G = ⇒ (a′, b) ∈ H, a′ ∈ A

syntenic block (SB): single marker or set of contiguous syntenic blocks dilemma: there is no one true decomposition of genomes into syntenic blocks

G H

A B

G H

A B

slide-8
SLIDE 8

What is synteny?

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 4

homology assignment: set H of pairwise (equivalence) relations

Definition [Ghiurcuta and Moret, 2014] Given two genomes G, H and homology assignment H, two SBs A ⊆ G and B ⊆ H are homologous if for each a ∈ A: ∃ (a, h) ∈ H, h ∈ H = ⇒ (a, b′) ∈ H, b′ ∈ B b ∈ B: ∃ (b, g) ∈ H, g ∈ G = ⇒ (a′, b) ∈ H, a′ ∈ A

syntenic block (SB): single marker or set of contiguous syntenic blocks dilemma: there is no one true decomposition of genomes into syntenic blocks

G H

A B

G H

A B

slide-9
SLIDE 9

Synteny hierarchy

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 5

What are the homologous SBs of G,H?

G H

slide-10
SLIDE 10

Synteny hierarchy

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 5

G, H are covered by one homologous SB pair

G H

slide-11
SLIDE 11

Synteny hierarchy

Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 5

... but contains several other homologous SB pairs

G H

slide-12
SLIDE 12

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 6

Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO

slide-13
SLIDE 13

Common intervals in permutations

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 7

Definition A pair of intervals of two permu- tations is common if they share the same set of elements.

slide-14
SLIDE 14

Synteny hierarchy

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 8

PQ-tree: [Booth and Lueker, 1976] “Q”-node: collinear, “P”-node: permute freely

G H

P Q Q P P Q Q P P P P P

slide-15
SLIDE 15

Booth and Lueker

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 9

PQ tree construction linear time w.r.t. input size, i.e., number of 1s of an n × m matrix number of markers: n number of common intervals: m ∈ O(n2) ... but cubic w.r.t. output size: the PQ tree has only O(n) nodes!

slide-16
SLIDE 16

Intervals of a PQ tree

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 10

Definition [Bergeron et al., 2008] The frontier of a node is the set of labels of the leaves

  • f the subtree rooted at this node, or a singleton

comprising a leaf label.

slide-17
SLIDE 17

Sets of common intervals in permutations

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 11

Definition [Bergeron et al., 2008] A set of intervals I is closed if (1), .., (n) ∈ I, (1..m) ∈ I, and for each pair of intervals (i..k), (j..l) ∈ I s.t. i < j ≤ k < l, also (i..j), (j..k), (k..l), (i..l) ∈ I i j k l

slide-18
SLIDE 18

Sets of common intervals in permutations

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 11

Definition [Bergeron et al., 2008] A set of intervals I is closed if (1), .., (n) ∈ I, (1..m) ∈ I, and for each pair of intervals (i..k), (j..l) ∈ I s.t. i < j ≤ k < l, also (i..j), (j..k), (k..l), (i..l) ∈ I i j k l

slide-19
SLIDE 19

Commuting sets

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 12

Definition [Bergeron et al., 2008] Two intervals A, B commutes if

A ⊆ B or B ⊆ A or A ∩ B = ∅.

... and a set of intervals I is commuting if all pairs of intervals commute.

slide-20
SLIDE 20

Strong intervals

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 13

Definition [Bergeron et al., 2008] Given a set of intervals I, an interval A is strong if it commutes with all intervals B ∈ I. The strong intervals of a closed set of intervals I are the frontier of the PQ tree of I.

slide-21
SLIDE 21

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences 14

Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO

slide-22
SLIDE 22

SB hierarchy

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences 15

Context-dependency two sets of common intervals intersect only if all their intervals intersect in the corresponding sequences

G H I

slide-23
SLIDE 23

Sets of common intervals in sequences

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences 16

Definition A set of intervals I is near-closed if (1), .., (n) ∈ I, (1..m) ∈ I, and for each pair of intervals (i..k), (j..l) ∈ I s.t. i < j ≤ k < l, also (i..l) ∈ I Lemma Let be a near-closed set of intervals. Then there exists a unique PQ-tree with frontier such that for the set of strong intervals I I holds true that and .

i j k l

slide-24
SLIDE 24

Sets of common intervals in sequences

Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences 16

Definition A set of intervals I is near-closed if (1), .., (n) ∈ I, (1..m) ∈ I, and for each pair of intervals (i..k), (j..l) ∈ I s.t. i < j ≤ k < l, also (i..l) ∈ I Lemma Let I be a near-closed set of intervals. Then there exists a unique PQ-tree with frontier F such that for the set of strong intervals I′ ⊆ I holds true that I′ ⊆ F and |I| ≥ ⌈1/2 · |F|⌉.

i j k l

slide-25
SLIDE 25

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 17

Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO

slide-26
SLIDE 26

PSyCHO

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 18

PSyCHO Principled Synteny using Common Intervals and Hierarchical Organization http://github.com/danydoerr/PSyCHO

slide-27
SLIDE 27

Construction of a synteny hierarchy

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 19

3 synteny hierarchy construction

G H I

2

G H I

discovery of homologous SBs marker similarity graph

G H I G H I

marker-order sequences 1

G H I

raw genomic sequences

G H I

genome segmentation

slide-28
SLIDE 28

Similarity graph, syntenic contexts, homologous SBs

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 20

Reference G2 G3

subject to indel handling

  • 1. reference-based reconstruction of syntenic contexts

computational problem: finding δ-teams in sequences

  • 2. handling of insertions/deletions (work in progress)
  • 3. reference-based discovery of homologous syntenic blocks in each context

computational problem: enumerating common intervals in k sequences

slide-29
SLIDE 29

Similarity graph, syntenic contexts, homologous SBs

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 20

Reference G2 G3

subject to indel handling

  • 1. reference-based reconstruction of syntenic contexts

computational problem: finding δ-teams in sequences

  • 2. handling of insertions/deletions (work in progress)
  • 3. reference-based discovery of homologous syntenic blocks in each context

computational problem: enumerating common intervals in k sequences

slide-30
SLIDE 30

Similarity graph, syntenic contexts, homologous SBs

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 20

Reference G2 G3

subject to indel handling

  • 1. reference-based reconstruction of syntenic contexts

computational problem: finding δ-teams in sequences

  • 2. handling of insertions/deletions (work in progress)
  • 3. reference-based discovery of homologous syntenic blocks in each context

computational problem: enumerating common intervals in k sequences

slide-31
SLIDE 31

Analysis of simulated genomes

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 21

5 species, 1000 markers of length 300, point mutations+rearrangements+ins+del+dupl

slide-32
SLIDE 32

Analysis of simulated genomes

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 21

Weighted Synteny Score: Fraction of markers in a homologous set of syntenic blocks that have at least

  • ne homologous counterpart in each block or have no

homologous counterpart at all in the respective genomes.

slide-33
SLIDE 33

Analysis of simulated genomes

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 21

5 species, 1000 markers of length 300, point mutations+rearrangements+ins+del+dupl

slide-34
SLIDE 34

Analysis of Drosophila genomes

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 22

species ID scaffolds size (Mbp) CDSs markers

  • D. melanogaster

D.mel 7 120.3 30, 443 98, 214

  • D. simulans

D.sim 6 118.2 24, 119 100, 549

  • D. yakuba

D.yak 6 119.5 23, 304 100, 774

slide-35
SLIDE 35

Analysis of Drosophila genomes

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 23

genome PSyCHO i-ADHoRe coverage D.mel 0.782 0.682 D.mel 0.823 0.840 D.yak 0.783 0.763 #SBs top: 10 int. nodes: 2090 80

slide-36
SLIDE 36

Weighted Synteny Score [Ghiurcuta and Moret, 2014]

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 24

i-ADHoRe

0.0 0.2 0.4 0.6 0.8 1.0

weighted synteny score

2 4 6 8 10 12 14 16

count

PSyCHO: 1 (by def.)

slide-37
SLIDE 37

Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 25

Thank you!

slide-38
SLIDE 38

References

Hierarchical organization of syntenic blocks in large genomic datasets: References 26

Bergeron, A., Chauve, C., de Montgolfier, F., and Raffinot, M. (2008). Computing common intervals of K permutations, with applications to modular decomposition of graphs. SIAM Journal on Discrete Mathematics, 22(3):1022–1039. Booth, K. S. and Lueker, G. S. (1976). Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms. JCSS. Brejová, B., Burger, M., and Vinař, T. (2011). Automated Segmentation of DNA Sequences with Complex Evolutionary Histories. In Proc. of WABI 2011, volume 6833, pages 1–13. Ghiurcuta, C. G. and Moret, B. M. E. (2014). Evaluating synteny for improved comparative studies. Bioinformatics, 30(12):i9–18. Meidanis, J. and Munuera, E. G. (1996). A theory for the consecutive ones property. Proceedings of WSP. Visnovská, M., Vinař, T., and Brejová, B. (2013). DNA Sequence Segmentation Based on Local Similarity. In Proc. of ITAT, volume 1003, pages 36–43.