An Algorithmic View on Multi-related-segments: a unifying model for - - PowerPoint PPT Presentation

an algorithmic view on multi related segments a unifying
SMART_READER_LITE
LIVE PREVIEW

An Algorithmic View on Multi-related-segments: a unifying model for - - PowerPoint PPT Presentation

An Algorithmic View on Multi-related-segments: a unifying model for approximate common interval X.Yang F .Sikora G.Blin S.Hamel R.Rizzi S.Aluru GSAP , Broad Institute of MIT & Harvard USA Universit e Paris-Est, LIGM, UMR 8049


slide-1
SLIDE 1

An Algorithmic View on Multi-related-segments: a unifying model for approximate common interval

X.Yang F .Sikora G.Blin S.Hamel R.Rizzi S.Aluru

GSAP , Broad Institute of MIT & Harvard – USA Universit´ e Paris-Est, LIGM, UMR 8049 – France DIRO - Universit´ e de Montr´ eal - QC – Canada DIMI - Universit` a di Udine - Udine – Italy Lehrstuhl f¨ ur Bioinformatik, Friedrich-Schiller-Universit¨ at Jena – Germany DECE, Iowa State University – USA

May 2012

Guillaume Blin An Algorithmic View on MRS

slide-2
SLIDE 2

Comparing genomes

◮ A set of genes that are proximately located on multiple

chromosomes often implies their origin from the same ancestral genomic segment or their involvment in the same biological process

◮ . . . seeking for gene clusters between genomes. ◮ A gene cluster = a set of genes appearing, in spatial

proximity along chromosomes.

Guillaume Blin An Algorithmic View on MRS

slide-3
SLIDE 3

Key properties for modeling gene proximity

◮ Hoberman and Durand 2005: Based on observing the

co-occurrence of a gene set A (ancestral genes) in different chromosomal segments

◮ A is subject to evolutionary constraints

1 2 3 A β = 2 ǫm = 3 α = 1 ǫl = 4 ǫt = 19

Guillaume Blin An Algorithmic View on MRS

slide-4
SLIDE 4

Key properties for modeling gene proximity

◮ Hoberman and Durand 2005: Based on observing the

co-occurrence of a gene set A (ancestral genes) in different chromosomal segments

◮ A is subject to evolutionary constraints

1 2 3 A β = 2 ǫm = 3 α = 1 ǫl = 4 ǫt = 19

◮ evidence of any gene of interest as being ancestral:

  • bserving a minimum β occurrences of any gene of A

⇒ reducing the possibility of misinterpreting what is in fact a chance occurrence

Guillaume Blin An Algorithmic View on MRS

slide-5
SLIDE 5

Key properties for modeling gene proximity

◮ Hoberman and Durand 2005: Based on observing the

co-occurrence of a gene set A (ancestral genes) in different chromosomal segments

◮ A is subject to evolutionary constraints

1 2 3 A β = 2 ǫm = 3 α = 1 ǫl = 4 ǫt = 19

◮ evidence of any gene of interest as being ancestral: β ◮ sufficient contribution of each segment to A: each segment

contains at least ǫm different ancestral genes

Guillaume Blin An Algorithmic View on MRS

slide-6
SLIDE 6

Key properties for modeling gene proximity

◮ Hoberman and Durand 2005: Based on observing the

co-occurrence of a gene set A (ancestral genes) in different chromosomal segments

◮ A is subject to evolutionary constraints

1 2 3 A β = 2 ǫm = 3 α = 1 ǫl = 4 ǫt = 19

◮ evidence of any gene of interest as being ancestral: β ◮ sufficient contribution of each segment to A: ǫm ◮ local and global ancestral gene density: at most α

interleaving genes between two consecutive ancestral genes and a maximum ǫl gene losses per segment with a maximum ǫt total gene losses among all segments

Guillaume Blin An Algorithmic View on MRS

slide-7
SLIDE 7

Existing models

◮ Gene clusters definition

◮ conserved segments, common intervals, conserved

intervals, gene teams, approximate common intervals

◮ Conserved segments – which require a full conservation

Guillaume Blin An Algorithmic View on MRS

slide-8
SLIDE 8

Existing models

◮ Gene clusters definition

◮ conserved segments, common intervals, conserved

intervals, gene teams, approximate common intervals

◮ Common intervals – genes must occur consecutively,

regardless of their order

Guillaume Blin An Algorithmic View on MRS

slide-9
SLIDE 9

Existing models

◮ Gene clusters definition

◮ conserved segments, common intervals, conserved

intervals, gene teams, approximate common intervals

◮ Conserved intervals – common intervals, framed by the

same two genes

Guillaume Blin An Algorithmic View on MRS

slide-10
SLIDE 10

Existing models

◮ Gene clusters definition

◮ conserved segments, common intervals, conserved

intervals, gene teams, approximate common intervals

◮ Gene teams – genes in a cluster must not be interrupted

by long stretches of genes not belonging to the cluster

Guillaume Blin An Algorithmic View on MRS

slide-11
SLIDE 11

Existing models

◮ Gene clusters definition

◮ conserved segments, common intervals, conserved

intervals, gene teams, approximate common intervals

◮ Approximate common intervals – common intervals that

may contain few genes from outside the cluster

Guillaume Blin An Algorithmic View on MRS

slide-12
SLIDE 12

MULTI-RELATED-SEGMENTS model

◮ A unified model to capture approximate common intervals ◮ A MRS is a set of maximal segments capturing previoulsy

mentioned key properties ({β, ǫm, α, ǫl, ǫt})

◮ It captures existing models:

◮ MRS = CI when β = k, ǫm = |A| and α = 0 ◮ MRS = GT when α ≥ 0 ◮ MRS further captures gene loss events without strong

pairwise similarity information

1 2 3 A β = 2 ǫm = 3 α = 1 ǫl = 4 ǫt = 19

Guillaume Blin An Algorithmic View on MRS

slide-13
SLIDE 13

Finding MRS

◮ The problem consists then in identifying the MRS in a set

  • f k chromosomes

◮ Considering the ancestral gene set A as a priori known,

the problem, termed LOCATEMRS, then corresponds to locate, given k chromosomes S = {S1, S2, . . . , Sk} represented as strings, a feasible MRS originating from A.

◮ LOCATEMRS is NP-hard even in the restricted case where

Si’s are permutations and no gene insertion are allowed (α = 0) ⇒ reduction from Exact-Cover by 3-Sets

◮ LOCATEMRS is fixed-parameter tractable considering

parameter |A| when α = 0

Guillaume Blin An Algorithmic View on MRS

slide-14
SLIDE 14

Finding MRS

◮ When A is unknown, identifying all MRS is hard to

approximate (APX-hard by reduction from Minimum Set Cover) even in the restricted case where Si’s are permutations

◮ With the removal of the maximum number of gene loss

constraint (i.e. ǫt = ∞) and the maximum number of substrings per input sequence constraint (i.e. α = ∞), a polynomial algorithm can be derived.

Guillaume Blin An Algorithmic View on MRS

slide-15
SLIDE 15

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ Segments can be pruned considering ǫl and A.

Since α = 0, one has to select exactly one substring of interest in each sequence Sj.

Guillaume Blin An Algorithmic View on MRS

slide-16
SLIDE 16

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ A naive algorithm = try all such combinations and check

parameters. ⇒ an exponential running time.

Guillaume Blin An Algorithmic View on MRS

slide-17
SLIDE 17

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ A naive algorithm = try all such combinations and check

parameters. ⇒ an exponential running time.

Guillaume Blin An Algorithmic View on MRS

slide-18
SLIDE 18

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ A naive algorithm = try all such combinations and check

parameters. ⇒ an exponential running time.

Guillaume Blin An Algorithmic View on MRS

slide-19
SLIDE 19

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ By using an efficient dynamic programming strategy, one

may hold the exponential factor in the size of the ancestral gene set.

Guillaume Blin An Algorithmic View on MRS

slide-20
SLIDE 20

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ No need to compute the exact number of times each

character occurs but only to ensure that it occurs in at least β (usually β = 2) substrings in the solution.

Guillaume Blin An Algorithmic View on MRS

slide-21
SLIDE 21

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A

◮ Consider a fixed ordering of characters (a1, a2, . . . , a|A|) of

A, one has to store a count vector C = (c1, c2, . . . , c|A|), where ci ∈ {0, 1, . . . , β} denotes the number of substrings containing ai. Here, C = (2, 2, 1, 0, 2, 1, 0)

Guillaume Blin An Algorithmic View on MRS

slide-22
SLIDE 22

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ The main property of this representation is that, given A,

there are only β|A| possible vectors.

Guillaume Blin An Algorithmic View on MRS

slide-23
SLIDE 23

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ We define a boolean dynamic table D indexed by the last

substring added to the solution and the vector C for the current solution. D(Si

j , (c1, . . . c|A|))

Guillaume Blin An Algorithmic View on MRS

slide-24
SLIDE 24

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ We then proceed by row

D(S1

1, (1, 1, 1, 0, 0, 0, 0) = 1

D(S2

1, (0, 0, 0, 1, 1, 1, 0) = 1

D(S3

1, (0, 0, 0, 0, 1, 0, 0) = 1

Guillaume Blin An Algorithmic View on MRS

slide-25
SLIDE 25

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ We then proceed by row

D(S1

2, (0, 0, 1, 0, 0, 1, 1) = 1

Guillaume Blin An Algorithmic View on MRS

slide-26
SLIDE 26

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ We then proceed by row

D(S1

2, (0, 0, 1, 0, 0, 1, 1) = 1

D(S1

2, (1, 1, 2, 0, 0, 1, 1) = 1

Guillaume Blin An Algorithmic View on MRS

slide-27
SLIDE 27

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ We then proceed by row

D(S1

2, (0, 0, 1, 0, 0, 1, 1) = 1

D(S1

2, (1, 1, 2, 0, 0, 1, 1) = 1

D(S1

2, (0, 0, 1, 2, 1, 2, 0) = 1

Guillaume Blin An Algorithmic View on MRS

slide-28
SLIDE 28

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ In the end, any entry D(Si j , (β, . . . , β)) = 1 corresponds to

a MRS being found. Note that we may also store ǫt for completeness.

Guillaume Blin An Algorithmic View on MRS

slide-29
SLIDE 29

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ In order to fill out D, one has to compute |S| × (β + 1)|A|

entries. The main recursion needs, for each entry, to browse at most |S| × (β + 1)|A| other entries of D (previous row). This leads to an overall O((|S| × (β + 1)|A|)2)

Guillaume Blin An Algorithmic View on MRS

slide-30
SLIDE 30

LOCATEMRS is FPT

◮ To show this, we provide a dynamic programming solution.

1 2 3 A S1

1

S2

1

S3

1

S1

2

S2

2

S3

2

S1

3

S2

3

S3

3 ◮ The problem is FPT with respect to |A| when β is a

constant

Guillaume Blin An Algorithmic View on MRS

slide-31
SLIDE 31

An Algorithmic View on Multi-related-segments: a unifying model for approximate common interval

X.Yang F .Sikora G.Blin S.Hamel R.Rizzi S.Aluru

GSAP , Broad Institute of MIT & Harvard – USA Universit´ e Paris-Est, LIGM, UMR 8049 – France DIRO - Universit´ e de Montr´ eal - QC – Canada DIMI - Universit` a di Udine - Udine – Italy Lehrstuhl f¨ ur Bioinformatik, Friedrich-Schiller-Universit¨ at Jena – Germany DECE, Iowa State University – USA

May 2012

Guillaume Blin An Algorithmic View on MRS