[PPT] - The relation between indel length and functional divergence A PowerPoint Presentation

SLIDE 1

The relation between indel length and functional divergence

A formal study Alexander Schönhuth

Raheleh Salari, Fereydoun Hormozdiari, Artem Cherkasov, Cenk Sahinalp

Computational Biology Lab School of Computing Science Simon Fraser University, Canada Division of Infectious Diseases Faculty of Medicine University of British Columbia

September 17, 2008

SLIDE 2

Guideline

SLIDE 3

Goals and Motivation

Develop computational solutions to identify truly

evolutionary insertions and deletions (indels) in alignments.

SLIDE 4

Goals and Motivation

Develop computational solutions to identify truly

evolutionary insertions and deletions (indels) in alignments.

Examine whether significant indels are correlated to

more severe functional divergence between homologous proteins.

Improved assessment of functional similarity of

homologous proteins.

SLIDE 5

Indel Facts

Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length:

Mixtures of exponentials [Qian

and Goldstein, 2003], [Pang et. al, 2005]

Zipfian distribution [Chang and

Benner, 2004]

SLIDE 6

Indel Facts

Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length:

Mixtures of exponentials [Qian

and Goldstein, 2003], [Pang et. al, 2005]

Zipfian distribution [Chang and

Benner, 2004] Classical alignment procedures with affine gap penalties:

Geometric distribution.

SLIDE 7

Indel Facts

Evolutionary processes behind insertions and deletions are not well understood. Truely evolutionary distributions of indel length:

Mixtures of exponentials [Qian

and Goldstein, 2003], [Pang et. al, 2005]

Zipfian distribution [Chang and

Benner, 2004] Classical alignment procedures with affine gap penalties:

Geometric distribution.

Moreover, from small-scale studies:

Indels often occur in the proteins’

loop regions and cause significant structural changes [Fechteler et al., 1995]

Indels occur in disease-causing

mutational hot spots [Kondrashov et al., 2004]

Thanks to the structural changes:

novel approaches to antibacterial drug design [Cherkasov et al. 2005, 2006], [Nandan et al. 2007]

SLIDE 8

Basic Idea and Workflow

Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins.

1. Compute all paralogous protein

pairs in an organism.

2. Collect “indel” and “non-indel”

pairs.

3. Compare functional similarity of

“indel” and “non-indel” protein pairs.

SLIDE 9

Basic Idea and Workflow

Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins.

1. Compute all paralogous protein

pairs in an organism.

2. Collect “indel” and “non-indel”

pairs.

3. Compare functional similarity of

“indel” and “non-indel” protein pairs.

1. Interesting organism and

sound, but efficient alignment method needed.

SLIDE 10

Basic Idea and Workflow

Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins.

1. Compute all paralogous protein

pairs in an organism.

2. Collect “indel” and “non-indel”

pairs.

3. Compare functional similarity of

“indel” and “non-indel” protein pairs.

1. Interesting organism and

sound, but efficient alignment method needed.

2. How to identify true indels?

Idea: Identify alignments that contain statistically significant indels, neglect “alignment noise”. Sound indel statistics needed.

SLIDE 11

Basic Idea and Workflow

Large-scale correlation study on indel occurrence and functional similarity for paralogous proteins.

1. Compute all paralogous protein

pairs in an organism.

2. Collect “indel” and “non-indel”

pairs.

3. Compare functional similarity of

“indel” and “non-indel” protein pairs.

1. Interesting organism and

sound, but efficient alignment method needed.

2. How to identify true indels?

Idea: Identify alignments that contain statistically significant indels, neglect “alignment noise”. Sound indel statistics needed.

3. Definition of functional

similarity needed.

SLIDE 12

Data and Methods

SLIDE 13

Organism: E.coli K12

E.coli K12 is a well known organism.
In prokaryotes: horizontal gene transfer, insertion of genetic

material from other prokaryotic organisms.

In bacteria: genomic islands, regions that accumulate inserted

genetic material.

Novel classes of antibacterial drug targets needed.

SLIDE 14

Paralogs

Paralogous proteins (paralogs) can assume different functions in

the organism [Gerlt and Babbitt, 2000].

Global alignments: Needleman-Wunsch algorithm with affine gap
penalties. Tool: GGSEARCH [Pearson and Lipman, 1988]
Paralogous pairs: E-value below 10−6, at least 50% sequence

and 20% sequence identity.

SLIDE 15

Paralogs

Paralogous proteins (paralogs) can assume different functions in

the organism [Gerlt and Babbitt, 2000].

Global alignments: Needleman-Wunsch algorithm with affine gap
penalties. Tool: GGSEARCH [Pearson and Lipman, 1988]
Paralogous pairs: E-value below 10−6, at least 50% sequence

and 20% sequence identity.

Above 40% sequence identity: aligned proteins are structurally

similar [Rost, 1999].

Between 20% and 40% sequence identity: Twilight zone,

structural similarity cannot straightforwardly be inferred.

SLIDE 16

GO based Computation of Functional Similarity

Gene Ontology (GO):

Structured description of functional annotation.

Three subcategories:
Molecular Function
Biological Process
Cellular Component
Organized as directed

acyclic graph (DAG).

SLIDE 17

GO based Computation of Functional Similarity

Gene Ontology (GO):

Structured description of functional annotation.

Three subcategories:
Molecular Function
Biological Process
Cellular Component
Organized as directed

acyclic graph (DAG).

Proteins are identified with subsets of

nodes in a DAG.

Protein similarity can be measured based
n reasonable comparison of “subDAGs”.

SLIDE 18

GO based Computation of Functional Similarity

Gene Ontology (GO):

Structured description of functional annotation.

Three subcategories:
Molecular Function
Biological Process
Cellular Component
Organized as directed

acyclic graph (DAG).

Proteins are identified with subsets of

nodes in a DAG.

Protein similarity can be measured based
n reasonable comparison of “subDAGs”.

Method of choice: Extension of the semantic similarity measure by [Resnik, 1999] to protein similarity, as described by [Schlicker, 2006], suggested by Francisco Couto and Daniel Faria.

SLIDE 19

Indel Statistics: Problem Definition

Definition

If A is an alignment algorithm, let

LA(x, y) resp. IA(x, y) be the length of the alignment resp. the length of the largest indel in the alignment of x = x1...xm, y = y1...yn, as computed by A.

SLIDE 20

Indel Statistics: Problem Definition

Definition

If A is an alignment algorithm, let

LA(x, y) resp. IA(x, y) be the length of the alignment resp. the length of the largest indel in the alignment of x = x1...xm, y = y1...yn, as computed by A.

SLIDE 21

Indel Statistics: Problem Definition

Definition

If A is an alignment algorithm, let

LA(x, y) resp. IA(x, y) be the length of the alignment resp. the length of the largest indel in the alignment of x = x1...xm, y = y1...yn, as computed by A.

If k is an integer, let

Pn,T(IA(x,y) ≥ k) := P(IA(x, y) ≥ k | LA(x, y) = n, (x, y) ∈ T) be the probability that the largest indel in the alignment of x and y ((x,y) drawn from a pool of pairs T such that LA(x, y) = n) is greater than k.

SLIDE 22

Problem Definition

Indel Length Probability (ILP) Problem Computation of the probabilities Pn,T(IA(x,y)≥k). Input: A pair of sequences (x, y), a pool T with (x, y) ∈ T, an alignment algorithm A and an integer k. Output: Pn,T(IA(x,y) ≥ k).

Remark

Replacing IA(x, y) by SA(x, y), the score of an alignment of x and y is

the classical problem of score statistics.

Exact solution for A the Smith-Waterman algorithm for ungapped local

alignments given by the Altschul-Dembo-Karlin statistics [Karlin and Altschul, 1990], [Dembo and Karlin, 1991].

SLIDE 23

Solution Strategy: Pair HMMs

1-q 1-q 1-2p q q p p

X Y M

Px y

i j

qxi

j

q y

End Begin

Pn,T (IA(x, y) ≥ k) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states

f length at least k. Hard problem!

Match Indel

q 2p 1-q 1-2p

SLIDE 24

Solution Strategy: Pair HMMs

1-q 1-q 1-2p q q p p

X Y M

Px y

i j

qxi

j

q y

End Begin

Pn,T (IA(x, y) ≥ k) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states

f length at least k. Hard problem!

Match Indel

q 2p 1-q 1-2p

We write I for ’Indel’ and M for ’Match’.
Let

Cn,k be the set of sequences over the alphabet M, I of length n that contain a consecutive I stretch of length at least k.

SLIDE 25

Solution Strategy: Pair HMMs

1-q 1-q 1-2p q q p p

X Y M

Px y

i j

qxi

j

q y

End Begin

Pn,T (IA(x, y) ≥ k) correspond to probabilities that Viterbi paths contain a consecutive run of either ’X’ or ’Y’ states

f length at least k. Hard problem!

Match Indel

q 2p 1-q 1-2p

We write I for ’Indel’ and M for ’Match’.
Let

Cn,k be the set of sequences over the alphabet M, I of length n that contain a consecutive I stretch of length at least k. COMPUTATION OF Pn,T (IA(x, y) ≥ k): 1: Align all sequence pairs from T. 2: Infer p and q by training the pair HMM with the alignments. 3: n ← length of the alignment of x and y 4: Compute P(Cn,k), the probability that the Markov chain generates a sequence from Cn,k. 5: Output P(Cn,k) as an approximation for Pn,T (IA(x, y) ≥ k).

SLIDE 26

Computation of P(Cn,k): Naive Approach

Consider first Bt,k , the set of sequences of the type (M is for ’Match’ and I for

’Indel’. Z can be both M or I): Z

1 ... Z t−1

I

t ...

I

t+k−1

Z

t+k ... Z n .

It holds that

Cn,k =

n−k+1

[

t=1

Bt,k

SLIDE 27

Computation of P(Cn,k): Naive Approach

Consider first Bt,k , the set of sequences of the type (M is for ’Match’ and I for

’Indel’. Z can be both M or I): Z

1 ... Z t−1

I

t ...

I

t+k−1

Z

t+k ... Z n .

It holds that

Cn,k =

n−k+1

[

t=1

Bt,k

However, proceeding by inclusion-exlusion

P(Cn,k) =

n−k+1

X

m=1

(−1)m+1 X

1≤t1<...<tm≤n−k+1

P(Bt1,k ∩ ... ∩ Btm,k) results in computation of

n−k+1

X

m=1

“n − k + 1 m ” = 2n−k+1 − 1 terms of the type P(Bt1,k ∩ ... ∩ Btm,k), hence computation of P(Cn,k) would be exponential in n.

SLIDE 28

Efficient Computation of P(Cn,k)

Therefore, for 1 ≤ t ≤ n − k, consider Dt,k, the set of sequences of the type

Z

1 ... Z t−1

I

t ...

I

t+k−1

M

t+k

Z

t+k+1 ... Z n

SLIDE 29

Efficient Computation of P(Cn,k)

Therefore, for 1 ≤ t ≤ n − k, consider Dt,k, the set of sequences of the type

Z

1 ... Z t−1

I

t ...

I

t+k−1

M

t+k

Z

t+k+1 ... Z n

Here as well

Cn,k =

n−k+1

[

t=1

Dt,k where Dn−k+1,k := Bn−k+1,k as before.

Inclusion-exclusion

P(Cn,k) =

n−k+1

X

m=1

(−1)m+1 X

1≤t1<...<tm≤n−k+1

P(Dt1,k ∩ ... ∩ Dtm,k) followed by noting that Dt,k ∩ Ds,k = ∅ for |t − s| ≤ k, thereby canceling a large number of terms of the type P(Dt1,k ∩ ... ∩ Dtm,k), and applying Markov chain theory allows to efficiently compute the P(Cn,k) in a dynamic programming approach, recursively in n.

SLIDE 30

Results

SLIDE 31

Inference of Markov Chain Parameters

Paralogs were grouped into

ten different pools, according to their alignment similarity scores.

Match Indel

q 2p 1-q 1-2p

Parameters 1 − 2p and q of

the respective ten different Markov chains were inferred.

SLIDE 32

Inference of Markov Chain Parameters

Paralogs were grouped into

ten different pools, according to their alignment similarity scores.

Match Indel

q 2p 1-q 1-2p

Parameters 1 − 2p and q of

the respective ten different Markov chains were inferred.

Sim. (%)
No. Al.

1 − 2p q 95 - 100 26 0.999 0.846 90 - 95 40 0.996 0.665 85 - 90 49 0.994 0.614 80 - 85 62 0.989 0.652 75 - 80 102 0.987 0.632 70 - 75 186 0.983 0.619 65 - 70 345 0.975 0.642 60 - 65 709 0.969 0.667 55 - 60 1793 0.956 0.659 50 - 55 20317 0.935 0.638

SLIDE 33

Determination of Indel- and Non-Indel Pairs

n length of the alignment, k maximal indel length, θI, θNI significance

levels

SLIDE 34

Determination of Indel- and Non-Indel Pairs

n length of the alignment, k maximal indel length, θI, θNI significance

levels

Indel (I) pairs

P(Cn,k) ≤ θI

Non-Indel (NI) pairs

1 − P(Cn,k+1) ≤ θNI

We found that

θI = 0.045, θNI = 0.25 yielded sufficiently good discrimination while keeping sufficient amounts

f alignments.

SLIDE 35

The Twilight Zone: Function

No. Alignments

GO Similarity T-test Identity I NI O I NI O I vs. O NI vs. O I vs. NI ≥ 20.0 841 647 4413 0.6724 0.7507 0.7069 3.6844 3.9313 4.9547 ≥ 25.0 468 314 2219 0.6871 0.7783 0.7356 4.1953 2.9257 4.4796 ≥ 30.0 183 159 1030 0.7347 0.8132 0.7736 2.0633 2.0135 2.6321 ≥ 35.0 75 98 546 0.7629 0.8515 0.8304 2.2001 0.9201 2.1015 ≥ 40.0 42 63 328 0.8522 0.8610 0.8796 0.7731

0.7300

0.1808

0.65 0.7 0.75 0.8 0.85 0.9 0.95 20 25 30 35 40 45

GO similarity: Function Identity

"Indels" "NonIndels" "Overall"

SLIDE 36

The Twilight Zone: Process

No. Alignments

GO Similarity T-test Identity I NI O I NI O I vs. O NI vs. O I vs. NI ≥ 20.0 813 670 4500 0.5609 0.7521 0.6699 8.4619 6.2435 9.5150 ≥ 25.0 446 326 2235 0.5443 0.7602 0.6756 7.5987 4.4952 7.6959 ≥ 30.0 177 160 1030 0.6069 0.8030 0.7261 4.3747 3.1777 4.9147 ≥ 35.0 76 97 545 0.7140 0.8193 0.8053 2.3570 0.4757 1.9635 ≥ 40.0 45 62 333 0.8122 0.8372 0.8653 1.3118

0.8248

0.4209

0.5 0.6 0.7 0.8 0.9 1 20 25 30 35 40 45

GO similarity: Process Identity

"Indels" "NonIndels" "Overall"

SLIDE 37

The Twilight Zone: Component

No. Alignments

GO Similarity T-test Identity I NI O I NI O I vs. O NI vs. O I vs. NI ≥ 20.0 542 381 2986 0.8625 0.9376 0.9139 4.7200 2.2012 4.4590 ≥ 25.0 281 182 1479 0.8644 0.9483 0.9208 3.9673 2.0667 3.9138 ≥ 30.0 123 106 750 0.8698 0.9746 0.9438 3.2578 2.6746 3.6899 ≥ 35.0 60 63 411 0.8390 0.9843 0.9472 2.9245 3.2622 3.3570 ≥ 40.0 36 39 257 0.8443 0.9746 0.9554 2.2478 1.2842 2.2405 ≥ 45.0 19 25 175 0.8832 0.9604 0.9689 1.3316

0.4529

1.0406

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 20 25 30 35 40 45 50

GO similarity: Component Identity

"Indels" "NonIndels" "Overall"

SLIDE 38

Outlook

Other organisms / orthologs.
Extending and refining the statistical model (e.g. for local

alignments)

SLIDE 39

Outlook

Other organisms / orthologs.
Extending and refining the statistical model (e.g. for local

alignments)

Indels vs. protein interactions
Drug targets (from studying human-pathogen orthologs)

SLIDE 40

Acknowledgments

Francisco Couto, Daniel Faria
SFU Community Trust Endowment Fund:

“Bioinformatics for Combatting Infectious Diseases” Project

Pacific Institute for the Mathematical Sciences
Computational Biology Lab Members, SFU
Michael Coons
Peter Höfl

SLIDE 41