On the Length of the Longest Common Subsequence Peter Rabinovitch - - PowerPoint PPT Presentation

on the length of the longest common subsequence
SMART_READER_LITE
LIVE PREVIEW

On the Length of the Longest Common Subsequence Peter Rabinovitch - - PowerPoint PPT Presentation

On the Length of the Longest Common Subsequence Peter Rabinovitch Summary Consider two sequence of coin tosses, and from these two sequences, extract the longest common subsequence. It is known that as the length of the sequences


slide-1
SLIDE 1

On the Length of the Longest Common Subsequence

Peter Rabinovitch

slide-2
SLIDE 2

Summary

  • Consider two sequence of coin tosses, and

from these two sequences, extract the longest common subsequence. It is known that as the length of the sequences increase, the ratio of the length of the longest common subsequence to the length of the sequence converges to a limit in expectation that is about 0.81, but the exact value of the limit is not known.

  • In this talk, we will survey some key results

related to the problem, as well as look at several potential approaches to determining the limit.

slide-3
SLIDE 3

A Simple Example

H T H H T T T H T T

slide-4
SLIDE 4

Applications

  • DNA (alphabet size=4)
  • Proteins (alphabet size=20)
  • Computer security (alphabet size=256)
  • And all these are more complicated, and more

interesting, and more useful with more than two strings.

slide-5
SLIDE 5

Formally

  • Let X=(X1,X2,...Xn}, Y=(Y1,Y2,...Yn) be two

sequences of iid Bernoulli r.v.s

  • P[Xi=H]=P[Xi=T]=P[Yi=H]=P[Yi=H]=1/2
  • Ln=length of a longest common subsequence

We seek to understand the r.v. Ln in particular lim E[Ln]/n

slide-6
SLIDE 6

For small n, things can be calculated explicitly

  • By explicit enumeration we find

– E[L2]=9/8, V[L2]=11/8 – E[L3]=29/16, V[L3]=119/256

  • But it gets messy for larger n
slide-7
SLIDE 7

Properties

20 40 60 80 100 0.0 0.4 0.8 n E[L_n]

E[Ln]/n Sd[Ln]/n

20 40 60 80 100 0.0 0.4 0.8 n E[L_n]

E[Ln]/n Sd[Ln]/n

L_100 Frequency 70 75 80 85 50 100 150

Appears monotonic, but not yet proved Could be Gaussian

Ln, as a function of Xn and Yn satisfies several symmetries

  • Globally switch H & T
  • Reverse both sequences
  • Etc
slide-8
SLIDE 8

An Algorithm (1)

T T H T T T H H T H

slide-9
SLIDE 9

An Algorithm (2)

T T H T T T H H T H

slide-10
SLIDE 10

An Algorithm (3)

T T H T T T H H T H 1 1 1 1 2 2 2 2 2 3 3

slide-11
SLIDE 11

Subadditivity, etc.

  • A sequence {an} is subadditive if am+n≤am+an for

all positive integers m & n

  • A sequence {an} is superadditive if {-an} is

subadditive

  • Fekete's lemma: if {an} is subadditive then

where

lim an n =inf an n =

−∞≤∞

slide-12
SLIDE 12

Fekete's Lemma (γ>-∞)

  • For any ε>0 we can find a k s.t. ak≤(γ+ε)k because γ=inf an/n.
  • m>0 can be written m=nk+j for the same k with 0≤j<k it follows

am=ank+j≤ank+aj≤nak+aj≤n(γ+ε)k+max0≤l<k al so limsupmam/m ≤ limsupmn(γ+ε)k/m + limsupmmax0≤l<k al/m and then limsupmam/m ≤ γ+ε

  • Also

γ+ε≤ liminfmam/m+ε

  • So

limsupmam/m ≤ γ+ε≤ liminfmam/m+ε

  • As ε>0 was arbitrary, we have

limnan/n=infnan/n=γ

slide-13
SLIDE 13

Existence of the Limit

  • an=E[Ln]/n is superadditive (by concatenation)
  • So applying Fekete's lemma shows that the

limit exists (Chvatal & Sankoff, 1975)

  • Deken (1979) shows that Ln/n converges a.s.
slide-14
SLIDE 14

Other Results

  • Aratia & Steele conjecture that c=2√2-2~0.8284
  • Alexander (1994) proves a rate of convergence

using methods of percolation

  • Steele (1997?) proves a concentration of

measure results using the Azuma Hoeffding inequality

  • Bundschuh (2001) shows that c~0.812653

using simulation, demonstrating that A&S were wrong

  • Lueker (2005) bounds 0.788071≤c≤0.826280
slide-15
SLIDE 15

A Heuristic (1)

  • What is the longest sequence of heads you will

see in n tosses of a fair coin?

slide-16
SLIDE 16

A Heuristic (1)

  • What is the longest sequence of heads you will

see in n tosses of a fair coin?

  • The probability of a length m run is pm, and

there are approximately n places where this run could start, so E[# of length m head runs]=npm

  • If the longest one is unique, then 1=npm, so the

length of the longest head run is log1/pn

  • Note: this can be made precise, eg. Durrett's

book has a proof.

slide-17
SLIDE 17

A Heuristic (2)

  • What is the largest red square in an n by n grid

where each square is coloured red or black by flipping a coin?

slide-18
SLIDE 18

A Heuristic (3) Arratia & Steele's Conjecture

  • Call any pair of subsequences of length k where the Xi and Yi

agree a 'good k pair'

  • Let Z be the total number of good k pairs of the two length n
  • strings. Then
  • Then E[Z]=(nCk)2/2k because there are nCk to choose each of the

subsequences, which have to agree in k places.

  • The mode of this sequence is approximately n/(1+√2)
  • Since every length k common subsequence yields a good k

pair, there are at least LnCk such good k pairs. This sequence has mode Ln/2.

  • Now equate the two to get Ln/n~2√2-2
slide-19
SLIDE 19

Solution Methods

  • We'll focus on two

– Patience sorting

  • Which has connections to the symmetric group, Young

tableaux, the Tracy Widom distribution (see Aldous & Diaconis' AMS paper “Longest Increasing Subsequences: From Patience Sorting to the Baik-Dieft- Johansson Theorem”)

– Directed last passage percolation on a disordered

media

  • Which has connections to percolation (see Grimmett's

book) as well as (in a suitably relaxed version of the problem) the Tracy Widom distribution

slide-20
SLIDE 20

Aside on the Tracy Widom Distribution ('94)

  • Arises in many new places

– LIS of a uniform random permutation – Largest eigenvalue of a random matrix in the

Gaussian Unitary Ensemble (GUE), i.e. complex Hermitian matrices

– Growth models in the plane (one of our “Other

Related Models”, later)

Fs=exp−∫

s ∞

x−sq

2xdx

q' ' s=sqs2q

3s

slide-21
SLIDE 21

Patience Sorting

  • (3,5,2,1,7,8,9,4,6)
  • Put the next element at the bottom of the first

column it is less than or equal to.

  • If no such column exists, start a new column
slide-22
SLIDE 22

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3

slide-23
SLIDE 23

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 5

slide-24
SLIDE 24

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 2 5

slide-25
SLIDE 25

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 1 2 5

slide-26
SLIDE 26

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 1 2 5 7

slide-27
SLIDE 27

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 1 2 5 7 8

slide-28
SLIDE 28

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 1 2 5 7 8 9

slide-29
SLIDE 29

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 1 2 4 5 7 8 9

slide-30
SLIDE 30

Patience Sorting

(3,5,2,1,7,8,9,4,6)

3 1 2 4 5 6 7 8 9

slide-31
SLIDE 31

Patience Sorting

(3,5,2,1,7,8,9,4,6) Thus we see that a LIS is of length 5

3 1 2 4 5 6 7 8 9

slide-32
SLIDE 32

Patience Sorting Applied to LCS

  • Let X=(HTHT), Y=(THHT)
  • Form yT=(0,3) and yH=(1,2)
  • Reverse them: yT=(3,0) and yH=(2,1)
  • Replace ith element of X with yT or yH depending on

value of Xi. Call this list z. z=(21302130), and do patience sorting on z.

2 3 3 1 2 1

slide-33
SLIDE 33

Patience Sorting Applied to LCS

  • So why is this interesting?

– LIS has been solved.

  • Why isn't this a solution?

– In the LIS case, the distribution is uniform over all

possible permutations

– In the LCS case, we don't have permutations, but

rather words (i.e. repeated elements)

  • The work on LIS has been largely extended to

the random word case

– In the LCS case, the distribution is NOT uniform –

there are forbidden words, etc.

slide-34
SLIDE 34

Patience Sorting Applied to LCS

  • But...
  • This seems likely to be true
slide-35
SLIDE 35

Patience Sorting Applied to LCS

  • But...
  • This is unknown, simulations are slooooooowwwww...
slide-36
SLIDE 36

Percolation

  • Percolation is a huge area of probability

– See, for example, books by Grimmett, as well as

Bollobas

slide-37
SLIDE 37

Directed Last Passage Percolation

  • At each vertex, there is a passage time (or

weight)

– typically iid exponential or geometric rvs

  • There is a set of allowed paths

– typically up-right, or strictly up-right

  • The question is what is the maximum time (or

weight) path from the origin to (x,y)

slide-38
SLIDE 38

Directed Last Passage Percolation

1

1 1 1 1 1 1 1 1 1 1

1

1 1 1 1 1 1 1 1 1 1

Last passage time = 4

  • Strictly up-right paths
  • Weights chosen by flipping a coin on each square
  • H->Green, weight = 1
  • T->red, weight = -∞
slide-39
SLIDE 39

Directed Last Passage Percolation

  • Strictly up-right paths
  • Weights chosen by flipping a coin on each axis
  • Coordinate flips agree->Green, weight = 1
  • Coordinate flips disagree->red, weight = -∞

T T H T T T H H T H 1 1 1 1 1 1 1 1 1 1 1 T T H T T T H H T H 1 1 1 1 1 1 1 1 1 1 1 Last passage time = 3

This is LCS

slide-40
SLIDE 40

Directed Last Passage Percolation

1 1 1 1 2 2 2 2 2 3 3

slide-41
SLIDE 41

Directed Last Passage Percolation

1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 2 3 3

slide-42
SLIDE 42

Directed Last Passage Percolation

1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3

slide-43
SLIDE 43

Directed Last Passage Percolation

1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3

slide-44
SLIDE 44

Directed Last Passage Percolation

1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3

slide-45
SLIDE 45

Directed Last Passage Percolation

1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 3 3

slide-46
SLIDE 46

Why doesn't it work?

  • Most (all?) the DLPP results are for iid (in fact

geometric and exponential) weights. In the LCS case, the weights are related.

  • So apply techniques from statistical mechanics
  • f disordered systems, i.e. spin glasses?
slide-47
SLIDE 47

Other Related Models

  • Bernoulli Model

– Seppalainen's result

  • limE[Ln]/n=2√2-2

– Majumdar's result

  • Ln≈(2√2-2)n+21/6(√2-1)4/3n1/3TW
  • Large alphabet

– Kiwi et al: γk√k->2 as k->∞

  • Many strings

– Not much other than Dancik showing that the limit

exists (same proof as in 2 string case)

slide-48
SLIDE 48

“We have lots of bricks, but we don't know what the building looks like yet.”