On the Length of the Longest Common Subsequence Peter Rabinovitch

Summary ● Consider two sequence of coin tosses, and from these two sequences, extract the longest common subsequence. It is known that as the length of the sequences increase, the ratio of the length of the longest common subsequence to the length of the sequence converges to a limit in expectation that is about 0.81, but the exact value of the limit is not known. ● In this talk, we will survey some key results related to the problem, as well as look at several potential approaches to determining the limit.

A Simple Example H T H H T T T H T T

Applications ● DNA (alphabet size=4) ● Proteins (alphabet size=20) ● Computer security (alphabet size=256) ● And all these are more complicated, and more interesting, and more useful with more than two strings.

Formally ● Let X =(X 1 ,X 2 ,...X n }, Y =(Y 1 ,Y 2 ,...Y n ) be two sequences of iid Bernoulli r.v.s ● P[X i =H]=P[X i =T]=P[Y i =H]=P[Y i =H]=1/2 ● L n =length of a longest common subsequence We seek to understand the r.v. L n in particular lim E[L n ]/n

For small n, things can be calculated explicitly ● By explicit enumeration we find – E[L 2 ]=9/8, V[L 2 ]=11/8 – E[L 3 ]=29/16, V[L 3 ]=119/256 ● But it gets messy for larger n

Properties E[L n ]/n E[L n ]/n 0.8 0.8 E[L_n] E[L_n] Appears monotonic, 0.4 0.4 but not yet proved Sd[L n ]/n Sd[L n ]/n 0.0 0.0 0 0 20 20 40 40 60 60 80 80 100 100 n n 150 Could be L n , as a function of X n and Y n Frequency Gaussian 100 satisfies several symmetries 50 • Globally switch H & T • Reverse both sequences 0 • Etc 70 75 80 85 L_100

An Algorithm (1) T H H T H T T H T T

An Algorithm (2) T H H T H T T H T T

An Algorithm (3) 3 T 1 3 2 2 H 2 H T 1 2 1 2 H 1 T T H T T

Subadditivity, etc. ● A sequence {a n } is subadditive if a m+n ≤a m +a n for all positive integers m & n ● A sequence {a n } is superadditive if {-a n } is subadditive ● Fekete's lemma: if {a n } is subadditive then lim a n n = inf a n n = where −∞≤∞

Fekete's Lemma (γ>-∞) ● For any ε>0 we can find a k s.t. a k ≤(γ+ε)k because γ=inf a n /n. ● m>0 can be written m=nk+j for the same k with 0≤j<k it follows a m =a nk+j ≤a nk +a j ≤na k +a j ≤n(γ+ε)k+max 0≤l<k a l so limsup m a m /m ≤ limsup m n(γ+ε)k/m + limsup m max 0≤l<k a l /m and then limsup m a m /m ≤ γ+ε ● Also γ+ε≤ liminf m a m /m+ε ● So limsup m a m /m ≤ γ+ε≤ liminf m a m /m+ε ● As ε>0 was arbitrary, we have lim n a n /n=inf n a n /n=γ

Existence of the Limit ● a n =E[L n ]/n is superadditive (by concatenation) ● So applying Fekete's lemma shows that the limit exists (Chvatal & Sankoff, 1975) ● Deken (1979) shows that L n /n converges a.s.

Other Results ● Aratia & Steele conjecture that c=2√2-2~0.8284 ● Alexander (1994) proves a rate of convergence using methods of percolation ● Steele (1997?) proves a concentration of measure results using the Azuma Hoeffding inequality ● Bundschuh (2001) shows that c~0.812653 using simulation, demonstrating that A&S were wrong ● Lueker (2005) bounds 0.788071≤c≤0.826280

A Heuristic (1) ● What is the longest sequence of heads you will see in n tosses of a fair coin?

A Heuristic (1) ● What is the longest sequence of heads you will see in n tosses of a fair coin? ● The probability of a length m run is p m , and there are approximately n places where this run could start, so E[# of length m head runs]=np m ● If the longest one is unique, then 1=np m , so the length of the longest head run is log 1/p n ● Note: this can be made precise, eg. Durrett's book has a proof.

A Heuristic (2) ● What is the largest red square in an n by n grid where each square is coloured red or black by flipping a coin?

A Heuristic (3) Arratia & Steele's Conjecture ● Call any pair of subsequences of length k where the X i and Y i agree a 'good k pair' ● Let Z be the total number of good k pairs of the two length n strings. Then ● Then E[Z]=( n C k ) 2 /2 k because there are n C k to choose each of the subsequences, which have to agree in k places. ● The mode of this sequence is approximately n/(1+√2) ● Since every length k common subsequence yields a good k pair, there are at least Ln C k such good k pairs. This sequence has mode L n /2. ● Now equate the two to get L n /n~2√2-2

Solution Methods ● We'll focus on two – Patience sorting ● Which has connections to the symmetric group, Young tableaux, the Tracy Widom distribution (see Aldous & Diaconis' AMS paper “Longest Increasing Subsequences: From Patience Sorting to the Baik-Dieft- Johansson Theorem”) – Directed last passage percolation on a disordered media ● Which has connections to percolation (see Grimmett's book) as well as (in a suitably relaxed version of the problem) the Tracy Widom distribution

Aside on the Tracy Widom Distribution ('94) ● Arises in many new places – LIS of a uniform random permutation – Largest eigenvalue of a random matrix in the Gaussian Unitary Ensemble (GUE), i.e. complex Hermitian matrices – Growth models in the plane (one of our “Other Related Models”, later) ∞ F  s = exp − ∫ 2  x  dx   x − s  q s 3  s  q' '  s = sq  s  2q

Patience Sorting ● (3,5,2,1,7,8,9,4,6) ● Put the next element at the bottom of the first column it is less than or equal to. ● If no such column exists, start a new column

Patience Sorting (3,5,2,1,7,8,9,4,6) 3

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 2

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 2 1

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 7 2 1

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 7 8 2 1

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 7 8 9 2 1

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 7 8 9 2 4 1

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 7 8 9 2 4 6 1

Patience Sorting (3,5,2,1,7,8,9,4,6) 3 5 7 8 9 2 4 6 1 Thus we see that a LIS is of length 5

Patience Sorting Applied to LCS ● Let X=(HTHT), Y=(THHT) ● Form y T =(0,3) and y H =(1,2) ● Reverse them: y T =(3,0) and y H =(2,1) ● Replace i th element of X with y T or y H depending on value of X i . Call this list z. z=(21302130), and do patience sorting on z. 2 3 3 1 2 0 1 0

Patience Sorting Applied to LCS ● So why is this interesting? – LIS has been solved. ● Why isn't this a solution? – In the LIS case, the distribution is uniform over all possible permutations – In the LCS case, we don't have permutations, but rather words (i.e. repeated elements) ● The work on LIS has been largely extended to the random word case – In the LCS case, the distribution is NOT uniform – there are forbidden words, etc.

Patience Sorting Applied to LCS ● But... ● This seems likely to be true

Patience Sorting Applied to LCS ● But... ● This is unknown, simulations are slooooooowwwww...

Percolation ● Percolation is a huge area of probability – See, for example, books by Grimmett, as well as Bollobas

Directed Last Passage Percolation ● At each vertex, there is a passage time (or weight) – typically iid exponential or geometric rvs ● There is a set of allowed paths – typically up-right, or strictly up-right ● The question is what is the maximum time (or weight) path from the origin to (x,y)

Directed Last Passage Percolation ● Strictly up-right paths ● Weights chosen by flipping a coin on each square ● H->Green, weight = 1 ● T->red, weight = -∞ 1 -∞ 1 1 1 1 -∞ 1 1 1 1 -∞ -∞ -∞ -∞ 1 -∞ -∞ -∞ -∞ -∞ -∞ -∞ 1 1 1 -∞ 1 1 1 -∞ -∞ 1 -∞ 1 -∞ 1 -∞ 1 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ 1 1 Last passage time = 4

Directed Last Passage Percolation ● Strictly up-right paths ● Weights chosen by flipping a coin on each axis ● Coordinate flips agree->Green, weight = 1 ● Coordinate flips disagree->red, weight = -∞ 1 1 T -∞ T -∞ 1 1 1 1 1 1 -∞ -∞ -∞ -∞ -∞ -∞ 1 1 H -∞ -∞ H -∞ H 1 H 1 -∞ -∞ -∞ -∞ -∞ -∞ -∞ T T 1 1 1 1 1 -∞ 1 1 -∞ 1 H H -∞ -∞ 1 -∞ -∞ 1 -∞ -∞ -∞ -∞ T T H T T T T H T T Last passage time = 3 This is LCS

Directed Last Passage Percolation 3 1 3 2 2 2 1 2 1 2 1

Directed Last Passage Percolation 3 3 1 3 1 3 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 1

Directed Last Passage Percolation 2 3 3 3 1 3 1 3 1 3 2 2 2 1 1 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 1 1 0 0 1 1 1

Directed Last Passage Percolation 2 3 3 3 1 3 1 3 1 3 2 2 2 1 1 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 1 1 0 0 1 1 1 2 3 1 3 2 1 1 2 2 2 1 2 1 2 2 1 2 1 1 2 0 0 1 1 1

On the Length of the Longest Common Subsequence Peter Rabinovitch - PowerPoint PPT Presentation

On the Length of the Longest Common Subsequence Peter Rabinovitch Summary Consider two sequence of coin tosses, and from these two sequences, extract the longest common subsequence. It is known that as the length of the sequences

Longest Common Subsequence C=c 1 c g is a subsequence of A=a 1 a m if C can be obtained

Fast Parallel Longest Common Subsequence with General Integer Scoring Support Adnan Ozsoy , Arun

1 The first problem we're covering today is longest common subsequence, or LCS. This was covered

Algorithms for Computing the Longest Parameterized Common Subsequence Costas S. Iliopoulos 1 ,

19. Dynamic Programming I Memoization, Optimal Substructure, Overlapping Sub-Problems,

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Bijective enumeration of permutations starting with a longest increasing subsequence Greta Panova

Renewal Approximation in the Online Increasing Subsequence Problem Alexander Gnedin, Amirlan

Govt. of Gujarat Gujarat Coastline Zone Accretion Erosion length Stable length Total length

Maximum Contiguous Subsequence Sum Check out from SVN: MCS CSSRac Races es Finish

Efficient List-based Computation of the String Subsequence Kernel Slimane Bellaouar 1 Hadda

Verification of Security Protocols with Lists: from Length One to Unbounded Length Miriam Paiola

For Friday Read Chapter 10, sections 1 and 2 Prolog Handout 4 Length of a List

Computi ting l longes est c common square s e subsequen ences Takafumi Inoue 1 , Shunsuke

Common vertex of longest cycles in circular arc graphs Hehui Wu University of Illinois at

My Longest Journey Poem Reading week 5 session 3 1 star - Miss Crook's English set 2 stars - Ms

Topics in U-statistics and Risk Estimation Qing Wang and Bruce G. Lindsay March 17, 2011 Qing

The Solid Foundation to a Successful College Career Eric Davis, Chad S. Briggs, Amy Fehr-Davis,

EMA Multiplicity Workshop Case study: a phase 3 study with 2 doses and secondary endpoints

Artificial Intelligence Based Control Power Optimization on Tailless Aircraft Frank H. Gern NASA

Anytime Best+Depth-First Search for Bounding Marginal MAP Radu Marinescu Junkyu Lee, Alex Ihler

Adult Social Care & Health Council Plan Priorities Report to Health Scrutiny Committee, 20 th

Lessons Learnt from Indonesia 35 Manila M-women Working Group November, 2013 Key Barriers for

Mind the Social Care Gap What does the future hold for users, carers and providers? Holly Holder