Evolutionary Search Techniques for the Lyndon Factorization of - PowerPoint PPT Presentation

Evolutionary Search Techniques for the Lyndon Factorization of Biosequences Workshop on Evolutionary Computation for Permutation Problems@GECCO 2019 Amanda Clare, Jacqueline W. Daykin, Thomas Mills, Christine Zarges Department of Computer Science Aberystwyth University Aberystwyth, Wales, UK � c.zarges@aber.ac.uk July 13, 2019

The Problem The Algorithm Results Conclusions Overview The Problem 1 The Algorithm 2 Results 3 Conclusions 4 C. Zarges GECCO 2019 July 13, 2019 2/18

The Problem The Algorithm Results Conclusions Motivation: Stringology Meets Bioinformatics Goal Investigate structures in strings and permutations of the string alphabet with application to factoring genomes for sequence alignment. Notation and Terminology Σ : an ordered alphabet word : finite sequence of symbols over Σ π : permutation defining the ordering of the alphabet Typical Alphabets Standard English alphabet (26 letters) DNA alphabet (4 letters) Protein alphabet (20 letters) C. Zarges GECCO 2019 July 13, 2019 3/18

The Problem The Algorithm Results Conclusions Lyndon Words Ordered alphabet Σ Given Lyndon Word A finite word x ∈ Σ + is a Lyndon word if it is least alphabetically amongst all cyclic rotations of the letters. Example English alphabet with standard lexicographical ordering ATOM is a Lyndon word since ATOM < OMAT < MATO < TOMA A M T O Other examples: Evolution, Christine, Aberystwyth, Abstract, Amazing, Chicken, Moon C. Zarges GECCO 2019 July 13, 2019 4/18

The Problem The Algorithm Results Conclusions Lyndon Factorisation Lyndon Factorisation A factorisation of x ∈ Σ + into x = ℓ 1 ℓ 2 . . . ℓ n where ℓ i are Lyndon words and ℓ 1 ≥ ℓ 2 ≥ . . . ≥ ℓ n Example English alphabet with standard lexicographical ordering w = UNIVERSITY U ≥ N ≥ IV ≥ ERSITY → Fact Any word x ∈ Σ + can be uniquely factored into a Lyndon factorisation. Research Questions What impact does the manipulation of the alphabet ordering have on the resulting Lyndon Factorisation, specifically the number of factors? Determine an optimal ordering for a number of different objectives. C. Zarges GECCO 2019 July 13, 2019 5/18

The Problem The Algorithm Results Conclusions Applications Sequence factorisation facilitates useful approaches such as parallelism and block compression to deal with the huge volumes of data. Bioinformatics: STAR, an algorithm to search for tandem repeats (approximate and adjacent repetitions of a DNA motif) Musicology: Enumerating periodic musical sequences Digital geometry Two-way string-matching Compression: In Suffix arrays + Burrows-Wheeler transform C. Zarges GECCO 2019 July 13, 2019 6/18

The Problem The Algorithm Results Conclusions On the Number of Factors w = 01 j 0 2 1 j − 1 . . . 0 j 1 for j > 1 Example 0 < 1 : j factors (01 j ) (0 2 1 j − 1 ) ( . . . ) (0 j 1) 1 < 0 : 3 factors (0) (1 j 0 2 1 j − 1 . . . 0 j ) (1) How can we minimise the number of factors? Existing approach Greedy Algorithm by Clare & Daykin How can we maximise the number or balance the length of factors? Observation Different alphabet sizes and usually no general pattern of characters. C. Zarges GECCO 2019 July 13, 2019 7/18

The Problem The Algorithm Results Conclusions Objectives Example: bacdbdabbcdbbddbdbdabbacbabacbc Minimise the number of factors (a < c < d < b) (b) (acdbdabbcdbbddbdbdabbacbabacbc) Maximise the number of factors (a < b < c < d) (b) (acdbd) (abbcdbbddbdbd) (abbacb) (abacbc) Balance the length of the factors (b < a < c < d) (bacdbda) (bbcdbbddbdbda) (bbacbabacbc) – Standard deviation of the factor length – Difference between maximum and minimum length Find a specific number of factors (if possible) Duval’s linear time and constant space algorithm to compute the number of factors. C. Zarges GECCO 2019 July 13, 2019 8/18

The Problem The Algorithm Results Conclusions Evolutionary Algorithm 1 Initialisation : Random + based on order of first appearance 2 While Exit Criteria Not Met Do Evaluate alphabet orderings Parent Selection: Select uniformly at random from top half of the population Create offspring using crossover and mutation Replacement: Offspring replace lower half of the population C. Zarges GECCO 2019 July 13, 2019 9/18

The Problem The Algorithm Results Conclusions Mutation Swap Mutation and Insert Mutation p 1 p 2 p 2 p 1 x : x : 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 y : y : 0 1 5 3 4 2 6 7 8 9 0 1 5 2 3 4 6 7 8 9 Observation Changes to low ordered characters have higher impact → Bias the selection of elements towards low ordered characters Observation Changing the order of two elements has higher impact → Select Swap Mutation with higher probability C. Zarges GECCO 2019 July 13, 2019 10/18

The Problem The Algorithm Results Conclusions Crossover Observation Need operator that preserves large parts of the ordering Partially Mapped Crossover Example p 1 p 2 x 1 : 1 2 3 4 5 6 7 8 9 4 5 6 7 x 2 : 9 3 7 8 2 6 5 1 4 2 4 5 6 7 8 y : 9 3 2 4 5 6 7 1 8 C. Zarges GECCO 2019 July 13, 2019 11/18

The Problem The Algorithm Results Conclusions Experimental Setup Parameters Generations: 1000 Population size: 16 Mutation bias: – Select one of the 3 lowest ordered elements with probability at least 0.3. – Select Insert Mutation with probability 0.9 Experiments Random Sequences: 10 random sequences of length 300 over an alphabet of size 20 Biosequences: 573 protein sequences from a bacterial genome (Buchnera aphidicola) C. Zarges GECCO 2019 July 13, 2019 12/18

The Problem The Algorithm Results Conclusions Random Sequences: Minimisation 3.5 3.5 3.5 Fitness Value Fitness Value Fitness Value 3.0 3.0 3.0 2.5 2.5 2.5 2.0 2.0 2.0 1.5 1.5 1.5 0 100 200 300 0 100 200 300 0 100 200 300 Generation Generation Generation Best individual in initial population has already good fitness → heuristic provides good results Fitness converges to 2 for all random sequences considered. C. Zarges GECCO 2019 July 13, 2019 13/18

The Problem The Algorithm Results Conclusions Random Sequences: Maximisation 25 25 25 Fitness Value Fitness Value Fitness Value 20 20 20 15 15 15 10 10 10 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Generation Generation Generation Maximisation problem appears to be more difficult Maximal fitness reached across different sequences very similar C. Zarges GECCO 2019 July 13, 2019 14/18

The Problem The Algorithm Results Conclusions Random Sequences: Balanced 40 40 40 Fitness Value Fitness Value Fitness Value 30 30 30 20 20 20 10 10 10 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Generation Generation Generation 125 125 125 100 100 100 Fitness Value Fitness Value Fitness Value 75 75 75 50 50 50 25 25 25 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Generation Generation Generation Balance problem also appears to be more difficult C. Zarges GECCO 2019 July 13, 2019 15/18

The Problem The Algorithm Results Conclusions Random Sequences: Specific 4 4 4 Fitness Value Fitness Value Fitness Value 3 3 3 2 2 2 1 1 1 0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Generation Generation Generation Target 12 seems to be relatively easy to reach More investigations needed to understand how the target influences the difficulty. C. Zarges GECCO 2019 July 13, 2019 16/18

The Problem The Algorithm Results Conclusions Biosequences Lexicographic: 4053 factors in total (mean 7, standard deviation 2.25). Minimisation: most cases just 1 factor, at most 2 factors Maximisation: Appears to follow a normal distribution, with mean of 22.7 Balanced: Range of factors from 2 to 31 Specific: Achieved for all sequences C. Zarges GECCO 2019 July 13, 2019 17/18

The Problem The Algorithm Results Conclusions Conclusions and Future Work Evolutionary algorithm for finding an optimal alphabet ordering for the Lyndon factorisation problem Future Work Consider different ways to initialise the population More detailed analysis of different operators for permutation problems and the underlying fitness landscape Investigate the solutions for the minimisation problem as they capture information about the protein sequences C. Zarges GECCO 2019 July 13, 2019 18/18

Evolutionary Search Techniques for the Lyndon Factorization of - PowerPoint PPT Presentation

Evolutionary Search Techniques for the Lyndon Factorization of Biosequences Workshop on Evolutionary Computation for Permutation Problems@GECCO 2019 Amanda Clare, Jacqueline W. Daykin, Thomas Mills, Christine Zarges Department of Computer

Broadband Town Hall Lyndon Township Green Lake Lyndon Township, MI Photo Credit: UnagiUnagi

Principles and Techniques of Evolutionary Architecture Rebecca Parsons Chief Technology O ffi cer

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

CSE CSE 460 460 Evolutionary Evolutionary Methods Methods In this section we will look at

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

Evolutionary Design By: Dianna Fox and Dan Morris Review 4 main types of Evolutionary

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

Runtime Analysis of Convex Evolutionary Search Convex Evolutionary Search Alberto Moraglio &

M-Tex OH&S Mining Seminar 7 October 2013 Dr Lyndon Arnold Centre for Advanced Materials

Quantum Graph Properties via Pseudo Orbits and Lyndon Words Jon Harrison 1 , Ram Band 2 , Tori

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

6 th A NNUAL H UMIES A WARDS Evolutionary Learning of Local Descriptor Evolutionary

Using Evolutionary Algorithm to find image segmentation Yossef Kitrossky & Yoad Lewenberg

I t Introduction to d ti t Evolutionary Algorithms Federico Nesti, f.nesti@santannapisa.it

Outline DM812 METAHEURISTICS Lecture 6 Evolutionary Algorithms 1. Evolutionary Algorithms

Models of Language Evolution Session 04 : Evolutionary Game Theory: Evolutionary Dynamics Michael

Computi ting l longes est c common square s e subsequen ences Takafumi Inoue 1 , Shunsuke

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

INTRODUCTION TO TELEPHONY & VOIP Advanced Internet Services (COMS 6181 Spring 2015)

Thickness Design 1972 AASHTO Method AASHTO Method Pavement engineers recognized early that

Bioinformatic Research at IIT: the Highlights Marco Pellegrini Istituto di Informatica e

RCRA RC RA and C CERC RCLA Integration a at Federal Facili lities FEBRUARY 3, 2020

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Describing Syntax and Semantics of Progr a mming L a ngu a ges Part I 1 Programming Language

Evolutionary Search Techniques for the Lyndon Factorization of - PowerPoint PPT Presentation

Evolutionary Search Techniques for the Lyndon Factorization of Biosequences Workshop on Evolutionary Computation for Permutation Problems@GECCO 2019 Amanda Clare, Jacqueline W. Daykin, Thomas Mills, Christine Zarges Department of Computer

Broadband Town Hall Lyndon Township Green Lake Lyndon Township, MI Photo Credit: UnagiUnagi

Principles and Techniques of Evolutionary Architecture Rebecca Parsons Chief Technology O ffi cer

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

CSE CSE 460 460 Evolutionary Evolutionary Methods Methods In this section we will look at

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

Evolutionary Design By: Dianna Fox and Dan Morris Review 4 main types of Evolutionary

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

Runtime Analysis of Convex Evolutionary Search Convex Evolutionary Search Alberto Moraglio &amp;

M-Tex OH&amp;S Mining Seminar 7 October 2013 Dr Lyndon Arnold Centre for Advanced Materials

Quantum Graph Properties via Pseudo Orbits and Lyndon Words Jon Harrison 1 , Ram Band 2 , Tori

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

6 th A NNUAL H UMIES A WARDS Evolutionary Learning of Local Descriptor Evolutionary

Using Evolutionary Algorithm to find image segmentation Yossef Kitrossky &amp; Yoad Lewenberg

I t Introduction to d ti t Evolutionary Algorithms Federico Nesti, f.nesti@santannapisa.it

Outline DM812 METAHEURISTICS Lecture 6 Evolutionary Algorithms 1. Evolutionary Algorithms

Models of Language Evolution Session 04 : Evolutionary Game Theory: Evolutionary Dynamics Michael

Computi ting l longes est c common square s e subsequen ences Takafumi Inoue 1 , Shunsuke

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

INTRODUCTION TO TELEPHONY &amp; VOIP Advanced Internet Services (COMS 6181 Spring 2015)

Thickness Design 1972 AASHTO Method AASHTO Method Pavement engineers recognized early that

Bioinformatic Research at IIT: the Highlights Marco Pellegrini Istituto di Informatica e

RCRA RC RA and C CERC RCLA Integration a at Federal Facili lities FEBRUARY 3, 2020

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Describing Syntax and Semantics of Progr a mming L a ngu a ges Part I 1 Programming Language

Runtime Analysis of Convex Evolutionary Search Convex Evolutionary Search Alberto Moraglio &

M-Tex OH&S Mining Seminar 7 October 2013 Dr Lyndon Arnold Centre for Advanced Materials

Using Evolutionary Algorithm to find image segmentation Yossef Kitrossky & Yoad Lewenberg

INTRODUCTION TO TELEPHONY & VOIP Advanced Internet Services (COMS 6181 Spring 2015)