Expressive pattern matching with LOGOL Application to the modelling - - PowerPoint PPT Presentation

expressive pattern matching with logol application to the
SMART_READER_LITE
LIVE PREVIEW

Expressive pattern matching with LOGOL Application to the modelling - - PowerPoint PPT Presentation

Expressive pattern matching with LOGOL Application to the modelling of -1 Ribosomal Frameshift events X XXY YY X XXY YYZ Catherine Belleanne - Dyliss team, Rennes 1 University Olivier Sallou - GenOuest plateform, Rennes 1 University Jacques


slide-1
SLIDE 1

Expressive pattern matching with LOGOL Application to the modelling of

  • 1 Ribosomal Frameshift events

X XXY YY

Catherine Belleannée - Dyliss team, Rennes 1 University

X XXY YYZ

Olivier Sallou - GenOuest plateform, Rennes 1 University Jacques Nicolas - Dyliss team, Inria

1 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-2
SLIDE 2

What is Logol? g

DNA

  • A new tool for pattern matching on RNA

proteins proteins

motif motif

  • attccggtctacc
  • ctttgtcacg

 attccggtctacc

  • ctttgtcacg
  • taggctggcttcggatt
  • tcggcattggattcgga
  • cggatcgattcttttac

 taggctggcttcggatt

  • tcggcattggattcgga

 cggatcgattcttttac t h i th sequences matches in the sequences

pattern Model

2 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-3
SLIDE 3

Why a new tool? Why a new tool?

  • Towards more expressive patterns

b d tif beyond motifs TAT-[A|T]-T-xxx-AATTCCC towards real biological models Logol language

X XXY YYZ

  • While remaining practicable

Logol tool accept real sequences (e g full genomes)

  • accept real sequences (e.g. full genomes)
  • in reasonable time

3 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-4
SLIDE 4

O tli Outline

1 Logol language 1. Logol language

  • Foundations
  • Some elements

2. Logol tool

Availability

  • Availability
  • Design of a pattern
  • Specifications of the tool

3. An example : modelling « -1 frameshifting sites » 4. Conclusion

4 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-5
SLIDE 5
  • 1. Logol language

5 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-6
SLIDE 6

Foundations of the language

  • Make the structure of motifs explicit -> Grammatical models

Describe «the language of gene » cf David Searls

g g

Describe the language of gene cf David Searls

  • with an accurate level of grammar -> String Variable Grammars (SVG)

String Variables : X X direct copy atcgttatgtatgttatga String Variables : X…X direct copy atcgttatgtatgttatga X…~X reverse complement atcgttatgtatataacga

SVG : beyond context-free grammars: « middly context sensitive »

regular grammars : motif (TAT-[A|T]-T-xxx-AATTCCC)

P i l /t l i St i V i bl

context-free grammars : + palindrome (stem-loops) SVG : + copy, repeat

  • Previous languages/tools using String Variables

Patscan[Dsouza&al, 97] , Patsearch[Pesole&al,00] limited expressivity Genlang[Dong&Searls, 94], Stan[Nicolas&al, 05] or no more maintained

6 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

  • > Logol : in the lineage of Genlang
slide-7
SLIDE 7

Some elements of the language 1/3

  • A first grammar: looking for « aaaaa » anywhere in the sequence

g g

mod1()==*>SEQ1 mod1()==>"aaaaa" actaaaaatggaaaaagta

I t t h

2 t i t h ($) i d l ($$)

  • Inexact matches: 2 counters -> mismatch ($) , indel ($$)
  • >

mod1()==>"aaaaa":{$[0,1]} actagaaatgga cost=1 mismatch

  • >

mod1()==>"aaaaa":{$$[0,1]} actaaaacatgga distance=1 insert > mod1() > aaaaa :{$$[0,1]} actaaaacatgga distance 1 insert

  • String Variables : looking for 2 copies of a string (X1) separated by a gap (.*)

mod1()==>X1:{#[5,8]}, .* , X1 actatcaatggatcaagta

  • Morphisms: to convert a string into another string personal morphisms allowed

"wc" :Watson Crick complement, "-" : reverse string, "wobble" :wobble cplt

  • "wc"

: reverse complement

tt

tt

t t

7 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

mod1()==>"-wc" "aacccc" acttggggttggatcaagta

slide-8
SLIDE 8

Some elements of the language 2/3

  • A constraint approach

Constraints : begin (@), end (@@), content (?), length (#), t ($) di t ($$) iti ( )

g g

cost ($), distance ($$), composition(%)

mod1()==> X1:{% "cg":70} X1 must contain at least 70% of ‘c’ and ‘g‘

V i bl d i

  • Variables can denote instances Instance= string + components
  • >Mark an instance (_VARNAME) and reuse it (?VARNAME, $VARNAME...)

The second string must exactly match the previous instance actaaTaaaaTaactacct The second string must exactly match the previous instance actaaTaaaaTaactacct mod1()==>"aaaaa":{$[0,1], _SAVE1}, ?SAVE1 ‘acgt’ must be located at least 50 nt further than ‘aaaaa’ d1() >" " { SAVE1} * " t" {@[@SAVE1+50 @SAVE1+100]} mod1()==>"aaaaa":{_SAVE1}, *., "acgt":{@[@SAVE1+50, @SAVE1+100]} Looking for 3 strings, successively deriving from each other actaaaaaaaaTaaCata mod1()==>X1:{#[5,8],_S1}, ?S1:{_S2}:{$[1,1]}, ?S2:{$[1,1]} Looking for a stem-loop, with sizes of : stem in [5,11], loop in [1,9], stem strands linked by Watson-Crick pairing, 2 mismatch + 1 indel allowed in the stem

8 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

mod1()==>STEM5:{#[5,11],_S5},.*:{#[1,9]}, -"wc" ?S5:{$[0,2],$$[0,1]}

slide-9
SLIDE 9

Some elements of the language 3/3

  • Repeats: looking for "acgt" repeated between 0 and 5 times. The instances

may be separated by a Spacer between 0 to 2 nt

g g

may be separated by a Spacer between 0 to 2 nt

mod1()==>repeat("acgt",[0,2])+[0,5] actacgtggacgtcacgtccta

  • Negative contain constraints (!): looking for a string with length between 2

and 5 which is not "ag"

mod1()==> !"ag":{#[2,5]}

P t constraints on se eral strings

  • Put constraints on several strings
  • VIEW: constraints on consecutive segments

The total size of X1::X2::X3 must be between 8 and 20

(X1 {#[1 10]} X2 {#[1 10]} X3 {#[1 10]}) {#[8 20]} (X1:{#[1,10]}, X2:{#[1,10]}, X3:{#[1,10]}): {#[8,20]}

  • Control panel: constraints on non consecutive segments
  • Superposition of complementary models: Multiple model

’points of view’

Superposition of complementary models: Multiple model

points of view

  • > The sequence must match all the models

Note: parameters may be transferred from one model to another one

9 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

mod1(VAR1).mod2(VAR1,VAR2).mod3()==*>SEQ1

slide-10
SLIDE 10
  • 2. Logol tool

10 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-11
SLIDE 11

Availability

  • On the web : Logol can be used on the GenOuest web site (with

restrictions) http://webapps genouest org/LogolDesigner/

y

restrictions) http://webapps.genouest.org/LogolDesigner/

  • Via Linux command-line on GenOuest plateform with a GenOuest

account account

  • Download on your own computer NEW! NEW! NEW!

Logol software is free and open source, under CeCILL license g p It includes a Linux command line tool and a graphical designer Logol is a fully maintained tool (development manager: Olivier SALLOU) g y ( p g )

Main logol page:

http://logol.genouest.org/web/app.php/logol

11 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-12
SLIDE 12

Design of a pattern g p

mod4()==>(("aaa")|("ccc")|("uuu")|("ggg"))

  • Grammatical model

text file mymodel.lgg

mod4() (( aaa )|( ccc )|( uuu )|( ggg )) mod2()==>mod4(),(("aaa")|("uuu")),! "g":{#[1,1]} …

  • Graphical model
  • Graphical model

with a graphical designer mymodel.lgd

http://webapps.genouest.org/LogolDesigner/

12 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-13
SLIDE 13

Specifications of the tool

  • Input : a Logol model (graphical or grammatical model)

a Fasta sequence

p

a Fasta sequence

  • Runs on a computer or a grid (Linux)

Configurable to support multi-core architectures and to use g pp multiple nodes to parallelize treatments when possible. Sequences may be split for more parallelization

  • Output: a compressed XML file contains all matches of the model
  • Output: a compressed XML file, contains all matches of the model

With the details of each match (position of each word, size, number of errors compared to model…) Possibility to convert it to Fasta (sequence only) or GFF output y ( q y) p Main pipeline

  • a Java program transforms the model file into a Prolog program

a Java program transforms the model file into a Prolog program

  • the Prolog program parses the sequence (to find the instances of the model)

it uses

  • > a Prolog library (with predicates to operate morphisms, % calculus…)

13 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

g y ( p p p , )

  • > a suffix array indexation (with “Vmatch” or home product “Cassiopee”)
slide-14
SLIDE 14

3 . An example : modelling 1 frameshifting sites modelling « -1 frameshifting sites »

14 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-15
SLIDE 15

Programmed -1 ribosomal frameshifting

A translational recoding strategy

  • RNA

d t di ti t t i

g g

  • one mRNA may produce two distinct proteins

Ribosome may switches from the translation of the standard ORF (in the 0-frame) t l i ORF (i th 1 f ) to an overlapping ORF (in the -1 frame)

slippery site There, the ribosome may slip of 1 nucleotide to the left start0 …….....//……….. stop0 .………………........... stop-1 p standard protein (from the 0-frame) alternative protein

  • > beginning : built from the 0 frame
  • > end :

built from the 1 frame

15 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

  • > end : built from the -1 frame
slide-16
SLIDE 16

Typical frameshifting site

Located in messenger RNA

yp g

  • slippery motif

heptamer X XXY YYZ

Stem2.3’ 3’ Stem2 5’

  • spacer
  • stimulatory structure

mainly H-type pseudo-knot Stem2.5 y yp p (two overlapping stem-loops) Stem1.5’ Stem1.3’ 5’

X XXY YYZ NNNNNNN AUG

slippery motif pseudo‐knot spacer start

16 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-17
SLIDE 17

Modelling slippery motif g pp y

Heptamer XXXYYYZ with X: any nucleotide, Y: ‘a’ or ‘u’ Z: not a ’g’

Graphical model Grammatical model

mod4()==>(("aaa")|("ccc")|("uuu")|("ggg")) Alternative

17 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

() (( )|( )|( )|( ggg )) mod2()==>mod4(),(("aaa")|("uuu")),! "g":{#[1,1]} Negation

slide-18
SLIDE 18

Modelling spacer g p

Grammatical model

Spacer : a string that contains from 1 to10 nucleotides

Grammatical model

mod5()==>SPACER1:{#[1,10]} String variable with a size constraint

18 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-19
SLIDE 19

Modelling pseudo-knot 1/3 A first model g p

Two overlapping stem-loops , with - sizes of : stem1 in [4,16], loop1 in [1,5 ], t 2 i [3 8] l 2 i [0 4] l 3 i [4 40] stem2 in [3, 8], loop2 in [0,4], loop3 in [4,40]

  • stem strands linked by Watson-crick pairing,
  • 4 mismatches allowed in stem1, 2 allowed in stem2

L3

L1

Graphical model

L2 L3

Grammatical model

STEM15:{#[4 16] S15} *:{#[1 5]} STEM25:{#[3 8] S25} STEM15:{#[4,16],_S15},.*:{#[1,5]},STEM25:{#[3,8],_S25}, .*:{#[0,4]},

  • "wc" ?S15 :{$[0,4]},.*:{#[4,40]},-"wc" ?S25 :{$[0,2]}

Logol operators: String variable, size constraint, cost constraint, morphism wc BUT: test on data > to much false positives The model is not selective enough

19 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

BUT: test on data=> to much false positives. The model is not selective enough

slide-20
SLIDE 20

Modelling pseudo-knot 2/3 g p

A more realistic model at a glance

20 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-21
SLIDE 21

Modeling pseudo-knot 3/3 g p

Logol components of a more realistic model

  • Precise error and constraint handling

prohibit mismatches at stem extremities : a stem is divided in 3 parts :< first nt, main part of the stem, last nt> and accept wobble pairing (g-u) (wobble morphism def) p p g (g ) ( p )

  • Check global statistical constraints Allows describing stem stability

A majority of ‘gc’ pairing is required A majority of gc pairing is required

  • > counting ‘gc’ in a stand (at least 50% of ‘g’ and ‘c’ on <A5,STEM5’,Z5>)

use contain constraint on adjacent strings (i.e. on a view)

d3() >(A5 {#[1 1] SA5} STEM15 {#[2 14] S15} mod3()==>(A5:{#[1,1],_SA5}, STEM15:{#[2,14],_S15}, Z5:{#[1,1],_SZ5}) : {%"gc":50}

> counting ‘c’ in both stands (at least 25% of ‘c’ on <A5 STEM5’ Z5> +

  • > counting c in both stands (at least 25% of c on <A5,STEM5 ,Z5> +

<Z3,STEM3’,A3>) because ‘g’ may be involved in a (weak) wobble pairing

use contain constraint on non adjacent strings (stem5’ + stem3’)

21 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-22
SLIDE 22

Modelling frame alignments 1/3 Specifications g g

  • 2 alternatives translations
  • - X

XXY YYZ

  • - X

start stop0 start

XXY YYZ

All translation in 0 frame Shifting in 1 frame

 Superposition of 3 mandatory models

stop-1

Shifting in -1 frame

 Superposition of 3 mandatory models

  • SlipperySite model : a start, then a slippery site in the -1 frame
  • ORF0 model : from the same start, a stop is needed in the 0 frame
  • ORFminus model : from the same start , a stop is needed in the -1 frame

Between start and stop intermediate codons should not contain a stop

22 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

Between start and stop, intermediate codons should not contain a stop codon

slide-23
SLIDE 23

Modelling frame alignments 2/3 In Logol g g

g

  • Superposition of the 3 models
  • > using multiple models in the Logol main ‘rule’

using multiple models in the Logol main rule

  • Alignment on the same start
  • > mark variable START and reuse its position (@START)
  • > transfer parameters between models (the variable START)

position constraint = @START p @

23 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

save the variable START

slide-24
SLIDE 24

Modelling frame alignments 3/3 In Logol g g

g

  • A nonstop codon : a string of size 3nt, which is not a stop

> stop is defined as a model by an alternative (uga | uag | uaa)

  • > stop is defined as a model by an alternative (uga | uag | uaa)
  • > then use a view with size constraint (size =3) and negative contain

constraint (≠ stop) > N t d l => Nonstop model

  • Set of successive nonstop codons
  • > consecutive repeat (from 5 to 300 times) of the nonstop model

nonstop stop

24 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-25
SLIDE 25

Test

  • Test on 30 positive sequences

30 « validated 1 frameshift » sequences 30 « validated -1 frameshift » sequences From the database Recode2 http://recode.ucc.ie/ Size: from 5 000 nt to 30 000 nt

  • Hits: about 3 sites and 100 hits per sequence including the good one

X XXY YYZ

  • Hits: about 3 sites and 100 hits per sequence, including the good one

with an additional post-filter ordering the hits according to stem quality, the

  • fficial hit is ranked 1st for 20 sequences
  • Time (on Intel X5550, 144Go RAM): 1mn 30s for the biggest sequence
  • Time (on Intel X5550, 144Go RAM): 1mn 30s for the biggest sequence

immediate response on dedicated KnotInFrame site

  • Test on Bacillus Subtilis complete genome
  • Test on Bacillus Subtilis complete genome

Reference: Str 168 NC_000964.3 Size : 4 215 606 nt

  • Hits : about 7 000 hits

(with a looser model)

  • Hits : about 7 000 hits

(with a looser model)

  • Time (on Intel X5550, 144Go RAM): 2 hours

no response on KnotInFrame site

25 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-26
SLIDE 26
  • 4. Conclusion

26 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-27
SLIDE 27

Logol:

  • A general purpose modelling language, for every type of sequences
  • > with quite important expressivity
  • > practicable and maintained
  • A new birth for String Variable Languages
  • An ongoing project :

g g p j

  • still in progress
  • your returns are expected to go on improving the

language and the tool language and the tool A single adress : http://logol genouest org/web/app php/logol http://logol.genouest.org/web/app.php/logol

Test it!

27 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-28
SLIDE 28
  • 5. Supplementary data

28 LOGOL C. Belleannée, O. Sallou, J. Nicolas Jobim 3 juillet 2012

slide-29
SLIDE 29

Job sequence

Grammar interpreter (convert grammar to prolog) Job Manager split sequence if multi core Sequence Parser (Execute prolog program) Vmatch Or (convert grammar to prolog) split sequence if multi core ( p g p g ) 1 1 1 1 1 N Or Cassiopee Logol grammar file N

Result file 1 sequence per job P ll l j b if l t ti l if l l

1

Parallel jobs if on cluster or sequential if local

Merged results Fasta sequences Multi Sequence Manager Merged results

slide-30
SLIDE 30

Grammar analysis Grammar analysis

Th l i J d t L l

  • The analyzer is a Java program used to parse a Logol

grammar file or model (graphical) and transform it in a Prolog program g p g

  • The Prolog program uses a Prolog library that contains

predicates to match expressions (fixed content, hi ) morphisms,…)

  • Logic programming: generated program tries to match

all expressions up to a final match else it goes all expressions up to a final match, else it goes backward to test the next possibility.

  • The analyzer extracts a maximum of information for a

y variable to limit the range of search (max size, base content defined later on in the grammar,….)

slide-31
SLIDE 31

Execution flow Execution flow

  • Left to right file reading but accepts variable use before being determined (
  • Left to right file reading but accepts variable use before being determined (

_R1,”acgt”:{?R1} )

  • Parse the sequence file by position and try to match an expression (content with

error, constraints on a group of variable,…). When all expressions are matched, record it as a result.

  • A match is recorded as an object. This object records all the information of the

match all along the parsing: Object = [pattern1, pattern2,…]

– patternX is itself an object holding match information (position errors ) but also sub – patternX is itself an object holding match information (position, errors,…) but also sub patterns if applicable (models, repeats,…).

  • In case of gap (not a position+1):

– Calls Vmatch (suffix array) or cassiopee when content is known (possibly with errors)

V t h i f t ffi h t l ith di t / t

  • Vmatch is a performant suffix array search tool with distance/error support
  • Cassiopee is a basic Ruby library with distance/error/ambiguity support, but not optimized for large

sequences.

– Continue by position with all possibilities when looking for content not defined at this time

  • Optional final step: filtering to keep optimal results only
  • Optional final step: filtering to keep optimal results only.
slide-32
SLIDE 32
  • Z

ifi it f L l i bl

  • Zoom on a specificity of Logol variables

X:: $[0,1], X:: $[0,1] tandem repeat p

  • > It accepts ATTA with X= ‘‘AA ’’ (or X= ‘‘TT’’ ). This is impossible with Genlang (X,

X:: $1 ). So, Logol allows to make the distinction between entity (abstract) and instance (concrete)

  • Control pannel: To put constraints on non adjacent strings

p p j g

Counting ‘c’ in both stands (at least 25% of ‘c’ on <A5,STEM5’,Z5> +

<Z3,STEM3’,A3>) because ‘g’ may be involved in a (weak) wobble pairing

use contain constraint on non adjacent strings (stem5’ + stem3’) controls:{ %"c"[mod3.SA5,mod3.S15,mod3.SZ5,mod3.SZ3,mod3.S13,mod3.SA3] >=25 }

Jobim - 3 juillet 2012 LOGOL C.Belleannée O.Sallou J.Nicolas 32

slide-33
SLIDE 33

#définition du morphisme wobble

A final grammatical model for -1 Frameshifting

def:{

/ without frame alignment /

morphism(wcw a t) morphism(wcw,a,t) morphism(wcw,t,a) morphism(wcw,c,g) morphism(wcw,g,c) morphism(wcw,g,t) morphism(wcw,t,g) } #définition des contrôles #définition des contrôles controls:{ % "c"[mod3.SA5,mod3.STEM15,mod3.SZ5,mod3.SZ3,mod3.STEM13,mod3.SA3]>=25 % "c"[mod3.SX5,mod3.STEM25,mod3.SY5,mod3.SY3,mod3.STEM23,mod3.SX3]>=25 } #définition du modèle Logol mod4()==>(("aaa")|("ccc")|("ttt")|("ggg")) mod2()==>mod4() (("aaa")|("ttt")) ! "g":{#[1 1]} mod2() mod4(),(( aaa )|( ttt )),! g :{#[1,1]} mod3()==>(A5:{#[1,1],_SA5},S15:{#[2,14],_STEM15},Z5:{#[1,1],_SZ5}):{%"gc":50}, L1:{#[1,5]}, (X5:{#[1,1],_SX5},S25:{#[1,6],_STEM25},Y5:{#[1,1],_SY5}):{%"gc":50}, L11 {#[0 4]} L11:{#[0,4]},

  • "wcw" Z3:{?SZ5,_SZ3},-"wc" ?STEM15:{_STEM13}:{p$[0,34]},-"wcw" A3:{?SA5,_SA3},

L2:{#[4,40]},

  • "wcw" Y3:{?SY5, SY3},-"wc" ?STEM25:{ STEM23}:{p$[0,34]},-"wcw" X3:{?SX5, SX3}

33

wcw Y3:{?SY5,_SY3}, wc ?STEM25:{_STEM23}:{p$[0,34]}, wcw X3:{?SX5,_SX3} mod1()==>CC1:{#[2,2]},mod2(),SPACER2:{#[1,10]},mod3() mod1()==*>SEQ1