motifs distribution in dna sequences
play

MOTIFS DISTRIBUTION IN DNA SEQUENCES St ephane ROBIN - PowerPoint PPT Presentation

MOTIFS DISTRIBUTION IN DNA SEQUENCES St ephane ROBIN robin@inapg.inra.fr UMR INA-PG / INRA, Paris Math ematique et Informatique Appliqu ees Bio-Info-Math Workshop, Tehran, April 2005 S. Robin (Motif statistics in DNA) 0 Biological


  1. MOTIFS DISTRIBUTION IN DNA SEQUENCES St´ ephane ROBIN robin@inapg.inra.fr UMR INA-PG / INRA, Paris Math´ ematique et Informatique Appliqu´ ees Bio-Info-Math Workshop, Tehran, April 2005 S. Robin (Motif statistics in DNA) 0

  2. Biological interest of motif statistics Four examples Ex 1 : Promoter motifs = structured motifs where polyme- rase binds to DNA ≃ 100 bps gene v w 16 bps ≤ d ≤ 18 bps Which structured motifs occur almost ( too ?) systematically in upstream regions of the genes of a given species ? S. Robin (Motif statistics in DNA) 1

  3. RecBCD , � Chi 5' 3' A l # � # c c � 3' 5' 3' � , � � Chi 5' � B l # � # Ex 2 : CHI motifs in bacterial genomes X c X c X � 3' X X X 5' � � 3' � , � � Chi 5' � C l # # � # # # # # # # # # # # ` # ` c # # # ` c # # � ` Crossover Hot-spot Initiator : defense function of the genome 3' ` ` 5' ` ` ` � � 3' � � against the degradation activity of an enzyme � , Chi � l � � � 5' � � 3' D # # � # # # # # # # # # # # # ` ` c # # # ` # c # � ` 3' ` ` ` ` ` ` ` ` ` 5' ` ` Known in several bacterial Fig. 9.1 { Mo d � ele d'inter action entr e R e cBCD et Chi. genomes : A�n d' � etudier la fr equence � du motif Chi=GCTGGTGG dans la s � equence d' E. c oli , nous a v ons a just � e successiv emen t c hacun des mo d � eles M 0 , M 1 , : : : et M 6 , et calcul � e les statistiques U asymptotiquemen t gaussiennes cen tr � ees r � eduites corresp ondan tes (P artie I) ; le mo d � ele M 0 est E. coli : gctggtgg celui o u � les bases de la s � equence son t supp os � ees ind � ep endan tes. Le T ableau 9.1 mon tre que Chi est le 8-mot le plus sur-repr � esen t � e lorsque l'on a juste l'un des trois mo d � eles M 0 , M 1 et M 2 sur la s � equence. De plus, il reste parmi les 8-mots les plus H. influenza : gNtggtgg sur-repr � esen t � es lorsque l'on augmen te l'ordre du mo d � ele mark o vien. Le fait que Chi soit excep- tionnellemen t fr � equen t dans c haque mo d � ele traduit donc une forte con train te vis a � vis de tous ses sous-mots de longueur 2 � a 7 car le nom bre de GCTGGTGG est toujours plus imp ortan t que ( Figure : Schbath, 95 ) celui pr � edit par les di-, tri-, t etra-n � ucl � eotides, etc : : : Is this motif unexpectedly frequent in some regions of the ge- nome ? If so, these regions may contain crucial functions. S. Robin (Motif statistics in DNA) 2 136

  4. Ex 3 : Palindromes = self-complementary words g t t a a c | | | | | | c a a t t g Palindromes of length 6 are restriction sites (i.e. frailty sites) of the genome of E. coli . If they are especially avoided in some regions, these regions may be of major importance for the organism. S. Robin (Motif statistics in DNA) 3

  5. Ex 4 : Detection of unknown motifs – Motifs with favorable functions should be unexpectedly frequent , – Motifs with damaging functions should be unexpectedly rare Even when we know nothing about them (except their length) , such motifs may be detected only because they have unexpected frequencies S. Robin (Motif statistics in DNA) 4

  6. A model : what for ? Model = Reference To be able to decide if something is unexpected, we first need to know what to expect. To avoid artifacts, the model should typically account for • the frequencies of nucleotides, or di-, or tri-nucleotides in the sequence, • the overlapping structure of the word, • eventually, the overall frequency of the word in the sequence The choice of the model (Markov chain / compound Poisson process) depends on the question. ( R., Rodolphe & Schbath ; 05 ) S. Robin (Motif statistics in DNA) 5

  7. Overlapping structure of the word Some words can overlap themselves (see Conway (Gardner, 74) ; Guibas & Odlyzko, 81 ). Such words tend to occur in clumps and have a less regular distribution along the sequence. Cdf of the distance between two occurrences under model M00 : w = ( gatc ) w = ( aaaa ) E ( Y ) = 256 bps E ( Y ) = 256 bps V ( Y ) = (256 . 2 bps) 2 V ( Y ) = (326 . 7 bps) 2 S. Robin (Motif statistics in DNA) 6

  8. 32 CHAPITRE 2. OCCURRENCES DE MOTIFS � et des distan es Y qui les s � eparen t. La �gure 2.1 illustre es d � e�nitions. P ar on v en tion, la p osition d'une o urren e est d � e�nie par la p osition de la derni ere � lettre dans la s � equen e. Cette on v en tion est ommo de, mais arbitraire et absolumen t pas g � en erale. � X 4 X 1 Probabilities and distributions of interest w w w w w w S Y Positions, distances, counts 4 N ( w ) = 6 Y Fig. 2.1 { O urr en es d'un mot w : X et X sont les p ositions de ses pr emi � er e et qua- 1 4 4 tri � eme o urr en es ; Y est la distan e entr e deux o urr en es su essives et Y la distan e umul � ee d'or dr e 4. Notre ob je tif est de d � eterminer les distributions exa tes des v ariables al � eatoires (v.a.) r X , X , Y et Y . L'in t � er et ^ des distan es um ul � ees appara ^ �tra dans le hapitre 3. n M � etho de d'obten tion de la distribution • Probability for a motif to occur in a sequence : X 1 La distribution de la p osition X = X de la premi ere � o urren e est obten ue � a partir 1 − → promoter motifs de sa fon tion g � en er � atri e (de probabilit � e). En notan t p ( x ) = Pr f X = x g , la fon tion g � en � eratri e � de X est d � e�nie par X • Distribution of the number of occurrences : N X x � ( t ) = p ( x ) t : X • Distribution of the occurrences along the sequence : Y r , N ( x ) − x � 1 N ( x − y ) − → CHI motifs, palindromes Cette fon tion g � en eratri e � est obten ue par un raisonnemen t en deux � etap es. ( i ) On etablit � une r � e urren e sur les probabilit � es p ( x ) : S. Robin (Motif statistics in DNA) 7 p ( x ) = f [ p (1) ; : : : ; p ( x � 1)℄ (th eor � eme � 1, paragraphe 4.2.1). ( ii ) On d eduit � la fon tion g � en � eratri e � ( t ) en somman t ette r e urren e � sur x � 1 et en X x m ultiplian t par t (th eor � � eme 2, paragraphe 4.2.2). On obtien t la fon tion g � en eratri e � � de la distan e Y selon le m ^ eme prin ip e. Cette Y distan e a un sens dans le mo d � ele M1 ar, dans e mo d � ele, les distan es s � eparan t les o urren es su essiv es son t ind � ep endan tes et iden tiquemen t distribu � ees (i.i.d.). r Les fon tions g � en � eratri es des p ositions ult erieures � X et des distan es um ul � ees Y n s'obtiennen t ensuite dire temen t gr^ a e � a l'ind � ep endan e des distan es : n � 1 r r � ( t ) = � ( t )[ � ( t )℄ ; � ( t ) = [ � ( t )℄ : (2.2) X X Y Y Y n

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend