Normalized maximum likelihood models in genomics Ioan Tabus - - PowerPoint PPT Presentation

normalized maximum likelihood models in genomics ioan
SMART_READER_LITE
LIVE PREVIEW

Normalized maximum likelihood models in genomics Ioan Tabus - - PowerPoint PPT Presentation

1 Normalized maximum likelihood models in genomics Ioan Tabus Department of Signal Processing Tampere University of Technology Department of Signal Processing NML models in genomics 8.7.2008 2 Universal distributions Optimality of the


slide-1
SLIDE 1

1 NML models in genomics 8.7.2008

Department of Signal Processing

Normalized maximum likelihood models in genomics Ioan Tabus

Department of Signal Processing Tampere University of Technology

slide-2
SLIDE 2

2 NML models in genomics 8.7.2008

Department of Signal Processing

Universal distributions Optimality of the Normalized Maximum Likelihood model The universal distribution q(.) is solution of two minmax problems:

  • 1. Best distribution for the worst string:

) ,..., ( )) ,..., ( ˆ , ,..., ( max min

1 1 1 ,..., (.)

1

n n n x x q

x x q x x x x P

n

θ

  • 2. Best distribution q(.) for the average regret of the worst

generating distribution g(.) :

) ,..., ( )) ,..., ( ˆ , ,..., ( max min

1 1 1 (.) (.) n n n g g q

x x q x x x x P E θ

slide-3
SLIDE 3

3 NML models in genomics 8.7.2008

Department of Signal Processing

Two related goals: DNA compression and DNA modelling

slide-4
SLIDE 4

4 NML models in genomics 8.7.2008

Department of Signal Processing

Finding the regressor in the past

slide-5
SLIDE 5

5 NML models in genomics 8.7.2008

Department of Signal Processing

The NML model for approximate matching

slide-6
SLIDE 6

6 NML models in genomics 8.7.2008

Department of Signal Processing

Memoryless model

slide-7
SLIDE 7

7 NML models in genomics 8.7.2008

Department of Signal Processing

The NML for memoryless discrete regression

slide-8
SLIDE 8

8 NML models in genomics 8.7.2008

Department of Signal Processing

slide-9
SLIDE 9

9 NML models in genomics 8.7.2008

Department of Signal Processing

Memory model

slide-10
SLIDE 10

10 NML models in genomics 8.7.2008

Department of Signal Processing

NML for the class of memory models

slide-11
SLIDE 11

11 NML models in genomics 8.7.2008

Department of Signal Processing

NML-1 Encoding algorithm

slide-12
SLIDE 12

12 NML models in genomics 8.7.2008

Department of Signal Processing

slide-13
SLIDE 13

13 NML models in genomics 8.7.2008

Department of Signal Processing

Compression of Human Genome (average 1.45 bit/base)

slide-14
SLIDE 14

14 NML models in genomics 8.7.2008

Department of Signal Processing

Compression ratio in bits / base when compressing the human genome (only A,C,G,T alphabet) with a 10 MB window size. Blue: bzip2, Red:GeNML.

5 10 15 20 25 0.5 1 1.5 2 2.5 Chromosome Bits / base

slide-15
SLIDE 15

15 NML models in genomics 8.7.2008

Department of Signal Processing

Approximate matching in DNA analysis

Use a universal coding of the binary mask resulting from matching two candidate sequences Normalized maximum likelihood models

For memoryless sources (Bernoulli) For sources with memory

Example: the DNA locus HUMGHCSA (about 65000 bases)

The genes GH-1 and GH-2 are human growth hormone genes CS-5, CS-1, and CS-2 are chorionic somatomammotropin genes expressed either in pituitary gland or in placenta

slide-16
SLIDE 16

16 NML models in genomics 8.7.2008

Department of Signal Processing

DNA duplications and their role in evolution and disease

slide-17
SLIDE 17

17 NML models in genomics 8.7.2008

Department of Signal Processing

Traditional approach to gene duplication

slide-18
SLIDE 18

18 NML models in genomics 8.7.2008

Department of Signal Processing

Hamming versus NML 0 (memoryless)

slide-19
SLIDE 19

19 NML models in genomics 8.7.2008

Department of Signal Processing

Optimizing the overall cost for duplication analysis

slide-20
SLIDE 20

20 NML models in genomics 8.7.2008

Department of Signal Processing

Encoding the pointers and the mask

slide-21
SLIDE 21

21 NML models in genomics 8.7.2008

Department of Signal Processing

Dynamic programming problem

slide-22
SLIDE 22

22 NML models in genomics 8.7.2008

Department of Signal Processing

slide-23
SLIDE 23

23 NML models in genomics 8.7.2008

Department of Signal Processing

NML with memory (orders 1 and 2)

slide-24
SLIDE 24

24 NML models in genomics 8.7.2008

Department of Signal Processing

slide-25
SLIDE 25

25 NML models in genomics 8.7.2008

Department of Signal Processing

Summarizing the duplication scenarios

slide-26
SLIDE 26

26 NML models in genomics 8.7.2008

Department of Signal Processing

NML Encoding

slide-27
SLIDE 27

27 NML models in genomics 8.7.2008

Department of Signal Processing

Efficient search of regressor

slide-28
SLIDE 28

28 NML models in genomics 8.7.2008

Department of Signal Processing

slide-29
SLIDE 29

29 NML models in genomics 8.7.2008

Department of Signal Processing

Renormalization to account for contiguous matches

slide-30
SLIDE 30

30 NML models in genomics 8.7.2008

Department of Signal Processing

Accounting for contiguous perfect matches

Unconstrained case: Constrained case

slide-31
SLIDE 31

31 NML models in genomics 8.7.2008

Department of Signal Processing

Accounting for contiguous perfect matches: efficient approach

slide-32
SLIDE 32

32 NML models in genomics 8.7.2008

Department of Signal Processing

Comparison of NML 1 and NML 2

slide-33
SLIDE 33

33 NML models in genomics 8.7.2008

Department of Signal Processing

Open avenues

Universal models provide efficient representation tools for genomic sequences More refined model order selection procedures may better account for non-stationarity along sequences The techniques are easy to extend to more adaptive tools Derive the exact NML model for more structured classes of models with memory