 
              1 Normalized maximum likelihood models in genomics Ioan Tabus Department of Signal Processing Tampere University of Technology Department of Signal Processing NML models in genomics 8.7.2008
2 Universal distributions Optimality of the Normalized Maximum Likelihood model The universal distribution q(.) is solution of two minmax problems: 1. Best distribution for the worst string: θ ˆ ( ,..., , ( ,..., )) P x x x x 1 n 1 n min max ( ,..., ) q x x (.) ,..., q x x 1 n 1 n 2. Best distribution q(.) for the average regret of the worst generating distribution g(.) : θ ˆ ( ,..., , ( ,..., )) P x x x x 1 1 n n min max E g ( ,..., ) q x x (.) q g (.) 1 n Department of Signal Processing NML models in genomics 8.7.2008
3 Two related goals: DNA compression and DNA modelling Department of Signal Processing NML models in genomics 8.7.2008
4 Finding the regressor in the past Department of Signal Processing NML models in genomics 8.7.2008
5 The NML model for approximate matching Department of Signal Processing NML models in genomics 8.7.2008
6 Memoryless model Department of Signal Processing NML models in genomics 8.7.2008
7 The NML for memoryless discrete regression Department of Signal Processing NML models in genomics 8.7.2008
8 Department of Signal Processing NML models in genomics 8.7.2008
9 Memory model Department of Signal Processing NML models in genomics 8.7.2008
10 NML for the class of memory models Department of Signal Processing NML models in genomics 8.7.2008
11 NML-1 Encoding algorithm Department of Signal Processing NML models in genomics 8.7.2008
12 Department of Signal Processing NML models in genomics 8.7.2008
13 Compression of Human Genome (average 1.45 bit/base) Department of Signal Processing NML models in genomics 8.7.2008
14 Compression ratio in bits / base when compressing the human genome (only A,C,G,T alphabet) with a 10 MB window size. Blue: bzip2, Red:GeNML. 2.5 2 1.5 Bits / base 1 0.5 0 0 5 10 15 20 25 Chromosome Department of Signal Processing NML models in genomics 8.7.2008
15 Approximate matching in DNA analysis Use a universal coding of the binary mask resulting from matching two candidate sequences Normalized maximum likelihood models � For memoryless sources (Bernoulli) � For sources with memory Example: the DNA locus HUMGHCSA (about 65000 bases) � The genes GH-1 and GH-2 are human growth hormone genes � CS-5, CS-1, and CS-2 are chorionic somatomammotropin genes � expressed either in pituitary gland or in placenta Department of Signal Processing NML models in genomics 8.7.2008
16 DNA duplications and their role in evolution and disease Department of Signal Processing NML models in genomics 8.7.2008
17 Traditional approach to gene duplication Department of Signal Processing NML models in genomics 8.7.2008
18 Hamming versus NML 0 (memoryless) Department of Signal Processing NML models in genomics 8.7.2008
19 Optimizing the overall cost for duplication analysis Department of Signal Processing NML models in genomics 8.7.2008
20 Encoding the pointers and the mask Department of Signal Processing NML models in genomics 8.7.2008
21 Dynamic programming problem Department of Signal Processing NML models in genomics 8.7.2008
22 Department of Signal Processing NML models in genomics 8.7.2008
23 NML with memory (orders 1 and 2) Department of Signal Processing NML models in genomics 8.7.2008
24 Department of Signal Processing NML models in genomics 8.7.2008
25 Summarizing the duplication scenarios Department of Signal Processing NML models in genomics 8.7.2008
26 NML Encoding Department of Signal Processing NML models in genomics 8.7.2008
27 Efficient search of regressor Department of Signal Processing NML models in genomics 8.7.2008
28 Department of Signal Processing NML models in genomics 8.7.2008
29 Renormalization to account for contiguous matches Department of Signal Processing NML models in genomics 8.7.2008
30 Accounting for contiguous perfect matches � Unconstrained case: � Constrained case Department of Signal Processing NML models in genomics 8.7.2008
31 Accounting for contiguous perfect matches: efficient approach Department of Signal Processing NML models in genomics 8.7.2008
32 Comparison of NML 1 and NML 2 Department of Signal Processing NML models in genomics 8.7.2008
33 Open avenues � Universal models provide efficient representation tools for genomic sequences � More refined model order selection procedures may better account for non-stationarity along sequences � The techniques are easy to extend to more adaptive tools � Derive the exact NML model for more structured classes of models with memory Department of Signal Processing NML models in genomics 8.7.2008
Recommend
More recommend