Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Minimum Description Length Bono Nonchev Principle in Model - - PowerPoint PPT Presentation
Minimum Description Length Bono Nonchev Principle in Model - - PowerPoint PPT Presentation
Minimum Description Length Principle in Model Selection Minimum Description Length Bono Nonchev Principle in Model Selection Information Theory The MDL Principle Bono Nonchev Model Selection Faculty of Mathematics and Informatics,
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Contents
1 Information Theory 2 The MDL Principle 3 Model Selection 4 Model Complexity
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Knowledge = compression
Regularities in data lead to compression Example: 0101010101010101010101010101010101010... 1101100111111101111110110011111111111... 1010101000111010001110100011101011111... Denote an n-tuple of real numbers xn - data, generated by some process. Make inference about the generating process by finding a way to encode the data using the patterns it exhibits. Kolmogorov complexity and problems
uncomputability arbitrariness
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Equivalence between code and distribution
Let xn be a realization of the random vector X n in (Ω, F, P). xn is encoded with a string of 0 and 1 with length L (xn) having unique decodability. Shortest code length (in expected sence) is achieved using a code that for a given observation xn has length (Shannon-Fano coding) L(xn) = − log P (xn) A probability distributions defines a code and vice versa. The requirement for integer code lengths is not essential.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
The MDL principle I
Restrict the class of models and codes to probability distributions. Define a set of candidate models H, e.g. N(µ, σ). Encode data “optimally” using code with length L(xn|H), for each point hypothesis H ∈ H, e.g. H = {X n ∈ N(1.42, 0.443)}. Encode point hypothesis “optimally” with code length L(H). The optimal point hypothesis H ∈ H is that for which L (xn|H) + L (H) is minimal. Not clear how to find the code lengths for H and xn|H.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
The MDL principle II
L is called a universal code with respect to a family of codes L, if 1 n
- L (xn) − max
L∗∈L L∗ (xn)
- −
− − →
n→∞ 0
Examples:
Two step coding: code H ∈ H “uniformly”, then code xn using the coding corresponding to P(xn|H). Bayesian approach (Minimum Message Length): define prior probability PH(H), then code using L(xn) = − log P (xn|H) − log PH(H)
Instead of a set of codes L examine a set of distributions
- M. M ∈ M that corresponds to a universal code is called
a universal model.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Measures of Goodness
For a model M and a distribution ˜ P, the Regret is a measure of distance - how much we loose if we try to encode the data using ˜ P instead of the best distribution in M: RM( ˜ P, xn) = − log ˜ P (xn) − min
P∈M {− log P (xn)}
To remove the dependence on xn we take the maximum Rmax
M ( ˜
P) = max
xn∈X n
- R( ˜
P, xn)
- For a parametric family Mθ we use the maximum
likelihood estimate ˆ θ (xn) and define model complexity as COMPn (Mθ) =
- xn∈X n
P(xn|ˆ θ (xn)) Also called Stochastic Complexity.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Normalized Maximum Likelihood (NML) Distribution
Find distribution for xn that is universal w.r.t. Mθ. Idea: use P(xn|ˆ θ(xn)). Not possible, so do the next best thing: normalize the above probability: ˜ PNML (xn) = P(xn|ˆ θ (xn))
- yn∈Y n P(yn|ˆ
θ (yn)) The NML distribution achieves constant regret RMθ( ˜ P, xn) = log COMPn (Mθ) NML is a universal model with code length L (xn) = − log P(xn|ˆ θ (xn)) + log COMPn (Mθ)
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Properties of Model Complexity
COMPn (Mθ) is a measure of the complexity or flexibility
- f the family of distributions Mθ.
In case Mθ is discrete, COMPn (Mθ) can be interpreted as the number of “essentially different models” in the family. COMPn (Mθ) is invariant under change of parametrization. When some regularity conditions are imposed log COMPn (Mθ) − − − →
n→∞
k 2 log n 2π+log
- θ∈Θ
- |I (θ)|dθ+o(1)
It is possible that COMPn (Mθ) = ∞.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Model Selection Problem
A sample X1, . . . , Xn of a random variable. Decide which of a myriad of distributions does this sample
- riginate.
Example:
y = axb + Z - Stevens’ model y = a ln (x + b) + Z - Fechner’s model
More data patterns can be explained by Stevens’ model than by Fechner’s (see [Grundwald, Rissanen, 2007]). Our goal is to select between:
(N) - the sample is from a distribution in N(µ, σ2I). (T) - the sample is from a distribution in Tν(µ, σ2I).
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Model Selection Solution
Ordinary information criteria (AIC, BIC, GIC, DIC) do not account for complexity beyond the number of free parameters. Using the MDL principle:
Calculate COMPn (Mθ) for both models. Calculate MLE for µ and σ and the log-likelihood of the data. Select the model with smallest total description length L(xn) = − log f (xn|ˆ µ, ˆ σ) + log COMPn (Mθ) .
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Solution for Gaussian Model with Known Variance
[Barron et al, 1998] show that with (jointly) sufficient statistics T present for θ COMPn (Mθ) =
- X n P
- xn|ˆ
θ (xn)
- dxn =
- T
P(t|ˆ θ(t))dt Normal model with known variance COMPn (Mθ) = ∞. They propose that the conditional complexity be used by limiting
- 1
n
xi
- ≤ R
- = A:
COMPn (Mθ|xn ∈ A) =
- A
P
- xn|ˆ
θ (xn)
- dxn
In that case COMPn (Mθ|xn ∈ A) = −1 2 log π−log σ+ 1 2 log n+log 2R
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Solution for Gaussian Model with Unknown Variance
Use the sufficient statistics x and s2. Compute the conditonal complexity for A =
- |x| ≤ R, D ≤ s2
: COMPn (Mθ|xn ∈ A) = 2 n
2
n
2 e− n 2
√πΓ n−1
2
× 2RD−1 The unconditional complexity is again infinite An idea emerges - try to extract the last term and ignore it when comparing models.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Complexity of Absolutely Continuous Location-Scale Family I
Define the p.d.f. of a multivariate location-scale family f (xn|µ, σ) = σ−ng xn − µ σ
- The conditional complexity, conditional on xn ∈ A for
A =
- |ˆ
µ(x)| ≤ R, D ≤ ˆ σ2(xn)
- is the following integral
COMPn (Mθ|xn ∈ A) =
- A
(ˆ σ(xn))−n g xn − ˆ µ(xn) ˆ σ(xn)
- dxn
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Complexity of Absolutely Continuous Location-Scale Family II
Theorem If δ is the Dirac delta function, then COMPn (Mθ|xn ∈ A) = 2RD−1 ×
- [δ (ˆ
µ (yn))) δ (1 − ˆ σ (yn))] g (yn) dyn = 2RD−1 × EY n [δ (ˆ µ (Y n)) δ (1 − ˆ σ (Y n))] Corollary The unconditional parametric complexity of an absolutely continuous location-scale family is either zero or infinity. Note: Dependence structure in a sample is treated in the integral.
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Future work
Calculate the parametric complexity of multivariate Student-T using the theorem and that x and s2 are the MLE estimators for the parameters. Calculate the stochastic complexity of linear regression with Student-T distributed residuals. Extend the theorem and corollary to be used when the sample statistics are not MLE estimators:
An i.i.d. sample with Student T marginals. Stable and CTS Models for time dependence
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A
Bibliography
Barron, A. & Rissanen, J. & Yu, B. - The Minimum Description Length Principle in Coding and Modeling, IEEE Transactions on Information Theory, vol. 44, no. 6, 1998 Stine, R. A. & Foster, D. P. - The Competitive Complexity Ratio, Conference on Information Sciences and Systems, Princeton University, March 15-17, 2000 Lanterman, A. D. - Schwarz, Wallace, and Rissanen: Intertwining Themes in Theories of Model Selection, International Statistical Review Vol. 69, No. 2, 2001, pp. 185-212 Grunwald, P. D. & Rissanen, J. - The Minimum Description Length Principle, The MIT Press, 2007
Minimum Description Length Principle in Model Selection Bono Nonchev Information Theory The MDL Principle Model Selection Model Complexity Q/A