Algorithm-Independent Learning Issues
Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr
CS 551, Spring 2005
Algorithm-Independent Learning Issues Selim Aksoy Bilkent - - PowerPoint PPT Presentation
Algorithm-Independent Learning Issues Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr CS 551, Spring 2005 Introduction We have seen many learning algorithms and techniques for pattern
CS 551, Spring 2005
CS 551, Spring 2005 1/49
CS 551, Spring 2005 2/49
CS 551, Spring 2005 3/49
CS 551, Spring 2005 4/49
◮ to see if the classifier performs well enough to be useful, ◮ to compare its performance with that of a competing design.
CS 551, Spring 2005 5/49
◮ Parametric models ◮ Cross-validation ◮ Jackknife and bootstrap estimation ◮ Maximum likelihood model selection ◮ Bayesian model selection ◮ Minimum description length principle
CS 551, Spring 2005 6/49
CS 551, Spring 2005 7/49
CS 551, Spring 2005 8/49
Figure 1: The data set D is split into two parts for validation. The first part (e.g., 90% of the patterns) is used as a standard training set for learning, the other (i.e., 10%) is used as the validation set. For most problems, the training error decreases monotonically during training. Typically, the error on the validation set decreases, but then increases, an indication that the classifier may be overfitting the training
CS 551, Spring 2005 9/49
CS 551, Spring 2005 10/49
n.
CS 551, Spring 2005 11/49
CS 551, Spring 2005 12/49
CS 551, Spring 2005 13/49
CS 551, Spring 2005 14/49
CS 551, Spring 2005 15/49
Figure 2: The evidence is shown for three models of different expressive power or
selection states that we should choose h2, which has the highest evidence.
CS 551, Spring 2005 16/49
CS 551, Spring 2005 17/49
likelihood
CS 551, Spring 2005 18/49
CS 551, Spring 2005 19/49
Figure 3: In the absence of training data, a particular model hi has available a large range of possible values of its parameters, denoted ∆0θ. In the presence of a particular training set D, a smaller range is available. The Occam factor, ∆θ/∆0θ, measures the fractional decrease in the volume of the model’s parameter space due to the presence of training data D.
CS 551, Spring 2005 20/49
CS 551, Spring 2005 21/49
CS 551, Spring 2005 22/49
CS 551, Spring 2005 23/49
jI
jk}d k=1)
CS 551, Spring 2005 24/49
j=1,
j=1 and the
j=1.
m
n
m
CS 551, Spring 2005 25/49
(a) True mixture
1 2 3 4 5 6 7 1630 1640 1650 1660 1670 1680 1690 1700 1710 MDL for covariance model Σj = σ2 I Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(b) Σj = σ2I
1 2 3 4 5 6 7 1630 1640 1650 1660 1670 1680 1690 1700 1710 MDL for covariance model Σj = σj 2 I Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(c) Σj = σ2
j I
1 2 3 4 5 6 7 1640 1645 1650 1655 1660 1665 1670 1675 1680 1685 1690 MDL for covariance model Σj, diagonal Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(d) Σj = diag({σ2
jk}q k=1)
1 2 3 4 5 6 7 1640 1645 1650 1655 1660 1665 1670 1675 1680 1685 1690 MDL for covariance model Σj = Σ Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(e) Σj = Σ
1 2 3 4 5 6 7 1620 1630 1640 1650 1660 1670 1680 1690 MDL for covariance model Σj, full Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(f) Σj = arbitrary
Figure 4: Example fits for a sample from a mixture of three bivariate Gaussians. For each covariance model, description length vs. the number of components (left) and fitted Gaussians as ellipses at one standard deviations (right) are shown. Using MDL with the arbitrary covariance matrix gave the smallest description length and also could capture the true number of components.
CS 551, Spring 2005 26/49
CS 551, Spring 2005 27/49
◮ Access to different classifiers, each developed in a different
◮ Availability of multiple training sets, each collected at a
◮ Local performances of different classifiers where each classifier
◮ Different performances due to different initializations and
CS 551, Spring 2005 28/49
◮ Parallel: all classifiers are invoked independently and then their
◮ Serial (cascading): individual classifiers are invoked in a linear
◮ Hierarchical (tree): individual classifiers are combined into a
CS 551, Spring 2005 29/49
◮ A classifier combination is especially useful if the individual
◮ This can be explicitly forced by using different training sets,
◮ Some combiners are static, with no training required, while
◮ Some are adaptive where the decisions of individual classifiers
◮ Different
CS 551, Spring 2005 30/49
◮ Majority voting (each classifier makes a binary decision (vote)
◮ Sum, product, maximum, minimum and median of the posterior
◮ Class ranking (each class receives m ranks from m classifiers,
◮ Borda count (sum of the number of classes ranked below a
◮ Weighted combination of classifiers
CS 551, Spring 2005 31/49
c
j=1 p(wj|x1, . . . , xn)
c
j=1 p(x1, . . . , xn|wj)p(wj)
CS 551, Spring 2005 32/49
c
j=1
n
c
j=1
n
c
j=1 n
CS 551, Spring 2005 33/49
c
j=1
n
c
j=1 n
CS 551, Spring 2005 34/49
n
i=1 ai ≤ maxn i=1 ai for ai ∈ R)
c
j=1
n
i=1 p(wj|xi)
c
j=1 n
i=1 p(wj|xi)
CS 551, Spring 2005 35/49
i=1 ai ≤
i=1 ai for 0 ≤ ai ≤ 1) gives the decision rule
c
j=1
n
i=1 p(wj|xi)
c
j=1 n
i=1 p(wj|xi)
CS 551, Spring 2005 36/49
c
j=1 n
i=1
c
j=1
CS 551, Spring 2005 37/49
c
j=1 p(wj|xi)
c
j=1 n
CS 551, Spring 2005 38/49
◮ they should not agree with each other when they misclassify a
◮ at least they should not assign the same incorrect class to a
CS 551, Spring 2005 39/49
i=1 egi
CS 551, Spring 2005 40/49
CS 551, Spring 2005 41/49
◮ bagging, where multiple classifiers are built by bootstrapping
◮ boosting, where a sequence of classifiers is built by training each
CS 551, Spring 2005 42/49
CS 551, Spring 2005 43/49
CS 551, Spring 2005 44/49
CS 551, Spring 2005 45/49
CS 551, Spring 2005 46/49
i=1 αt+1(xi) = 1.
CS 551, Spring 2005 47/49
CS 551, Spring 2005 48/49
Table 1: Classifier combination methods.
CS 551, Spring 2005 49/49