Algorithm-Independent Learning Issues
Selim Aksoy
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Spring 2010
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 1 / 53
Algorithm-Independent Learning Issues Selim Aksoy Department of - - PowerPoint PPT Presentation
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2010 CS 551, Spring 2010 2010, Selim Aksoy (Bilkent University) c 1 / 53 Introduction We
Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 1 / 53
◮ We have seen many learning algorithms and techniques for
◮ Some of these algorithms may be preferred because of
◮ Others may be preferred because they take into account
◮ Even though the Bayes error rate is the theoretical limit for
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 2 / 53
◮ Given practical constraints such as finite training data, we
◮ We will explore several ways to quantify and adjust the
◮ We will also study techniques for integrating the results of
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 3 / 53
◮ Are there any reasons to prefer one classifier or learning
◮ Can we even find an algorithm that is overall superior to
◮ The no free lunch theorem states that the answer to these
◮ There are no context-independent or usage-independent
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 4 / 53
◮ If one algorithm seems to outperform another in a particular
◮ It is the type of problem, prior distribution, and other
◮ Therefore, we should focus on important aspects such as
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 5 / 53
◮ To compare learning algorithms, we should use test data
◮ Using the error on points not in the training set (also called
◮ There are at least two reasons for requiring to know the
◮ to see if the classifier performs well enough to be useful, ◮ to compare its performance with that of a competing design. CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 6 / 53
◮ Estimating the final generalization performance requires
◮ Occasionally, our assumptions are explicit (such as in
◮ We will study the following methods for evaluation:
◮ Parametric models ◮ Cross-validation ◮ Jackknife and bootstrap estimation ◮ Maximum likelihood model selection ◮ Bayesian model selection ◮ Minimum description length principle CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 7 / 53
◮ One approach for estimating the generalization rate is to
◮ Estimates for the probability of error can be computed using
◮ However, such estimates are often overly optimistic. ◮ Finally, it is often very difficult to compute the error rate
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 8 / 53
◮ In simple cross-validation, we randomly split the set of
◮ We use one set as the traditional training set for adjusting
◮ Since our goal is to obtain low generalization error, we train
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 9 / 53
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 10 / 53
◮ A simple generalization of this method is m-fold
◮ Then, the classifier is trained m times, each time with a
◮ The estimated error is the average of these m errors. ◮ This technique can be applied to virtually every
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 11 / 53
◮ Once we train a classifier using cross-validation, the
◮ If the true but unknown error rate of the classifier is p and if
n. ◮ The properties of this estimate for the parameter p of a
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 12 / 53
◮ In the jackknife approach, we estimate the accuracy of a
◮ This is the m = n limit of m-fold cross-validation, also called
◮ Each resulting classifier is tested on the single deleted
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 13 / 53
◮ A bootstrap data set is created by randomly selecting n
◮ Bootstrap estimation of classification accuracy consists of
◮ The final estimate is the average of these bootstrap
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 14 / 53
◮ The goal of maximum likelihood model selection is to
◮ Let Mi represent a candidate model and let D represent
◮ The posterior probability of any given model can be
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 15 / 53
◮ In practice, the data-dependent term dominates and the
◮ Therefore, in maximum likelihood model selection, we find
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 16 / 53
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 17 / 53
◮ Bayesian model selection uses the full information over
◮ In particular, the evidence for a particular model is the
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 18 / 53
◮ The posterior p(θ|D, Mi) is often peaked at ˆ
likelihood
◮ The Occam factor is the ratio of the volume in the
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 19 / 53
◮ The Occam factor has magnitude less than 1, where it is
◮ The more the training data, the smaller the range of
◮ Once the posteriors for different models have been
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 20 / 53
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 21 / 53
◮ The Minimum Description Length (MDL) principle tries to
◮ Under the MDL principle, the best model is the one that
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 22 / 53
◮ According to Shannon, the shortest code-length to encode
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 23 / 53
◮ The model complexity is measured as the number of bits
◮ According to Rissanen, the code-length to encode κM
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 24 / 53
◮ Once the description lengths for different models have been
◮ It can be shown theoretically that classifiers designed with a
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 25 / 53
◮ As an example, let’s derive the description lengths for
◮ The total number of free parameters for different covariance
jI
jk}d k=1)
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 26 / 53
◮ The first term describes the mixture weights {αj}m j=1, the
j=1 and the third
j=1. ◮ Hence, the best m can be found as
m
n
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 27 / 53
(a) True mixture
1 2 3 4 5 6 7 1630 1640 1650 1660 1670 1680 1690 1700 1710 MDL for covariance model Σj = σ2 I Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(b) Σj = σ2I
1 2 3 4 5 6 7 1630 1640 1650 1660 1670 1680 1690 1700 1710 MDL for covariance model Σj = σj 2 I Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(c) Σj = σ2
j I
1 2 3 4 5 6 7 1640 1645 1650 1655 1660 1665 1670 1675 1680 1685 1690 MDL for covariance model Σj, diagonal Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(d) Σj = diag({σ2
jk}q k=1)
1 2 3 4 5 6 7 1640 1645 1650 1655 1660 1665 1670 1675 1680 1685 1690 MDL for covariance model Σj = Σ Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(e) Σj = Σ
1 2 3 4 5 6 7 1620 1630 1640 1650 1660 1670 1680 1690 MDL for covariance model Σj, full Number of mixture components Description length −3 −2 −1 1 2 3 4 5 −1 1 2 3 4 5 Plot of data and covariances dim 1 dim 2(f) Σj = arbitrary
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 28 / 53
◮ Just like different features capturing different properties of a
◮ An empirical comparison of different classifiers can help us
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 29 / 53
◮ However, although most of the classifiers may have similar
◮ Not relying on a single decision but rather combining the
◮ Such combinations are variously called combined
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 30 / 53
◮ Some reasons for combining multiple classifiers to solve a
◮ Access to different classifiers, each developed in a different
◮ Availability of multiple training sets, each collected at a
◮ Local performances of different classifiers where each
◮ Different performances due to different initializations and
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 31 / 53
◮ In summary, we may have different feature sets, training
◮ Combination architectures can be grouped as:
◮ Parallel: all classifiers are invoked independently and then
◮ Serial (cascading): individual classifiers are invoked in a
◮ Hierarchical (tree): individual classifiers are combined into a
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 32 / 53
◮ Selecting and training of individual classifiers:
◮ Combination of classifiers is especially useful if the individual
◮ This can be explicitly forced by using different training sets,
◮ Combiner:
◮ Some combiners are static, with no training required, while
◮ Some are adaptive where the decisions of individual classifiers
◮ Different combiners use different types of output from individual
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 33 / 53
◮ Examples of classifier combination schemes are:
◮ Majority voting (each classifier makes a binary decision (vote)
◮ Sum, product, maximum, minimum and median of the posterior
◮ Class ranking (each class receives m ranks from m classifiers,
◮ Borda count (sum of the number of classes ranked below a
◮ Weighted combination of classifiers.
◮ We will study different combination schemes using a Bayesian
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 34 / 53
◮ Given c classes w1, . . . , wc with prior probabilities
c
j=1 P(wj|x1, . . . , xn)
c
j=1 p(x1, . . . , xn|wj)P(wj). ◮ In a practical situation with limited training data, computing
◮ However, we can train a separate classifier for each feature
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 35 / 53
◮ Product rule: Assuming that the features x1, . . . , xn are
c
j=1
n
c
j=1
n
◮ Under the assumption of equal priors, the rule becomes
c
j=1 n
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 36 / 53
◮ Sum rule: Assuming that the posterior probabilities from
◮ Substituting this approximation in the product rule and
c
j=1
n
◮ Under the assumption of equal priors, the rule becomes
c
j=1 n
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 37 / 53
◮ Max rule: Approximating the sum in the sum rule by the
n
i=1 ai ≤ maxn i=1 ai for ai ∈ R) gives the decision rule
c
j=1
n
i=1 P(wj|xi)
◮ Under the assumption of equal priors, the rule becomes
c
j=1 n
i=1 P(wj|xi).
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 38 / 53
◮ Min rule: Approximating the product in the product rule by
i=1 ai ≤ minn i=1 ai for 0 ≤ ai ≤ 1) gives the decision rule
c
j=1
n
i=1 P(wj|xi)
◮ Under the assumption of equal priors, the rule becomes
c
j=1 n
i=1 P(wj|xi).
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 39 / 53
◮ Median rule: Using the fact that median is a robust estimate
c
j=1 n
i=1
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 40 / 53
◮ Harmonic mean rule: Another measure of central tendency
c
j=1
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 41 / 53
◮ Majority vote rule: If we set each classifier to make a binary
c
j=1 P(wj|xi)
c
j=1 n
◮ The votes received by each class from individual classifiers
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 42 / 53
◮ All of these combination methods are based on the
◮ Furthermore, some additional conditions need to be
◮ For example, the individual classifiers should not be
◮ they should not agree with each other when they misclassify
◮ at least they should not assign the same incorrect class to a
◮ This requirement can be satisfied to a certain extent by
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 43 / 53
◮ These combination algorithms require the output of
◮ When the classifiers provide non-probabilistic output, these
◮ Analog output: If the outputs of a classifier are analog
i=1 egi
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 44 / 53
◮ Rank order output: If the output is a rank order list, the
◮ One-of-c output: If the output is a one-of-c representation, in
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 45 / 53
◮ We have studied using resampling for generating training
◮ Resampling methods can also be used to build classifier
◮ We will study:
◮ bagging, where multiple classifiers are built by bootstrapping
◮ boosting, where a sequence of classifiers is built by training
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 46 / 53
◮ Bagging (bootstrap aggregating) uses multiple versions of
◮ Each of these bootstrap data sets is used to train a different
◮ The final classification decision is based on the vote of each
◮ Traditionally, the component classifiers are of the same
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 47 / 53
◮ A classifier/learning algorithm is informally called unstable if
◮ Decision trees and neural networks are examples of
◮ In general, bagging improves recognition for unstable
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 48 / 53
◮ In boosting, each training pattern receives a weight that
◮ If a training pattern is accurately classified, its chance of
◮ Conversely, if the pattern is not accurately classified, its
◮ The final classification decision is based on the weighted
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 49 / 53
◮ The popular AdaBoost (adaptive boosting) algorithm allows
◮ Let αt(xi) denote the weight of pattern xi at trial t, where
◮ At each trial t = 1, . . . , T, a classifier Ct is constructed from
◮ The error ǫt of this classifier is also measured with respect
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 50 / 53
◮ If ǫt is greater than 0.5, the trials terminate and T is set to
◮ Conversely, if Ct correctly classifies all patterns so that ǫt is
◮ Otherwise, the weights αt+1 for the next trial are generated
i=1 αt+1(xi) = 1. ◮ The boosted classifier C∗ is obtained by summing the votes
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 51 / 53
◮ Provided that ǫt is always less than 0.5, it was shown that
◮ A succession of weak classifiers {Ct} can thus be boosted
◮ However, note that there is no guarantee of the
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 52 / 53
CS 551, Spring 2010 c 2010, Selim Aksoy (Bilkent University) 53 / 53