APPLIED MACHINE LEARNING – 2011-2012
1
APPLIED MACHINE LEARNING
1
MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 - - PowerPoint PPT Presentation
APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING Exam Format The exam lasts a total of 3 hours: - Upon entering the room, you must
APPLIED MACHINE LEARNING – 2011-2012
1
APPLIED MACHINE LEARNING
1
APPLIED MACHINE LEARNING – 2011-2012
2
APPLIED MACHINE LEARNING
2
The exam lasts a total of 3 hours:
corner of the room; you are allowed to keep a couple of pen/pencil/ eraser and a few blank sheets of paper.
card with you to write your sciper number on your exam sheet, as we will check your card. Exam is closed book but you can bring one A4 page with personal handwritten notes written recto-verso.
APPLIED MACHINE LEARNING – 2011-2012
3
APPLIED MACHINE LEARNING
3
Formalism / Taxonomy:
likelihood.
learning and be able to give examples of algorithms in each case. Principles of evaluation:
training vs. testing sets, cross-validation, ground truth.
and know which method of evaluation to apply where (F-measure in clustering vs. classification, BIC, etc).
APPLIED MACHINE LEARNING – 2011-2012
4
APPLIED MACHINE LEARNING
4
– what it can do: classification, regression, structure discovery / reduction of dimensionality – what one should be careful about (limitations of the algorithm, choice
– the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs
APPLIED MACHINE LEARNING – 2011-2012
5
APPLIED MACHINE LEARNING
5
SVM
– what it can do: classification, regression, structure discovery / reduction of dimensionality Performs binary classification; can be extended to multi-class classification; can be extended to regression (SVR) – what one should be careful about (limitations of the algorithm, choice
e.g. choice of kernel; too small kernel width in Gaussian kernels may lead to over-fitting; – the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs
APPLIED MACHINE LEARNING – 2011-2012
6
APPLIED MACHINE LEARNING
6
This overview is meant to highlight similarities and differences across the different methods presented in class. To be well prepared to the exam, read carefully the slides, the exercises and their solutions.
APPLIED MACHINE LEARNING – 2011-2012
7
APPLIED MACHINE LEARNING
7
This class has presented groups of methods for structure discovery, classification and non-linear regression.
Classification SVM, GMM + Bayes Regression SVR GMR Structure Discovery PCA & Clustering Techniques K-Means, Soft K-means, GMM
APPLIED MACHINE LEARNING – 2011-2012
8
APPLIED MACHINE LEARNING
8
Techniques for finding structure in data proceed by projecting or grouping the data from the original space into another space of lower dimension. The projected space is chosen so as to highlight particular features common to subsets of datapoints. Pre-processing step: The found structure may be exploited in a second stage by another algorithm for regression, classification, etc.
APPLIED MACHINE LEARNING – 2011-2012
9
APPLIED MACHINE LEARNING
9
Principal Component Analysis (PCA)
N
q
APPLIED MACHINE LEARNING
10
All three methods for clustering we have seen in class (K-means, soft K- means, GMM) are all solved through E-M (expectation-maximization). You should be able to spell out the similarities and differences across K- means, soft K-means and GMM.
are similar in their representation
problem and
hyper-parameters, etc.
APPLIED MACHINE LEARNING
11
All clustering methods depend on choosing well a metric of similarity to measure how similar subgroup of data-points are. You should be able to list which metric of similarity can be used in each case and how this choice may impact the clustering.
Lp-norm in K-means Exponential decreasing function in soft K-means modulated by the stiffness ~= isotropic rbf (unnormalized Gauss) function Likelihood of each Gauss function Can use isotropic/diagonal & full covariance matrices
APPLIED MACHINE LEARNING
12
Fundamental difference between clustering and classification:
Both use the F-measure but not in the same way. The clustering F-measure assumes a semi-supervised model, in which only a subset of the points are labelled.
13
APPLIED MACHINE LEARNING
13
Clustering F1-Measure:
(careful: similar but not the same F-measure as the F-measure we will see for classification!)
Tradeoff between clustering correctly all datapoints of the same class in the same cluster and making sure that each cluster contains points of only one class.
Picks for each class the cluster with the maximal F1 measure Recall: proportion of datapoints correctly classified/clusterized Precision: proportion of datapoints of the same class in the cluster
1 1 1
: nm of labeled datapoints : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,
ik i ik ik
i i i i c C k i i i i i i i i
M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k
Penalize fraction of labeled points in each class
APPLIED MACHINE LEARNING
14
True Positives( ) : nm of datapoints of class 1 that are correctly classified False Negative ( ) : nm of datapoints of class 1 that are incorrectly classified False Positives( ) : nm of datapoints of TP FN FP class 2 that are incorrectly classified Recall: Precision: 2*Precision*Recall Precision+Recall TP TP FN TP TP FP F
Classification F-Measure:
(careful: similar but not the same F-measure as the F-measure we saw for clustering!)
Tradeoff between classifying correctly all datapoints of the same class and making sure that each class contains points of only one class. Recall: Proportion of datapoints correctly classified in Class 1 Precision: proportion of datapoints of class 1 correctly classified over all datapoints classified in class 1
APPLIED MACHINE LEARNING – 2011-2012
15
APPLIED MACHINE LEARNING
15
GMM + Bayes SVM Non-Linear boundary in both cases. Compute number of parameters required for the same fit. Original two-classes 1 Gauss fct per class But full covariance matrix 7 support vectors
APPLIED MACHINE LEARNING – 2011-2012
16
APPLIED MACHINE LEARNING
16
We have seen two examples of kernel method with SVM/SVR. Kernel Methods implicitly search for structure in the data prior to performing another computation (classification or regression)
The kernel trick exploits the observation that all linear methods for finding structure in data are based on computing an inner product across variables. This inner product can be replaced by the kernel function if
: , , .
i j i j
k X X k x x x x
Metric of similarity across datapoints
APPLIED MACHINE LEARNING – 2011-2012
17
APPLIED MACHINE LEARNING
17
SVR and GMR lead to a regressive model that computes a weighted combination of local predictors.
In SVR, the computation is reduced to summing only
In GMR, the sum is over the set of Gaussians. The centers of the Gaussians are usually not located on any particular datapoint. The models are local m(x)!
* 1
i
M i i i
1 K i i i
y x x m
SVR Solution GMR Solution
APPLIED MACHINE LEARNING – 2011-2012
18
APPLIED MACHINE LEARNING
18
SVR and GMR lead to the following regressive model: 8 Gauss functions full covariance matrix
1 K i i i
y x x m
GMR Solution
APPLIED MACHINE LEARNING – 2011-2012
19
APPLIED MACHINE LEARNING
19
SVR and GMR lead to the following regressive model: 27 support vectors
* 1
i
M i i i
SVR Solution
APPLIED MACHINE LEARNING
20
SVR and GMR are based on the same probabilistic regressive model, but do not optimize the same objective function.
APPLIED MACHINE LEARNING – 2011-2012
21
APPLIED MACHINE LEARNING
21
This course covered a variety of topics that are core to Machine Learning. It gives you the basis to go and read recent advances in each of these topics. We hope that you will find this material useful and that you will use some
If you do so, drop us a note and we would be glad to include your application in future lectures as examples!