Semi-supervised Object Detector Learning from Minimal Labels Sudeep - - PDF document

semi supervised object detector learning from minimal
SMART_READER_LITE
LIVE PREVIEW

Semi-supervised Object Detector Learning from Minimal Labels Sudeep - - PDF document

Semi-supervised Object Detector Learning from Minimal Labels Sudeep Pillai December 12, 2012 Abstract While traditional machine learning approaches to classification involve using a substantial training phase with significant number of training


slide-1
SLIDE 1

Semi-supervised Object Detector Learning from Minimal Labels

Sudeep Pillai December 12, 2012

Abstract While traditional machine learning approaches to classification involve using a substantial training phase with significant number of training examples, in a semi-supervised setting, the focus is on learning the trends in the data from a limited training set and simultaneously using the trends learned to label unlabeled data. The specific scenario that semi-supervised learning (SSL) focuses on is when the labels are expensive or difficult to obtain. Furthermore, with the availability of a large amounts of unlabeled data, SSL focuses on bootstrapping knowledge from training examples to predict the unlabeled data, and propagating that labeling in a well-formulated manner. This report focuses on a particular semi-supervised learning technique called Graph-based Regularized Least Squares (LapRLS) that can learn from both labeled and unlabeled data as long as the data satisfies a limited set of assumptions. This report compares the performance of traditional supervised learning algorithms against LapRLS and demonstrates that LapRLS outperforms the supervised classifiers on several datasets especially when the number of training examples are minimal. As a particular application, the LapRLS performs considerably well on Caltech-101, an object recognition dataset. This report also focuses on the particular methods used for feature selection and dimensionality reduction to build a robust object detector capable of learning purely from a single training example and a reasonably large set of unlabeled examples.

1 Introduction

In a setting where labeled data is hard to find or expensive to attain, we can formulate the notion of learning from the vast amounts of unlabeled instances in the data given a few minimal labels per class instance. Formally, semi- supervised learning addresses this problem by using a large amount of unlabeled data along with labeled data to make better predictions of the class of the unlabeled data. Fundamentally, the goal of semi-supervised classification is to train a classifier f from both the labeled and unlabeled data, such that it is better than the supervised classifier trained on the original labeled data alone. Semi-supervised learning has tremendous practical value in several domains [8] including speech recognition, protein 3D structure prediction, video surveillance etc. In this report, the primary focus is to learn trends from labeled image data that may be readily available from human annotation, or some external source, and using the learned knowledge to label the evergrowing data deluge of unlabeled images on the internet. Particularly, the focus is on utilizing these semi-supervised techniques on Caltech-101 [4], [3], an object recognition dataset while also providing convincing results on toy datasets.

2 Background

Before we delve into the details of the motivation and implementation behind semi-supervised learning, it is important to differentiate two distict forms of semi-supervised learning settings. In semi-supervised classification, the training dataset contains some unlabeled data, unlike in the supervised setting. Therefore, there are two distinct goals; one is to predict the labels on future test data, and the other goal is to predict the labels on the unlabeled instances in the training dataset. The former is called inductive semi-supervised learning and the latter transductive learning [9].

2.1 Inductive semi-supervised learning

Given a training example (xi, yi)l

i=1, xj l+u j=l+1, inductive semi-supervised learning learns a function f : X → Y so

that f is expected to be a good predictor on future data, beyond xj

l+u j=l+1.

1

slide-2
SLIDE 2

2.2 Transductive learning

Given a training example (xi, yi)l

i=1, xj l+u j=l+1, transductive learning trains a function f : Xl+u → Y l+u so that f is

expected to be a good redictor on the unlabeled data xj

l+u j=l+1.

2.3 Assumptions

While it is reasonable that semi-supervised learning can use additional unlabeled data to learn a better predictor f, the key lies in the model assumptions about the relation between the marginal distribution P(x) and the conditional distribution P(y|x). Thus it is important to realize that picking any semi-supervised learning technique cannot always perform better than the supervised case even with minimal labels.

Figure 1: Plots visualizing the cluster assumption (left column), manifold assumption (middle column), and the clus- ter/manifold assumption (right column)

2.3.1 Cluster & Manifold Assumption Cluster assumption states that points which can be connected via multiple paths through high-density regions are likely to have the same labels. Figure 1 show high density regions in red in the 2 moons dataset where the cluster would be defined by points having multiple pathways between them that belong to a single moon cluster. Manifold assumption dictates that each class lies on a different continuous manifold. In this case of figure 1, each

  • f the moon clusters lies on a different manifold, which is apparent from the figure.

Keeping both these assumptions in mind, we can draw the cluster/manifold assumption as follows: Points which can be connected via a path through high density regions on the data manifold are likely to have the same

  • label. In a semi-supervised setting, the idea is to use a regularizer that prefers functions which vary smoothly along

the manifold and do not vary in high density regions as depicted in figure 1 (right column).

3 Graph-based Semi-Supervised Learning

To motivate the use of graphs in a semi-supervised setting, we refer back to the background section 2. Via the clus- ter/manifold assumption, the learning ensures that we pick a regularizer that prefers functions that are differentiable and vary smoothly along the manifold and that does not vary in high density regions. In order to label points that are similar to each other with the same label, we create a graph where the nodes represent the data points L ∪ U (labeled and unlabeled, this is discussed in further detail in later sub-sections) and the edges represent the similarity measure between data points.

Figure 2: The plots depict the motivation for graph-based learning where one can leverage the information from both labeled and unlabeled data

2

slide-3
SLIDE 3

Figure 2 shows an example of a continuous manifold on which features lie, and the data correlation represented by edges of the graph. One can think of using the graph as a way to approximate the manifold (and density) on which the data lies. With this motivation in mind, we formalize the graph representation in our semi-supervised learning framework in the following sections.

3.1 Graph representation

Let L = (x1, y1) . . . (xl, yl) be the labeled data with y ∈ 1, . . . , C, and U = xl+1, . . . , xl+u the unlabeled data, usually l ≪ u, and n = l + u. The formulation of semi-supervised learning in a graph setting can be defined as follows; Given training data (xi, yi)l

i=1, xj l+u j=l+1, the vertices represent the labeled and unlabeled instances L ∪ U. Obviously,

this graph turns out be fairly large. The semi-supervised classification setting requires learning what y values each

  • f the nodes take in the graph. This is accomplished by first connecting edges between each of the labeled instance

nodes in the graph with the unlabeled instances. As expected, these edges represent the similarity between each of the nodes xi, xj involved in the edge eij. Let wij represent the edge weight, in other words, a similarity measure between nodes xi, xj. 3.1.1 Fully connected graphs Here, each node xi or data point is connected to every other data point xj with an edge weight wij that exresses the similarity measure between the nodes. An example of weighting metric is the expression below: wij =

  • exp(− xi−xj2

2σ2

) eij ∈ E eij / ∈ E (1) where σ is a bandwidth parameter that controls how quickly the weight decreases. This weight is exactly the same form as a Gaussian function, hence called the Gaussian kernel or a Radial Basis Function (RBF). One advantage of a fully connected graph is the ability to learn the weights in a closed form when a differentiable weight function (such as the one above) is used. However, the clear disadvantage is that the computational cost associated with instantiating an edge for every pair of points. In a continous weighting scheme such as the one above, with xi −xj2 denoting the Euclidean distance, the bandwidth parameters are different per feature dimension. This will also be discussed in detail later. 3.1.2 kNN Graphs and ǫ Graphs kNN Graphs and ǫ Graphs are various forms of graph construction, specifically determining the final edge represen- tation E of the graph. In the case of kNN graph, each node defines its k nearest neighbor nodes via some distance metric, euclidean distance for instance. Nodes xi, xj are connected by eij if either xi is one of the k-nearest neigh- bors of xj or vice-versa. In such a graph representation, the edge weight wij is either 1 if they are connected, or 0

  • therwise. For ǫNN graphs, the nodes xi,xj are connected by eij if xi − xj ≤ ǫ. Similar to kNN, weights may be

unweighted with wij = 1 if connected, and 0 otherwise.

3.2 The Graph Laplacian

The above mentioned graph can be represented by an n × n weight matrix W, with wij as defined in 1. As expected, the weights are non-negative and symmetric. An important quantity from graph theory, the combinatorial Graph Laplacian, is the matrix representation of a graph. The Graph Laplacian is defined as L = D − W (2) where Dii =

j Wij is the degree of node i and W is the weight matrix as defined above.

In an attempt to define a continous random field on the graph, the notion of a gaussian random field is introduced. First a real function over the nodes is defined by f : L ∪ U → R. A quadratic energy function is chosen to ensure that similar points are labeled with the same labels. An important point to note is that f is constrained to take values f(i) = yi, i ∈ L on the labeled data. Using the laplacian, one can define the energy function for a graph by the following: E(f) = 1 2

  • i,j

wij(f(i) − f(j))2 = f T Lf (3) 3

slide-4
SLIDE 4

3.3 Overall Error Minimization

Minimization of the smoothness term f T Lf can be achieved by the trivial solution of f = 1, but in a semi-supervised setting such as this one, the minimization is a combination of the smoothness and the training loss(label agreement). For the squared training loss, this is defined as J(f) = f T Lf +

l

  • i=1

λ(f(i) − yi)2 = f T Lf

smoothness

+ (f − y)T Λ(f − y)

  • label agreement

(4) where Λ is the diagonal matrix whose diagonal elements are Λii = 0 for i ∈ U (unlabeled points), and Λii = λ for i ∈ L (labeled points). One can think of the λ term as a regularization term which penalizes the objective function if the final solution for the labeled data points doesn’t agree with the original training labels. The final solution has a closed form to the following equation (L + Λ)f = Λy given by: f = (L + Λ)−1Λy (5) Although the solution can be given in closed form for the squared error loss, it requires a solution to an n × n linear system which poses serious problems when n grows larger. While this report doesn’t focus on larger datasets, we shall limit ourselves to a reasonable dataset size. However, it is important to note that various approximations to the final solution exists for larger n, specifically as suggested in [5].

3.4 Manifold regularization

Most of the discussion has focused on the transductive learning paradigm of semi-supervised learning. In section 3.3, we focused on learning a function f that is restricted to the labeled and unlabeled nodes in the graph. More importantly, a direct way to predict the labels on a new test set x∗ is not possible unless it was included in the graph construction and learning process. Furthermore, with training labels y fixed, the overall optimization does not allow for errors in the original training set. This has motivated the need for an inductive semi-supervised learning algorithm, called Manifold regularization [1], that addresses both the aforementioned issues. Here, f is defined over the whole feature space f : X → R. f is regularized (via λ2) to be smooth with respect to the graph by the graph laplacian. This regularizer controls f, the value of f on l + u training instances. To further regularize the manifold on which f lies outside the training samples, another regularization term is introduced such that f2 =

  • x∈Xf(x)2dx. Putting them together, the regularizer for manifold regularization becomes

J(f) = λ1f2 + λ2f T Lf +

l

  • i=1

λ(f(i) − yi)2 (6) While an implementation of the above regularization was attempted, much progress could not be made before the conclusion of this report. This section has been mentioned particularly since some considerable effort was involved in this attempt. In the following sections, we employ the final solution in eqn. 5 to perform the Laplacian RLS SSL

  • n the datasets mentioned.

4 Feature Selection and Classification

An important aspect of learning from images is involved in the representation of an image, whether it is a global or local representation. The following sections describe the methodology and motivation behind global descriptors for images that allow for robust matching and inference on unlabeled datasets.

4.1 Feature Selection

For experiments in this writeup, a global image descriptor is used to represent the entire image. It is important to note that there is no attempt to localize the object(s) within the image. The GIST descriptor was originally proposed in [6]. The main idea of the GIST descriptor was to develop a low dimensional representation of the scene in the image, without any form of segmentation. The GIST descriptor extracts a set of perceptual dimensions (naturalness,

  • penness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. The image is

divided into a 4-by-4 grid for which orientation histograms (8 orientations) are extracted. This is done at 4 different image scales, which makes the descriptor a 512-dimensional (4 ∗ 4 ∗ 8 ∗ 4) vector. The descriptor purely works on a 4

slide-5
SLIDE 5

global scale, extracting the “gist” of an image. Specifically, each descriptor that is extracted from a single image is a 512-dimensional vector.

Figure 3: The GIST descriptor for the corresponding image on the left. The image on the right shows average filter energy in each bin once oriented gabor filters are applied to the image.

4.1.1 Dimensionality Reduction Most of these global image descriptors extract information from the images at multiple scales which produce a lot of redundant information encapsulated in its full high-dimensional representation. Specifically for the GIST descriptor, the 512-dimensional vector represents information from multiple scales and orientations which makes the mutual information between each of its dimensions across images redundant. To eliminate this mutual information, or in essence, reduce the dimensionality of the feature vector, principal component analysis is performed on the training

  • data. The resulting low-dimensional vector after dimensionality reduction should still be capable of representing the
  • riginal high-dimensional vector with relatively good accuracy. A common technique is to reduce the dimensionality

such that the reduced-dimensional set of features encapsulates 98% of the variance in the original high-dimensional data feature set. By doing so, a reasonable k = 64 was estimated, and further results make claims with the assumption that the features have already been pre-processed to R64. 4.1.2 Feature space and distance metric As expected, GIST features extracted from images all lie on a feature space that is specific to the feature extraction technique, in this case GIST. As explained earlier, semi-supervised learning makes an assumption that these features extracted from images need to lie on a continous manifold. [6], [2] suggests that euclidean distance is a sufficient metric to compute the similarity measure between images. While other distance metrics may work equally as well, the results look promising regardless of the distance metric that was used [2].

4.2 Classification

Since the classification scheme discussed in the earlier sections were binary, we need to extend the capability of the classifier to be able to handle multiple classes. While it is possible to formulate eqn. 5 in such a way that multi-class classification is possible, it is unclear whether this would have a closed form solution. To this end, a one-versus-all classifier was employed to classify between the K classes. 4.2.1 One-versus-all classifier One approach would be to employ a one-versus-all approach where K different binary classifiers are built. Here each classifier fk, solution to eqn. 5, would be responsible for predicting those examples that fall within the kth class as positive and classifying the rest of the dataset as negative. In such a scenario, the one-versus-all formulation is given by f(x) = arg maxi fi(x), breaking ties arbitrarily. While one-versus-all seems to fit well with the problem in hand, there are a few scenarios that this can be troublesome. In cases where there are multiple objects within the same image, one-versus-all classification determines that the points can belong to either classes, but arbitrarily picks one

  • f the classes.

5

slide-6
SLIDE 6

5 Experiments and Results

In the following section, the performance of the Graph-based Laplacian Regularized Least Squares (LapRLS) algo- rithm is evaluated and tested on a variety of toy datasets and Caltech-101 [4], [3], an object recognition dataset. To further understand and appreciate the utility of semi-supervised learning, the graph-based LapRLS is compared against traditional supervised learning algorithms such as Regularized Least Squares (RLS), SVM with linear kernels, SVM with RBF kernel. The real motivation behind using semi-supervised learning algorithms is to leverage the availability of unlabeled data, and the ability to learn from a minimally labeled training set. The performance

  • f such algorithms, including LapRLS are further tested on a minimally labeled dataset to truly appreciate the

capabilities of SSL techniques.

5.1 Toy Datasets

In order to guage and visualize the performance of these aforementioned algorithms, a few toy datasets were generated. These toy datasets could conceivably be a more simplified model of the high-dimensional manifolds that the GIST feature space takes. Since the data is represented in 2D, we use the data as is and build the weight matrix, and laplacian matrix for the data. The edge weights are determined using an RBF with σ=0.35 for the case of the SVM-RBF Kernel, and the LapRLS.

5.2 2 Moons Dataset

A classical dataset for classification is the 2 moons dataset, where the datapoints can be easily classified by visualizing the data, but non separable with linear kernel SVMs, and RLS. The plots in figure 4 below show the performance

  • f each of the classifiers discussed, with LapRLS outperforming especially when the number of training examples is

minimal.

(linear) rlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 laprlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) rlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 laprlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) rlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 laprlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) rlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (linear) svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 svm −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 (rbf) : σ =0.2 laprlsc −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1

Figure 4: Plots comparing the performance of RLS (1st column), Linear-SVM (2nd column), RBF-SVM (3rd column), and LapRLS (last column) with a RBF kernel with increasing number of training examples. The decision boundaries clearly indicate that the LapRLS outperforms in each of the scenarios and predicts decision boundaries even with a minimal training set (colored points are labeled training examples Nt).

6

slide-7
SLIDE 7

As expected, the RLS and linear SVMs fail to classify the data correctly. While this is attributed solely due to the fact that the data is not linearly separable, it is interesting to note that for the linear SVM/RLS case, if the classes are sparsely located, the linear-SVM performs equally as well with minimal labels. This shall be discussed in another toy dataset introduced later. 5.2.1 Performance compared to kNN classifiers It is of particular interest to understand why the performance of the LapRLS with an RBF kernel is significant compared to the other classifiers in this case. The first observation is that the data lies on predominantly a continuous

  • manifold. Here, two such manifolds are represented for each class. This is exactly the assumption that was made

while formulating the graph-based laplacian RLS framework. Since there is continuity in the data manifold, one can propagate labels accordingly by minimizing the overall objective function keeping the regularity in manifold

  • consistent. As compared to a kNN classifier, the LapRLS avoids labeling the second manifold (data from a different

class) with the same label as itself since there are high derivatives near regions where manifolds come close to each

  • ther. Figure 5 compares a traditional kNN classifier with the graph-based LapRLS for different values of k. The

plots clearly show that the RBF-LapRLS outperform the kNN classifier in most cases (both in cases where Nt is low and high). In the case of kNN classifiers, the high misclassification error rate can be attributed to outliers in the data that tend to propagate labels undesirably, leading to overall misclassification of the data. 5.2.2 Performance with Minimal Labels When only a few training examples are available, both linear-SVMs and linear-RLS tend to find a half-plane that best classifies the training examples without any notion of utilizing full information (in this case, leveraging unlabeled data). In figure 5, columns 1 and 2 show the final performance of the RLS classifier, and the linear-SVM classifier. The linear-SVM classifier was modified to include a slack variable for datasets that were linearly non-separable. Both these classifiers show similar results, but overall poor performance on the 2 moons dataset. While RLS and linear-SVM have relatively low misclassification error with fewer training examples Nt in this case, as Nt increases, the misclassification rates tend to increase. In the case of the RBF-SVM (3rd column in figure 5), performance tends to increase with increasing Nt. This is expected as the classifier tends to overfit to the training data, and as Nt approaches N (full test/train dataset size), the performance should converge to optimal. As a side note, in order to compare the performance between the RBF-SVM and the RBF-LapRLS, both variations were provided with the same kernel parameters. The interesting performance measures are when the training size Nt is minimal (1st and 2nd row). Here we see that the LapRLS

  • utperforms the RBF-SVM by correctly classifying all the unlabeled points in the dataset. It does this by propagating

the labels within each of the manifolds (regularizing the labels), and keeping the labels in agreement with the original training set. The RBF-SVM on the other hand uses purely the training example to make decisions on the boundaries

  • f the classification. This results in poor performance and lack of generalization for the specific dataset as seen in rows

1, 2, and 3 in figure 5 (columns 3 and 4 comparing RBF-SVM and LapRLS). The LapRLS performs substantially better than any of the classifiers compared, even with minimal data, or with full training data Nt.

5 10 15 20 25 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Misclassification rate (%) vs. number of training examples on 2moons Number of training examples Misclassification rate (%) 1−knn 2−knn 3−knn 4−knn 5−knn rbf−laprlsc 5 10 15 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Misclassification rate (%) vs. number of training examples on 2moons Number of training examples Misclassification rate (%) rlsc−linear svm−linear svm−rbf laprlsc−rbf

Figure 5: Plots showing the decreasing misclassification error rate with increasing number of training examples Nt. The figure to the left compares various kNN classifiers with the LapRLS classifier. The figure to the right compares the RLS, Linear-SVM, RBF-SVM and the LapRLS misclassification error rate. Both plots show how the LapRLS outperforms other classifiers in both cases.

7

slide-8
SLIDE 8

In short, utilizing unlabeled data does not necessarily hurt the classification as long as the data belongs to either

  • f the classes in question. If however, spurious data points that correspond to a class that is not well represented or

labeled in the training set tends to bias the classification, especially when their manifolds overlap or approach each

  • ther.

5.2.3 Swiss Roll and Lines Dataset The following figure 6 shows how the RBF-LapRLS outperforms the previously mentioned classifiers on the Swiss Roll

  • dataset. In this particular dataset, the label assignment is unclear close to the middle but becomes more apparent

as you backtrack from the the exteriors. The same experiments are performance measures are plotted for the swiss roll dataset, and figure 7 summarizes the results.

(linear) rlsc −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (linear) svm −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (rbf) : σ =0.2 svm −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (rbf) : σ =0.2 laprlsc −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (linear) rlsc −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (linear) svm −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (rbf) : σ =0.2 svm −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 (rbf) : σ =0.2 laprlsc −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

Figure 6: Plots showing the LapRLS classification outperforming k-NN classifiers, RLS, linear-SVM, and RBF-SVM classifiers

It is interesting to note that in this particular dataset, even the LapRLS fails to completely classify the test set

  • correctly. This is due to ambiguity in the label assignment as previously discussed, and the overlap in class-specific

manifold that results in false labeling in the overall optimization. Nevertheless, LapRLS still performs better than each of the classifiers discussed.

5 10 15 20 25 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Misclassification rate (%) vs. number of training examples on swiss Number of training examples Misclassification rate (%) 1−knn 2−knn 3−knn 4−knn 5−knn rbf−laprlsc 5 10 15 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Misclassification rate (%) vs. number of training examples on swiss Number of training examples Misclassification rate (%) rlsc−linear svm−linear svm−rbf laprlsc−rbf

Figure 7: Plots showing the outperformance of LapRLS algorithm against kNN classifiers, RLS, Linear-SVM, and RBF-SVM

  • n the swiss roll dataset. The LapRLS performs significantly better even with minimal labels and continues to perform well

with increasing training labels Nt

8

slide-9
SLIDE 9

As a final remark, the same set of algorithms were tested on a simple linearly separable gaussian dataset. From figure 8, it is evident that the performance of the linear SVM and RLS is on par with the RBF-SVM and the LapRLS. This is peculiar since one would not expect a linear model to perform really well even with limited training data. This is purely a consequence of the data being sparsely located, but the class-specific data having a high density. In figure 8, each of the methods (RLS, Linear-SVM, RBF-SVM, and the LapRLS) perform equally as well without making any assumptions on the data. Since the performance metrics are very obvious in this case, the misclassification error plots have been omitted.

(linear) rlsc −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (linear) svm −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (rbf) : σ =0.4 svm −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (rbf) : σ =0.4 laprlsc −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (linear) rlsc −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (linear) svm −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (rbf) : σ =0.4 svm −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 (rbf) : σ =0.4 laprlsc −2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5

Figure 8: Plots showing similar performance of RLS, Linear-SVM, RBF-SVM and LapRLS with limited training data Nt

5.3 Caltech-101

In order to test the true performance and utility of the Graph-based Laplacian Regularized Least Squares (LapRLS) approach, the algorithm was tested on Caltech-101, an object recognition dataset. The dataset consists of 101 classes, each of which contains approximately 800 images per object class. Since we limit ourselves to understanding the trends in the data using minimal labels, only 7 classes (Faces, Airplanes, Motorbikes, Cars, Sunflower, Chair, Emu) were considered in our experiments and tests. 5.3.1 Feature extraction As previously discussed in section 4, each image in the dataset is considered as a node/vertex in the graph laplacian

  • framework. The GIST descriptor is extracted from each of the images in the dataset, and it is the collection of these

features that is used to represent the dataset. The GIST descriptor being a 512-dimensional vector with a lot of mutual information between descriptors, its dimensionality is reduced to 64 dimensions, still retaining over 98% of the variance in the data. 5.3.2 Graph-based Laplacian RLS Once the features are extracted and reduced to R64, the L2 distance is taken between each pair of nodes in the dataset to build an n×n weight matrix as discussed in section 3. The LapRLS again takes an RBF-kernel to account for non-linear but continuous manifolds on which these R64 GIST features lie on. Figure 9 shows the performance of LapRLS versus other kNN classifier (k=2,3,6) with increasing number of training examples Nt. Once again, LapRLS

  • utperforms each of the kNN classifiers for most of the cases (at Nt = 20 Average per-class misclassification error

0.06%), except for Nt = 1 (Average per-class misclassification error 0.43%). This may be attributed to the fact that the class-specific manifolds each of these GIST features lie on may not be entirely continous (or have high derivatives) causing the regularization of the label assignment for the LapRLS to have poor performance. It could also be the case that the specific training example that was provided for the learning was not close to the edge of a manifold, making the discrimination between classes difficult as the labels are propagated farther away from the

  • riginal training set.

9

slide-10
SLIDE 10

2 2.5 3 3.5 4 4.5 5 5.5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Misclassification rate (%) vs. k (k−NN) k Misclassification rate (%) LapRLSC k−NN classifier

5 10 15 20 0.2 0.4 0.6 0.8 1 Misclassification rate (%) vs. number of training examples Number of training examples Misclassification rate (%) LapRLSC 2−NN classifier 3−NN classifier 4−NN classifier 5−NN classifier 6−NN classifier

Figure 9: Plots showing the performance of LapRLS on Caltech-101 using GIST descriptors and its comparison against kNN classifiers with increasing number of training examples. Once again, LapRLS seem to perform better for both small and larger Nt training samples

5.3.3 Minimal labels The following figure (Figure 10 show the performance of the LapRLS algorithm with increasing number of training

  • examples. The figure to the left shows the Receiver-Operating-Characteristic (ROC) curve for the performance ( 67%

recognition accuracy) of the object recognition task using GIST descriptors. The curves tend to imply that with increasing number of training examples, Nt, the performance continues to improve. More importantly, the LapRLS has considerable performance even with Nt = 1 (blue curve), and continues to rise when Nt = 40 (orange curve). The figure on the right shows the confusion matrix for the object recognition task. This gives a rough idea of what

  • bjects are mistaken for others, if any, and what classifiers work the best. As expected, the confusion matrix is

strongly biased to the diagonal implying that the objects are recognized correctly.

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 true negative rate true positve rate (recall) ROC curve for increasing training examples using GIST ROC n=1 ROC n=2 ROC n=4 ROC n=8 ROC n=10 ROC n=20 ROC rand.

Confusion matrix using GIST descriptor (67.14 % accuracy) Faces airplanes Motorbikes car_side sunflower chair emu Faces airplanes Motorbikes car_side sunflower chair emu

Figure 10: Left: ROC curve showing the performance of LapRLS on Caltech-101 with increasing number of training examples

  • Nt. The figure shows that LapRLS has considerable performance even with a single training example. Right: Confusion matrix
  • f the multi-class classifcation on Caltech-101 showing reasonably good performance on training with a global descriptor such

as GIST, and a single training example

Giving a bit more perspective of the power of semi-supervised learning in an object recognition framework, using reasonably simple feature descriptors with a combination of a powerful algorithm such as the LapRLS, one can build a robust object classifier capable of multi-class classification even with a minimal set of labels. The following figure (Figure 11 shows the results of the application of the Graph-based Laplacian Regularized Least Squares framework in an object recognition setting. 10

slide-11
SLIDE 11

Figure 11: Final results on multi-class classification on Caltech-101 using a single training example with the Graph-based Laplacian Regularized Least Squares framework. Column 1 shows the original training datapoint per object class. Column 2 shows the set of images that have been correctly classified via the LapRLS approach discussed. Column 3 shows the set

  • f false positives obtained for the corresponding object class. Column 4 shows the set of images within the object class that

could not be classified correctly.

The LapRLS works as expected with promising performance even with limited training data samples. It is of particular interest to realize the regions of poor performance for the LapRLS case. In the case of faces or motorbikes in the figure 11 above, the faces and motorbikes that could not be classified correctly (column 4) seem to have a significantly different background than the training set (column 1) or the images that have been classified correctly (column 2). This may be due to the fact the feature descriptor (GIST, in this case) that is used to discriminate between images captures the background information as well (or possibly looks at contrasting foreground/background scenes). That said, it would make sense that the objects in the 4th column do not correlate too well with the original training example where the background seems to be less cluttered. This may be accounted for in the training process where more optimal training labels can be actively suggested to the user for annotation while simulateously

  • ptimizing the overall classification objective function. This sub-domain is referred to as active learning [7]and has

received quite a bit of reception from the computer vision community in the recent years.

6 Conclusion

To summarize, it is demonstrated that a semi-supervised framework can be very beneficial in many applications, specifically object recognition, especially when the number of training data is limited. Furthermore, from the results it is evident that the aforementioned graph-based laplacian least squares can leverage the availability of unlabeled data in building a better classifier as compared to the traditional supervised classifiers such as RLS, Linear-SVM and RBF-SVM. While this report predominantly focuses on the performance of such classifiers when the training data is limited, the results also generalize quite well with increasing number of training examples and still continue to be the lower bound of misclassification error rate from figures 5, 7, 9. As a result, graph-based methods work very well if the underlying assumptions of the feature space manifolds are true. As a final note, this report incorporates a novel Graph-based Regularized Least Squares semi-supervised learning technique for multiclass-object recognition, providing convincing recognition performance despite a minimal set of training examples. 11

slide-12
SLIDE 12

References

[1] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, 2006. [2] Matthijs Douze, Herv´ e J´ egou, Harsimrat Sandhawalia, Laurent Amsaleg, and Cordelia Schmid. Evaluation of gist descriptors for web-scale image search. In International Conference on Image and Video Retrieval. ACM, july 2009. [3] L. Fei-Fei, R. Fergus, and P. Perona. A bayesian approach to unsupervised one-shot learning of object categories. volume 2, page 1134, 2003. [4] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst., 106(1):59–70, April 2007. [5] Rob Fergus, Yair Weiss, and Antonio Torralba. Semi-supervised learning in gigantic image collections. In

  • Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information

Processing Systems 22, pages 522–530. 2009. [6] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial

  • envelope. International Journal of Computer Vision, 42:145–175, 2001.

[7] Burr Settles. Active learning literature survey. Technical report, 2010. [8] Xiaojin Zhu. Semi-supervised learning literature survey, 2006. [9] Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2009. 12