CMU-Q 15-381
Lecture 23: Supervised Learning 1
Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. - - PowerPoint PPT Presentation
CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. Di Caro M ACHINE L EARNING ? (Model) ? ! e (Inductive learning) c n e i c S a t a D (Targets) (Model) 2 G ENERAL ML S CHEME Data in the domain is Task: define an
Teacher: Gianni A. Di Caro
2
D a t a S c i e n c e ! ?
(Model) (Targets) (Model) (Inductive learning)
3
§ ML Design: Use the right features (description language), to build the right model, that achieve the task according to desired performance § Learning by examples: Look at some data, guess at a general scientific hypothesis, make statements - predictions on test data - based on this hypothesis § Inductive learning (from evidence) ≠ Deductive learning (logical, from facts)
Data in the domain is described in the language of selected Features Task: define an appropriate mapping from data to the Outputs
Learning Problem: Obtaining such a mapping from training data
4
Labeled / Unlabeled Given / Not Given Errors / Rewards Performance criteria Hypotheses space Hypothesis function
5
§ Supervised (inductive) learning (labeled data) § A training data set is given § Training data include target outputs (labels put by a teacher / supervisor) § Using the labels a precise error measure for a prediction can be derived § Aims to find out models that explain and generalize observed data
Labeled Given Errors
Performance criteria Hypotheses space Hypothesis function
6
§ Classification, categorical target: given ! possible classes/categories, to which class each (new) data item belongs to? § Task mapping is ": ℱ% → {0,1}, § Binary (! = 2): Dog or cat? Rich or poor? Hot or cold? § Multi-class (! > 2): Cloudy, or snowing, or mostly clear? Dog, or cat,
7
§ Regression, numerical target: which is the function that best describes the (conditional expectation) relation between ! dependent variables (outputs) and " independent variables (predictors)? § Task mapping is #: ℱ& → ℝ) § Univariate (* = ,): What is the expected relation between temperature (predictor) and peak electricity usage in Doha (target output)? What is the expected relation between age and diabetes in Qatar? § Multi-variate (* > ,): What is the expected relation between (temperature, time hour, day of the week) and peak electricity usage in Doha? What is the expected relation between advertising in (TV, radio) and sales of a product?
8
§ Unsupervised (associative) learning (unlabeled data) § A (training) data set is given § Training data does not include target outputs / labels § Aims to find hidden structure, association relationships in data
Unlabeled Given Similarity measures
Performance criteria Hypotheses space Hypothesis function
9
§ Clustering, hidden target: based on some measure of similarity/dissimilarity, group data items in ! clusters, ! is (usually) not known in advance. § Task mapping is ": ℱ% → {0,1}, § Given a set of photos, and similarity features (e.g., colors, geometric objects) how can be the photos clustered? § Group these people in different clusters
10
EXAMPLE OF UNSUPERVISED LEARNING TASK: DISCOVER RELATIONS
§ Finding underlying structure, hidden target: discover relations and correlations among data. § Given a set of photos, find possible relations among sub-groups of them. § Given shopping data of customers at gas station spread in the country, discover relations that could suggest what to put in the shops and where to locate the items on display.
11
§ Reinforcement (direct interaction) learning (reward model) § Explore, to find the training data by your own § Gathered data does not include target outputs, but are associated to (possibly sparse) rewards/costs (advisory signals) § Aims to learn an optimal action policy or make optimal predictions § Sequential decision-making vs. One-shot decision making
RL is hard(er)!
Unlabeled Not Given Rewards
Performance criteria Hypotheses space Hypothesis function
12
Data set: cars Labeling: + Family car (positive example) − No family car (negative example) Task: Learn class C ≡ Family car %(car) ? ⁄
' (
c a r s +
Price Engine power Color Shape Traction Consumption ….
13
§ A car is represented as a numeric vector of two features: § The label of a car denotes its type:
§ In the data set !, each car example " is represented by an ordered pair ($(%), ((%)), and there are ) examples in the data set:
! = {($(%), ((%))}%./
r = 1
if x is a positive example (family car) if x is a negative example
(
14
§ Plot of the dataset in the two-dimensional feature space
Which is the relationship between (price, power) and the Class !? Hypothesis about the form of the searched mapping: to which class of functions it belongs to? We have to make a choice: Inductive bias. We will explain the data according to the hypothesis class ℋ that we choose (that sets a bias)
15
Target of learning: find a particular hypothesis ℎ ∈ ℋ to approximate the (true) class $ as closely as possible Hypothesis class from which we believe $ is drawn: ℋ = set of axis-aligned rectangles (&' ≤ price ≤ &)) ⋀ (,' ≤ engine power ≤ ,)) Learning a vector - of four parameters: ℎ = ℎ/(0) Problem: We don’t know $, we only have a set
How do we evaluate how good: ℎ = ℎ/ 0 is?
16
§ The empirical error on the training dataset of ! labeled pairs is: § Loss function: quantifies the prediction/classification error done by our hypothesis function ℎ # on a training or test example
ℓ ∶ ℝ×ℝ×{0,1} → {0,1} ℓ ∶ ℱ/×ℝ0 → ℝ ℓ ℎ1 2 , 3 = 1(ℎ1 2
≠ 3) § In the example:
89:; = <
=>? :
ℓ ℎ1 2(=) , 3(=) § Do we aim to minimize the empirical error? To a certain extent, yes, but we really aim to minimize the generalization error: the loss on new examples, not in the training set!
17
§ Fundamental problem: we re looking for the parameter values ! that minimize the prediction error resulting from the hypothesis function ℎ# § This “seems” to be equivalent to find: ! = arg min ∑,-.
/ ℓ ℎ# 1(,) , 5(,)
,-. ;
ℓ ℎ#( 1(,)), 5(,) < 1(,), 5(,)
§ … but actually, what we really care about is loss of prediction on new examples 1= , 5= → Generalization error § Expected loss over all > input-output pairs the learning machine may see § To quantify this expectation, we need to define a prior probability distribution < ?, @ over the examples, which we assume as stationary (< doesn’t change) § The expected generalization loss is:
18
§ But ! ", $ is not known! Therefore it is only possible to estimate the generalization error, which is the true error for the considered population
§ How can we make a sound estimate? § Two general ways: § Theoretical: derive statistical bounds on the difference between true error and expected empirical error (PAC learning, VC dimension) § Empirical (Practical): Compute the expected empirical error on training dataset as a local indicator of performance, then use a separate data set to test the learned model and use the expected empirical error on the test set to estimate the generalization error
19
! "#$% = 1 ( )
*+, $
ℓ ℎ/ 0(*) , 4(*)
§ In any case, we need to estimate the generalization loss with the expected empirical loss on a set of examples ( ≪ 6, that can be estimated as the sample average of losses: § ! "#$% is an approximation of the risk associated with the use of the hypothesis ℎ/ for the learning task (i.e., the risk of incurring in prediction losses when classifying samples that are not in the training set) § The empirical risk minimization principle states that the learning algorithm should choose the hypothesis ℎ/ that minimizes empirical risk:
7 = arg min 1 ( )
*+, $
ℓ ℎ/ 0(*) , 4(*)
20
The true class ! For a choice of ℎ #: Most specific $: Most general Consistent hypotheses
21
minimize
&
1 ( )
*+,
Ø Since ,
making the problem equivalent to minimize the sum of prediction losses: Ø Given a collection of input features and outputs ℎ0 1(*) , 5(*) , 6 = 1, … , ( and a hypothesis function ℎ0, find parameter values & that minimize the average empirical error: In some cases (e.g., when using quadratic losses), it can be convenient to use
, 9-
minimize
&
)
*+,
Ø Virtually, all supervised learning algorithms can be described in this form, where we need to specify:
22
23
Simple example: Predicting electricity use based on measures of temperature as predictors
24
Several days of peak demand vs. high temperature in Pittsburgh
25
Hypothesis class ℋ = set of linear functions, " = $%& + $(
26
27
28
In classification:
29
30
31
! ← ! − $ % &
'() *
+(') +('). / ! − 0(')
Ø If the averaging factor 1/2% is used, then the update action becomes:
32
33
34
§ Move in the direction opposite to the gradient vector, that is, ⊥ to isocountours, the direction of maximal change of the function § Isocountours / Sublevel set: "# = % ∶ ' % ≤ )
35
§ GD run ~ Motion of a mass in a potential field towards the minimum energy
step ! scales the force, to define the next point
36
§ Adapting the step size may be necessary to avoid either a too slow progress, or overshooting the target minimum Small, good !, convergence ! is too large, divergence
37
! is too low, slow convergence ! is too large, divergence
38
§ If the (loss) function is very anisotropic, then the problem is said ill- conditioned, since the gradient vector doesn’t point to the direction of the local minimum, resulting into a zig-zagging trajectory § Ill-conditioning can be determined computing the ratio between the eigenvalues of the Hessian matrix (the matrix of the second partial derivatives)
Ill-conditioned problem Well-conditioned problem
39
§ If the chosen features have significantly different ranges, then the resulting loss function would be ill-conditioned § For instance, if we want to learn a regression model for house pricing, we could use as predicting features 4-components vectors: ! = (square meters, floor, number of bed rooms) § We can easily have the following value ranges: {[0,500], [1,50], [1,5]}, that would result in an ill-conditioned problem, stretched along the sqm dimension § One way to overcome the problem is by rescaling feature values, accounting for their min/max ranges and possibly making all features ranging in [0,1] § Since feature values are stochastic variables, they can also be rescaled to have 0 mean and fixed range or standard deviation § All these rescaling operations can be performed using the available dataset
40
Very effective with training set with redundant / duplicate data
41
42
Example of the on-line evolution of parameter learning as more examples (black dots) are considered in the course of SGD
43
44
Issues with differentiability and, therefore, when computing gradients
45