CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. - - PowerPoint PPT Presentation

cmu q 15 381
SMART_READER_LITE
LIVE PREVIEW

CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. - - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. Di Caro M ACHINE L EARNING ? (Model) ? ! e (Inductive learning) c n e i c S a t a D (Targets) (Model) 2 G ENERAL ML S CHEME Data in the domain is Task: define an


slide-1
SLIDE 1

CMU-Q 15-381

Lecture 23: Supervised Learning 1

Teacher: Gianni A. Di Caro

slide-2
SLIDE 2

2

MACHINE LEARNING?

D a t a S c i e n c e ! ?

(Model) (Targets) (Model) (Inductive learning)

slide-3
SLIDE 3

3

GENERAL ML SCHEME

§ ML Design: Use the right features (description language), to build the right model, that achieve the task according to desired performance § Learning by examples: Look at some data, guess at a general scientific hypothesis, make statements - predictions on test data - based on this hypothesis § Inductive learning (from evidence) ≠ Deductive learning (logical, from facts)

Data in the domain is described in the language of selected Features Task: define an appropriate mapping from data to the Outputs

Learning Problem: Obtaining such a mapping from training data

slide-4
SLIDE 4

4

GENERAL ML SCHEME

Labeled / Unlabeled Given / Not Given Errors / Rewards Performance criteria Hypotheses space Hypothesis function

slide-5
SLIDE 5

5

SUPERVISED LEARNING

§ Supervised (inductive) learning (labeled data) § A training data set is given § Training data include target outputs (labels put by a teacher / supervisor) § Using the labels a precise error measure for a prediction can be derived § Aims to find out models that explain and generalize observed data

Labeled Given Errors

Performance criteria Hypotheses space Hypothesis function

slide-6
SLIDE 6

6

EXAMPLE OF SUPERVISED LEARNING TASK: CLASSIFICATION

§ Classification, categorical target: given ! possible classes/categories, to which class each (new) data item belongs to? § Task mapping is ": ℱ% → {0,1}, § Binary (! = 2): Dog or cat? Rich or poor? Hot or cold? § Multi-class (! > 2): Cloudy, or snowing, or mostly clear? Dog, or cat,

  • r fox, or …?
slide-7
SLIDE 7

EXAMPLE OF SUPERVISED LEARNING TASK: REGRESSION

7

§ Regression, numerical target: which is the function that best describes the (conditional expectation) relation between ! dependent variables (outputs) and " independent variables (predictors)? § Task mapping is #: ℱ& → ℝ) § Univariate (* = ,): What is the expected relation between temperature (predictor) and peak electricity usage in Doha (target output)? What is the expected relation between age and diabetes in Qatar? § Multi-variate (* > ,): What is the expected relation between (temperature, time hour, day of the week) and peak electricity usage in Doha? What is the expected relation between advertising in (TV, radio) and sales of a product?

slide-8
SLIDE 8

8

UNSUPERVISED LEARNING

§ Unsupervised (associative) learning (unlabeled data) § A (training) data set is given § Training data does not include target outputs / labels § Aims to find hidden structure, association relationships in data

Unlabeled Given Similarity measures

Performance criteria Hypotheses space Hypothesis function

slide-9
SLIDE 9

9

EXAMPLE OF UNSUPERVISED LEARNING TASK: CLUSTERING

§ Clustering, hidden target: based on some measure of similarity/dissimilarity, group data items in ! clusters, ! is (usually) not known in advance. § Task mapping is ": ℱ% → {0,1}, § Given a set of photos, and similarity features (e.g., colors, geometric objects) how can be the photos clustered? § Group these people in different clusters

slide-10
SLIDE 10

10

EXAMPLE OF UNSUPERVISED LEARNING TASK: DISCOVER RELATIONS

§ Finding underlying structure, hidden target: discover relations and correlations among data. § Given a set of photos, find possible relations among sub-groups of them. § Given shopping data of customers at gas station spread in the country, discover relations that could suggest what to put in the shops and where to locate the items on display.

slide-11
SLIDE 11

11

REINFORCEMENT LEARNING

§ Reinforcement (direct interaction) learning (reward model) § Explore, to find the training data by your own § Gathered data does not include target outputs, but are associated to (possibly sparse) rewards/costs (advisory signals) § Aims to learn an optimal action policy or make optimal predictions § Sequential decision-making vs. One-shot decision making

RL is hard(er)!

Unlabeled Not Given Rewards

Performance criteria Hypotheses space Hypothesis function

slide-12
SLIDE 12

SUPERVISED LEARNING: CLASSIFICATION

12

Data set: cars Labeling: + Family car (positive example) − No family car (negative example) Task: Learn class C ≡ Family car %(car) ? ⁄

' (

c a r s +

  • Features?

Price Engine power Color Shape Traction Consumption ….

slide-13
SLIDE 13

CLASSIFICATION EXAMPLE

13

§ A car is represented as a numeric vector of two features: § The label of a car denotes its type:

x = x1 x2

 $ cc

  • ,

x ∈ R2

§ In the data set !, each car example " is represented by an ordered pair ($(%), ((%)), and there are ) examples in the data set:

! = {($(%), ((%))}%./

r =    1

if x is a positive example (family car) if x is a negative example

(

slide-14
SLIDE 14

CLASSIFICATION EXAMPLE: HYPOTHESIS

14

§ Plot of the dataset in the two-dimensional feature space

Which is the relationship between (price, power) and the Class !? Hypothesis about the form of the searched mapping: to which class of functions it belongs to? We have to make a choice: Inductive bias. We will explain the data according to the hypothesis class ℋ that we choose (that sets a bias)

slide-15
SLIDE 15

CLASSIFICATION EXAMPLE: HYPOTHESIS CLASS

15

Target of learning: find a particular hypothesis ℎ ∈ ℋ to approximate the (true) class $ as closely as possible Hypothesis class from which we believe $ is drawn: ℋ = set of axis-aligned rectangles (&' ≤ price ≤ &)) ⋀ (,' ≤ engine power ≤ ,)) Learning a vector - of four parameters: ℎ = ℎ/(0) Problem: We don’t know $, we only have a set

  • f examples drawn from $

How do we evaluate how good: ℎ = ℎ/ 0 is?

slide-16
SLIDE 16

EMPIRICAL AND GENERALIZATION ERRORS

16

§ The empirical error on the training dataset of ! labeled pairs is: § Loss function: quantifies the prediction/classification error done by our hypothesis function ℎ # on a training or test example

ℓ ∶ ℝ×ℝ×{0,1} → {0,1} ℓ ∶ ℱ/×ℝ0 → ℝ ℓ ℎ1 2 , 3 = 1(ℎ1 2

≠ 3) § In the example:

89:; = <

=>? :

ℓ ℎ1 2(=) , 3(=) § Do we aim to minimize the empirical error? To a certain extent, yes, but we really aim to minimize the generalization error: the loss on new examples, not in the training set!

slide-17
SLIDE 17

EMPIRICAL AND GENERALIZATION ERRORS

17

§ Fundamental problem: we re looking for the parameter values ! that minimize the prediction error resulting from the hypothesis function ℎ# § This “seems” to be equivalent to find: ! = arg min ∑,-.

/ ℓ ℎ# 1(,) , 5(,)

6789 = :

,-. ;

ℓ ℎ#( 1(,)), 5(,) < 1(,), 5(,)

§ … but actually, what we really care about is loss of prediction on new examples 1= , 5= → Generalization error § Expected loss over all > input-output pairs the learning machine may see § To quantify this expectation, we need to define a prior probability distribution < ?, @ over the examples, which we assume as stationary (< doesn’t change) § The expected generalization loss is:

slide-18
SLIDE 18

HOW TO ASSESS GENERALIZATION ERROR?

18

§ But ! ", $ is not known! Therefore it is only possible to estimate the generalization error, which is the true error for the considered population

  • f data examples given the chosen hypothesis

§ How can we make a sound estimate? § Two general ways: § Theoretical: derive statistical bounds on the difference between true error and expected empirical error (PAC learning, VC dimension) § Empirical (Practical): Compute the expected empirical error on training dataset as a local indicator of performance, then use a separate data set to test the learned model and use the expected empirical error on the test set to estimate the generalization error

slide-19
SLIDE 19

EMPIRICAL RISK MINIMIZATION

19

! "#$% = 1 ( )

*+, $

ℓ ℎ/ 0(*) , 4(*)

§ In any case, we need to estimate the generalization loss with the expected empirical loss on a set of examples ( ≪ 6, that can be estimated as the sample average of losses: § ! "#$% is an approximation of the risk associated with the use of the hypothesis ℎ/ for the learning task (i.e., the risk of incurring in prediction losses when classifying samples that are not in the training set) § The empirical risk minimization principle states that the learning algorithm should choose the hypothesis ℎ/ that minimizes empirical risk:

7 = arg min 1 ( )

*+, $

ℓ ℎ/ 0(*) , 4(*)

slide-20
SLIDE 20

HOW TO CHOOSE THE HYPOTHESIS?

20

The true class ! For a choice of ℎ #: Most specific $: Most general Consistent hypotheses

slide-21
SLIDE 21

THE CANONICAL SUPERVISED ML PROBLEM

21

minimize

&

1 ( )

*+,

  • ℓ ℎ0 1(*) , 5(*)

Ø Since ,

  • is a constant that depends on the size of the dataset, it may be omitted,

making the problem equivalent to minimize the sum of prediction losses: Ø Given a collection of input features and outputs ℎ0 1(*) , 5(*) , 6 = 1, … , ( and a hypothesis function ℎ0, find parameter values & that minimize the average empirical error: In some cases (e.g., when using quadratic losses), it can be convenient to use

, 9-

minimize

&

)

*+,

  • ℓ ℎ0 1(*) , 5(*)

Ø Virtually, all supervised learning algorithms can be described in this form, where we need to specify:

  • 1. The hypothesis class :, ;& ∈ :
  • 2. The loss function ℓ
  • 3. The algorithm for solving the optimization problem (often approximately)
slide-22
SLIDE 22

BIG PICTURE (PATTERN RECOGNITION)

22

slide-23
SLIDE 23

A REGRESSION TASK

23

Simple example: Predicting electricity use based on measures of temperature as predictors

slide-24
SLIDE 24

PREDICTING ELECTRICITY USE

24

Several days of peak demand vs. high temperature in Pittsburgh

slide-25
SLIDE 25

LINEAR REGRESSION MODEL

25

Hypothesis class ℋ = set of linear functions, " = $%& + $(

slide-26
SLIDE 26

LINEAR REGRESSION MODEL

26

slide-27
SLIDE 27

LINEAR REGRESSION MODEL

27

slide-28
SLIDE 28

WHAT ABOUT CHOOSING A DIFFERENT HYPOTHESIS?

28

In classification:

slide-29
SLIDE 29

HYPOTHESIS SPACE

29

slide-30
SLIDE 30

POWER DEMAND FORECASTING PROBLEM

30

slide-31
SLIDE 31

HOW DO WE SOLVE IT?

31

! ← ! − $ % &

'() *

+(') +('). / ! − 0(')

Ø If the averaging factor 1/2% is used, then the update action becomes:

slide-32
SLIDE 32

GRADIENT DESCENT FOR GENERAL ML PROBLEMS

32

slide-33
SLIDE 33

ANALYTICAL SOLUTION

33

  • Analytical solution can still face computational issues for large datasets
slide-34
SLIDE 34

RECAP: GRADIENT DESCENT

34

§ Move in the direction opposite to the gradient vector, that is, ⊥ to isocountours, the direction of maximal change of the function § Isocountours / Sublevel set: "# = % ∶ ' % ≤ )

slide-35
SLIDE 35

35

RCAP: STEP SIZE (LEARNING RATE)

§ GD run ~ Motion of a mass in a potential field towards the minimum energy

  • configuration. At each point the gradient defines the attraction force, while the

step ! scales the force, to define the next point

slide-36
SLIDE 36

36

RECAP: STEP SIZE (LEARNING RATE)

§ Adapting the step size may be necessary to avoid either a too slow progress, or overshooting the target minimum Small, good !, convergence ! is too large, divergence

slide-37
SLIDE 37

37

RECAP: STEP SIZE (LEARNING RATE)

! is too low, slow convergence ! is too large, divergence

slide-38
SLIDE 38

38

RECAP: ILL-CONDITIONED PROBLEMS

§ If the (loss) function is very anisotropic, then the problem is said ill- conditioned, since the gradient vector doesn’t point to the direction of the local minimum, resulting into a zig-zagging trajectory § Ill-conditioning can be determined computing the ratio between the eigenvalues of the Hessian matrix (the matrix of the second partial derivatives)

Ill-conditioned problem Well-conditioned problem

slide-39
SLIDE 39

39

RESCALING FEATURES

§ If the chosen features have significantly different ranges, then the resulting loss function would be ill-conditioned § For instance, if we want to learn a regression model for house pricing, we could use as predicting features 4-components vectors: ! = (square meters, floor, number of bed rooms) § We can easily have the following value ranges: {[0,500], [1,50], [1,5]}, that would result in an ill-conditioned problem, stretched along the sqm dimension § One way to overcome the problem is by rescaling feature values, accounting for their min/max ranges and possibly making all features ranging in [0,1] § Since feature values are stochastic variables, they can also be rescaled to have 0 mean and fixed range or standard deviation § All these rescaling operations can be performed using the available dataset

slide-40
SLIDE 40

STOCHASTIC (SEQUENTIAL) GRADIENT DESCENT

40

Very effective with training set with redundant / duplicate data

slide-41
SLIDE 41

RESULT OF LEARNING

41

slide-42
SLIDE 42

42

STOCHASTIC GRADIENT IN ACTION

Example of the on-line evolution of parameter learning as more examples (black dots) are considered in the course of SGD

slide-43
SLIDE 43

43

ALTERNATIVE LOSS FUNCTION?

slide-44
SLIDE 44

44

ALTERNATIVE LOSS FUNCTIONS

Issues with differentiability and, therefore, when computing gradients

slide-45
SLIDE 45

45

EFFECT IN THE POWER PROBLEM