Linear Models Machine Learning 1 Checkpoint: The bigger picture - - PowerPoint PPT Presentation

linear models
SMART_READER_LITE
LIVE PREVIEW

Linear Models Machine Learning 1 Checkpoint: The bigger picture - - PowerPoint PPT Presentation

Linear Models Machine Learning 1 Checkpoint: The bigger picture Supervised learning: instances, concepts, and hypotheses Specific learners Learning Hypothesis/ Labeled algorithm Model h Decision trees data New example


slide-1
SLIDE 1

Machine Learning

Linear Models

1

slide-2
SLIDE 2

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees

  • General ML ideas

– Features as high dimensional vectors – Overfitting

2

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction Questions?

slide-3
SLIDE 3

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees

  • General ML ideas

– Features as high dimensional vectors – Overfitting

3

Questions?

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction

slide-4
SLIDE 4

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees

  • General ML ideas

– Features as high dimensional vectors – Overfitting

4

Questions?

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction

slide-5
SLIDE 5

Checkpoint: The bigger picture

  • Supervised learning: instances, concepts, and hypotheses
  • Specific learners

– Decision trees

  • General ML ideas

– Features as high dimensional vectors – Overfitting

5

Questions?

Learning algorithm Labeled data Hypothesis/ Model h h

New example Prediction

slide-6
SLIDE 6

Lecture outline

  • Linear models
  • What functions do linear classifiers express?

6

slide-7
SLIDE 7

Where are we?

  • Linear models

– Introduction: Why linear classifiers and regressors? – Geometry of linear classifiers – A notational simplification – Learning linear classifiers: The lay of the land

  • What functions do linear classifiers express?

7

slide-8
SLIDE 8

Is learning possible at all?

  • There are 216 = 65536 possible

Boolean functions over 4 inputs

– Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 216 functions.

  • We have seen only 7 outputs
  • We cannot know what the rest are

without seeing them

– Think of an adversary filling in the labels every time you make a guess at the function

8

slide-9
SLIDE 9

Is learning possible at all?

  • There are 216 = 65536 possible

Boolean functions over 4 inputs

– Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 216 functions.

  • We have seen only 7 outputs
  • We cannot know what the rest are

without seeing them

– Think of an adversary filling in the labels every time you make a guess at the function

9

How could we possibly learn anything?

slide-10
SLIDE 10

Solution: Restrict the search space

  • A hypothesis space is the set of possible functions we consider

– We were looking at the space of all Boolean functions – Instead choose a hypothesis space that is smaller than the space of all functions

  • Only simple conjunctions (with four variables, there are only 16 conjunctions without

negations)

  • Simple disjunctions
  • m-of-n rules: Fix a set of n variables. At least m of them must be true
  • Linear functions

!

10

slide-11
SLIDE 11

Which is the better classifier?

11

Suppose this our training set and we have to separate the blue circles from the red triangles

slide-12
SLIDE 12

Which is the better classifier?

12

Suppose this our training set and we have to separate the blue circles from the red triangles Curve: A

slide-13
SLIDE 13

Which is the better classifier?

13

Suppose this our training set and we have to separate the blue circles from the red triangles Curve: A Line: B

slide-14
SLIDE 14

Which is the better classifier?

14

Suppose this our training set and we have to separate the blue circles from the red triangles Curve: A Line: B Blue Red

slide-15
SLIDE 15

Which is the better classifier?

15

Suppose this our training set and we have to separate the blue circles from the red triangles Think about overfitting Which curve runs the risk of

  • verfitting?

Simplicity versus Accuracy Curve: A Line: B Blue Red

slide-16
SLIDE 16

Similar argument for regression

16

x F(x) Linear regression might make smaller errors on new points

slide-17
SLIDE 17

Similar argument for regression

17

x F(x) Curve: A Linear regression might make smaller errors on new points

slide-18
SLIDE 18

Similar argument for regression

18

x F(x) Line: B Curve: A Linear regression might make smaller errors on new points

slide-19
SLIDE 19

Recall: Regression vs. Classification

  • Linear regression is about predicting real valued
  • utputs
  • Linear classification is about predicting a discrete

class label

– +1 or -1 – SPAM or NOT-SPAM – Or more than two categories

19

slide-20
SLIDE 20

Linear classifiers: An example

Suppose we want to determine whether a robot arm is defective or not using two measurements: 1. The maximum distance the arm can reach 𝑒 2. The maximum angle it can rotate 𝑏 Suppose we use a linear decision rule that predicts defective if 2𝑒 + 0.01𝑏 ≥ 7 We can apply this rule if we have the two measurements

For example: for a certain arm, if d = 3 and a = 200, then 2𝑒 + 0.01𝑏 = 8 ≥ 7 The arm would be labeled as not defective

20

slide-21
SLIDE 21

Linear classifiers: An example

Suppose we want to determine whether a robot arm is defective or not using two measurements: 1. The maximum distance the arm can reach 𝑒 2. The maximum angle it can rotate 𝑏 Suppose we use a linear decision rule that predicts defective if 2𝑒 + 0.01𝑏 ≥ 7 We can apply this rule if we have the two measurements

For example: for a certain arm, if d = 3 and a = 200, then 2𝑒 + 0.01𝑏 = 8 ≥ 7 The arm would be labeled as not defective

21

slide-22
SLIDE 22

Linear classifiers: An example

Suppose we want to determine whether a robot arm is defective or not using two measurements: 1. The maximum distance the arm can reach 𝑒 2. The maximum angle it can rotate 𝑏 Suppose we use a linear decision rule that predicts defective if 2𝑒 + 0.01𝑏 ≥ 7 We can apply this rule if we have the two measurements

For example: for a certain arm, if d = 3 and a = 200, then 2𝑒 + 0.01𝑏 = 8 ≥ 7 The arm would be labeled as not defective

22

slide-23
SLIDE 23

Linear classifiers: An example

Suppose we want to determine whether a robot arm is defective or not using two measurements: 1. The maximum distance the arm can reach 𝑒 2. The maximum angle it can rotate 𝑏 Suppose we use a linear decision rule that predicts defective if 2𝑒 + 0.01𝑏 ≥ 7 We can apply this rule if we have the two measurements

For example: for a certain arm, if d = 3 and a = 200, then 2𝑒 + 0.01𝑏 = 8 ≥ 7 The arm would be labeled as not defective

23

This rule is an example of a linear classifier

Features are weighted and added up, the sum is checked against a threshold

slide-24
SLIDE 24

Linear Classifiers

Inputs are 𝑒 dimensional feature vectors, denoted by 𝐲 Output is a label 𝑧 ∈ {−1, 1}

Linear Threshold Units classify an example 𝐲 using parameters 𝐱 (a

𝑒 dimensional vector) and 𝑐 (a real number) according the following classification rule Output = sign 𝐱!𝐲 + 𝑐 = sign =

"

𝑥"𝑦" + 𝑐 𝐱!𝐲 + 𝑐 ≥ 0 ⇒ 𝑧 = +1 𝐱!𝐲 + 𝑐 < 0 ⇒ 𝑧 = −1

24

𝑐 is called the bias term

slide-25
SLIDE 25

Linear Classifiers

Inputs are 𝑒 dimensional feature vectors, denoted by 𝐲 Output is a label 𝑧 ∈ {−1, 1}

Linear Threshold Units classify an example 𝐲 using parameters 𝐱 (a

𝑒 dimensional vector) and 𝑐 (a real number) according the following classification rule Output = sign 𝐱!𝐲 + 𝑐 = sign =

"

𝑥"𝑦" + 𝑐 𝐱!𝐲 + 𝑐 ≥ 0 ⇒ 𝑧 = +1 𝐱!𝐲 + 𝑐 < 0 ⇒ 𝑧 = −1

25

𝑐 is called the bias term

slide-26
SLIDE 26

Linear Classifiers

Inputs are 𝑒 dimensional feature vectors, denoted by 𝐲 Output is a label 𝑧 ∈ {−1, 1}

Linear Threshold Units classify an example 𝐲 using parameters 𝐱 (a

𝑒 dimensional vector) and 𝑐 (a real number) according the following classification rule Output = sign 𝐱!𝐲 + 𝑐 = sign =

"

𝑥"𝑦" + 𝑐 if 𝐱!𝐲 + 𝑐 ≥ 0 then predict 𝑧 = +1 if 𝐱!𝐲 + 𝑐 < 0 then predict 𝑧 = −1

26

𝑐 is called the bias term

slide-27
SLIDE 27

The geometry of a linear classifier

27

x1 x2

An illustration in two dimensions

slide-28
SLIDE 28

The geometry of a linear classifier

28

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • An illustration in two dimensions
slide-29
SLIDE 29

The geometry of a linear classifier

29

sgn(b +w1 x1 + w2x2)

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • An illustration in two dimensions
slide-30
SLIDE 30

The geometry of a linear classifier

30

sgn(b +w1 x1 + w2x2)

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • b +w1 x1 + w2x2=0

An illustration in two dimensions

slide-31
SLIDE 31

The geometry of a linear classifier

31

sgn(b +w1 x1 + w2x2)

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • b +w1 x1 + w2x2=0

An illustration in two dimensions

slide-32
SLIDE 32

The geometry of a linear classifier

32

sgn(b +w1 x1 + w2x2)

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • b +w1 x1 + w2x2=0

[w1 w2] An illustration in two dimensions

slide-33
SLIDE 33

The geometry of a linear classifier

33

sgn(b +w1 x1 + w2x2)

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • b +w1 x1 + w2x2=0

We only care about the sign, not the magnitude [w1 w2] An illustration in two dimensions

slide-34
SLIDE 34

The geometry of a linear classifier

34

sgn(b +w1 x1 + w2x2)

In higher dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • b +w1 x1 + w2x2=0

We only care about the sign, not the magnitude [w1 w2] Questions? An illustration in two dimensions

slide-35
SLIDE 35

Simplifying notation

We can stop writing b at each step using notational sugar: The prediction function is sgn(𝐱!𝐲 + 𝑐) = sgn ∑" 𝑥"𝑦" + 𝑐 Rewrite 𝐲 as 𝐲 1 . Call this 𝐲# Rewrite 𝐱 as 𝐱 𝑐 . Call this 𝐱′ Note that 𝐱$𝐲 + 𝑐 is the same as 𝐱#𝑼𝐲′ The prediction function is now sgn(𝐱#𝑼𝐲#) In the increased dimensional space, the vector 𝐱# goes through the origin We sometimes hide the bias 𝑐, and instead fold the bias term into the weights by adding an extra constant feature. But remember that it is there.

35

Questions?

slide-36
SLIDE 36

Simplifying notation

We can stop writing b at each step using notational sugar: The prediction function is sgn(𝐱!𝐲 + 𝑐) = sgn ∑" 𝑥"𝑦" + 𝑐 Rewrite 𝐲 as 𝐲 1 . Call this 𝐲# Rewrite 𝐱 as 𝐱 𝑐 . Call this 𝐱′ Note that 𝐱$𝐲 + 𝑐 is the same as 𝐱#𝑼𝐲′ The prediction function is now sgn(𝐱#𝑼𝐲#) In the increased dimensional space, the vector 𝐱# goes through the origin We sometimes hide the bias 𝑐, and instead fold the bias term into the weights by adding an extra constant feature. But remember that it is there.

36

Questions?

slide-37
SLIDE 37

Simplifying notation

We can stop writing b at each step using notational sugar: The prediction function is sgn(𝐱!𝐲 + 𝑐) = sgn ∑" 𝑥"𝑦" + 𝑐 Rewrite 𝐲 as 𝐲 1 . Call this 𝐲# Rewrite 𝐱 as 𝐱 𝑐 . Call this 𝐱′ Note that 𝐱$𝐲 + 𝑐 is the same as 𝐱#𝑼𝐲′ The prediction function is now sgn(𝐱#𝑼𝐲#) In the increased dimensional space, the vector 𝐱# goes through the origin We sometimes hide the bias 𝑐, and instead fold the bias term into the weights by adding an extra constant feature. But remember that it is there.

37

Questions?

slide-38
SLIDE 38

Simplifying notation

We can stop writing b at each step using notational sugar: The prediction function is sgn(𝐱!𝐲 + 𝑐) = sgn ∑" 𝑥"𝑦" + 𝑐 Rewrite 𝐲 as 𝐲 1 . Call this 𝐲# Rewrite 𝐱 as 𝐱 𝑐 . Call this 𝐱′ Note that 𝐱$𝐲 + 𝑐 is the same as 𝐱#𝑼𝐲′ The prediction function is now sgn(𝐱#𝑼𝐲#) In the increased dimensional space, the vector 𝐱# goes through the origin We sometimes hide the bias 𝑐, and instead fold the bias term into the weights by adding an extra constant feature. But remember that it is there.

38

Questions? Increases dimensionality by one Equivalent to adding a feature that is a constant: always 1

slide-39
SLIDE 39

Simplifying notation

We can stop writing b at each step using notational sugar: The prediction function is sgn(𝐱!𝐲 + 𝑐) = sgn ∑" 𝑥"𝑦" + 𝑐 Rewrite 𝐲 as 𝐲 1 . Call this 𝐲# Rewrite 𝐱 as 𝐱 𝑐 . Call this 𝐱′ Note that 𝐱$𝐲 + 𝑐 is the same as 𝐱#𝑼𝐲′ The prediction function is now sgn(𝐱#𝑼𝐲#) In the increased dimensional space, the vector 𝐱# goes through the origin We sometimes hide the bias 𝑐, and instead fold the bias term into the weights by adding an extra constant feature. But remember that it is there.

39

Questions? Increases dimensionality by one Equivalent to adding a feature that is a constant: always 1

slide-40
SLIDE 40

Simplifying notation

We can stop writing b at each step using notational sugar: The prediction function is sgn(𝐱!𝐲 + 𝑐) = sgn ∑" 𝑥"𝑦" + 𝑐 Rewrite 𝐲 as 𝐲 1 . Call this 𝐲# Rewrite 𝐱 as 𝐱 𝑐 . Call this 𝐱′ Note that 𝐱$𝐲 + 𝑐 is the same as 𝐱#𝑼𝐲′ The prediction function is now sgn(𝐱#𝑼𝐲#) In the increased dimensional space, the vector 𝐱# goes through the origin We sometimes hide the bias 𝑐, and instead fold the bias term into the weights by adding an extra constant feature. But remember that it is there.

40

Questions? Increases dimensionality by one Equivalent to adding a feature that is a constant: always 1

slide-41
SLIDE 41

Coming up (next several weeks): Linear classification

  • Perceptron: Error driven learning, updates the hypothesis if there is an

error

  • Support Vector Machines: Define a different cost function that includes an

error term and a term that targets future performance

  • Naïve Bayes classifier: A simple linear classifier with a probabilistic

interpretation

  • Logistic regression: Another probabilistic linear classifier, bears similarity

to support vector machines

41

In all cases, the prediction will be done with the same rule:

𝐱!𝐲 + 𝑐 ≥ 0 ⇒ 𝑧 = +1 𝐱!𝐲 + 𝑐 < 0 ⇒ 𝑧 = −1