The Perceptron Algorithm Machine Learning 1 Some slides based on - - PowerPoint PPT Presentation

โ–ถ
the perceptron algorithm
SMART_READER_LITE
LIVE PREVIEW

The Perceptron Algorithm Machine Learning 1 Some slides based on - - PowerPoint PPT Presentation

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Outline The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Where are we? The Perceptron


slide-1
SLIDE 1

Machine Learning

The Perceptron Algorithm

1

Some slides based on lectures from Dan Roth, Avrim Blum and others

slide-2
SLIDE 2

Outline

  • The Perceptron Algorithm
  • Variants of Perceptron
  • Perceptron Mistake Bound

2

slide-3
SLIDE 3

Where are we?

  • The Perceptron Algorithm
  • Variants of Perceptron
  • Perceptron Mistake Bound

3

slide-4
SLIDE 4

Recall: Linear Classifiers

Inputs are ๐‘’ dimensional vectors, denoted by ๐ฒ Output is a label ๐‘ง โˆˆ {โˆ’1, 1}

Linear Threshold Units classify an example ๐ฒ using

parameters ๐ฑ (a ๐‘’ dimensional vector) and ๐‘ (a real number) according the following classification rule Output = sign(๐ฑ!๐ฒ + ๐‘) = sign(โˆ‘" ๐‘ฅ"๐‘ฆ" + ๐‘) ๐ฑ!๐ฒ + ๐‘ โ‰ฅ 0 โ‡’ ๐‘ง = +1 ๐ฑ!๐ฒ + ๐‘ < 0 โ‡’ ๐‘ง = โˆ’1

4

๐‘ is called the bias term

slide-5
SLIDE 5

Recall: Linear Classifiers

Inputs are ๐‘’ dimensional vectors, denoted by ๐ฒ Output is a label ๐‘ง โˆˆ {โˆ’1, 1}

Linear Threshold Units classify an example ๐ฒ using

parameters ๐ฑ (a ๐‘’ dimensional vector) and ๐‘ (a real number) according the following classification rule Output = sign(๐ฑ!๐ฒ + ๐‘) = sign(โˆ‘" ๐‘ฅ"๐‘ฆ" + ๐‘) ๐ฑ!๐ฒ + ๐‘ โ‰ฅ 0 โ‡’ ๐‘ง = +1 ๐ฑ!๐ฒ + ๐‘ < 0 โ‡’ ๐‘ง = โˆ’1

5

๐‘ is called the bias term

โˆ‘ sgn ๐‘ฅ! ๐‘ฅ" ๐‘ฅ# ๐‘ฅ$ ๐‘ฅ% ๐‘ฅ& ๐‘ฅ' ๐‘ฅ( ๐‘ฆ! ๐‘ฆ" ๐‘ฆ# ๐‘ฆ$ ๐‘ฆ% ๐‘ฆ& ๐‘ฆ' ๐‘ฆ( 1 ๐‘

slide-6
SLIDE 6

The geometry of a linear classifier

6

sgn(b +w1 x1 + w2x2)

In higher dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces

x1 x2 + + + + +++ +

  • -
  • -
  • -
  • b +w1 x1 + w2x2=0

We only care about the sign, not the magnitude [w1 w2]

slide-7
SLIDE 7

The Perceptron

7

slide-8
SLIDE 8

The Perceptron algorithm

  • Rosenblatt 1958

โ€“ (Though there were some hints of a similar idea earlier, eg: Agmon 1954)

  • The goal is to find a separating hyperplane

โ€“ For separable data, guaranteed to find one

  • An online algorithm

โ€“ Processes one example at a time

  • Several variants exist

โ€“ We will see these briefly at towards the end

8

slide-9
SLIDE 9

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

9

slide-10
SLIDE 10

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

10

Remember: Prediction = sgn(wTx) There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w

slide-11
SLIDE 11

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

11

Remember: Prediction = sgn(wTx) There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true.

slide-12
SLIDE 12

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

12

Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-13
SLIDE 13

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

13

r is the learning rate, a small positive number less than 1 Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-14
SLIDE 14

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

14

r is the learning rate, a small positive number less than 1 Update only on error. A mistake-driven algorithm Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-15
SLIDE 15

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

15

r is the learning rate, a small positive number less than 1 Update only on error. A mistake-driven algorithm Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+ Mistake can be written as y+๐ฑ)

,๐ฒ+ โ‰ค 0

slide-16
SLIDE 16

The Perceptron algorithm

Input: A sequence of training examples ๐ฒ+, ๐‘ง+ , ๐ฒ,, ๐‘ง, , โ‹ฏ where all ๐ฒ" โˆˆ โ„œ-, ๐‘ง" โˆˆ {โˆ’1, 1}

  • 1. Initialize ๐ฑ. = 0 โˆˆ โ„œ-
  • 2. For each training example ๐ฒ", ๐‘ง" :
  • 1. Predict y/ = sgn(๐ฑ0

1๐ฒ")

  • 2. If y/ โ‰  ๐‘ง":
  • Update ๐ฑ02+ โ† ๐ฑ0 + ๐‘ (๐‘ง"๐ฒ")
  • 3. Return final weight vector

16

r is the learning rate, a small positive number less than 1 Update only on error. A mistake-driven algorithm This is the simplest version. We will see more robust versions shortly Mistake can be written as y+๐ฑ)

,๐ฒ+ โ‰ค 0

Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-17
SLIDE 17

Intuition behind the update

Suppose we have made a mistake on a positive example That is, ๐‘ง = +1 and ๐ฑ!

"๐ฒ โ‰ค 0

Call the new weight vector ๐ฑ!#$ = ๐ฑ! + ๐ฒ (say r = 1) The new dot product is ๐ฑ%#$

" ๐ฒ = ๐ฑ! + ๐ฒ "๐ฒ = ๐ฑ! "๐ฒ + ๐ฒ๐”๐ฒ โ‰ฅ ๐ฑ๐ฎ ๐”๐ฒ

For a positive example, the Perceptron update will increase the score assigned to the same input Similar reasoning for negative examples

17

Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-18
SLIDE 18

Geometry of the perceptron update

18

wold Predict Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-19
SLIDE 19

Geometry of the perceptron update

19

wold (x, +1) Predict Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-20
SLIDE 20

Geometry of the perceptron update

20

wold (x, +1)

For a mistake on a positive example

Predict Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-21
SLIDE 21

Geometry of the perceptron update

21

wold (x, +1) ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ

For a mistake on a positive example

(x, +1) Predict Update Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-22
SLIDE 22

Geometry of the perceptron update

22

wold (x, +1) ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ

For a mistake on a positive example

(x, +1) Predict Update y x Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-23
SLIDE 23

Geometry of the perceptron update

23

wold (x, +1) ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ

For a mistake on a positive example

(x, +1) Predict Update y x Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-24
SLIDE 24

Geometry of the perceptron update

24

wold (x, +1) (x, +1) wnew ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ

For a mistake on a positive example

(x, +1) Predict Update After y x Mistake on positive: ๐ฑ)*! โ† ๐ฑ) + ๐‘ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ† ๐ฑ) โˆ’ ๐‘ ๐ฒ+

slide-25
SLIDE 25

Geometry of the perceptron update

25

wold Predict

slide-26
SLIDE 26

Geometry of the perceptron update

26

wold (x, -1) Predict

For a mistake on a negative example

slide-27
SLIDE 27

Geometry of the perceptron update

27

wold (x, -1) ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ (x, -1) Predict Update

For a mistake on a negative example

y x

slide-28
SLIDE 28

Geometry of the perceptron update

28

wold (x, -1) ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ (x, -1) Predict Update

For a mistake on a negative example

y x

slide-29
SLIDE 29

Geometry of the perceptron update

29

wold (x, -1) ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ (x, -1) Predict Update

For a mistake on a negative example

y x

slide-30
SLIDE 30

Geometry of the perceptron update

30

wold (x, -1) (x, -1) wnew ๐ฑ โ† ๐ฑ + ๐‘ง๐ฒ (x, -1) Predict Update After

For a mistake on a negative example

y x

slide-31
SLIDE 31

Where are we?

  • The Perceptron Algorithm
  • Variants of Perceptron
  • Perceptron Mistake Bound

31

slide-32
SLIDE 32

Practical use of the Perceptron algorithm

  • 1. Using the Perceptron algorithm with a finite dataset
  • 2. Voting and Averaging
  • 3. Margin Perceptron

32

slide-33
SLIDE 33
  • 1. The โ€œstandardโ€ algorithm

Given a training set ๐ธ = ๐ฒ(, ๐‘ง( where ๐ฒ( โˆˆ โ„œ), ๐‘ง( โˆˆ โˆ’1, 1 1. Initialize ๐ฑ = ๐Ÿ โˆˆ โ„œ) 2. For epoch in 1 โ‹ฏ ๐‘ˆ: 1. Shuffle the data 2. For each training example ๐ฒ(, ๐‘ง( โˆˆ ๐ธ:

  • If ๐‘ง(๐ฑ*๐ฒ( โ‰ค 0, then:

โ€“ update ๐ฑ โ† ๐ฑ + ๐‘ ๐‘ง(๐ฒ( 3. Return ๐ฑ Prediction on a new example with features ๐ฒ: sgn ๐ฑ*๐ฒ

33

slide-34
SLIDE 34
  • 1. The โ€œstandardโ€ algorithm

Given a training set ๐ธ = ๐ฒ(, ๐‘ง( where ๐ฒ( โˆˆ โ„œ), ๐‘ง( โˆˆ โˆ’1, 1 1. Initialize ๐ฑ = ๐Ÿ โˆˆ โ„œ) 2. For epoch in 1 โ‹ฏ ๐‘ˆ: 1. Shuffle the data 2. For each training example ๐ฒ(, ๐‘ง( โˆˆ ๐ธ:

  • If ๐‘ง(๐ฑ*๐ฒ( โ‰ค 0, then:

โ€“ update ๐ฑ โ† ๐ฑ + ๐‘ ๐‘ง(๐ฒ( 3. Return ๐ฑ Prediction on a new example with features ๐ฒ: sgn ๐ฑ*๐ฒ

34

T is a hyper-parameter to the algorithm Another way of writing that there is an error

slide-35
SLIDE 35
  • 2. Voting and Averaging
  • So far: We return the final weight vector
  • Voted perceptron

โ€“ Remember every weight vector in your sequence of updates. โ€“ At final prediction time, each weight vector gets to vote on the label. The number of votes it gets is the number of iterations it survived before being updated โ€“ Comes with strong theoretical guarantees about generalization, impractical because of storage issues

  • Averaged perceptron

โ€“ Instead of using all weight vectors, use the average weight vector (i.e longer surviving weight vectors get more say) โ€“ More practical alternative and widely used

35

slide-36
SLIDE 36
  • 2. Voting and Averaging
  • So far: We return the final weight vector
  • Voted perceptron

โ€“ Remember every weight vector in your sequence of updates. โ€“ At final prediction time, each weight vector gets to vote on the label. The number of votes it gets is the number of iterations it survived before being updated โ€“ Comes with strong theoretical guarantees about generalization, impractical because of storage issues

  • Averaged perceptron

โ€“ Instead of using all weight vectors, use the average weight vector (i.e longer surviving weight vectors get more say) โ€“ More practical alternative and widely used

36

slide-37
SLIDE 37

Averaged Perceptron

Given a training set ๐ธ = ๐ฒ(, ๐‘ง( where ๐ฒ( โˆˆ โ„œ), ๐‘ง( โˆˆ โˆ’1, 1 1. Initialize ๐ฑ = ๐Ÿ โˆˆ โ„œ) and ๐› = ๐Ÿ โˆˆ โ„œ) 2. For epoch in 1 โ‹ฏ ๐‘ˆ: 1. Shuffle the data 2. For each training example ๐ฒ(, ๐‘ง( โˆˆ ๐ธ:

  • If ๐‘ง(๐ฑ*๐ฒ( โ‰ค 0, then:

โ€“ update ๐ฑ โ† ๐ฑ + ๐‘ ๐‘ง(๐ฒ(

  • ๐› โ† ๐’ƒ + ๐ฑ

3. Return ๐› Prediction on a new example with features ๐ฒ: sgn ๐›*๐ฒ

37

slide-38
SLIDE 38

Averaged Perceptron

Given a training set ๐ธ = ๐ฒ(, ๐‘ง( where ๐ฒ( โˆˆ โ„œ), ๐‘ง( โˆˆ โˆ’1, 1 1. Initialize ๐ฑ = ๐Ÿ โˆˆ โ„œ) and ๐› = ๐Ÿ โˆˆ โ„œ) 2. For epoch in 1 โ‹ฏ ๐‘ˆ: 1. Shuffle the data 2. For each training example ๐ฒ(, ๐‘ง( โˆˆ ๐ธ:

  • If ๐‘ง(๐ฑ*๐ฒ( โ‰ค 0, then:

โ€“ update ๐ฑ โ† ๐ฑ + ๐‘ ๐‘ง(๐ฒ(

  • ๐› โ† ๐’ƒ + ๐ฑ

3. Return ๐› Prediction on a new example with features ๐ฒ: sgn ๐›*๐ฒ

38

This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated

  • nly when there is an error
slide-39
SLIDE 39

Averaged Perceptron

Given a training set ๐ธ = ๐ฒ(, ๐‘ง( where ๐ฒ( โˆˆ โ„œ), ๐‘ง( โˆˆ โˆ’1, 1 1. Initialize ๐ฑ = ๐Ÿ โˆˆ โ„œ) and ๐› = ๐Ÿ โˆˆ โ„œ) 2. For epoch in 1 โ‹ฏ ๐‘ˆ: 1. Shuffle the data 2. For each training example ๐ฒ(, ๐‘ง( โˆˆ ๐ธ:

  • If ๐‘ง(๐ฑ*๐ฒ( โ‰ค 0, then:

โ€“ update ๐ฑ โ† ๐ฑ + ๐‘ ๐‘ง(๐ฒ(

  • ๐› โ† ๐’ƒ + ๐ฑ

3. Return ๐› Prediction on a new example with features ๐ฒ: sgn ๐›*๐ฒ

39

This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated

  • nly when there is an error

If you want to use the Perceptron algorithm, use averaging

slide-40
SLIDE 40
  • 3. Margin Perceptron
  • Perceptron makes updates only when the prediction is

incorrect ๐‘ง"๐ฑ1๐ฒ" โ‰ค 0

  • What if the prediction is close to being incorrect? That is, Pick

a small positive ๐œƒ and update when ๐‘ง"๐ฑ1๐ฒ" โ‰ค ๐œƒ

  • Can generalize better, but need to choose ๐œƒ

Exercise: Why is the margin perceptron a good idea?

40

slide-41
SLIDE 41

The Perceptron

41

slide-42
SLIDE 42

The hype

42

The New Yorker, December 6, 1958 P. 44 The New York Times, July 8 1958

slide-43
SLIDE 43

The hype

43

The New Yorker, December 6, 1958 P. 44 The New York Times, July 8 1958 The IBM 704 computer

slide-44
SLIDE 44

What you need to know

  • The Perceptron algorithm
  • The geometry of the update
  • What can it represent
  • Variants of the Perceptron algorithm

44