The Perceptron Mistake Bound Machine Learning 1 Some slides based - - PowerPoint PPT Presentation

the perceptron mistake bound
SMART_READER_LITE
LIVE PREVIEW

The Perceptron Mistake Bound Machine Learning 1 Some slides based - - PowerPoint PPT Presentation

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Where are we? The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Convergence Convergence


slide-1
SLIDE 1

Machine Learning

The Perceptron Mistake Bound

1

Some slides based on lectures from Dan Roth, Avrim Blum and others

slide-2
SLIDE 2

Where are we?

  • The Perceptron Algorithm
  • Variants of Perceptron
  • Perceptron Mistake Bound

2

slide-3
SLIDE 3

Convergence

Convergence theorem

– If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge.

3

slide-4
SLIDE 4

Convergence

Convergence theorem

– If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge.

Cycling theorem

– If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop

4

slide-5
SLIDE 5

Perceptron Learnability

  • Obviously Perceptron cannot learn what it cannot represent

– Only linearly separable functions

  • Minsky and Papert (1969) wrote an influential book

demonstrating Perceptron’s representational limitations

– Parity functions can’t be learned (XOR)

  • We have already seen that XOR is not linearly separable

– In vision, if patterns are represented with local features, can’t represent symmetry, connectivity

5

slide-6
SLIDE 6

Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

6

+ + + + + ++ +

  • -
  • -
  • -
  • Margin with respect to this hyperplane
slide-7
SLIDE 7

Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set (𝛿) is the maximum margin possible for that dataset using any weight vector.

7

+ + + + + ++ +

  • -
  • -
  • -
  • Margin of the data
slide-8
SLIDE 8

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}.

8

slide-9
SLIDE 9

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}.

9

We can always find such an 𝑆. Just look for the farthest data point from the origin.

slide-10
SLIDE 10

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}. Suppose there is a unit vector 𝐯 ∈ ℜ$ (i.e., 𝐯 = 1) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0, we have 𝑧#𝐯&𝐲# ≥ 𝛿 for every example (𝐲#, 𝑧#).

10

slide-11
SLIDE 11

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}. Suppose there is a unit vector 𝐯 ∈ ℜ$ (i.e., 𝐯 = 1) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0, we have 𝑧#𝐯&𝐲# ≥ 𝛿 for every example (𝐲#, 𝑧#).

11

The data has a margin 𝛿. Importantly, the data is separable. 𝛿 is the complexity parameter that defines the separability of data.

slide-12
SLIDE 12

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}. Suppose there is a unit vector 𝐯 ∈ ℜ$ (i.e., 𝐯 = 1) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0, we have 𝑧#𝐯&𝐲# ≥ 𝛿 for every example (𝐲#, 𝑧#). Then, the perceptron algorithm will make no more than ⁄ 𝑆" 𝛿" mistakes on the training sequence.

12

slide-13
SLIDE 13

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}. Suppose there is a unit vector 𝐯 ∈ ℜ$ (i.e., 𝐯 = 1) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0, we have 𝑧#𝐯&𝐲# ≥ 𝛿 for every example (𝐲#, 𝑧#). Then, the perceptron algorithm will make no more than ⁄ 𝑆" 𝛿" mistakes on the training sequence.

13

If u hadn’t been a unit vector, then we could scale ° in the mistake bound. This will change the final mistake bound to (||u||R/°)2

slide-14
SLIDE 14

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}. Suppose there is a unit vector 𝐯 ∈ ℜ$ (i.e., 𝐯 = 1) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0, we have 𝑧#𝐯&𝐲# ≥ 𝛿 for every example (𝐲#, 𝑧#). Then, the perceptron algorithm will make no more than ⁄ 𝑆" 𝛿" mistakes on the training sequence.

14

Suppose we have a binary classification dataset with n dimensional inputs. If the data is separable,… …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes

slide-15
SLIDE 15

Proof (preliminaries)

The setting

  • Initial weight vector 𝐱 is all zeros
  • Learning rate = 1

– Effectively scales inputs, but does not change the behavior

  • All training examples are contained in a ball of size 𝑆.

– That is, for every example (𝐲!, 𝑧!), we have 𝐲! ≤ 𝑆

  • The training data is separable by margin 𝛿 using a unit vector 𝐯.

– That is, for every example (𝐲!, 𝑧!), we have 𝑧!𝐯"𝐲! ≥ 𝛿

15

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲!

slide-16
SLIDE 16

Proof (1/3)

  • 1. Claim: After t mistakes, 𝐯!𝐱" ≥ 𝑢𝛿

16

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲!

slide-17
SLIDE 17

Proof (1/3)

  • 1. Claim: After t mistakes, 𝐯!𝐱" ≥ 𝑢𝛿

17

Because the data is separable by a margin 𝛿

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲!

slide-18
SLIDE 18

Proof (1/3)

  • 1. Claim: After t mistakes, 𝐯!𝐱" ≥ 𝑢𝛿

Because 𝐱# = 𝟏 (that is, 𝐯!𝐱# = 𝟏), straightforward induction gives us 𝐯!𝐱" ≥ 𝑢𝛿

18

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲! Because the data is separable by a margin 𝛿

slide-19
SLIDE 19
  • 2. Claim: After t mistakes, 𝐱"

$ ≤ 𝑢𝑆$

Proof (2/3)

19

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲!

slide-20
SLIDE 20

Proof (2/3)

20

The weight is updated only when there is a mistake. That is when 𝑧!𝐱"

#𝐲! < 0.

𝐲! ≤ 𝑆, by definition of R

  • 2. Claim: After t mistakes, 𝐱"

$ ≤ 𝑢𝑆$

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲!

slide-21
SLIDE 21

Proof (2/3)

  • 2. Claim: After t mistakes, 𝐱"

$ ≤ 𝑢𝑆$

Because 𝐱# = 𝟏 (that is, 𝐯!𝐱# = 𝟏), straightforward induction gives us 𝐱"

$ ≤ 𝑆$

21

  • Receive an input 𝐲!, 𝑧!
  • if sgn 𝐱"

#𝐲! ≠ 𝑧!:

Update 𝐱"$% ← 𝐱" + 𝑧!𝐲!

slide-22
SLIDE 22

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

22

slide-23
SLIDE 23

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

23

From (2)

slide-24
SLIDE 24

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

24

From (2) 𝒗𝑼𝐱" = 𝐯 𝐱" 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯#𝐱" ≤ 𝐱"

slide-25
SLIDE 25

𝒗𝑼𝐱" = 𝐯 𝐱" 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯#𝐱" ≤ 𝐱"

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

25

From (2) (Cauchy-Schwarz inequality)

slide-26
SLIDE 26

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

26

From (2) From (1) 𝒗𝑼𝐱" = 𝐯 𝐱" 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯#𝐱" ≤ 𝐱"

slide-27
SLIDE 27

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

27

Number of mistakes

slide-28
SLIDE 28

Proof (3/3)

What we know:

1. After t mistakes, 𝐯&𝐱6 ≥ 𝑢𝛿 2. After t mistakes, 𝐱6

" ≤ 𝑢𝑆"

28

Bounds the total number of mistakes!

Number of mistakes

slide-29
SLIDE 29

Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let 𝐲!, 𝑧! , 𝐲", 𝑧" , ⋯ be a sequence of training examples such that every feature vector 𝐲# ∈ ℜ$ with 𝐲# ≤ 𝑆 and the label 𝑧# ∈ {−1, 1}. Suppose there is a unit vector 𝐯 ∈ ℜ$ (i.e., 𝐯 = 1) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0, we have 𝑧#𝐯&𝐲# ≥ 𝛿 for every example (𝐲#, 𝑧#). Then, the perceptron algorithm will make no more than ⁄ 𝑆" 𝛿" mistakes on the training sequence.

29

slide-30
SLIDE 30

The Perceptron Mistake bound

  • 𝑆 is a property of the dimensionality. How?

– For Boolean functions with 𝑜 attributes, show that 𝑆# = 𝑜.

  • 𝛿 is a property of the data
  • Exercises:

– How many mistakes will the Perceptron algorithm make for disjunctions with 𝑜 attributes?

  • What are 𝑆 and 𝛿?

– How many mistakes will the Perceptron algorithm make for 𝑙-disjunctions with 𝑜 attributes? – Find a sequence of examples that will force the Perceptron algorithm to make 𝑃 𝑜 mistakes for a concept that is a 𝑙-disjunction.

30

Number of mistakes

slide-31
SLIDE 31

Beyond the separable case

  • Good news

– Perceptron makes no assumption about data distribution, could be even adversarial – After a fixed number of mistakes, you are done. Don’t even need to see any more data

  • Bad news: Real world is not linearly separable

– Can’t expect to never make mistakes again – What can we do: more features, try to be linearly separable if you can, use averaging

31

slide-32
SLIDE 32

What you need to know

  • What is the perceptron mistake bound?
  • How to prove it

32

slide-33
SLIDE 33

Summary: Perceptron

  • Online learning algorithm, very widely used, easy to implement
  • Additive updates to weights
  • Geometric interpretation
  • Mistake bound
  • Practical variants abound
  • You should be able to implement the Perceptron algorithm and its

variants, and also prove the mistake bound theorem

33