the perceptron mistake bound
play

The Perceptron Mistake Bound Machine Learning 1 Some slides based - PowerPoint PPT Presentation

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Where are we? The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Convergence Convergence


  1. The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others

  2. Where are we? • The Perceptron Algorithm • Variants of Perceptron • Perceptron Mistake Bound 2

  3. Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 3

  4. Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 4

  5. Perceptron Learnability • Obviously Perceptron cannot learn what it cannot represent – Only linearly separable functions • Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations – Parity functions can’t be learned (XOR) • We have already seen that XOR is not linearly separable – In vision, if patterns are represented with local features, can’t represent symmetry, connectivity 5

  6. Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 6

  7. Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set ( 𝛿 ) is the maximum margin possible for that dataset using any weight vector. + ++ - + - + - - - - - - + + + - - - - - - - - - - Margin of the data 7

  8. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . 8

  9. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . We can always find such an 𝑆 . Just look for the farthest data point from the origin. 9

  10. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . 10

  11. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . The data has a margin 𝛿 . Importantly, the data is separable . 𝛿 is the complexity parameter that defines the separability of data. 11

  12. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ 12

  13. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ If u hadn’t been a unit vector, then we could scale ° in the mistake bound. This will change the final mistake bound to ( || u || R/°) 2 13

  14. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose we have a binary classification dataset with n dimensional inputs. Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . If the data is separable,… Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes 14

  15. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (preliminaries) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! The setting Initial weight vector 𝐱 is all zeros • Learning rate = 1 • – Effectively scales inputs, but does not change the behavior All training examples are contained in a ball of size 𝑆 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝐲 ! ≤ 𝑆 The training data is separable by margin 𝛿 using a unit vector 𝐯 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝑧 ! 𝐯 " 𝐲 ! ≥ 𝛿 15

  16. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 16

  17. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 17

  18. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), straightforward induction gives us 𝐯 ! 𝐱 " ≥ 𝑢𝛿 18

  19. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " 19

  20. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " The weight is updated only 𝐲 ! ≤ 𝑆 , by definition of R when there is a mistake. That is # 𝐲 ! < 0. when 𝑧 ! 𝐱 " 20

  21. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), $ ≤ 𝑆 $ straightforward induction gives us 𝐱 " 21

  22. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 22

  23. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 23

  24. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 24

  25. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " (Cauchy-Schwarz inequality) 25

  26. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (1) From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 26

  27. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes 27

  28. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes Bounds the total number of mistakes! 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend