The Perceptron Mistake Bound Machine Learning 1 Some slides based - PowerPoint PPT Presentation

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others

Where are we? • The Perceptron Algorithm • Variants of Perceptron • Perceptron Mistake Bound 2

Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 3

Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 4

Perceptron Learnability • Obviously Perceptron cannot learn what it cannot represent – Only linearly separable functions • Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations – Parity functions can’t be learned (XOR) • We have already seen that XOR is not linearly separable – In vision, if patterns are represented with local features, can’t represent symmetry, connectivity 5

Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 6

Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set ( 𝛿 ) is the maximum margin possible for that dataset using any weight vector. + ++ - + - + - - - - - - + + + - - - - - - - - - - Margin of the data 7

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . 8

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . We can always find such an 𝑆 . Just look for the farthest data point from the origin. 9

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . 10

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . The data has a margin 𝛿 . Importantly, the data is separable . 𝛿 is the complexity parameter that defines the separability of data. 11

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ 12

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ If u hadn’t been a unit vector, then we could scale ° in the mistake bound. This will change the final mistake bound to ( || u || R/°) 2 13

Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose we have a binary classification dataset with n dimensional inputs. Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . If the data is separable,… Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes 14

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (preliminaries) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! The setting Initial weight vector 𝐱 is all zeros • Learning rate = 1 • – Effectively scales inputs, but does not change the behavior All training examples are contained in a ball of size 𝑆 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝐲 ! ≤ 𝑆 The training data is separable by margin 𝛿 using a unit vector 𝐯 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝑧 ! 𝐯 " 𝐲 ! ≥ 𝛿 15

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 16

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 17

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), straightforward induction gives us 𝐯 ! 𝐱 " ≥ 𝑢𝛿 18

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " 19

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " The weight is updated only 𝐲 ! ≤ 𝑆 , by definition of R when there is a mistake. That is # 𝐲 ! < 0. when 𝑧 ! 𝐱 " 20

• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), $ ≤ 𝑆 $ straightforward induction gives us 𝐱 " 21

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 22

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 23

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 24

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " (Cauchy-Schwarz inequality) 25

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (1) From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 26

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes 27

Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes Bounds the total number of mistakes! 28

The Perceptron Mistake Bound Machine Learning 1 Some slides based - PowerPoint PPT Presentation

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Where are we? The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Convergence Convergence

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Learning Theory Part 2: Mistake Bound Model CS 760@UW-Madison Goals for the lecture you should

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

Consistency Analysis for Massively Inconsistent Datasets in Bound-to-Bound Data Collaboration

Rightward Bound: The Rise of Conservatism in Postwar America Rightward Bound : The Rise of

The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) First learning algorithm for

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

RNA structure alignment by a unit-vector approach Emidio Capriotti Marc A. Marti-Renom

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

DIRECTIONAL DERIVATIVE MATH 200 GOALS Be able to compute a gradient vector, and use it to

The Dot Product and Orthogonal Vectors The Dot Product Defn. The dot product (or inner product )

/ 9/16/2004 ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

System Call Vulnerabilities in Linux Storage Stack Nima Mohammadi Sepand Haghighi Fall - 2016

Viktor Lysiuk V. Lashkariov Ins2tute of Semiconductor Physics, NAS of Ukraine 21.02.2017 Winter

Presented by Jack Humburg Jack Humburg, Executive Vice President of Housing, Development,

The Perceptron Mistake Bound Machine Learning 1 Some slides based - PowerPoint PPT Presentation

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Where are we? The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Convergence Convergence

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Learning Theory Part 2: Mistake Bound Model CS 760@UW-Madison Goals for the lecture you should

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Branch-and-Bound Math 482, Lecture 33 Misha Lavrov April 27, 2020 Branch-and-bound methods

Consistency Analysis for Massively Inconsistent Datasets in Bound-to-Bound Data Collaboration

Rightward Bound: The Rise of Conservatism in Postwar America Rightward Bound : The Rise of

The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) First learning algorithm for

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net &gt; 0, else 0) l

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

RNA structure alignment by a unit-vector approach Emidio Capriotti Marc A. Marti-Renom

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

DIRECTIONAL DERIVATIVE MATH 200 GOALS Be able to compute a gradient vector, and use it to

The Dot Product and Orthogonal Vectors The Dot Product Defn. The dot product (or inner product )

/ 9/16/2004 ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

System Call Vulnerabilities in Linux Storage Stack Nima Mohammadi Sepand Haghighi Fall - 2016

Viktor Lysiuk V. Lashkariov Ins2tute of Semiconductor Physics, NAS of Ukraine 21.02.2017 Winter

Presented by Jack Humburg Jack Humburg, Executive Vice President of Housing, Development,

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l