Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1

Q&A Q: We pick the best hyperparameters by learning on the training data and evaluating error on the validation error. For our final model, should we then learn from training + validation? Yes. A: Let's assume that {train-original} is the original training data, and {test} is the provided test dataset. 1. Split {train-original} into {train-subset} and {validation}. 2. Pick the hyperparameters that when training on {train-subset} give the lowest error on {validation}. Call these hyperparameters {best-hyper}. 3. Retrain a new model using {best-hyper} on {train-original} = {train- subset} ∪ {validation}. 4. Report test error by evaluating on {test}. Alternatively, you could replace Step 1/2 with the following: Pick the hyperparameters that give the lowest cross-validation error on {train- original}. Call these hyperparameters {best-hyper}. 2

Reminders • Homework 2: Decision Trees – Out: Wed, Sep 05 – Due: Wed, Sep 19 at 11:59pm • Homework 3: KNN, Perceptron, Lin.Reg. – Out: Wed, Sep 19 – Due: Wed, Sep 26 at 11:59pm 3

THE PERCEPTRON ALGORITHM 4

Perceptron: History Imagine you are trying to build a new machine learning technique… your name is Frank Rosenblatt…and the year is 1957 5

Perceptron: History Imagine you are trying to build a new machine learning technique… your name is Frank Rosenblatt…and the year is 1957 6

Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: Directly modeling the • We’ll see a number of hyperplane would use a commonly used Linear decision function: Classifiers • These include: h ( � ) = sign ( θ T � ) – Perceptron – Logistic Regression – Naïve Bayes (under for: certain conditions) – Support Vector y ∈ { − 1 , +1 } Machines

Geometry In-Class Exercise Answer Here: Draw a picture of the region corresponding to: Draw the vector w = [w 1 , w 2 ] 8

Visualizing Dot-Products Chalkboard: – vector in 2D – line in 2D – adding a bias term – definition of orthogonality – vector projection – hyperplane definition – half-space definitions 9

Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: Directly modeling the • We’ll see a number of hyperplane would use a commonly used Linear decision function: Classifiers • These include: h ( � ) = sign ( θ T � ) – Perceptron – Logistic Regression – Naïve Bayes (under for: certain conditions) – Support Vector y ∈ { − 1 , +1 } Machines

Online vs. Batch Learning Batch Learning Learn from all the examples at once Online Learning Gradually learn as each example is received 11

Online Learning Examples 1. Stock market prediction (what will the value of Alphabet Inc. be tomorrow?) 2. Email classification (distribution of both spam and regular mail changes over time, but the target function stays fixed - last year's spam still looks like spam) 3. Recommendation systems. Examples: recommending movies; predicting whether a user will be interested in a new news article 4. Ad placement in a new market 12 Slide adapted from Nina Balcan

Online Learning For i = 1, 2, 3, … : • Receive an unlabeled instance x (i) • Predict y’ = h θ ( x (i) ) • Receive true label y (i) • Suffer loss if a mistake was made, y’ ≠ y (i) • Update parameters θ Goal: • Minimize the number of mistakes 13

Perceptron Chalkboard: – (Online) Perceptron Algorithm – Why do we need a bias term? – Inductive Bias of Perceptron – Limitations of Linear Models 14

Perceptron Algorithm: Example Example: X −1,2 − - a 1,0 + + 1,1 + X a −1,0 − - + −1, −2 − X + a 1, −1 + - Perceptron Algorithm: (without the bias term) 𝑥 & = (0,0) Set t=1, start with all-zeroes weight vector 𝑥 & . § 𝑥 + = 𝑥 & − −1,2 = (1, −2) Given example 𝑦 , predict positive iff 𝑥 1 ⋅ 𝑦 ≥ 0. § 𝑥 , = 𝑥 + + 1,1 = (2, −1) § On a mistake, update as follows: 𝑥 . = 𝑥 , − −1, −2 = (3,1) • Mistake on positive, update 𝑥 15& ← 𝑥 1 + 𝑦 • Mistake on negative, update 𝑥 15& ← 𝑥 1 − 𝑦 Slide adapted from Nina Balcan

Background: Hyperplanes Hyperplane (Definition 1): Notation Trick : fold the H = { x : w T x = b } bias b and the weights w into a single vector θ by Hyperplane (Definition 2): prepending a constant to x and increasing dimensionality by one! w Half-spaces:

(Online) Perceptron Algorithm Data: Inputs are continuous vectors of length M . Outputs are discrete. Prediction: Output determined by hyperplane. � if a ≥ 0 1 , y = h θ ( x ) = sign( θ T x ) sign ( a ) = ˆ otherwise − 1 , Learning: Iterative procedure: • initialize parameters to vector of all zeroes • while not converged • receive next example ( x (i) , y (i) ) • predict y’ = h( x (i) ) • if positive mistake: add x (i) to parameters • if negative mistake: subtract x (i) from parameters 18

(Online) Perceptron Algorithm Data: Inputs are continuous vectors of length M . Outputs are discrete. Prediction: Output determined by hyperplane. Implementation Trick : same � if a ≥ 0 1 , y = h θ ( x ) = sign( θ T x ) behavior as our “ add on sign ( a ) = ˆ otherwise − 1 , positive mistake and subtract on negative Learning: mistake ” version, because y (i) takes care of the sign 19

(Batch) Perceptron Algorithm Learning for Perceptron also works if we have a fixed training dataset, D. We call this the “batch” setting in contrast to the “online” setting that we’ve discussed so far. Algorithm 1 Perceptron Learning Algorithm (Batch) 1: procedure P��( D = { ( � (1) , y (1) ) , . . . , ( � ( N ) , y ( N ) ) } ) � Initialize parameters 2: θ � 0 while not converged do 3: for i � { 1 , 2 , . . . , N } do � For each example 4: y � sign ( θ T � ( i ) ) � Predict 5: ˆ y � = y ( i ) then if ˆ � If mistake 6: � Update parameters θ � θ + y ( i ) � ( i ) 7: return θ 8: 20

(Batch) Perceptron Algorithm Learning for Perceptron also works if we have a fixed training dataset, D. We call this the “batch” setting in contrast to the “online” setting that we’ve discussed so far. Discussion: The Batch Perceptron Algorithm can be derived in two ways. 1. By extending the online Perceptron algorithm to the batch setting (as mentioned above) 2. By applying Stochastic Gradient Descent (SGD) to minimize a so-called Hinge Loss on a linear separator 21

Extensions of Perceptron • Voted Perceptron – generalizes better than (standard) perceptron – memory intensive (keeps around every weight vector seen during training, so each one can vote) • Averaged Perceptron – empirically similar performance to voted perceptron – can be implemented in a memory efficient way (running averages are efficient) • Kernel Perceptron – Choose a kernel K(x’, x) – Apply the kernel trick to Perceptron – Resulting algorithm is still very simple • Structured Perceptron – Basic idea can also be applied when y ranges over an exponentially large set – Mistake bound does not depend on the size of that set 22

ANALYSIS OF PERCEPTRON 23

Geometric Margin Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 (or the negative if on wrong side) Margin of positive example 𝑦 & 𝑦 & w Margin of negative example 𝑦 + 𝑦 + Slide from Nina Balcan

Geometric Margin Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 (or the negative if on wrong side) Definition: The margin 𝛿 9 of a set of examples 𝑇 wrt a linear separator 𝑥 is the smallest margin over points 𝑦 ∈ 𝑇 . + + + w + + 𝛿 9 - 𝛿 9 ++ - - + - - - - - - Slide from Nina Balcan

Geometric Margin Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 (or the negative if on wrong side) Definition: The margin 𝛿 9 of a set of examples 𝑇 wrt a linear separator 𝑥 is the smallest margin over points 𝑦 ∈ 𝑇 . Definition: The margin 𝛿 of a set of examples 𝑇 is the maximum 𝛿 9 over all linear separators 𝑥 . w + + 𝛿 - 𝛿 ++ - - + - - - - - - Slide from Nina Balcan

Linear Separability Def : For a binary classification problem, a set of examples 𝑇 is linearly separable if there exists a linear decision boundary that can separate the points Case 4: Case 2: Case 3: Case 1: - + + - + - + + + + + + - 27

Analysis: Perceptron Perceptron Mistake Bound Guarantee: If data has margin γ and all points inside a ball of radius R , then Perceptron makes ≤ ( R / γ ) 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.) + + + + θ ∗ + + g g - ++ - - - R - - - - - 28 Slide adapted from Nina Balcan

Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We pick the best hyperparameters by learning on the

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) First learning algorithm for

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Regularization + Perceptron Perceptron Readings: Matt Gormley Murphy 8.5.4 Bishop

Perceptron Algorithm An aside: a hyperplane is a perceptron. (single layer neural network, do you

Today Perceptron. Today Perceptron. Support Vector Machine. Labelled points with x 1 ,..., x n

Lecture 12 - GPU Ray Tracing (2) Welcome! , = (, )

January 12, 2006 File: 15-85-13 Alberta Infrastructure and Transportation Room 301, Provincial

yt : An Integrated Science Environment for Astrophysical Simulations Matthew Turk There is only

I t n r e i t n F o r s t n e m i r e p E x r e University

Perfect Half Space Games Thomas Colcombet, Marcin Jurdzi nski, Ranko Lazi c, and Sylvain

Uniform Convergence for Learning Binary Classifcation Given a concept class C , and a training

Non decaying solutions to the Navier-Stokes equations in the half-space Yasunori Maekawa (Kyoto

Gap Dependency on Half Spaces in Product Vacua and Boundary State Models Michael Bishop

Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 6 Sep. 17, 2018 1 Q&A Q: We pick the best hyperparameters by learning on the

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) First learning algorithm for

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net &gt; 0, else 0) l

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

Regularization + Perceptron Perceptron Readings: Matt Gormley Murphy 8.5.4 Bishop

Perceptron Algorithm An aside: a hyperplane is a perceptron. (single layer neural network, do you

Today Perceptron. Today Perceptron. Support Vector Machine. Labelled points with x 1 ,..., x n

Lecture 12 - GPU Ray Tracing (2) Welcome! , = (, )

January 12, 2006 File: 15-85-13 Alberta Infrastructure and Transportation Room 301, Provincial

yt : An Integrated Science Environment for Astrophysical Simulations Matthew Turk There is only

I t n r e i t n F o r s t n e m i r e p E x r e University

Perfect Half Space Games Thomas Colcombet, Marcin Jurdzi nski, Ranko Lazi c, and Sylvain

Uniform Convergence for Learning Binary Classifcation Given a concept class C , and a training

Non decaying solutions to the Navier-Stokes equations in the half-space Yasunori Maekawa (Kyoto

Gap Dependency on Half Spaces in Product Vacua and Boundary State Models Michael Bishop

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l