Regularization + Perceptron Perceptron Readings: Matt Gormley - PowerPoint PPT Presentation

10-‑601 ¡Introduction ¡to ¡Machine ¡Learning Machine ¡Learning ¡Department School ¡of ¡Computer ¡Science Carnegie ¡Mellon ¡University Regularization + Perceptron Perceptron ¡Readings: Matt ¡Gormley Murphy ¡8.5.4 Bishop ¡4.1.7 Lecture ¡10 HTF ¡-‑-‑ February ¡20, ¡2016 Mitchell ¡4.4.0 1

Reminders • Homework 3: ¡Linear ¡/ ¡Logistic Regression – Release: ¡Mon, ¡Feb. ¡13 – Due: ¡Wed, ¡Feb. ¡22 ¡at ¡11:59pm • Homework 4: ¡Perceptron / ¡Kernels / ¡SVM – Release: ¡Wed, ¡Feb. ¡22 1 week ¡ for ¡HW4 – Due: ¡Wed, ¡Mar. ¡01 ¡at ¡11:59pm • Midterm Exam (Evening Exam) – Tue, ¡Mar. ¡07 ¡at ¡7:00pm ¡– 9:30pm – See Piazza ¡for details about location 2

Outline • Regularization – Motivation: ¡Overfitting – L2, ¡L1, ¡L0 ¡Regularization – Relation ¡between ¡Regularization ¡and ¡MAP ¡ Estimation • Perceptron – Online ¡Learning – Margin ¡Definitions – Perceptron ¡Algorithm – Perceptron ¡Mistake ¡Bound • Generative ¡vs. ¡Discriminative ¡Classifiers 3

REGULARIZATION 11

Overfitting Definition : ¡The ¡problem ¡of ¡ overfitting is ¡when ¡ the ¡model ¡captures ¡the ¡noise ¡in ¡the ¡training ¡data ¡ instead ¡of ¡the ¡underlying ¡structure ¡ Overfitting ¡can ¡occur ¡in ¡all ¡the ¡models ¡we’ve ¡seen ¡ so ¡far: ¡ – KNN ¡(e.g. ¡when ¡k ¡is ¡small) – Naïve ¡Bayes ¡(e.g. ¡without ¡a ¡prior) – Linear ¡Regression ¡(e.g. ¡with ¡basis ¡function) – Logistic ¡Regression ¡(e.g. ¡with ¡many ¡rare ¡features) 12

Motivation: ¡Regularization Example: ¡Stock ¡Prices • Suppose ¡we ¡wish ¡to ¡predict ¡Google’s ¡stock ¡price ¡at ¡ time ¡t+1 ¡ • What ¡features ¡should ¡we ¡use? (putting ¡all ¡computational ¡concerns ¡aside) – Stock ¡prices ¡of ¡all ¡other ¡stocks ¡at ¡times ¡t, ¡t-‑1, ¡t-‑2, ¡…, ¡t ¡-‑ k – Mentions ¡of ¡Google ¡with ¡positive ¡/ ¡negative ¡sentiment ¡ words ¡in ¡all ¡newspapers ¡and ¡social ¡media ¡outlets • Do ¡we ¡believe ¡that ¡ all of ¡these ¡features ¡are ¡going ¡to ¡ be ¡useful? 13

Motivation: ¡Regularization • Occam’s ¡Razor: ¡ prefer ¡the ¡simplest ¡ hypothesis • What ¡does ¡it ¡mean ¡for ¡a ¡hypothesis ¡(or ¡ model) ¡to ¡be ¡ simple ? 1. small ¡number ¡of ¡features ¡( model ¡selection ) 2. small ¡number ¡of ¡“important” ¡features ¡ ( shrinkage ) 14

Regularization Whiteboard – L2, ¡L1, ¡L0 ¡Regularization – Example: ¡Linear ¡Regression – Probabilistic ¡Interpretation ¡of ¡Regularization 15

Regularization Don’t ¡Regularize ¡the ¡Bias ¡(Intercept) ¡Parameter! • In ¡our ¡models ¡so ¡far, ¡the ¡bias ¡/ ¡intercept ¡parameter ¡is ¡ usually ¡denoted ¡by ¡ 𝜄 " -‑-‑ that ¡is, ¡the ¡parameter ¡for ¡which ¡ we ¡fixed ¡ 𝑦 " = 1 • Regularizers always ¡avoid ¡penalizing ¡this ¡bias ¡/ ¡intercept ¡ parameter • Why? ¡Because ¡otherwise ¡the ¡learning ¡algorithms ¡wouldn’t ¡ be ¡invariant ¡to ¡a ¡shift ¡in ¡the ¡y-‑values Whitening ¡Data • It’s ¡common ¡to ¡ whiten each ¡feature ¡by ¡subtracting ¡its ¡ mean ¡and ¡dividing ¡by ¡its ¡variance • For ¡regularization, ¡this ¡helps ¡all ¡the ¡features ¡be ¡penalized ¡ in ¡the ¡same ¡units ¡ (e.g. ¡convert ¡both ¡centimeters ¡and ¡kilometers ¡to ¡z-‑scores) 16

Regularization: ¡ + Slide ¡courtesy ¡of ¡William ¡Cohen

Polynomial ¡Coefficients ¡ ¡ ¡ none exp(18) huge Slide ¡courtesy ¡of ¡William ¡Cohen

Over ¡Regularization: ¡ Slide ¡courtesy ¡of ¡William ¡Cohen

Regularization ¡Exercise In-‑class ¡Exercise 1. Plot ¡train ¡error ¡vs. ¡# ¡features ¡(cartoon) 2. Plot ¡test ¡error ¡vs. ¡# ¡features ¡(cartoon) error # ¡features 20

Example: ¡Logistic ¡Regression Training ¡ Data 21

Example: ¡Logistic ¡Regression Test Data 22

Example: ¡Logistic ¡Regression error 1/lambda 23

Example: ¡Logistic ¡Regression 24

Example: ¡Logistic ¡Regression error 1/lambda 37

Takeaways 1. Nonlinear ¡basis ¡functions ¡ allow ¡ linear ¡ models (e.g. ¡Linear ¡Regression, ¡Logistic ¡ Regression) ¡to ¡capture ¡ nonlinear aspects ¡of ¡ the ¡original ¡input 2. Nonlinear ¡features ¡are ¡ require ¡no ¡changes ¡ to ¡the ¡model ¡ (i.e. ¡just ¡preprocessing) 3. Regularization helps ¡to ¡avoid ¡ overfitting 4. Regularization and ¡ MAP ¡estimation are ¡ equivalent ¡for ¡appropriately ¡chosen ¡priors 46

THE ¡PERCEPTRON ¡ALGORITHM 47

Background: ¡Hyperplanes Why ¡don’t ¡we ¡drop ¡the ¡ generative ¡model ¡and ¡ try ¡to ¡learn ¡this ¡ hyperplane directly?

Background: ¡Hyperplanes Hyperplane (Definition ¡1): ¡ H = { x : w T x = b } Hyperplane (Definition ¡2): ¡ H = { x : w T x = 0 w x 0 and x 1 = 1 } Half-‑spaces: ¡ H + = { x : w T x > 0 and x 1 = 1 } x 0 H − = { x : w T x < 0 and x 1 = 1 } x 0

Background: ¡Hyperplanes Directly ¡modeling ¡the ¡ hyperplane would ¡use ¡a ¡ Why ¡don’t ¡we ¡drop ¡the ¡ decision ¡function: generative ¡model ¡and ¡ try ¡to ¡learn ¡this ¡ h ( � ) = sign ( θ T � ) hyperplane directly? for: y ∈ { − 1 , +1 }

Online ¡Learning For ¡ i = ¡1, ¡2, ¡3, ¡… : • Receive an ¡unlabeled ¡instance ¡ x (i) • Predict y’ ¡= ¡h( x (i) ) • Receive true ¡label ¡y (i) Check for ¡correctness ¡(y’ ¡== ¡y (i) ) Goal: • Minimize the ¡number ¡of ¡ mistakes 52

Online ¡Learning: ¡Motivation Examples 1. Email ¡classification ¡(distribution ¡of ¡both ¡ spam ¡and ¡regular ¡mail ¡changes ¡over ¡time, ¡ but ¡the ¡target ¡function ¡stays ¡fixed ¡-‑ last ¡ year's ¡spam ¡still ¡looks ¡like ¡spam). 2. Recommendation ¡systems. ¡Recommending ¡ movies, ¡etc. 3. Predicting ¡whether ¡a ¡user ¡will ¡be ¡interested ¡ in ¡a ¡new ¡news ¡article ¡or ¡not. 4. Ad ¡placement ¡in ¡a ¡new ¡market. 53 Slide ¡from ¡Nina ¡Balcan

� Perceptron ¡Algorithm Data: ¡ Inputs ¡are ¡continuous ¡vectors ¡of ¡length ¡K. ¡Outputs ¡ are ¡discrete. =1 where � ∈ R K and y ∈ { +1 , − 1 } Prediction: ¡ Output ¡determined ¡by ¡hyperplane. � if a ≥ 0 y = h θ ( x ) = sign( θ T x ) 1 , ˆ sign ( a ) = otherwise − 1 , Learning: ¡ Iterative ¡procedure: • while ¡ not ¡converged • receive next ¡example ¡( x (i) , ¡y (i) ) • predict y’ ¡= ¡h( x (i) ) • if positive ¡mistake: ¡ add x (i) to ¡parameters • if negative ¡mistake: ¡ subtract x (i) from ¡parameters 54

� Perceptron ¡Algorithm Data: ¡ Inputs ¡are ¡continuous ¡vectors ¡of ¡length ¡K. ¡Outputs ¡ are ¡discrete. =1 where � ∈ R K and y ∈ { +1 , − 1 } Prediction: ¡ Output ¡determined ¡by ¡hyperplane. � if a ≥ 0 y = h θ ( x ) = sign( θ T x ) 1 , ˆ sign ( a ) = otherwise − 1 , Learning: 55

Perceptron ¡Algorithm: ¡Example Example: X −1,2 − - a 1,0 + + 1,1 + X a −1,0 − - + −1, −2 − X + a 1, −1 + - Algorithm: 𝜄 ) = (0,0) Set ¡t=1, ¡start ¡with ¡all-‑zeroes ¡weight ¡vector ¡ 𝑥 ) . § 𝜄 - = 𝜄 ) − −1,2 = (1, −2) Given ¡example ¡ 𝑦 , ¡predict ¡positive ¡iff 𝜄 3 ⋅ 𝑦 ≥ 0. § 𝜄 . = 𝜄 - + 1,1 = (2, −1) § On ¡a ¡mistake, ¡update ¡as ¡follows: ¡ 𝜄 0 = 𝜄 . − −1, −2 = (3,1) • Mistake ¡on ¡positive, ¡update ¡ 𝜄 37) ← 𝜄 3 + 𝑦 • Mistake ¡on ¡negative, ¡update ¡ 𝜄 37) ← 𝜄 3 − 𝑦 Slide ¡adapted ¡from ¡Nina ¡Balcan

Geometric ¡Margin Definition: The ¡margin of ¡example ¡ 𝑦 w.r.t. a ¡linear ¡sep. 𝑥 is ¡the ¡ distance ¡from ¡ 𝑦 ¡ to ¡the ¡plane ¡ 𝑥 ⋅ 𝑦 = 0 (or ¡the ¡negative if ¡on ¡wrong ¡side) Margin ¡of ¡positive ¡example ¡ 𝑦 ) 𝑦 ) w Margin ¡of ¡negative ¡example ¡ 𝑦 - 𝑦 - Slide ¡from ¡Nina ¡Balcan

Geometric ¡Margin Definition: The ¡margin of ¡example ¡ 𝑦 w.r.t. a ¡linear ¡sep. 𝑥 is ¡the ¡ distance ¡from ¡ 𝑦 ¡ to ¡the ¡plane ¡ 𝑥 ⋅ 𝑦 = 0 (or ¡the ¡negative if ¡on ¡wrong ¡side) Definition: The ¡margin ¡ 𝛿 ; of ¡a ¡set ¡of ¡examples ¡ 𝑇 wrt a ¡linear ¡ separator ¡ 𝑥 is ¡the ¡smallest ¡margin ¡over ¡points ¡ 𝑦 ∈ 𝑇 . + + + w + + 𝛿 ; - 𝛿 ; ++ - - + - - - - - - Slide ¡from ¡Nina ¡Balcan

Regularization + Perceptron Perceptron Readings: Matt Gormley - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization + Perceptron Perceptron Readings: Matt Gormley Murphy 8.5.4 Bishop

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) First learning algorithm for

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

CSC 411 Lecture 13: Probabilistic Models I Roger Grosse, Amir-massoud Farahmand, and Juan

EE361: Signals and System II Probability Distributions http://www.ee.unlv.edu/~b1morris/ee361/

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

The cosmological evolution of blazars and the cosmic gamma- ray background in the Fermi era

Hamilton-Jacobi-Bellman Equation of an Optimal Consumption Problem Shuenn-Jyi Sheu Institute of

Funds of Funds AQF 2005 Quick review Due to their relatively weak correlation with other

Sum m ary 1 . Rapid review of literature 2 . Local Stochastic Volatility : The best of tw o w

Outline 1 Presentation of the problem Truncated Stochastic Algorithms and Variance Reduction:

Sambuz

Useful Links

Newsletter

Mail Us

Regularization + Perceptron Perceptron Readings: Matt Gormley - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization + Perceptron Perceptron Readings: Matt Gormley Murphy 8.5.4 Bishop

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Introduction to Machine Learning Perceptron Barnabs Pczos Contents History of Artificial

How to Train Your Perceptron 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth,

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) First learning algorithm for

Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net &gt; 0, else 0) l

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

CSC 411 Lecture 13: Probabilistic Models I Roger Grosse, Amir-massoud Farahmand, and Juan

EE361: Signals and System II Probability Distributions http://www.ee.unlv.edu/~b1morris/ee361/

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

The cosmological evolution of blazars and the cosmic gamma- ray background in the Fermi era

Hamilton-Jacobi-Bellman Equation of an Optimal Consumption Problem Shuenn-Jyi Sheu Institute of

Funds of Funds AQF 2005 Quick review Due to their relatively weak correlation with other

Sum m ary 1 . Rapid review of literature 2 . Local Stochastic Volatility : The best of tw o w

Outline 1 Presentation of the problem Truncated Stochastic Algorithms and Variance Reduction:

Sambuz

Useful Links

Newsletter

Mail Us

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Perceptron Homework Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0) l