Linear Classifiers R Greiner Cmput 466/551 1 Outline Framework - PowerPoint PPT Presentation

HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/551 1

Outline � Framework � Exact � Minimize Mistakes (Perceptron Training) � Matrix inversion (LMS) � Logistic Regression � Max Likelihood Estimation (MLE) of P( y | x ) � Gradient descent (MSE; MLE) � Newton-Raphson � Linear Discriminant Analysis � Max Likelihood Estimation (MLE) of P( y, x ) � Direct Computation � Fisher’s Linear Discriminant 2

Diagnosing Butterfly-itis �� 3

Classifier: Decision Boundaries � Classifier : partitions input space X into “decision regions” � � � � s � g � n � � � i � w # � � � � � #antennae � L inear threshold unit has a linear decision boundary � Defn: Set of points that can be separated by linear decision boundary is “linearly separable" 4

Linear Separators � Draw “separating line” � � � � � � � � � � � � � � � � � � � � � � � � �� If #antennae ≤ 2, then butterfly-itis � So ? is Not butterfly-itis. 5

Can be “angled”… � � � � � � � � � � � � � � � � � � � � � � � � �� 2.3 × × #w – 7.5 × × × × × × #a + 1.2 = 0 � If 2.3 × × × #Wings – 7.5 × × × #antennae + 1.2 × × > 0 then butterfly-itis 6

Linear Separators, in General � Given data (many features) Temp. F 1 Press F 2 … … Color F n diseaseX? Class 35 95 … 3 No 35 95 … Pale No 22 80 … -2 Yes 22 80 … Clear Yes : : : : : : : : 10 50 … 1.9 No No 10 50 … Pale � find “weights” {w 1 , w 2 , …, w n , w 0 } such that � �� × × � � �� × × � � �� × × × × �� means 7

Linear Separator '( � � � � �� )( � � Σ �� × � � ! ! � � "# � � ' $��%�� &��#�� 8

Linear Separator *+,- '( � � �� )( � � Σ �� × � � ! ! "# �� ' � Performance � Given {w i }, and values for instance, compute response � Learning � Given labeled data, find “correct” {w i } � Linear Threshold Unit … “Perceptron” 9

Geometric View ( [1.0, 1.0]; 1 ) ( [0.5; 3.0]; 1 ) � Consider 3 training examples: ( [2.0; 2.0]; 0 ) � Want classifier that looks like. . . 10

Linear Equation is Hyperplane � Equation w · x = � i w i ·x i is plane 1 if w · x > 0 y( x ) = 0 otherwise 11

Linear Threshold Unit: “Perceptron” � Squashing function: sgn: ℜ→ {-1, +1 } 1 if r > 0 sgn(r) = 0 otherwise (“Heaviside”) � Actually w · x > b but. . . Create extra input x 0 fixed at 1 Corresponding w 0 corresponds to -b 12

Learning Perceptrons � Can represent Linearly-Separated surface . . . any hyper-plane between two half-spaces… � Remarkable learning algorithm: [Rosenblatt 1960] If function f can be represented by perceptron, then ∃ learning alg guaranteed to quickly converge to f ! � enormous popularity, early / mid 60's � But some simple fns cannot be represented … killed the field temporarily! 13

Perceptron Learning � Hypothesis space is. . . � Fixed Size: ∃ O(2 n^2 ) distinct perceptrons over n boolean features � Deterministic � Continuous Parameters � Learning algorithm: � Various: Local search, Direct computation, . . . � Eager � Online / Batch 14

Task � Input: labeled data Transformed to � Output: w ∈ℜ r+1 Goal : Want w s.t. ∀ i sgn( w · [1, x (i) ]) = y (i) ∀ ∀ ∀ � . . . minimize mistakes wrt data . . . 15

Error Function Given data { [x (i) , y (i) ] } i=1..m , optimize... m � 1. Classification error 1 � ( i ) ( i ) = ≠ err ( w ) I [ y o ( x ) ] Class w m Perceptron Training; Matrix Inversion = i 1 m � 2. Mean-squared error (LMS) 1 1 � ( i ) ( i ) 2 = − err ( w ) [ y o ( x ) ] MSE w 2 m Matrix Inversion; Gradient Descent = i 1 m 1 � � 3. (Log) Conditional Probability (LR) ( i ) ( i ) = LCL ( w ) log P ( y | x ) w m = i 1 MSE Gradient Descent; LCL Gradient Descent m 1 � � 4. (Log) Joint Probability (LDA; FDA) ( i ) ( i ) = LL ( w ) log P ( y , x ) w m = i 1 Direct Computation 16

#1: Optimal Classification Error � For each labeled instance [ x , y] Err = y – o w ( x ) y = f(x) is target value o w ( x ) = sgn( w · x ) is perceptron output � Idea : Move weights in appropriate direction, to push Err → 0 � If Err > 0 (error on POSITIVE example) � need to increase sgn( w · x ) � need to increase w · x � If Err < 0 (error on NEGATIVE example) � Input j contributes w j · x j to w · x � if x j > 0, increasing w j will increase w · x � if x j < 0, decreasing w j will increase w · x � w j ← w j – x j � w j ← w j + x j 17

Local Search via Gradient Descent 18 18

#1a: Mistake Bound Perceptron Alg Weights Instance Action Initialize w = 0 [0 0 0] #1 +x Do until bored [1 0 0] #2 -x Predict “+” iff w · x > 0 else “–" [0 -1 0] #3 +x Mistake on y = +1: w ← w + x #1 OK [1 0 1] Mistake on �� w ← w – x [1 0 1] #2 -x [0 -1 1] #3 +x #1 OK [1 0 2] [1 0 2] #2 -x [0 -1 2] #3 OK #1 +x [1 -1 2] [1 -1 2] #2 OK [1 -1 2] #3 OK #1 OK [1 -1 2] 19

… Mistake Bound Theorem M V S e e S Theorem: [Rosenblatt 1960] If data is consistent w/some linear threshold w , then number of mistakes is ≤ (1/ ∆ ) 2 , where � ∆ measures “wiggle room” available: If |x| = 1, then ∆ is max, over all consistent planes, of minimum distance of example to that plane � w is ⊥ to separator, as w · x = 0 at boundary � So | w · x | is projection of x onto plane, PERPENDICULAR to boundary line … ie, is distance from x to that line (once normalized) 20

Proof of Convergence x wrong wrt w iff w · x < 0 � Let w * be unit vector rep'ning target plane ∆ = min x { w * · x } Let w be hypothesis plane � Consider: w = Σ {x | x · w < 0 } x � On each mistake, add x to w 21

Proof (con't) If w is mistake… ∆ = min x { w * · x } w = Σ {x | x · w < 0 } x 22

#1b: Perceptron Training Rule For each labeled instance [ x , y] � Err( [ x , y] ) = y – o w ( x ) ∈ { -1, 0, +1 } � If Err( [ x , y] ) = 0 Correct! … Do nothing! ∆ w = 0 ≡ Err( [ x , y] ) · x � If Err( [ x , y] ) = +1 Mistake on positive! Increment by +x ∆ w = +x ≡ Err( [ x , y] ) · x � If Err( [ x , y] ) = -1 Mistake on negative! Increment by -x ∆ w = -x ≡ Err( [ x , y] ) · x In all cases... ∆ w (i) = Err( [ x (i) , y (i) ] ) · x (i) = [y (i) – o w ( x (i) )] · x (i) Batch Mode: do ALL updates at once! � ∆ w j = � i ∆ w j(i) = � i x (i)j ( y (i) – o w ( x (i) ) ) w j += η ∆ w j 23

0. New w 0. Fix w ∆ w := 0 1. For each row i, compute a. E (i) := y (i) – o w ( x (i) ) b. ∆ w += E (i) x (i) [ … ∆ w j += E (i) x (i) j … ] feature j 2. Increment w += η ∆ w x (i) E (i) x (i) j ∆ w j ∆ w 24

Correctness � Rule is intuitive: Climbs in correct direction. . . � Thrm: Converges to correct answer, if . . . � training data is linearly separable � sufficiently small η � Proof: Weight space has EXACTLY 1 minimum ! (no non-global minima) � with enough examples, finds correct function! � Explains early popularity � If η too large, may overshoot If η too small, takes too long � So often η = η (k) … which decays with # of iterations, k 25

#1c: Matrix Version? � Task: Given { � x i , y i } i � y i ∈ { –1, +1 } is label Find { w i } s.t. � Linear Equalities y = X w � Solution: w = X -1 y 26

Issues 1. Why restrict to only y i ∈ { –1, +1 } ? � If from discrete set y i ∈ { 0, 1, …, m } : General (non-binary) classification � If ARBITRARY y i ∈ ℜ : Regression 2. What if NO w works? ...X is singular; overconstrained ... Could try to minimize residual NP-Hard! � i � [ y (i) ≠ w · x (i) ] = � i | y (i) – w · x (i) | || y – X w || 1 Easy! = � i ( y (i) – w · x (i) ) 2 || y – X w || 2 27

L 2 error vs 0/1-Loss � “0/1 Loss function” not smooth, differentiable � MSE error is smooth, differentiable… and is overbound... 28

Gradient Descent for Perceptron? � Why not Gradient Descent for THRESHOLDed perceptron? � Needs gradient (derivative), not � Gradient Descent is General approach. Requires + continuously parameterized hypothesis + error must be differentiatable wrt parameters But. . . – can be slow (many iterations) – may only find LOCAL opt 29

Linear Separators – Facts � GOOD NEWS: � If data is linearly separated, � Then FAST ALGORITHM finds correct {w i } ! � But… � � � � 30

Linear Separators – Facts � GOOD NEWS: � If data is linearly separated, � Then FAST ALGORITHM finds correct {w i } ! � But… � � � � Stay tuned! � Some “data sets” are NOT linearly separatable! 31

Linear Classifiers R Greiner Cmput 466/551 1 Outline Framework - PowerPoint PPT Presentation

HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/551 1 Outline Framework Exact Minimize Mistakes (Perceptron Training) Matrix inversion (LMS) Logistic Regression Max Likelihood Estimation (MLE) of P( y | x )

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

let 5th a maximal element of S be rn If which STI then EI JE rn ri a c V contradicts

Chapter gambling A simple I : game Newstead Define stochastic process based example

Meridian Park Elementary 2018-2019 Educational Effectiveness Survey Report A Conversation

Panel Data Analysis Part II Feasible Estimators James J. Heckman University of Chicago Econ

Database Application Development Chapter 6 Database Management Systems 3ed, R. Ramakrishnan and

Pu Published Data with and Appropriate Integr gration on of I of Immune Ch Checkp kpoi oint

Reverse Engineering Closed, heterogeneous platforms and the defenders dilemma Looking back at

Digital Technologies Hub - Auditing capacity D Digital literacy grants: G rants of between

Linear Classifiers R Greiner Cmput 466/551 1 Outline Framework - PowerPoint PPT Presentation

HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/551 1 Outline Framework Exact Minimize Mistakes (Perceptron Training) Matrix inversion (LMS) Logistic Regression Max Likelihood Estimation (MLE) of P( y | x )

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Sketching Linear Classifiers Over Data Streams Kai Sheng Tai Vatsal Sharan, Peter Bailis,

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall

Linear &amp; nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

let 5th a maximal element of S be rn If which STI then EI JE rn ri a c V contradicts

Chapter gambling A simple I : game Newstead Define stochastic process based example

Meridian Park Elementary 2018-2019 Educational Effectiveness Survey Report A Conversation

Panel Data Analysis Part II Feasible Estimators James J. Heckman University of Chicago Econ

Database Application Development Chapter 6 Database Management Systems 3ed, R. Ramakrishnan and

Pu Published Data with and Appropriate Integr gration on of I of Immune Ch Checkp kpoi oint

Reverse Engineering Closed, heterogeneous platforms and the defenders dilemma Looking back at

Digital Technologies Hub - Auditing capacity D Digital literacy grants: G rants of between

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology