HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/551 1

Outline � Framework � Exact � Minimize Mistakes (Perceptron Training) � Matrix inversion (LMS) � Logistic Regression � Max Likelihood Estimation (MLE) of P( y | x ) � Gradient descent (MSE; MLE) � Newton-Raphson � Linear Discriminant Analysis � Max Likelihood Estimation (MLE) of P( y, x ) � Direct Computation � Fisher’s Linear Discriminant 2

Diagnosing Butterfly-itis ����� �������� ����������������� 3

Classifier: Decision Boundaries � Classifier : partitions input space X into “decision regions” � � � � s � g � n � � � i � w # � � � � � #antennae � L inear threshold unit has a linear decision boundary � Defn: Set of points that can be separated by linear decision boundary is “linearly separable" 4

Linear Separators � Draw “separating line” � � � � � � � � � � � � � � � � � � � � � � � � ��������� � � If #antennae ≤ 2, then butterfly-itis � So ? is Not butterfly-itis. 5

Can be “angled”… � � � � � � � � � � � � � � � � � � � � � � � � ��������� 2.3 × × #w – 7.5 × × × × × × #a + 1.2 = 0 � If 2.3 × × × #Wings – 7.5 × × × #antennae + 1.2 × × > 0 then butterfly-itis 6

Linear Separators, in General � Given data (many features) Temp. F 1 Press F 2 … … Color F n diseaseX? Class 35 95 … 3 No 35 95 … Pale No 22 80 … -2 Yes 22 80 … Clear Yes : : : : : : : : 10 50 … 1.9 No No 10 50 … Pale � find “weights” {w 1 , w 2 , …, w n , w 0 } such that � �� × × � � ��� ��� � × × � � ��� � ��� × × × × �������� �� means 7

Linear Separator '( � � � � �� � � )( � � Σ �� � � × � � ! ! � � "# � � ' $����%����� � ���&��#�� � � 8

Linear Separator *+,- '( � � ��� �� �� ��� )( � � Σ �� � � × � � ! ! "# �� � � ' � Performance � Given {w i }, and values for instance, compute response � Learning � Given labeled data, find “correct” {w i } � Linear Threshold Unit … “Perceptron” 9

Geometric View ( [1.0, 1.0]; 1 ) ( [0.5; 3.0]; 1 ) � Consider 3 training examples: ( [2.0; 2.0]; 0 ) � Want classifier that looks like. . . 10

Linear Equation is Hyperplane � Equation w · x = � i w i ·x i is plane 1 if w · x > 0 y( x ) = 0 otherwise 11

Linear Threshold Unit: “Perceptron” � Squashing function: sgn: ℜ→ {-1, +1 } 1 if r > 0 sgn(r) = 0 otherwise (“Heaviside”) � Actually w · x > b but. . . Create extra input x 0 fixed at 1 Corresponding w 0 corresponds to -b 12

Learning Perceptrons � Can represent Linearly-Separated surface . . . any hyper-plane between two half-spaces… � Remarkable learning algorithm: [Rosenblatt 1960] If function f can be represented by perceptron, then ∃ learning alg guaranteed to quickly converge to f ! � enormous popularity, early / mid 60's � But some simple fns cannot be represented … killed the field temporarily! 13

Perceptron Learning � Hypothesis space is. . . � Fixed Size: ∃ O(2 n^2 ) distinct perceptrons over n boolean features � Deterministic � Continuous Parameters � Learning algorithm: � Various: Local search, Direct computation, . . . � Eager � Online / Batch 14

Task � Input: labeled data Transformed to � Output: w ∈ℜ r+1 Goal : Want w s.t. ∀ i sgn( w · [1, x (i) ]) = y (i) ∀ ∀ ∀ � . . . minimize mistakes wrt data . . . 15

Error Function Given data { [x (i) , y (i) ] } i=1..m , optimize... m � 1. Classification error 1 � ( i ) ( i ) = ≠ err ( w ) I [ y o ( x ) ] Class w m Perceptron Training; Matrix Inversion = i 1 m � 2. Mean-squared error (LMS) 1 1 � ( i ) ( i ) 2 = − err ( w ) [ y o ( x ) ] MSE w 2 m Matrix Inversion; Gradient Descent = i 1 m 1 � � 3. (Log) Conditional Probability (LR) ( i ) ( i ) = LCL ( w ) log P ( y | x ) w m = i 1 MSE Gradient Descent; LCL Gradient Descent m 1 � � 4. (Log) Joint Probability (LDA; FDA) ( i ) ( i ) = LL ( w ) log P ( y , x ) w m = i 1 Direct Computation 16

#1: Optimal Classification Error � For each labeled instance [ x , y] Err = y – o w ( x ) y = f(x) is target value o w ( x ) = sgn( w · x ) is perceptron output � Idea : Move weights in appropriate direction, to push Err → 0 � If Err > 0 (error on POSITIVE example) � need to increase sgn( w · x ) � need to increase w · x � If Err < 0 (error on NEGATIVE example) � Input j contributes w j · x j to w · x � if x j > 0, increasing w j will increase w · x � if x j < 0, decreasing w j will increase w · x � w j ← w j – x j � w j ← w j + x j 17

Local Search via Gradient Descent 18 18

#1a: Mistake Bound Perceptron Alg Weights Instance Action Initialize w = 0 [0 0 0] #1 +x Do until bored [1 0 0] #2 -x Predict “+” iff w · x > 0 else “–" [0 -1 0] #3 +x Mistake on y = +1: w ← w + x #1 OK [1 0 1] Mistake on ������� w ← w – x [1 0 1] #2 -x [0 -1 1] #3 +x #1 OK [1 0 2] [1 0 2] #2 -x [0 -1 2] #3 OK #1 +x [1 -1 2] [1 -1 2] #2 OK [1 -1 2] #3 OK #1 OK [1 -1 2] 19

… Mistake Bound Theorem M V S e e S Theorem: [Rosenblatt 1960] If data is consistent w/some linear threshold w , then number of mistakes is ≤ (1/ ∆ ) 2 , where � ∆ measures “wiggle room” available: If |x| = 1, then ∆ is max, over all consistent planes, of minimum distance of example to that plane � w is ⊥ to separator, as w · x = 0 at boundary � So | w · x | is projection of x onto plane, PERPENDICULAR to boundary line … ie, is distance from x to that line (once normalized) 20

Proof of Convergence x wrong wrt w iff w · x < 0 � Let w * be unit vector rep'ning target plane ∆ = min x { w * · x } Let w be hypothesis plane � Consider: w = Σ {x | x · w < 0 } x � On each mistake, add x to w 21

Proof (con't) If w is mistake… ∆ = min x { w * · x } w = Σ {x | x · w < 0 } x 22

#1b: Perceptron Training Rule For each labeled instance [ x , y] � Err( [ x , y] ) = y – o w ( x ) ∈ { -1, 0, +1 } � If Err( [ x , y] ) = 0 Correct! … Do nothing! ∆ w = 0 ≡ Err( [ x , y] ) · x � If Err( [ x , y] ) = +1 Mistake on positive! Increment by +x ∆ w = +x ≡ Err( [ x , y] ) · x � If Err( [ x , y] ) = -1 Mistake on negative! Increment by -x ∆ w = -x ≡ Err( [ x , y] ) · x In all cases... ∆ w (i) = Err( [ x (i) , y (i) ] ) · x (i) = [y (i) – o w ( x (i) )] · x (i) Batch Mode: do ALL updates at once! � ∆ w j = � i ∆ w j(i) = � i x (i)j ( y (i) – o w ( x (i) ) ) w j += η ∆ w j 23

0. New w 0. Fix w ∆ w := 0 1. For each row i, compute a. E (i) := y (i) – o w ( x (i) ) b. ∆ w += E (i) x (i) [ … ∆ w j += E (i) x (i) j … ] feature j 2. Increment w += η ∆ w x (i) E (i) x (i) j ∆ w j ∆ w 24

Correctness � Rule is intuitive: Climbs in correct direction. . . � Thrm: Converges to correct answer, if . . . � training data is linearly separable � sufficiently small η � Proof: Weight space has EXACTLY 1 minimum ! (no non-global minima) � with enough examples, finds correct function! � Explains early popularity � If η too large, may overshoot If η too small, takes too long � So often η = η (k) … which decays with # of iterations, k 25

#1c: Matrix Version? � Task: Given { � x i , y i } i � y i ∈ { –1, +1 } is label Find { w i } s.t. � Linear Equalities y = X w � Solution: w = X -1 y 26

Issues 1. Why restrict to only y i ∈ { –1, +1 } ? � If from discrete set y i ∈ { 0, 1, …, m } : General (non-binary) classification � If ARBITRARY y i ∈ ℜ : Regression 2. What if NO w works? ...X is singular; overconstrained ... Could try to minimize residual NP-Hard! � i � [ y (i) ≠ w · x (i) ] = � i | y (i) – w · x (i) | || y – X w || 1 Easy! = � i ( y (i) – w · x (i) ) 2 || y – X w || 2 27

L 2 error vs 0/1-Loss � “0/1 Loss function” not smooth, differentiable � MSE error is smooth, differentiable… and is overbound... 28

Gradient Descent for Perceptron? � Why not Gradient Descent for THRESHOLDed perceptron? � Needs gradient (derivative), not � Gradient Descent is General approach. Requires + continuously parameterized hypothesis + error must be differentiatable wrt parameters But. . . – can be slow (many iterations) – may only find LOCAL opt 29

Linear Separators – Facts � GOOD NEWS: � If data is linearly separated, � Then FAST ALGORITHM finds correct {w i } ! � But… � � � � 30

Linear Separators – Facts � GOOD NEWS: � If data is linearly separated, � Then FAST ALGORITHM finds correct {w i } ! � But… � � � � Stay tuned! � Some “data sets” are NOT linearly separatable! 31

Recommend

More recommend