CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science University of Virginia

Overview 1. Review: Linear Functions 2. Perceptron 3. Logistic Regression 4. Linear Regression 1

Review: Linear Functions

Linear Predictors Linear predictors discussed in this course ◮ halfspace predictors ◮ logistic regression classifiers ◮ linear SVMs (lecture on support vector machines) ◮ naive Bayes classifiers (lecture on generative models) ◮ linear regression predictors 3

Linear Predictors Linear predictors discussed in this course ◮ halfspace predictors ◮ logistic regression classifiers ◮ linear SVMs (lecture on support vector machines) ◮ naive Bayes classifiers (lecture on generative models) ◮ linear regression predictors A common core form of these linear predictors d � h w , b � � w , x � + b � � � w i x i + b (1) i � 1 where w is the weights and b is the bias 3

Alternative Form Given the original definition of a linear function d � h w , b � � w , x � + b � � � w i x i + b , (2) i � 1 we could redefine it in a more compact form ( w 1 , w 2 , . . . , w d , b ) T ← w ( x 1 , x 2 , . . . , x d , 1 ) T x ← and then h w , b ( x ) � � w , x � (3) 4

Linear Functions Consider a two-dimensional case with w � ( 1 , 1 , − 0 . 5 ) f ( x ) � w T x � x 1 + x 2 − 0 . 5 (4) x 2 x 1 Different values of f ( x ) map to different areas on this 2-D space. For example, the following equation defines the blue line L . f ( x ) � w T x � 0 (5) 5

Properties of Linear Functions (II) For any two points x and x ′ lying in the line f ( x ) − f ( x ′ ) � w T x − w T x ′ � 0 (6) x 2 x x 1 x ′ [Friedman et al., 2001, Section 4.5] 6

Properties of Linear Functions (III) Furthermore, f ( x ) � x 1 + x 2 − 0 . 5 � 0 (7) separates the 2-D space R 2 into two half spaces x 2 f ( x ) > 0 x 1 f ( x ) < 0 7

Properties of Linear Functions (IV) From the perspective of linear projection, f ( x ) � 0 defines the vectors on this 2-D space, whose projections onto the direction ( 1 , 1 ) have the same magnitude 0 . 5 � 1 � x 1 + x 2 − 0 . 5 � 0 ⇒ ( x 1 , x 2 ) · � 0 . 5 (8) 1 x 2 ( 1 , 1 ) x x 1 8 [Friedman et al., 2001, Section 4.5]

Properties of Linear Functions (IV) From the perspective of linear projection, f ( x ) � 0 defines the vectors on this 2-D space, whose projections onto the direction ( 1 , 1 ) have the same magnitude 0 . 5 � 1 � x 1 + x 2 − 0 . 5 � 0 ⇒ ( x 1 , x 2 ) · � 0 . 5 (8) 1 x 2 ( 1 , 1 ) This idea can be generalized x to compute the distance x 1 between a point and a line. 8 [Friedman et al., 2001, Section 4.5]

Properties of Linear Functions (IV) The distance of point x to line L : f ( x ) � � w , x � � 0 is given by f ( x ) � � w , x � w � � , x � (9) � w � 2 � x � 2 � w � 2 x 2 x x 1 [Friedman et al., 2001, Section 4.5] 9

Perceptron

Halfspace Hypothesis Class ◮ X � R d ◮ Y � {− 1 , + 1 } ◮ Halfspace hypothesis class half � { sign (� w , x �) : w ∈ R d } H (10) which is an infinite hypothesis space. The sign function y � sign ( x ) is defined as 11

Linearly Separable Cases The algorithm can find a hyperplane to separate all positive examples from negative examples x 2 x 1 The definition of linearly separable cases is with respect to the training set S instead of D 12

Prediction Rule The prediction rule of a half-space predictor is based on the sign of h ( x ) � sign (� w , x �) � + 1 � w , x � > 0 h ( x ) � (11) − 1 � w , x � < 0 x 2 x 2 � w , x � > 0 + x 1 x 1 − � w , x � < 0 13

Prediction Rule The prediction rule of a half-space predictor is based on the sign of h ( x ) � sign (� w , x �) � + 1 � w , x � > 0 h ( x ) � (11) − 1 � w , x � < 0 or, if y ′ ∈ {− 1 , + 1 } and y ′ � w , x � > 0 h ( x ) � y ′ (12) x 2 x 2 � w , x � > 0 + x 1 x 1 − � w , x � < 0 13

Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 9: Output : w ( T ) 14

Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: 8: end for 9: Output : w ( T ) 14

Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: if y i � w ( t ) , x i � ≤ 0 then 5: w ( t + 1 ) ← w ( t ) + y i x i // updating rule 6: end if 7: 8: end for 9: Output : w ( T ) 14

Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: if y i � w ( t ) , x i � ≤ 0 then 5: w ( t + 1 ) ← w ( t ) + y i x i // updating rule 6: end if 7: 8: end for 9: Output : w ( T ) Exercise : Implementing this algorithm with a simple 14 example

Two Questions The updating rule can be break down into two cases: w ( t + 1 ) ← w ( t ) + y i x i (13) ◮ For y i � + 1 , w ( t + 1 ) ← w ( t ) + x i ◮ For y i � − 1 , w ( t + 1 ) ← w ( t ) − x i 15

Two Questions The updating rule can be break down into two cases: w ( t + 1 ) ← w ( t ) + y i x i (13) ◮ For y i � + 1 , w ( t + 1 ) ← w ( t ) + x i ◮ For y i � − 1 , w ( t + 1 ) ← w ( t ) − x i Two questions: ◮ How the updating rule can help? ◮ How many updating steps the algorithm needs? 15

The Updating Rule At time step t , given the training example ( x i , y i ) and the current weight w ( t ) y i � w ( t + 1 ) , x i � y i � w ( t ) + y i x i , x i � (14) � y i � w ( t ) , x i � + � x i � 2 (15) � ◮ w ( t + 1 ) gives a higher value of y i � w ( t + 1 ) , x i � on predicting x i than w ( t ) ◮ the updating is affected by the norm of x i , � x i � 2 16

Theorem Assume that {( x i , y i )} m i � 1 is separable. Let ◮ B � min {� w � : ∀ i ∈ [ m ] , y i � w , x i � ≥ 1 } , and ◮ R � max i � x i � . Then, the Perceptron algorithm stops after at most ( RB ) 2 iterations, and when it stops it holds that ∀ i ∈ [ m ] , y i � w ( t ) , x � > 0 (16) ◮ A realizable case with infinite hypothesis space ◮ Finish training in finite steps 17

Example [Bishop, 2006, Page 195] 18

The XOR Example: a Non-separable Case ◮ X 1 , X 2 ∈ { 0 , 1 } ◮ the XOR operation is defined as x 2 Y � X 1 ⊕ X 2 where x 1 � 1 X 1 � X 2 Y � X 1 � X 2 0 19

The XOR Example: Further Comment 20

Logistic Regression

Hypothesis Class ◮ The hypothesis class of logistic regression is defined as LR � { σ (� w , x �) : w ∈ R d } H (17) ◮ The sigmoid function σ ( a ) with a ∈ R 1 σ ( a ) � (18) 1 + exp (− a ) 22

Unified Form for Logistic Predictors ◮ An unified form for y ∈ {− 1 , + 1 } 1 h ( x , y ) � (19) 1 + exp (− y � w , x �) which is similar to the half-space predictors 23

Unified Form for Logistic Predictors ◮ An unified form for y ∈ {− 1 , + 1 } 1 h ( x , y ) � (19) 1 + exp (− y � w , x �) which is similar to the half-space predictors ◮ Prediction 1. Compute the the values from Eq. 19 with y ∈ {− 1 , + 1 } 2. Pick the y that has bigger value � + 1 h ( x , + 1 ) > h ( x , − 1 ) y � (20) − 1 h ( x , + 1 ) < h ( x , − 1 ) 23

A Predictor Take a close look of the uniform definition of h ( x , y ) ◮ When y � + 1 1 h w ( x , + 1 ) � 1 + exp (−� w , x �) ◮ When y � − 1 1 h w ( x , − 1 ) � 1 + exp (� w , x �) exp (−� w , x �) � 1 + exp (−� w , x �) 1 1 − � 1 + exp (−� w , x �) 1 − h w ( x , + 1 ) � 24

A Linear Classifier? To justify this is a linear classifier, let take a look the decision boundary given by h ( x , + 1 ) � h ( x , − 1 ) (21) Specifically, we have 1 1 � 1 + exp (−� w , x �) 1 + exp (� w , x �) exp (−� w , x �) exp (� w , x �) � −� w , x � � w , x � � 2 � w , x � 0 � The decision boundary is a straight line 25

Risk/Loss Function For a given training example ( x , y ) , the risk/loss function is defined as the negative log of h ( x , y ) 1 L ( h w , ( x , y )) − log � 1 + exp (− y � w , x �) log ( 1 + exp (− y � w , x �)) (22) � Intuitively, minimizing the risk will increase the value of h ( x , y ) 26

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Review: Linear Functions 2. Perceptron 3. Logistic Regression 4. Linear Regression 1 Review: Linear Functions Linear

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Minor International Minor International Analyst Presentation 3Q10 Performance November 15, 2010

PT Link Net Tbk 9M 2017 Earnings Conference Call November 1, 2017 9M 2017 Key Highlights

Overview of Content Accounting Investor Relations April 2019 1 Disclosure This presentation

FY 2020 Q4 Earnings Call November 12, 2020 Agenda TransDigm Overview and Highlights Nick

Statistical Learning Learning From Examples We want to estimate the working temperature

{ 3 ! 837 : + 9 3 ;

So#ware Architecture Bertrand Meyer, Michela Pedroni ETH Zurich, FebruaryMay 2010 Lecture 2:

THE analysis place / transition invariants state equation MUTEX PATTERN trap equation -