Perceptron and Logistic Regression Milan Straka October 19, 2020 - PowerPoint PPT Presentation

NPFL129, Lecture 3 Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Cross-Validation We already talked about a train set and a test set . Given that the main goal of machine learning is to perform well on unseen data, the test set must not be used during training nor hyperparameter selection. Ideally, it is hidden to us altogether. Therefore, to evaluate a machine learning model (for example to select model architecture, features, or hyperparameter value), we normally need the validation or a development set. However, using a single development set might give us noisy results. To obtain less noisy results (i.e., with smaller variance), we can use cross-validation . In cross-validation, we choose multiple validation sets from the training data, and for every one, we train a model on the rest of the training data and evaluate on the chosen validation sets. A commonly used strategy to choose the validation sets is called k-fold cross-validation . Here the k training set is partitioned into subsets of approximately the same size, and each subset takes turn to play a role of a validation set. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 2/30

Cross-Validation An extreme case of the k-fold cross-validation is leave-one-out cross-validation , where every element is considered a separate validation set. Computing leave-one-out cross-validation is usually extremely inefficient for larger training sets, but in case of linear regression with L2 regularization, it can be evaluated efficiently. If you are interested, see: Ryan M. Rifkin and Ross A. Lippert: Notes on Regularized Least Square http://cbcl.mit.edu/publications/ps/MIT-CSAIL-TR-2007-025.pdf Implemented by sklearn.linear_model.RidgeCV . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 3/30

Binary Classification Binary classification is a classification in two classes. To extend linear regression to binary classification, we might seek a threshold and then classify y ( x ; w ) = x w + T b an input as negative/positive depending whether is smaller/larger than a given threshold. Zero value is usually used as the threshold, both because of symmetry and also because the bias parameter acts as a trainable threshold anyway. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 4/30

Binary Classification Consider two points on the decision y ( x ; w ) = y ( x ; w ) 1 2 boundary. Because , ( x − ) w = 0 2 T x w 1 we have , and so is orthogonal to every vector on the decision w surface – is a normal of the boundary. x x ⊥ Consider and let be orthogonal x projection of to the bounary, so we can x = x + w r ⊥ ∣∣ w ∣∣ write . Multiplying both w T b sides by and adding , we get that the y ( x ) r = x ∣∣ w ∣∣ distance of to the boundary is . The distance of the decision boundary from ∣ b ∣ ∣∣ w ∣∣ origin is therefore . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 5/30

Perceptron The perceptron algorithm is probably the oldest one for training weights of a binary t ∈ {−1, +1} w classification. Assuming the target value , the goal is to find weights such that for all train data, T sign( y ( x ; w )) = sign( x w ) = , t i i i or equivalently, y ( x ; w ) = T w > 0. t t x i i i i Note that a set is called linearly separable , if there exists a w weight vector such that the above equation holds. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 6/30

Perceptron The perceptron algorithm was invented by Rosenblat in 1958. X ∈ R N × D t ∈ {−1, +1} N Input : Linearly separable dataset ( , ). w ∈ R D w > 0 T t x i i i Output : Weights such that for all . w ← 0 i until all examples are classified correctly, process example : y ← x T w i y ≤ 0 t i if (incorrectly classified example): w ← w + t x i i w We will prove that the algorithm always arrives at some correct set of weights if the training set is linearly separable. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 7/30

Perceptron as SGD Consider the main part of the perceptron algorithm: y ← x T w i y ≤ 0 t i if (incorrectly classified example): w ← w + t x i i We can derive the algorithm using on-line gradient descent, using the following loss function { − t x w if t x w ≤ 0 T T def L ( y ( x ; w ), t ) = = max(0, − t x w ) = T ReLU(− t x w ). T 0 otherwise In this specific case, the value of the learning rate does not actually matter, because multiplying w by a constant does not change a prediction. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 8/30

Perceptron Example NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 9/30

Proof of Perceptron Convergence w w k ∗ k Let be some weights separating the training data and let be the weights after w 0 non-trivial updates of the perceptron algorithm, with being 0. α w w k ∗ We will prove that the angle between and decreases at each step. Note that T w w ∗ k cos( α ) = . ∣∣ w ∣∣ ⋅ ∣∣ w ∣∣ ∗ k NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 10/30

Proof of Perceptron Convergence ∣∣ x ∣∣ R Assume that the maximum norm of any training example is bounded by , and that ≥ γ . T γ w t x w ∗ ∗ is the minimum margin of , so w w ∗ k First consider the dot product of and : = ( w + ) ≥ + γ . T T T w w w t x w w k −1 k −1 ∗ ∗ ∗ k k k By iteratively applying this equation, we get T ≥ kγ . w w ∗ k w k Now consider the length of : 2 = ∣∣ w 2 2 2 ∣∣ w ∣∣ + t ∣∣ = ∣∣ w ∣∣ + 2 t T + ∣∣ x ∣∣ x x w k −1 k −1 k −1 k k k k k k 2 2 2 ≤ 0 ∣∣ w ∣∣ ≤ ∣∣ w ∣∣ + R . T x t x w k −1 k −1 k k k k Because was misclassified, we know that , so 2 k ⋅ R 2 ∣∣ w ∣∣ ≤ k When applied iteratively, we get . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 11/30

Proof of Perceptron Convergence Putting everything together, we get T w w kγ ∗ k cos( α ) = ≥ . ∣∣ w ∣∣ ⋅ ∣∣ w ∣∣ kR 2 ∣∣ w ∣∣ ∗ k ∗ cos( α ) cos( α ) Therefore, the increases during every update. Because the value of is at most one, we can compute the upper bound on the number of steps when the algorithm converges as 2 2 R ∣∣ w ∣∣ k γ ∗ 1 ≤ or k ≥ . γ 2 R 2 ∣∣ w ∣∣ ∗ NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 12/30

Perceptron Issues Perceptron has several drawbacks: If the input set is not linearly separable, the algorithm never finishes. The algorithm cannot be easily extended to classification into more than two classes. The algorithm performs only prediction, it is not able to return the probabilities of predictions. Most importantly, Perceptron algorithm finds some solution, not necessary a good one, because once it finds some, it cannot perform any more updates. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 13/30

Common Probability Distributions Bernoulli Distribution The Bernoulli distribution is a distribution over a binary random variable. It has a single φ ∈ [0, 1] parameter , which specifies the probability of the random variable being equal to 1. 1− x P ( x ) = φ (1 − φ ) x E [ x ] = φ Var( x ) = φ (1 − φ ) Categorical Distribution k Extension of the Bernoulli distribution to random variables taking one of different discrete k p ∈ [0, 1] k = 1 ∑ i =1 p i outcomes. It is parametrized by such that . ∏ k x P ( x ) = p i i i E [ x ] = p , Var( x ) = p (1 − p ) i i i i i NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 14/30

Information Theory Self Information Amount of surprise when a random variable is sampled. Should be zero for events with probability 1. Less likely events are more surprising. Independent events should have additive information. 1 def − log P ( x ) = log I ( x ) = P ( x ) NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 15/30

Information Theory Entropy Amount of surprise in the whole distribution. def E − E H ( P ) = [ I ( x )] = [log P ( x )] x∼ P x∼ P P H ( P ) = − P ( x ) log P ( x ) ∑ x for discrete : P H ( P ) = − P ( x ) log P ( x ) d x ∫ for continuous : Note that in the continuous case, the continuous entropy (also called differential entropy ) has slightly different semantics, for example, it can be negative. From now on, all logarithms are natural logarithms with e base . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 16/30

Perceptron and Logistic Regression Milan Straka October 19, 2020 - PowerPoint PPT Presentation

NPFL129, Lecture 3 Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Cross-Validation We

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Bayesian logistic regression Already covered in lectures on classification Laplace and

Logistic Regression Two Worlds: Probabilistic & Algorithmic We know two conceptual approaches

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Optimal scaling and convergence of Markov chain Monte Carlo methods Alain Durmus Joint work

High-dimensional classification by sparse logistic regression Felix Abramovich Tel Aviv

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

Model selection and parameter estimation with covariates in logistic regression missing

CS3157: Advanced Programming Lecture #2 Sept 12 Shlomo Hershkop shlomo@cs.columbia.edu

Perceptron and Logistic Regression Milan Straka October 19, 2020 - PowerPoint PPT Presentation

NPFL129, Lecture 3 Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Cross-Validation We

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim

Bayesian logistic regression Already covered in lectures on classification Laplace and

Logistic Regression Two Worlds: Probabilistic &amp; Algorithmic We know two conceptual approaches

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Optimal scaling and convergence of Markov chain Monte Carlo methods Alain Durmus Joint work

High-dimensional classification by sparse logistic regression Felix Abramovich Tel Aviv

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

Model selection and parameter estimation with covariates in logistic regression missing

CS3157: Advanced Programming Lecture #2 Sept 12 Shlomo Hershkop shlomo@cs.columbia.edu

Logistic Regression Two Worlds: Probabilistic & Algorithmic We know two conceptual approaches