A Unified View of Loss Functions in Supervised Learning Shuiwang Ji - PowerPoint PPT Presentation

A Unified View of Loss Functions in Supervised Learning Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 12

Linear Classifier 1 For a binary classification problem, we are given an input dataset X = [ x 1 , x 2 , . . . , x n ] with the corresponding label Y = [ y 1 , y 2 , . . . , y n ], where x i ∈ R d and y i ∈ { +1 , − 1 } . 2 For a given sample x i , a linear classifier computes the linear score s i as a weighted summation of all features as: s i = w T x i + b , (1) where w is the weights and b is the bias. 3 We can predict the label of x i based on the linear score s i . By employing an appropriate loss function, we can train and obtain a linear classifier. 4 We describe and compare a variety of loss functions used in supervised learning, including zero-one loss, perceptron loss, hinge loss, log loss (also known as logistic regression loss or cross entropy loss), exponential loss, and square loss. 5 We describe these loss functions in the context of linear classifiers, but they can also be used for nonlinear classifiers. 2 / 12

Zero-one Loss 1 The zero-one loss aims at measuring the number of prediction errors for classifier. For a given input x i , the classifier makes a correct prediction if y i s i > 0. Otherwise, it makes a wrong prediction. 2 Therefore, the zero-one loss function can be described as � n 1 i =1 L 0 / 1 ( y i , s i ) , where L 0 / 1 is the zero-one loss defined as n � 1 if y i s i < 0 , L 0 / 1 ( y i , s i ) = (2) 0 otherwise . 3 / 12

Perceptron loss 1 The zero-one loss incurs the same loss value of 1 for all wrong predictions, no matter how far a wrong prediction is from the hyperplane. 2 The perceptron loss addresses this by penalizing each wrong prediction by the extent of violation. The perceptron loss function is � n defined as 1 i =1 L p ( y i , s i ) , where L p is perceptron loss which can be n described as L p ( y i , s i ) = max(0 , − y i s i ) . (3) 3 Note that the loss is 0 when the input example is correctly classified. The loss is proportional to a quantification of the extent of violation ( − y i s i ) when the input example is incorrectly classified. 4 / 12

Square loss 1 The square loss function is commonly used for regression problems. 2 It can also be used for binary classification problems as n 1 � L s ( y i , s i ) , (4) n i =1 where L s is the square loss, defined as L s ( y i , s i ) = (1 − y i s i ) 2 . (5) 3 Note that the square loss tends to penalize wrong predictions excessively. In addition, when the value of y i s i is large and the classifier is making correct predictions, the square loss incurs a large loss value. 5 / 12

Log loss (cross entropy) 1 Logistic regression employs the log loss (cross entropy) to train classifiers. 2 The loss function used in logistic regression can be expressed as n 1 � L log ( y i , s i ) , (6) n i =1 where L log is the log loss, defined as L log ( y i , s i ) = log(1 + e − y i s i ) . (7) 6 / 12

Hinge loss (support vector machines) 1 The support vector machines employ hinge loss to obtain a classifier with “maximum-margin”. 2 The loss function in support vector machines is defined as follows: n 1 � L h ( y i , s i ) , (8) n i =1 where L h is the hinge loss: L h ( y i , s i ) = max(0 , 1 − y i s i ) . (9) 3 Different with the zero-one loss and perceptron loss, a data may be penalized even if it is predicted correctly. 7 / 12

Exponential Loss 1 The log term in the log loss encourages the loss to grow slowly for negative values, making it less sensitive to wrong predictions. 2 There is a more aggressive loss function, known as the exponential loss, which grows exponentially for negative values and is thus very sensitive to wrong predictions. The AdaBoost algorithm employs the exponential loss to train the models. � n 3 The exponential loss function can be expressed as 1 i =1 L exp ( y i , s i ) , n where L exp is the exponential loss, defined as L exp ( y i , s i ) = e − y i s i . (10) 8 / 12

Convexity 1 Mathematically, a function f ( · ) is convex if f ( tx 1 + (1 − t ) x 2 ) ≤ tf ( x 1 ) + (1 − t ) f ( x 2 ) , for t ∈ [0 , 1] . 2 A function f ( · ) is strictly convex if f ( tx 1 + (1 − t ) x 2 ) < tf ( x 1 ) + (1 − t ) f ( x 2 ) , for t ∈ (0 , 1) , x 1 � = x 2 . 3 Intuitively, a function is convex if the line segment between any two points on the function is not below the function. 4 A function is strictly convex if the line segment between any two distinct points on the function is strictly above the function, except for the two points on the function itself. 9 / 12 https://en.wikipedia.org/wiki/Convex_function

Comparison of loss functions 1 In the zero-one loss, if a data sample is predicted correctly ( y i s i > 0), it results in zero penalties; otherwise, there is a penalty of one. For any data sample that is not predicted correctly, it receives the same loss. 2 For the perceptron loss, the penalty for each wrong prediction is proportional to the extent of violation. For other losses, a data sample can still incur penalty even if it is classified correctly. 3 The log loss is similar to the hinge loss but it is a smooth function which can be optimized with the gradient descent method. 4 While log loss grows slowly for negative values, exponential loss and square loss are more aggressive. 5 Note that, in all of these loss functions, square loss will penalize correct predictions severely when the value of y i s i is large. 6 In addition, zero-one loss is not convex while the other loss functions are convex. Note that the hinge loss and perceptron loss are not strictly convex. 10 / 12

Comparison of different loss functions in a unified view 11 / 12

THANKS! 12 / 12

A Unified View of Loss Functions in Supervised Learning Shuiwang Ji - PowerPoint PPT Presentation

A Unified View of Loss Functions in Supervised Learning Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 12 Linear Classifier 1 For a binary classification problem, we are given an input dataset X = [ x

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Prior and loss robustness for varoius loss functions Agnieszka Kami nska and Zdzis law

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Symbolic Memory Graphs invariant and corresponding optimizations for SMGCPA Anton Vasilyev

Lecture 2: Single processor architecture and memory David Bindel 27 Jan 2010 Logistics If

Memory Corruption Vulnerabilities, Part I Gang Tan Penn State University Spring 2019 CMPSC

1 Locate set Cache Read Example: Direct Mapped Cache (E = 1) Check if any line in set

Learning Architectures and Loss Functions in Continuous Space Fei Tian Machine Learning Group

Lecture 10: Neural Networks (Part 2) Feb 25th, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1

An Introductory Tutorial on Implementing DRL Algorithms with DQN and TensorFlow Tim Tse May 18,

Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus