Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao - PowerPoint PPT Presentation

EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University

Reference and Acknowledgement • Most of the course materials are credited to Andrew Ng’s CS229 lecture notes.

Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

Supervised Learning Example Revisited (x (i) , y (i) ): a training example {(x (i) , y (i) ); i = 1,…, m}: training set y (i) ∈ Y : output variables h: X ⟼ Y hypothesis ( 假设函数 ) 4 predicted Price 3 value in million 2 RMB 1 x (i) ∈ X : input variables 0 0 50 100 150 200 250 75 Size in m 2 testing example

Supervised Learning Example Revisited Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment • x: two-dimensional vectors in R 2 Price Size #bedrooms (million ￥ ) • x 1 (i) : the size of the i-th apartment 40 0 1.2 in the training set • x 2 (i) : the number of bedrooms of 65 1 1.9 the i-th apartment in the training 80 2 2.2 set • We decide hypothesis h as a 89 2 3.3 linear function: h θ (x) = θ 0 + θ 1 x 1 + 120 3 5.3 θ 2 x 2 • θ i : parameters/weights of h … … … • By letting x 0 = 1, we rewrite h as n θ i x i = θ T x X Why a linear function? h ( x ) = i =0

Supervised Learning Example Revisited Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment Price Size #bedrooms (million ￥ ) • By letting x 0 = 1, we rewrite h as 40 0 1.2 n θ i x i = θ T x X 65 1 1.9 h ( x ) = i =0 80 2 2.2 • How can we learn θ ? Making h(x) close to y for the training 89 2 3.3 examples! • cost function ( 损失函数 ) ： 120 3 5.3 m … … … J ( θ ) = 1 X ( h θ ( x ( i ) ) − y ( i ) ) 2 2 Why a least-squares cost? i =1

Least-Mean Square Alg • How to choose θ to minimize J( θ )? Let’s start with some “initial guess” for θ , and use gradient descent ( 梯度下降 ) alg. repeatedly to make J( θ ) smaller: θ j := θ j − α∂ J ( θ ) direction of steepest decrease of J θ j 𝛽 : learning rate • What is the partial derivative ( 偏导数 ) term? least mean square update rule: error term

Least-Mean Square Alg • Two ways to modify the method: • batch gradient descent: scan through the entire training set before taking a single step • stochastic gradient descent: update parameters according to the gradient of the error w.r.t. a single training example

Convergence • In most cases, gradient descent converges to local minima. Linear regression only has one global minima, which the gradient descent always converges to. This is because the cost function J is a convex quadratic function ( ⼆亍次凸函数 ). θ global minima is reached! contour ( 等⾼髙线 ) shows the cost x

Normal Equations ( 标准⽅斺程 ) • Gradient descent gives one way of minimizing J. How about others? • We minimize J by explicitly taking derivatives w.r.t. θ and setting them to 0s. And solve the equations! f: R mxn ⟼ R A: m x n matrix

Normal Equations ( 标准⽅斺程 ) 1. trace ( 迹 ): , the trace of a real number is itself 2. trace of a matrix = trace of its transpose ( 转置矩阵 ) 3. ， 4. 5.

Normal Equations ( 标准⽅斺程 ) Property 1 Property 2, 3 Property 4, 5 = 0

Probabilistic View • The target variables and the inputs are related by y ( i ) = ✓ T x ( i ) + ✏ ( i ) error term • Assume are distributed IID (independently and identically ✏ ( i ) ✏ ( i ) ∼ N (0 , � 2 ) distributed 独⽴竌同分布 ) and • Implies − ( y ( i ) − θ T x ( i ) ) 2 1 ⇣ ⌘ p ( y ( i ) | x ( i ) ; θ ) = 2 πσ exp √ 2 σ 2 • Given X and θ , what is the distribution of y (i) ’s? Likelihood function:

Probabilistic View • Maximum likelihood: we should choose θ to make the data as high probability as possible • Equivalently, we maximize the log likelihood: minimizing this term instead! original least-squares cost

Underfitting & Overfitting • Fitting to di ff erent hypotheses: 5 X θ j x j y = y = θ 0 + θ 1 x + θ 2 x 2 y = θ 0 + θ 1 x j =0 underfitting overfitting The more features we add, the better. However, there is also a risk in adding too many features.

Locally Weighted Linear Regression • The choice of features is important to learning performance! • Locally weighted linear regression X w ( i ) ( y ( i ) − θ T x ( i ) ) 2 1. Fit θ to minimize i 2. Output θ T x • larger w (i) -> try harder to make (y (i) - θ T x (i) ) 2 small; otherwise, ignore the corresponding error term Non-parametric Alg: • Standard choice for the weight: keep the entire training − ( x ( i ) − x ) 2 dataset when making w ( i ) = exp ⇣ ⌘ 2 τ 2 predictions θ is giving a higher weight to the training examples close to the testing data x

Summary • Linear regression n θ i x i = θ T x X • Linear hypothesis class h ( x ) = m i =0 J ( θ ) = 1 • Cost function X ( h θ ( x ( i ) ) − y ( i ) ) 2 2 i =1 • Least mean square algorithm: • Batch/stochastic gradient descent • Probabilistic view: • Errors ∼ I.I.D. Gaussian distribution • Maximum likelihood • Overfitting & Underfitting • Locally weighted linear regression

Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

Binary Classification • The target y can only take two values: y ∈ {-1, +1}. y = 1 if the example belongs to the positive class, otherwise, it is a member of the negative class • Hypothesis: h(x) = θ T x. Given x, we classify it as positive or negative depending on the sign of θ T x, i.e., sign( θ T x) = y ⟺ y θ T x > 0 • Margin for the example (x, y): y θ T x — the more θ T x is negative (or positive), the stronger the belief that y is negative (or positive) • loss function: should penalize the θ for which y(i) θ T x(i) < 0 frequently in the training data. Loss value should be small if y(i) θ T x(i) > 0 and large if y(i) θ T x(i) < 0 • We expect the loss function to be continuous and convex (easy to converge to the global minima!)

Binary Classification • Expect the loss to satisfy: Loss_func ( y(i) θ T x(i) ) → 0 as y(i) θ T x(i) → ∞ and Loss_func ( y(i) θ T x(i) ) → ∞ as y(i) θ T x(i) → - ∞ Loss logistic ( z ) = log(1 + e − z ) logistic regression support vector machines Loss hinge = max { 1 − z, 0 } boosting Loss exp = e − z

Logistic Regression • Choose θ to minimize m m J ( θ ) = 1 Loss logistic ( y ( i ) θ T x ( i ) ) = 1 log(1 + exp( − y ( i ) θ T x ( i ) )) X X m m i =1 i =1 which hopefully yields θ that y(i) θ T x(i) > 0 for most training examples • Alternative view : Logistic (Sigmoid) 1 function g ( z ) = 1 + e − z → 1 as z → ∞ and g(z) → 0 as z → - ∞ • g(z) + g(-z) = 1 we could use it to define the probability model for binary classification.

Probabilistic View • For y ∈ {-1, +1}, we define the logistic model as 1 p ( Y = y | x ; θ ) = g ( yx T θ ) = , & refine hypothesis class as 1 + e − yx T θ 1 h θ ( x ) = 1 + e − x T θ • The likelihood of the training data is • The log-likelihood is • maximizing likelihood in the logistic model = minimizing the average logistic loss

Gradient Descent • For the , the derivative is Loss logistic ( z ) = log(1 + e − z ) Sigmoid function e − z d 1 + e − z · d 1 d z e − z = − d z Loss logistic ( z ) = 1 + e − z = − g ( − z ) • For a single training example (x, y): ∂ Loss logistic ( yx T θ ) = − g ( − yx T θ ) ∂ ( yx T θ ) = − g ( − yx T θ ) yx k ∂θ k ∂θ k • Update rule for stochastic gradient descent: θ t +1 = θ t � α t · r θ Loss logistic ( � y ( i ) x ( i ) T θ t ) incorrect label

Update Rule when y ∈ {0,1} 1 P ( y = 1 | x ; θ ) = h θ ( x ) = 1 + e − θ T x p ( y | x ; θ ) = ( h θ ( x )) y (1 − h θ ( x )) 1 − y P ( y = 0 | x ; θ ) = 1 − h θ ( x ) similar to least mean square update rule, but h is non-linear! gradient ascent:

Another Update Rule to Maximize l( θ ) • Newton’s method for finding a zero of a function: f( θ ) = 0 • Update rule: θ := θ - f( θ )/f’( θ )

Another Update Rule to Maximize l( θ ) • Newton’s method for finding a zero of a function: f( θ ) = 0 • What if we want to maximize some loss function? The maxima of the loss corresponds to points where its first derivative is 0 θ := θ − l 0 ( θ ) • Update rule: l 00 ( θ ) • Multidimensional setting: θ := θ � H − 1 r θ l ( θ ) Hessian matrix • Advantage: Newton’s method typically enjoys faster convergence than gradient descent, and requires many fewer iterations to get very close to the minimum. • Disadvantage: more expensive in one iteration

Summary • Logistic regression • Hypothesis h(x) = θ T x • Cost function Loss logistic ( z ) = log(1 + e − z ) • Update rule: θ t +1 = θ t � α t · r θ Loss logistic ( � y ( i ) x ( i ) T θ t ) θ := θ − l 0 ( θ ) • Newton’s method l 00 ( θ ) • Probabilistic view: • maximizing likelihood in the logistic model = minimizing the average logistic loss

Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao - PowerPoint PPT Presentation

EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University Reference and Acknowledgement Most of the course materials are credited to Andrew Ngs CS229 lecture notes. Outline

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Self-Supervised Feature Learning by Learning to Spot Artifacts Wonbin Kim Self-Supervised

4. Minimax and planning problems Optimizing piecewise linear functions Minimax problems

CS 445 Introduction to Machine Learning Logistic Regression Instructor: Dr. Kevin Molloy Review

Linear regression DS GA 1002 Probability and Statistics for Data Science

Linear and Logistic Regression Marta Arias marias@cs.upc.edu Dept. CS, UPC Fall 2018 Linear

Applied Machine Learning Linear Regression Siamak Ravanbakhsh COMP 551 (fall 2020) Learning

Integer Linear Programming CONTACT@ADAMFURMANEK.PL HTTP://BLOG.ADAMFURMANEK.PL FURMANEKADAM 1

Linear models for classification. Perceptron. Logistic regression. Petr Po s k P. Po

Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale