supervised learning
play

Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao - PowerPoint PPT Presentation

EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University Reference and Acknowledgement Most of the course materials are credited to Andrew Ngs CS229 lecture notes. Outline


  1. EE226 Big Data Mining Lecture 4 Supervised Learning Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University

  2. Reference and Acknowledgement • Most of the course materials are credited to Andrew Ng’s CS229 lecture notes.

  3. Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

  4. Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

  5. Supervised Learning Example Revisited (x (i) , y (i) ): a training example {(x (i) , y (i) ); i = 1,…, m}: training set y (i) ∈ Y : output variables h: X ⟼ Y hypothesis ( 假设函数 ) 4 predicted Price 3 value in million 2 RMB 1 x (i) ∈ X : input variables 0 0 50 100 150 200 250 75 Size in m 2 testing example

  6. Supervised Learning Example Revisited Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment • x: two-dimensional vectors in R 2 Price Size #bedrooms (million ¥ ) • x 1 (i) : the size of the i-th apartment 40 0 1.2 in the training set • x 2 (i) : the number of bedrooms of 65 1 1.9 the i-th apartment in the training 80 2 2.2 set • We decide hypothesis h as a 89 2 3.3 linear function: h θ (x) = θ 0 + θ 1 x 1 + 120 3 5.3 θ 2 x 2 • θ i : parameters/weights of h … … … • By letting x 0 = 1, we rewrite h as n θ i x i = θ T x X Why a linear function? h ( x ) = i =0

  7. Supervised Learning Example Revisited Let’s consider a richer dataset in which we also know the number of bedrooms in each apartment Price Size #bedrooms (million ¥ ) • By letting x 0 = 1, we rewrite h as 40 0 1.2 n θ i x i = θ T x X 65 1 1.9 h ( x ) = i =0 80 2 2.2 • How can we learn θ ? Making h(x) close to y for the training 89 2 3.3 examples! • cost function ( 损失函数 ) : 120 3 5.3 m … … … J ( θ ) = 1 X ( h θ ( x ( i ) ) − y ( i ) ) 2 2 Why a least-squares cost? i =1

  8. Least-Mean Square Alg • How to choose θ to minimize J( θ )? Let’s start with some “initial guess” for θ , and use gradient descent ( 梯度下降 ) alg. repeatedly to make J( θ ) smaller: θ j := θ j − α∂ J ( θ ) direction of steepest decrease of J θ j 𝛽 : learning rate • What is the partial derivative ( 偏导数 ) term? least mean square update rule: error term

  9. Least-Mean Square Alg • Two ways to modify the method: • batch gradient descent: scan through the entire training set before taking a single step • stochastic gradient descent: update parameters according to the gradient of the error w.r.t. a single training example

  10. Convergence • In most cases, gradient descent converges to local minima. Linear regression only has one global minima, which the gradient descent always converges to. This is because the cost function J is a convex quadratic function ( ⼆亍次凸函数 ). θ global minima is reached! contour ( 等⾼髙线 ) shows the cost x

  11. Normal Equations ( 标准⽅斺程 ) • Gradient descent gives one way of minimizing J. How about others? • We minimize J by explicitly taking derivatives w.r.t. θ and setting them to 0s. And solve the equations! f: R mxn ⟼ R A: m x n matrix

  12. Normal Equations ( 标准⽅斺程 ) 1. trace ( 迹 ): , the trace of a real number is itself 2. trace of a matrix = trace of its transpose ( 转置矩阵 ) 3. , 4. 5.

  13. Normal Equations ( 标准⽅斺程 ) Property 1 Property 2, 3 Property 4, 5 = 0

  14. Probabilistic View • The target variables and the inputs are related by y ( i ) = ✓ T x ( i ) + ✏ ( i ) error term • Assume are distributed IID (independently and identically ✏ ( i ) ✏ ( i ) ∼ N (0 , � 2 ) distributed 独⽴竌同分布 ) and • Implies − ( y ( i ) − θ T x ( i ) ) 2 1 ⇣ ⌘ p ( y ( i ) | x ( i ) ; θ ) = 2 πσ exp √ 2 σ 2 • Given X and θ , what is the distribution of y (i) ’s? Likelihood function:

  15. Probabilistic View • Maximum likelihood: we should choose θ to make the data as high probability as possible • Equivalently, we maximize the log likelihood: minimizing this term instead! original least-squares cost

  16. Underfitting & Overfitting • Fitting to di ff erent hypotheses: 5 X θ j x j y = y = θ 0 + θ 1 x + θ 2 x 2 y = θ 0 + θ 1 x j =0 underfitting overfitting The more features we add, the better. However, there is also a risk in adding too many features.

  17. Locally Weighted Linear Regression • The choice of features is important to learning performance! • Locally weighted linear regression X w ( i ) ( y ( i ) − θ T x ( i ) ) 2 1. Fit θ to minimize i 2. Output θ T x • larger w (i) -> try harder to make (y (i) - θ T x (i) ) 2 small; otherwise, ignore the corresponding error term Non-parametric Alg: • Standard choice for the weight: keep the entire training − ( x ( i ) − x ) 2 dataset when making w ( i ) = exp ⇣ ⌘ 2 τ 2 predictions θ is giving a higher weight to the training examples close to the testing data x

  18. Summary • Linear regression n θ i x i = θ T x X • Linear hypothesis class h ( x ) = m i =0 J ( θ ) = 1 • Cost function X ( h θ ( x ( i ) ) − y ( i ) ) 2 2 i =1 • Least mean square algorithm: • Batch/stochastic gradient descent • Probabilistic view: • Errors ∼ I.I.D. Gaussian distribution • Maximum likelihood • Overfitting & Underfitting • Locally weighted linear regression

  19. Outline • Linear Regression ( 线性回归 ) • Classification and Logistic Regression ( 逻辑回归 ) • Generalized Linear Models

  20. Binary Classification • The target y can only take two values: y ∈ {-1, +1}. y = 1 if the example belongs to the positive class, otherwise, it is a member of the negative class • Hypothesis: h(x) = θ T x. Given x, we classify it as positive or negative depending on the sign of θ T x, i.e., sign( θ T x) = y ⟺ y θ T x > 0 • Margin for the example (x, y): y θ T x — the more θ T x is negative (or positive), the stronger the belief that y is negative (or positive) • loss function: should penalize the θ for which y(i) θ T x(i) < 0 frequently in the training data. Loss value should be small if y(i) θ T x(i) > 0 and large if y(i) θ T x(i) < 0 • We expect the loss function to be continuous and convex (easy to converge to the global minima!)

  21. Binary Classification • Expect the loss to satisfy: Loss_func ( y(i) θ T x(i) ) → 0 as y(i) θ T x(i) → ∞ and Loss_func ( y(i) θ T x(i) ) → ∞ as y(i) θ T x(i) → - ∞ Loss logistic ( z ) = log(1 + e − z ) logistic regression support vector machines Loss hinge = max { 1 − z, 0 } boosting Loss exp = e − z

  22. Logistic Regression • Choose θ to minimize m m J ( θ ) = 1 Loss logistic ( y ( i ) θ T x ( i ) ) = 1 log(1 + exp( − y ( i ) θ T x ( i ) )) X X m m i =1 i =1 which hopefully yields θ that y(i) θ T x(i) > 0 for most training examples • Alternative view : Logistic (Sigmoid) 1 function g ( z ) = 1 + e − z → 1 as z → ∞ and g(z) → 0 as z → - ∞ • g(z) + g(-z) = 1 we could use it to define the probability model for binary classification.

  23. Probabilistic View • For y ∈ {-1, +1}, we define the logistic model as 1 p ( Y = y | x ; θ ) = g ( yx T θ ) = , & refine hypothesis class as 1 + e − yx T θ 1 h θ ( x ) = 1 + e − x T θ • The likelihood of the training data is • The log-likelihood is • maximizing likelihood in the logistic model = minimizing the average logistic loss

  24. Gradient Descent • For the , the derivative is Loss logistic ( z ) = log(1 + e − z ) Sigmoid function e − z d 1 + e − z · d 1 d z e − z = − d z Loss logistic ( z ) = 1 + e − z = − g ( − z ) • For a single training example (x, y): ∂ Loss logistic ( yx T θ ) = − g ( − yx T θ ) ∂ ( yx T θ ) = − g ( − yx T θ ) yx k ∂θ k ∂θ k • Update rule for stochastic gradient descent: θ t +1 = θ t � α t · r θ Loss logistic ( � y ( i ) x ( i ) T θ t ) incorrect label

  25. Update Rule when y ∈ {0,1} 1 P ( y = 1 | x ; θ ) = h θ ( x ) = 1 + e − θ T x p ( y | x ; θ ) = ( h θ ( x )) y (1 − h θ ( x )) 1 − y P ( y = 0 | x ; θ ) = 1 − h θ ( x ) similar to least mean square update rule, but h is non-linear! gradient ascent:

  26. Another Update Rule to Maximize l( θ ) • Newton’s method for finding a zero of a function: f( θ ) = 0 • Update rule: θ := θ - f( θ )/f’( θ )

  27. Another Update Rule to Maximize l( θ ) • Newton’s method for finding a zero of a function: f( θ ) = 0 • What if we want to maximize some loss function? The maxima of the loss corresponds to points where its first derivative is 0 θ := θ − l 0 ( θ ) • Update rule: l 00 ( θ ) • Multidimensional setting: θ := θ � H − 1 r θ l ( θ ) Hessian matrix • Advantage: Newton’s method typically enjoys faster convergence than gradient descent, and requires many fewer iterations to get very close to the minimum. • Disadvantage: more expensive in one iteration

  28. Summary • Logistic regression • Hypothesis h(x) = θ T x • Cost function Loss logistic ( z ) = log(1 + e − z ) • Update rule: θ t +1 = θ t � α t · r θ Loss logistic ( � y ( i ) x ( i ) T θ t ) θ := θ − l 0 ( θ ) • Newton’s method l 00 ( θ ) • Probabilistic view: • maximizing likelihood in the logistic model = minimizing the average logistic loss

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend