classification or regression regression
play

Classification or Regression? Regression Classification: want to - PowerPoint PPT Presentation

Classification or Regression? Regression Classification: want to learn a discrete target variable Machine Learning and Pattern Recognition Regression: want to learn a continuous target variable Linear regression,


  1. Classification or Regression? Regression ◮ Classification: want to learn a discrete target variable Machine Learning and Pattern Recognition ◮ Regression: want to learn a continuous target variable ◮ Linear regression, linear-in-the-parameters models Chris Williams ◮ Linear regression is a conditional Gaussian model ◮ Maximum likelihood solution - ordinary least squares School of Informatics, University of Edinburgh ◮ Can use nonlinear basis functions ◮ Ridge regression September 2014 ◮ Full Bayesian treatment ◮ Reading: Murphy chapter 7 (not all sections needed), Barber (17.1, 17.2, 18.1.1) (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) 1 / 24 2 / 24 One Dimensional Data Linear Regression 2.5 ◮ Simple example: one-dimensional linear regression. 2 ◮ Suppose we have data of the form ( x, y ) , and we believe the data should fol low a straight line: the data should have a 1.5 straight line fit of the form y = w 0 + w 1 x . ◮ However we also believe the target values y are subject to 1 measurement error, which we will assume to be Gaussian. So y = w 0 + w 1 x + η where η is a Gaussian noise term, mean 0 , variance σ 2 η . 0.5 0 −2 −1 0 1 2 3 3 / 24 4 / 24

  2. Generated Data 2.5 2 1.5 1 0.5 Figure credit: http://jedismedicine.blogspot.co.uk/2014/01/ 0 ◮ Linear regression is just a conditional version of estimating a −2 −1 0 1 2 3 Gaussian (conditional on the input x ) 5 / 24 6 / 24 Multivariate Case Minimizing Squared Error ◮ Consider the case where we are interested in y = f ( x ) for D N dimensional x : y = w 0 + w 1 x 1 + . . . w D x D + η , where L ( w ) = − 1 ( y n − w T φ n ) 2 − N � 2 log(2 πσ 2 η ) η ∼ Gaussian (0 , σ 2 η ) . 2 σ 2 η n =1 ◮ Examples? Final grade depends on time spent on work for N � ( y n − w T φ n ) 2 − C 2 each tutorial. = − C 1 ◮ We set w = ( w 0 , w 1 , . . . w D ) T and introduce φ = (1 , x T ) T , n =1 then we can write y = w T φ + η instead where C 1 > 0 and C 2 don’t depend on w . Now ◮ This implies p ( y | φ , w ) = N ( y ; w T φ , σ 2 η ) ◮ Multiplying by a positive constant doesn’t change the ◮ Assume that training data is iid, i.e., maximum p ( y 1 , . . . y N | x 1 , . . . , x N , w ) = � N n =1 p ( y n | x n , w ) ◮ Adding a constant doesn’t change the maximum. ◮ � N n =1 ( y n − w T φ n ) 2 is the sum of squared errors made if you ◮ Given data { ( x n , y n ) , n = 1 , 2 , . . . , N } , the log likelihood is use w L ( w ) = log P ( y 1 . . . y N | x 1 . . . x N , w ) So maximizing the likelihood is the same as minimizing the total N squared error of the linear predictor. = − 1 ( y n − w T φ n ) 2 − N � 2 log(2 πσ 2 η ) 2 σ 2 So you don’t have to believe the Gaussian assumption. You can η n =1 simply believe that you want to minimize the squared error. 7 / 24 8 / 24

  3. Maximum Likelihood Solution I Maximum Likelihood Solution II ◮ Setting the derivatives to zero to find the minimum gives Φ T Φ ˆ ◮ Write Φ = ( φ 1 , φ 2 , . . . , φ N ) T , and y = ( y 1 , y 2 , . . . , y N ) T w = Φ T y ◮ Φ is called the design matrix , has N rows, one for each ◮ This means the maximum likelihood ˆ w is given by example w = (Φ T Φ) − 1 Φ T y ˆ L ( w ) = − 1 ( y − Φ w ) T ( y − Φ w ) − C 2 2 σ 2 The matrix (Φ T Φ) − 1 Φ T is called the pseudo-inverse . η ◮ Ordinary least squares (OLS) solution for w ◮ Take derivatives of the log likelihood: ◮ MLE for the variance ∇ w L ( w ) = − 1 Φ T (Φ w − y ) σ 2 N η η = 1 � ( y n − w T φ n ) 2 σ 2 ˆ N n =1 i.e. the average of the squared residuals 9 / 24 10 / 24 Generated Data Nonlinear regression 2.5 ◮ All this just used φ . 2 ◮ We chose to put the x values in φ , but we could have put anything in there, including nonlinear transformations of the x 1.5 values. ◮ In fact we can choose any useful form for φ so long as the final 1 derivatives are linear wrt w . We can even change the size. 0.5 ◮ We already have the maximum likelihood solution in the case of Gaussian noise: the pseudo-inverse solution. 0 ◮ Models of this form are called general linear models or linear-in-the-parameters models. −0.5 −2 −1 0 1 2 3 The black line is the maximum likelihood fit to the data. 11 / 24 12 / 24

  4. Example:polynomial fitting Dimensionality issues ◮ Model y = w 1 + w 2 x + w 2 x 2 + w 4 x 3 . ◮ Set φ = (1 , x, x 2 , x 3 ) T and w = ( w 1 , w 2 , w 3 , w 4 ) . ◮ Can immediately write down the ML solution: ◮ How many radial basis functions do we need? w = (Φ T Φ) − 1 Φ T y , where Φ and y are defined as before. ◮ Suppose we need only three per dimension ◮ Could use any features we want: e.g. features that are only ◮ Then we would need 3 D for a D -dimensional problem active in certain local regions (radial basis functions, RBFs). ◮ This becomes large very fast: this is commonly called the curse of dimensionality ◮ Gaussian processes (see later) can help with these issues Figure credit: David Barber, BRML Fig 17.6 13 / 24 14 / 24 Higher dimensional outputs Adding a Prior ◮ Put prior over parameters, e.g., p ( y | φ , w ) = N ( y ; w T φ , σ 2 η ) p ( w ) = N ( w ; 0 , τ 2 I ) ◮ I is the identity matrix ◮ The log posterior is ◮ Suppose the target values are vectors. ◮ Then we introduce different w i for each y i . N 1 ( y n − w T φ n ) 2 − N � 2 log(2 πσ 2 ) log p ( w |D ) = const − ◮ Then we can do regression independently in each of those 2 σ 2 η n =1 cases. 1 − D 2 τ 2 w T w 2 log(2 πτ 2 ) − � �� � penalty on large weights ◮ MAP solution can be computed analytically. Derivation almost the same as with MLE (where λ = σ 2 η /τ 2 ) w MAP = (Φ T Φ + λI ) − 1 Φ T y This is called ridge regression 15 / 24 16 / 24

  5. Effect of Ridge Regression Effect of Ridge Regression (Graphic) ◮ Collecting constant terms from log posterior on last slide N ln lambda −20.135 ln lambda −8.571 log p ( w |D ) = const − 1 1 � ( y n − w T φ n ) 2 − 20 20 2 τ 2 w T w 2 σ 2 15 η 15 n =1 � �� � || w || 2 10 2 . penalty term 10 5 ◮ This is called ℓ 2 regularization or weight decay . The second 5 0 term is the squared Euclidean (also called ℓ 2 ) norm of w . 0 −5 ◮ The idea is to reduce overfitting by forcing the function to be −5 −10 simple. The simplest possible function is constant w = 0 , so −10 −15 0 5 10 15 20 0 5 10 15 20 encourage ˆ w to be closer to that. ◮ τ is a parameter of the method. Trades off between how well Figure credit: Murphy Fig 7.7 you fit the training data and how simple the method is. Most Degree 14 polynomial fit with and without regularization commonly set via cross validation. ◮ Regularization is a general term for adding a “second term” to an objective function to encourage simple models. 17 / 24 18 / 24 Why Ridge Regression Works (Graphic) Bayesian Regression u 2 ◮ Bayesian regression model p ( y | φ , w ) = N ( y ; w T φ , σ 2 η ) u 1 p ( w ) = N ( w ; 0 , τ 2 I ) ML Estimate MAP Estimate ◮ Possible to compute the posterior distribution analytically, because linear Gaussian models are jointly Gaussian (see Murphy § 7.6.1 for details) prior mean p ( w | Φ , y , σ 2 η ) ∝ p ( w ) p ( y | Φ , σ 2 η ) = N ( w | w N , V N ) w N = 1 V N Φ T y σ 2 η η /τ 2 I + Φ T Φ) − 1 V N = σ 2 η ( σ 2 Figure credit: Murphy Fig 7.9 19 / 24 20 / 24

  6. Making predictions Example of Bayesian Regression likelihood prior/posterior data space 1 1 W1 y 0 0 −1 −1 −1 0 1 −1 0 1 ◮ For a new test point x ∗ with corresponding feature vector φ ∗ , W0 x 1 1 1 we have that W1 W1 y f ( x ∗ ) = w T φ ∗ + η 0 0 0 −1 −1 −1 −1 0 1 −1 0 1 −1 0 1 where w ∼ N ( w N , V N ) . W0 W0 x ◮ Hence 1 1 1 W1 W1 y 0 0 0 N φ ∗ , ( φ ∗ ) T V N φ ∗ + σ 2 p ( y ∗ | x ∗ , D ) ∼ N ( w T η ) −1 −1 −1 −1 0 1 −1 0 1 −1 0 1 W0 W0 x 1 1 1 W1 W1 y 0 0 0 −1 −1 −1 −1 0 1 −1 0 1 −1 0 1 W0 W0 x Figure credit: Murphy Fig 7.11 21 / 24 22 / 24 Another Example Summary plugin approximation (MLE) Posterior predictive (known variance) 60 80 prediction prediction training data training data 70 50 60 40 50 40 30 30 20 20 10 ◮ Linear regression is a conditional Gaussian model 10 0 0 −10 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 ◮ Maximum likelihood solution - ordinary least squares MLE Bayes ◮ Can use nonlinear basis functions functions sampled from plugin approximation to posterior functions sampled from posterior 50 100 45 ◮ Ridge regression 80 40 35 60 ◮ Full Bayesian treatment 30 25 40 20 20 15 10 0 5 0 −20 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 MLE samples Bayes samples Figure credit: Murphy Fig 7.12 Fitting a quadratic. Notice how the error bars get larger further away from training data 23 / 24 24 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend