Regression and regularization Matthieu R. Bloch learning setting - - PDF document

regression and regularization
SMART_READER_LITE
LIVE PREVIEW

Regression and regularization Matthieu R. Bloch learning setting - - PDF document

1 random noise. traced back to the work of Legendre in Nouvelles mthodes pour la dtermination des orbites des comtes Linear least square regression is a widely used technique in applied mathematics, which can be (2) . . . . . the loss


slide-1
SLIDE 1

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

Regression and regularization

Matthieu R. Bloch

We now turn our attention to the problem of regression, which corresponds to the supervised learning setting when Y = R. Said differently, we will not attempt to learn a discrete label anymore as in classification but a continuously changing one. Classification is a special case of regression, but the discrete nature of labels lends itself to specific insights and analysis, which is why we studied it

  • separately. Looking at regression will require the introduction of new concepts and will allow us to
  • btain new insights into the learning problem.

1 From classification to regression As a refresher, the supervised learning problem we are interested in consists in using a labeled dataset

{(xi, yi)}N

i=1, xi ∈ Rd to predict the labels of unseen data. In classification, yi ∈ Y ⊂ R with

|Y| ≜ K < ∞ while in regresssion yi ∈ Y = R. Our regression model is that the relation between label and data is of the form y = f(x) + n with f ∈ H, where H is a class of functions (polynomials, splines, kernels, etc.), and n is some random noise. Definition 1.1 (Linear regression). Linear regression corresponds to the situation in which H is the set

  • f affine functions

f(x) ≜ β⊺x + β0 with β ≜ [β1, · · · , βd]⊺ (1) Definition 1.2 (Least square regression). Least square regression corresponds to the situation in which the loss function is sum of square errors SSE(β, β0) ≜

N

  • i=1

(yi − β⊺xi − β0)2 (2) Linear least square regression is a widely used technique in applied mathematics, which can be traced back to the work of Legendre in Nouvelles méthodes pour la détermination des orbites des comètes (1805) and Gauss in Tieoria Motus (1809, but claim to discovery in 1795). We will make a change of notation to simplify our analysis moving forward. We set θ ≜      β0 β1 . . . βd      ∈ Rd+1 y ≜      y1 y2 . . . yN      ∈ RN X ≜      1 −x⊺

1−

1 −x⊺

2−

. . . . . . 1 −x⊺

N−

     ∈ RN×(d+1), (3) which allows us to rewrite the sum of square error as SSE(θ) ≜ y − Xθ2

2.

(4) One of the reason that makes linear least square regression so popular is the existence of a closed form analytical solution. 1

slide-2
SLIDE 2

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

Lemma 1.3 (Linear least square solution). If X⊺X is non singular the minimizer of the SSE is ˆ θ = (X⊺X)−1X⊺y (5)

  • Proof. See annotated slides.

Tie existence of this solution is a bit misleading because computing ˆ θ can be extremely numer- ically unstable. Tie matrix (X⊺X)−1 could be ill-conditioned. As for classification, linear methods have their limit, and one can create a non-linear estimator using a non-linear feature map Φ : Rd → Rℓ : x → Φ(x). Tie regression model becomes y = β⊺Φ(x) + β0 with β ∈ Rℓ. (6) Example 1.4. To obtain a least square estimate of cubic polynomial f with d = 1, one can use the non linear map Φ : R → R4 : x →     1 x x2 x3     . (7) 2 Overfitting and regularization Overfitting is the problem that happens when fitting the data well no longer ensures that the out-of- sample error is small, i.e., the underlying model learned generalized poorly. Tiis happens no only when there are too many degrees of freedom in model so that one “learns the noise ” but also when the hypothesis set contains simpler functions than the target function f but the number of sample points N is too small. In general, overfitting occurs as the number of features d begins to approach the number of observations N. To illustrate this, consider the following example in data is generated as y = x2 + n with x ∈ [−1; 1], where n ∼ N(0, σ = 0.1). We perform regression with polynomial of degree d.

  • Fig. 1a shows the true underlying model and five samples obtained independently and uniformly at

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and data

True model noisy data

(a) True model and sample points.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.0 0.2 0.4 0.6 0.8 1.0 y True model degree 4 noisy data

(b) Regression fit with d = 4.

Figure 1: Illustration of overfitting

  • random. Fig. 1 shows the resulting predictor obtained by fitting the data to a polynomial of degree

d = 4. Since we only have five points, there exists a degree four polynomial that predicts exactly the 2

slide-3
SLIDE 3

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

value of all five training point. Tiis is an example where our regression is effectively learning the noise in the model. To fully appreciate the consequence of overfitting, Fig. 2a shows the regression results for twenty randomly sampled sets of five points.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and prediction

True model

(a) Many regressions with d = 4 on five randomly sampled points.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and prediction

True model

(b) Many regressions with d = 1 on five randomly sampled points.

Figure 2: Regression with too few datapoints As you can see, there is a huge variance in the resulting predictor, suggesting that we have an unstable prediction that does not generalize well. Perhaps surprisingly, one observes a similar vari- ance when trying to fig the data to a polynomial of degree d = 1. In the latter situation, the degree

  • f the polynomial is one less than the true model so that the model cannot fit the noise; however,

the variance stems from the fact that there are few sample points. As shown in Fig. 3a and Fig. 3b,

  • verfitting disappears once we have enough data points.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and prediction

True model

(a) Many regressions with d = 4 on fifty ran- domly sampled points.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and prediction

True model

(b) Many regressions with d = 1 on fifty ran- domly sampled points.

Figure 3: Regression with enough data points In practice though, we are often interested in limiting overfitting even when the number of data points is small. Tie key solution is a technique called regularization. 3

slide-4
SLIDE 4

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

3 Tikhonov regularization Tie key idea behind regularization is to introduce a penalty term to “regularize” the vector θ:

θ = argmin

θ

y − Xθ2

2 + Γθ2 2

(8) where Γ ∈ R(d+1)×(d+1) Lemma 3.1 (Tikhonov regularization solution). Tie minimizer of the least-square problem with Tikhonov regularization is ˆ θ = (X⊺X + Γ⊺Γ)−1X⊺y (9)

  • Proof. See annotated slides.

For the special case Γ = √ λI for some λ > 0, we obtain ˆ θ = (X⊺X + λI)−1X⊺y (10) Tiis simple change has many benefits, including improving numerical stability when computing ˆ θ since X⊺X + λI is better conditioned than X⊺X. Ridge regression is a slight variant of the above that does not penalize β0 and corresponds to Γ =      · · · √ λ · · · . . . ... ... . . . · · · · · · √ λ      (11) To appreciate the effect of regularization. Fig. 4 shows the resulting regressions with λ = 1 in the same situation as earlier. Notice how the variance of the regression is substantially reduced.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and prediction

True model

(a) Many ridge regressions with d = 4 on five randomly sampled points.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.2 0.0 0.2 0.4 0.6 0.8 1.0 y

True model and prediction

True model

(b) Many ridge regressions with d = 1 on five randomly sampled points.

Figure 4: Ridge regression It is also useful to understand Tikhonov regularization as a constrained optimization problem. 4

slide-5
SLIDE 5

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

Lemma 3.2 (Tikhonov regularization solution revisited). Tie minimizer of the least-square problem with Tikhonov regularization is the solution of argmin

θ

y − Xθ2

2 such that Γθ2 2 ⩽ τ

(12) for some τ > 0

  • Proof. See annotated slides.

  • Fig. 5 illustrates the effect of Tikhonov regularization in R2 assuming that Γ = I. Tie Tikhonov

solution is shrinked towards the zero vector to satisfy the constraint. Intuitively, the regularized solution corresponds to the point where the level set of y − Xθ2 first intersects the feasible region θ2

2 = τ.

θ1 θ2 θ2

2 τ

ˆ θLS ˆ θreg

Figure 5: Illustration of Tikhonov regularization 4 Shrinkage estimators Tie Tikhonov regularization previously introduced is a shrinkage estimator, in the sense that its shrinks a naive estimate towards some guess. As illustrated in Fig. 5, one can think of the regular- ization as shrinking the least-square estimate θLS towards zero. Shrinkage estimators are arguably a bit strange, especially because it may not be clear priori how biasing an estimate towards a guess would bring any benefit. Tie intuition you should have is that the shrinkage often leads to a lower variance of the estimator, perhaps at the expense of an increase in the bias. Tie example below illustrates this idea in a simple situation. Example 4.1 (Estimating the variance). Let {xi}N

i=1 be iid samples drawn according to unknown

distribution with variance σ2. Consider two estimators of the variance ˆ σ2

biased ≜ 1

N

N

  • i=1

(xi − ˆ µ)2 ˆ σ2

unbiased ≜

1 N − 1

N

  • i=1

(xi − ˆ µ)2 (13) As the names suggest, it is not hard to show that E

  • ˆ

σ2

biased

  • = N − 1

N σ2 E

  • ˆ

σ2

unbiased

  • = σ2

(14) Perhaps surprisingly one can also show that the biased estimate has a lower variance than the unbiased

  • ne.

5

slide-6
SLIDE 6

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

Tykhonov regularization is not the only regularizer that one can use, and statisticians have de- veloped many alternative regularizers, such as

  • Akaike information criterion (AIC): penalty that is an increasing function of the number of

estimated parameters;

  • Bayesian information criterion (BIC): also an increasing function of the number of estimated

parameters. One regularize that has become increasingly popular is the LASSO (LASSO) estimator ˆ θ = argmin

θ

y − Xθ2

2 + λθ1.

(15) With LASSO, coordinates are shrunk by the same amount. Tie main benefit of LASSO is that it promotes sparsity, i.e., solutions tend to have a small number of non-zero components. Tie problem is also much more computationally tractable than directly enforcing sparsity through a θ0 constraint. In constrained form, the LASSO estimation problem becomes θ = argmin

θ

y − Xθ2

2 such that θ1 ⩽ τ,

(16) for which there is no closed form solution but which is very powerful whenn ≪ d (susceptible to

  • verfitting) or X has non trivial null space and there is no obvious way to find the best solution. Tie

intuition behind LASSO is illustrated in Fig. 6, the norm 1 ball {x : x1 ⩽ τ} is more “pointy” than the norm 2 ball, which increase the likelihood of the regularization to yield an extreme point with few non-zero components.

θ1

<latexit sha1_base64="xKwcpK9z170uZAH7B8XGjdPwUg=">ACj3icnVFNb9NAEN2Yj5bw1cCRi0WExAHFdkFqTygSF5A4FKlpIyVRNJ6M41X3w+yOq0ZW/wRX+GP8G9ZODrTlxEi7enrz3s7sTF4p6TlNf/eie/cfPNzbf9R/OTps+cHgxdn3tYOaYJWTfNwZOShiYsWdG0cgQ6V3SeX3xq8+eX5Ly05pQ3FS0rI0sJAIHajrnkhiW2fJgmI7SLuK7INuBodjFyXLQ+z5fWaw1GUYF3s+ytOJFA4lKruz2tPFeAFrGkWoAFNftF0DV/HbwKzigvrwjEcd+zfjga09xudB6UGLv3tXEu+86zBbdzqX6JZzcXxopGmqpkMbisWtYrZxu0Y4pV0hKw2AQA6GZqOsQHyGFYN8qtHVSlxKvwpe65Jpn4oEnCiC2WyVc4pWnCdJUoaG9waPX2XhrL3ZhHIfEf9qI2Nr91t8PS8pur+QuODscZe9Hh98+DMfHu3Xti1fitXgrMnEkxuKzOBETgUKJH+Kn+BUNoqPoYzTeSqPezvNS3Ijoyx8DUMpk</latexit>

θ2

<latexit sha1_base64="XyAir7xfiuiGyhWvElUF5sRdY=">ACj3icnVFNb9NAEN2YrxK+GjhysYiQOKDYDkjtCUXiAhKHIjVtpCSKxpNxvOp+mN0xamT1T3CFP8a/Ye3kQFtOjLSrpzfv7czO5JWSntP0dy+6c/fe/QcHD/uPHj95+uxw8PzM29ohTdEq62Y5eFLS0JQlK5pVjkDnis7zi49t/vw7OS+tOeVtRUsNGyMLicCBmi24JIbVeHU4TEdpF/FtkO3BUOzjZDXofVusLdaDKMC7+dZWvGyAcSFV31F7WnCvACNjQP0IAmv2y6hq/i14FZx4V14RiO/ZvRwPa+63Og1IDl/5mriXfetbgtm79L9G85uJ42UhT1UwGdxWLWsVs43YM8Vo6QlbACdDE3HWID5DCsa+U2DqpS4mX4Uvdck0x90CRhxBbL5Auc0ixhukwUtDc4tHp3r4zlbsyjkPgPe1EbO1+5+HJWU3V3IbnI1H2bvR+Ov74eR4v64D8VK8Em9EJo7ERHwSJ2IqUCjxQ/wUv6JBdBR9iCY7adTbe16IaxF9/gMFdMpl</latexit>

ˆ θLS

<latexit sha1_base64="q7d3Es0WlD+e17tiBzCPOtG1fQ=">ACrnicnVFLbtswEKWVflL356TLboQaAboLCkp0CwDdJNFikaOwYsQRhRI4sIRarkqIgh+AQ9TbJSXqbUrIXTdJVByD5MDOP83lZLYWlMPw98HaePH32fPfF8OWr12/ejvb2Z1Y3huOUa6nNPAOLUickiCJ89ogVJnEy+zqaxe/InGCq0uaFVjUsFSiUJwIOdKRwdxCdTGmZa5XVXuaWMqkWC9Th3Ca2rPvjs8GoeTsDf/MYi2YMy2dp7uDX7EueZNhYq4BGsXUVhT0oIhwSWuh3FjsQZ+BUtcOKigQpu0/Txr/8B5cr/Qxh1Ffu/9m9FCZbtuXWYFVNqHsc75yVIFZmXyfyUtGiqOk1aouiFUfFOxaKRP2u+25OfCICe5cgC4Ea5pn5dgJPb5b1ySwN1Kfi1G6n/rg2m1uUETgHNy+AMLnAeuEUGErobDNfV5k6Vpl6FiQv8B71oFO/odsMfOpGih5I8BrPDSXQ0Ofz2eXxyvJVrl71nH9hHFrEv7ISdsnM2Zz9Yjfslt15oTfzEi/dpHqDLecdu2de+QfenNhS</latexit>

kθk1 6 τ

<latexit sha1_base64="ELGpgoeCmrkOgbgm4seld5CGAHg=">ACrnicnVFLb9NAEN6YVwmvFI5cLKJKHFBsFyR6rMSFA4ciNWmk2LG63G86j7c3TFqZOUX8Gu4wi/h37B2cqAtJ0ba3U8z8+08vqKRwlEc/x4F9+4/ePjo4PH4ydNnz19MDl8unGktxzk30thlAQ6l0DgnQRKXjUVQhcSL4vJTH7/4htYJo89p02CmYK1FJTiQd+WTo1Qbq1ZJ1qWFkaXbKP90KdVIsN2mEq9SgjafTONZPFh4FyR7MGV7O8sPR1dpaXirUBOX4NwqiRvKOrAkuMTtOG0dNsAvY0rDzUodFk3zLMNj7ynDCtj/dEUDt6/GR0o13fqMxVQ7W7Heuc7Rwrsxpb/Slq1VJ1kndBNS6j5rmLVypBM2G8pLIVFTnLjAXArfNMhr8ECJ7/LG+XWFpa8Gs/0vBdF82dz4m8AobX0Rc4x2VEeB1J6G+w3KjdnWtDgwozH/gPetVq3tPdj/2IiW3JbkLFsez5P3s+OuH6enJXq4D9pq9YW9Zwj6yU/aZnbE54+w7+8F+sl9BHCyCLMh3qcFoz3nFblhQ/wEB0dfv</latexit>

ˆ θLASSO

<latexit sha1_base64="VvgJDjkrxep4z6jIyrDY7KQXvLA=">ACsXicnVHNbtNAEN6YvxJ+msKRi0WE1AOK7RaJHou4cKhEUZs2KI7CeD2OV9kfsztGjay8Ak/DFd6Dt2Ht5EBbToy0s9/OzLe7M19WSeEojn/3gjt3791/sPOw/+jxk6e7g71nF87UluOYG2nsJAOHUmgckyCJk8oiqEziZbZ83+Yv6F1wuhzWlU4U7DQohAcyIfmg/20BGrSzMjcrZTfmpRKJFiv5x7hFTUn787OPvrjYBiP4s7C2yDZgiHb2ul8r/c1zQ2vFWriEpybJnFswYsCS5x3U9rhxXwJSxw6qEGhW7WdC2tw1c+koeFsX5pCrvo34wGlGs/7CsVUOlu5trga0cK7Mrm/yqa1lQczRqhq5pQ82LRS1DMmE7qDAXFjnJlQfArfCfDnkJFj5cV57bmGhKgW/8i1zXR2PmayItgeBmdwDlOIj/LSELrwXKjNn6uDXVCjHziP+hFrXlLdxt+34uU3JTkNrg4GCWHo4NPb4bHR1u5dtgL9pLts4S9ZcfsAztlY8bZd/aD/WS/gsPgc/AlyDalQW/Lec6uWbD8A2Tu2VM=</latexit>

Figure 6: Illustration of LASSO regularization. 5 General approach to regression Consider now a general approach to regression, in which we define our estimate as ˆ θ = argmin

θ

L(θ, X, y) + λr(θ) (17) where L : Rd+1 × RN×(d+1) × RN → R is a loss function and r : Rd+1 → R is a regularizer. Intuitively, the role of the loss function it to ensure data fidelity while the role of the regularizer is to limit the complexity of the solution. Among the many possible choices of loss functions, we highlight three that are fairly popular: 6

slide-7
SLIDE 7

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

  • the mean absolute error LAE(r) ≜ |r|;
  • the Huber loss LH(r) =
  • 1

2r2 if |r| ⩽ c

c |r| − c2

2 else

;

  • the ϵ-insensitive loss Lϵ(r) =
  • 0 if |r| ⩽ ϵ

|r| − ϵ else . Tiese losses are illustrated in Fig. 7. r Loss LAE(r) c c LH(r) ϵ −ϵ Lϵ(r) Figure 7: Illustration of loss functions Let us now investigate one specific example called the regularized robust regression, which com- bines the ϵ-insensitive loss with an ℓ2 regularizer as ˆ β, ˆ β0 = argmin

β,β0 N

  • i=1

Lϵ(yi − (β⊺xi + β0)) + λ 2 β2

2

(18) Tiis ϵ-insensitive loss does not incur a penalty as long as prediction is within a margin of ϵ of the true

  • value. Tiis is suspiciously similar to what the problem we were solving when looking at Support

Vector Machines (SVMs), and one can therefore wonder if we could kernelize the regression prob-

  • lem. Tie answer turns out to be affirmative as shown by the characterization of the dual problem
  • f (18).

Lemma 5.1 (Dual of regularized robust regression). Robust regression is the solution of argmin

α,α∗ N

  • i=1

((ϵ − yi)αi + (ϵ + yi)α∗

i ) + 1

2

  • i,j

(αi − α∗

i )(αj − α∗ j)x⊺ j xi

(19) such that 0 ⩽ α∗

i , αi ⩽ 1

λ,

N

  • i=1

(α∗

i − αi) = 0,

α∗

i αi = 0

(20) Tie solution is then ˆ y = N

i=1(αi − α∗ i )x⊺ i x − β0.

7

slide-8
SLIDE 8

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

  • Proof. See for yourselves!

Tie previous kernelization of the ϵ-robust regression through the formulation of the dual prob- lem is a bit misleading because the kernelization is in general not straightforward. We highlight next two examples for which the kernelization can be worked out. 6 Kernelized LASSO Tie LASSO regression is given by ˆ θ = argmin

θ

y − Xθ2

2 + θ1,

which is not easy to kernerlize directly. Instead, we modify the problem and and enforce the kernel- ization by looking for a specific θ of the form ˜ θ ≜

i αixi for {αi}d i=1 and solve the optimization

problem ˆ α = argmin

α

y − X˜ θ2

2 + α1

Tiis can be kernelized but promotes sparsity in α instead of θ, which is not quite the original LASSO problem. 7 Kernel ridge regression One example we can explore in much more details is kernel ridge regression. Tie first step towards kernelization consists in reformulating the objective of ridge regression as follows. Lemma 7.1. Tie ridge regression problem is equivalent to ˆ β, ˆ β0 = argmin

β,β0 N

  • i=1

yi − ¯ y − β⊺(xi − ¯ x)2

2 + λβ2 2

(21) with ¯ x = 1

N

  • i xi and ¯

y = 1

N

  • i yi. Tie prediction for an unseen x is then computed as

ˆ y = ¯ y + ˆ β

⊺(x − ¯

x) (22)

  • Proof. Recall that the original ridge regression problem is

ˆ β, ˆ β0 = argmin

β,β0

1 N

N

  • i=1

yi − β⊺xi − β02

2 + λβ2 2

  • ≜L

(23) Applying the stationarity condition of the Karush-Kuhn Tucker (KKT) conditions to this problem, we must have ∂L ∂β0 = 0 ⇔ − 2 N

N

  • i=1

(yi − β⊺xi − β0) = 0 (24) 8

slide-9
SLIDE 9

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

so that β0 ≜ 1 N

N

  • i=1

yi − β⊺

  • 1

N

N

  • i=1

xi

  • ≜ ¯

y − β⊺¯ x. (25) Substituting this optimal value of β0 in (25) into the optimization problem (23), we obtain (21). Tie evaluation y = β⊺x + β0 with (25) yields (22).

Note that the optimization problem in Lemma 7.1 is a ridge regression problem for a centered data set, in which the value yi are replaced by yi − ¯ y and in which the feature vectors xi are replaced by xi − ¯

  • x. Using the known analytical solution for ridge regression developed in a previous lecture,

we immediately obtain ˆ β = (A⊺A + λI)−1A⊺˜ y where A ≜      (x1 − ¯ x)⊺ (x2 − ¯ x)⊺ . . . (xN − ¯ x)⊺      ˜ y ≜      y1 − ¯ y y2 − ¯ y . . . yN − ¯ y      . Unfortunately, one cannot directly kernelize this solution because the matrix A⊺A does not contain inner products between feature vectors. One can, however, rewrite the solution using the Woodbury matrix inversion identity given below. Lemma 7.2 (Woodbury matrix inversion identity). (P + QRS)−1 = P−1 − P−1Q(R−1 + SP−1Q)−1SP−1 Tie main benefit of this identity is to switch the order in which the matrix are multiplied. In

  • ur context, we use this to introduce inner products between feature vector in the expression of the

solution for β. Lemma 7.3. ˆ β = 1 λ(A⊺ − A⊺(λI + K)−1K)˜ y with K ≜ AA⊺ = [(xi − ¯ x)⊺(xj − ¯ x)]i,j and ˆ y = ¯ y + 1 λ˜ y⊺(I − K(λI + K)−1I)k(x) with k(x) =    (x1 − ¯ x)⊺(x − ¯ x) . . . (xN − ¯ x)⊺(x − ¯ x)   

  • Proof. We use Lemma 7.2 with the choice of matrices

P ≜ λI Q ≜ A⊺ R = I S = A, (26) for which we obtain (λI + A⊺A)−1 = 1 λI − 1 λIA⊺(I + A 1 λA⊺)−1A 1 λI (27) = 1 λ I − A⊺(λI + AA⊺)−1A

  • .

(28) 9

slide-10
SLIDE 10

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

Consequently, (λI + A⊺A)−1A⊺˜ y = 1 λ A⊺ − A⊺(λI + AA⊺)−1AA⊺ ˜ y (29) = 1 λA⊺ I − (λI + K)−1K

  • ˜

y, (30) and the evaluation ˆ y − ¯ y = β⊺(x − ¯ x) from (22) yields ˆ y − ¯ y = ˜ y⊺ I − K(λI + K)−1 A(x − ¯ x), (31) where we have used the fact that K⊺ = K. Tie result follows from the definition of k(x).

Notice that the evaluation of ˆ y in Lemma 7.3 only involves inner products between centered feature vectors in K and k(x). One kernelize the expression by substituting every inner product (xj − ¯ x)⊺(x − ¯ x) with k(xj − ¯ x, x − ¯ x) for some inner product kernel k(·, ·). Remark 7.4. For many kernels, the map Φ(x) already contains a constant component so that we often

  • mit β0. Tie expression of kernel ridge regression without offset then simplifies to

ˆ y = y⊺(λI + K)−1k(x) where k(x) = k(x1, x) · · · k(xN, x) ⊺ Ki,j = k(xi, xj). Example 7.5 (Gaussian kernel). A widely used kernel is the Gaussian kernel defined as k(u, v) ≜ exp

  • −u − v2

2

2σ2

  • .

(32) Tie parameter σ is called the width of the kernel and can be thought of as controlling how smooth we want the learned function to be. To visualize this, note that y⊺(λI + K)−1 is merely a row vector α⊺ = [α1, · · · , αN], so that ˆ y = N

i=1 αik(xi, x). One can interpret this as writing our approximation

ˆ y as a weighted sum of bell-shaped curves placed around each data point as shown in Fig. 8. x ˆ y x0 x1 x2 x3 x4 Figure 8: Illustration of Gaussian kernel ridge regression. 10

slide-11
SLIDE 11

ECE 6254 - Spring 2020 - Lecture 24 v1.0 - revised April 11, 2020

8 Regularization and classification Since classification is a special case of regression, the viewpoint of loss function and regularization also applies to classification. As a simple application, the logistic regression minθ −ℓ(θ) can be reg- ularized as minθ −ℓ(θ)+λθ2

2 to make the Hessian matrix well conditioned in a Newton method.

Tiis is also very useful when the number of observations is small and the data not separable. Most importantly, the loss function and regularization framework offers a new way to think about classification. In fact, when trying to separate two classes with a hyperplane, we would ideally like to solve the optimization problem argmin

w,b

1 N

N

  • i=1

1{yi(w⊺xi + b) < 0} , which assigns a penalty of one to every misclassified point and zero otherwise. Unfortunately, this is a hard problem to solve not only analytically but also numerically because the indicator function 1{·} is not well behaved. Viewing the indicator function as a {0, 1}-loss, we now know that we can use other loss functions; in particular, we can choose a loss function ϕ(t) ⩾ 1{t < 0} and a regularization to solve a relaxed but well-behaved modified optimization argmin

w,b

1 N

N

  • i=1

ϕ (yi(w⊺xi + b)) + λw2

2.

Tiere are many possible choices, but a popular choice is the hinge loss ϕ(t) ≜ max(0, 1−t) ≜ (1−t)+ illustrated in Fig 9. t ϕ(t) 1{t ⩾ 0} (1 − t)+ Figure 9: Hinge loss Tie optimization problem with the hinge loss then becomes argmin

w,b

1 N

N

  • i=1

(1 − yi(w⊺xi + b))+ + λw2

2.

We can introduce slack variables ξi = 1 − yi(w⊺xi + b) to rewrite the problem equivalently as argmin

w,b,ξ⩾0

1 N

N

  • i=1

ξi + λw2

2 such that ∀i ∈ 1, N yi(w⊺xi + b) ⩾ 1 − ξi,

(33) which is exactly the soft-margin hyperplane seen in previous lectures. 11