Notes on Linear Least Squares Model, COMP24111 Tingting Mu - - PDF document

notes on linear least squares model comp24111
SMART_READER_LITE
LIVE PREVIEW

Notes on Linear Least Squares Model, COMP24111 Tingting Mu - - PDF document

Notes on Linear Least Squares Model, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA 1. Notations In a regression (or classification) task, we are given N


slide-1
SLIDE 1

Notes on Linear Least Squares Model, COMP24111

Tingting Mu

tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA

  • 1. Notations

In a regression (or classification) task, we are given N training samples. Each training sample is characterised by a total of d features. We store the feature values of these training samples in an N × d matrix, denoted by X = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ x11 x12 ⋯ x1d x21 x22 ⋯ x2d ⋮ ⋮ ⋱ ⋮ xN1 xN2 ⋯ xNd ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , (1) where xij denotes the ij-th element of this matrix. Usually, we use the simplified notation X = [xij] to denote this matrix, and use the d-dimensional column vector xi to denote feature vector of the i-th training sample such that xi = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ xi1 xi2 ⋮ xid ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (2) As you can see, xi contains elements from the i-row of the feature matrix X. In the single-output case, each training sample is associated with one target output. The following column vector y = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ y1 y2 ⋮ yN ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (3) is used to store the output of all the training samples. Each element yi corresponds to the single-variable output of the i-th training sample. In a regression task, the target output is a real-valued number (yi ∈ R). In a binary classification task, the target output is often set as a binary integer, e.g., yi ∈ {−1,+1} or yi ∈ {0,1}. In the multi-output case, each training sample is associated with c different output

  • variables. We use the N ×c matrix Y = [yij] to store the output variables of all the training

1

slide-2
SLIDE 2

samples: Y = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ y11 y12 ⋯ y1c y21 y22 ⋯ y2c ⋮ ⋮ ⋱ ⋮ yN1 yN2 ⋯ yNc ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (4) We use the c-dimensional column vector yi = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ yi1 yi2 ⋮ yic ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (5) to store the c output variables of the i-th training sample.

  • 2. Linear Model

In machine learning, building a linear model refers to employing a linear function to estimate a desired output. The general formulation of a linear function that takes n input variables is f(x1,x2,...,xn) = a0 + a1x1 + a2x2 + ⋯anxn, (6) where a0,a1,a2 ...an are often referred to as the linear combination coefficients (weights),

  • r linear model weights.

2.1 Single-output Case We use one linear function to estimate the single output variable of a given sample based

  • n its input features x = [x1,x2,...,xd]T . The estimated output is given by

ˆ y = w0 + w1x1 + w2x2 + ⋯wdxd = w0 +

d

i=1

wixi = wT ˜ x, (7) where the column vector w = [w0,w1,w2,...,wd]T stores the model weights. The modified notation ˜ x = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 x1 x2 ⋮ xd ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (8) is introduced to simplify the writing of the linear model formulation, and it is called the expanded feature vector. 2

slide-3
SLIDE 3

2.2 Multi-output Case In this case, each target output is estimated using one linear function. We seek c different functions to predict the c output for a sample x = [x1,x2,...,xd]T : ˆ y1 = w01 + w11x1 + w21x2 + ⋯wd1xd = wT

1 ˜

x, (9) ˆ y2 = w02 + w12x1 + w22x2 + ⋯wd2xd = wT

2 ˜

x, (10) ⋮ ˆ yc = w0c + w1cx1 + w2cx2 + ⋯wdcxd = wT

c ˜

x, (11) where the vector wi = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w0i w1i w2i ⋮ wdi ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (12) stores the linear model weights for predicting the i-th target output. By collecting all the estimated output in a vector, a neat expression of the multi-output linear model can be

  • btained:

ˆ y = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ˆ y1 ˆ y2 ⋮ ˆ yc ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w01 + w11x1 + w21x2 + ⋯wd1xd w02 + w12x1 + w22x2 + ⋯wd2xd ⋮ w0c + w1cx1 + w2cx2 + ⋯wdcxd ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w01 w11 w21 ... wd1 w02 w12 w22 ... wd2 ⋮ ⋮ ⋮ ⋱ ⋮ w0c w1c w2c ... wdc ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 x1 ⋮ xd ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = WT ˜ x, (13)

where the (d + 1) × c matrix W = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w01 w02 ⋯ w0c w11 w12 ⋯ w1c w21 w22 ⋯ w2c ⋮ ⋮ ⋱ ⋮ wd1 wd2 ⋯ wdc ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . (14) stores all the model weights.

  • 3. Least Squares

Training a linear model refers to the process of finding the optimal values of the model weights, by utilising information provided by the training samples. The least squares ap- proach refers to the method of finding the optimal model weights by minimising the sum-

  • f-squares error function.

3.1 Sum-of-squares Error The sum-of-squares error function is computed as the sum of the squared differences between the true target outputs and their estimation. In the single-output case, the error function computed using N training samples is given as O(w) =

N

i=1

(ˆ yi − yi)2 =

N

i=1

((w0 +

d

k=1

wkxik) − yi)

2

=

N

i=1

(wT ˜ xi − yi)

2 ,

(15) 3

slide-4
SLIDE 4

where ˜ xi = [1,xi1,xi2,...xid]T is the expanded feature vector for the i-th training sample. In the multi-output case, each sample is associated with multiple output variables (e.g., yi1,yi2,...,yic for the i-th training sample). The error function is computed by examining the squared difference over each target output of each training sample, resulting in the following sum: O(W) =

N

i=1 c

j=1

(ˆ yij − yij)2 =

N

i=1 c

j=1

((w0j +

d

k=1

wkjxik) − yij)

2

=

N

i=1 c

j=1

(wT

j ˜

xi − yij)

2 .

(16) 3.2 Normal Equations Normal equations provide a way to find the model weights that minimises the sum-of- squares error function. It is derived by setting the partial derivatives of the error function with respect to the weights to zero. We first look at the single-output case, and use w∗ to denote the optimal weight vector that minimises the sum-of-squares error function. The normal equations are w∗ = ( ˜ X

T ˜

X)

−1 ˜

X

T y = ˜

X

+y,

(17) where ˜ X = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 x11 x12 ⋯ x1d 1 x21 x22 ⋯ x2d ⋮ ⋮ ⋮ ⋱ ⋮ 1 xN1 xN2 ⋯ xNd ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (18) is the expanded feature matrix. The quantity ˜ X

+ = ( ˜

X

T ˜

X)

−1 ˜

X

T is called the Moore-

Penrose pseudo-inverse of the matrix. To compute the optimal weight matrix W∗ for the multi-output case, the normal equa- tions possess a similar form to Eq. (17): W∗ = ( ˜ X

T ˜

X)

−1 ˜

X

T Y = ˜

X

+Y.

(19) When implementing the normal equations, you can seek help from existing linear algebra libraries, e.g. “inv()”, “pinv()” in MATLAB, to compute the inverse or pseudo-inverse of a given matrix. If you are interested in how to derive the normal equations, you can read the

  • ptional reading materials in Section 4.

3.3 Regularised Least Squares model The regularised least squares model finds its model weights by minimising the following modified error function: O(w) =

N

i=1

(ˆ yi − yi)2 + λ(w2

0 + d

i=1

w2

i )

(20) for the single-output case, and O(W) =

N

i=1 c

j=1

(ˆ yij − yij)2 + λ

c

j=1

(w2

0j + d

i=1

w2

ij)

(21) 4

slide-5
SLIDE 5

for the multi-output case. Here, λ > 0 is the regularisation parameter. The normal equations for the regularised least squares model are given as w∗ = ( ˜ X

T ˜

X + λI)

−1 ˜

X

T y (single-output),

(22) W∗ = ( ˜ X

T ˜

X + λI)

−1 ˜

X

T Y (multi-output).

(23)

  • Eq. (22) is derived by setting the gradient of O(w) with respect to w to zero. Eq. (23) is

derived by setting the gradient of O(W) with respect to W to zero.

  • 4. Derive Normal Equations (Optional Reading)

4.1 Single-output Case The sum-of-squares error function in Eq. (15) can be expressed in matrix form. One way to do this is shown below: O(w) =

N

i=1

(wT ˜ xi − yi)

2

=

N

i=1

((wT ˜ xi)

2 − 2wT ˜

xiyi + y2

i )

=

N

i=1

(wT ˜ xi ˜ xT

i w − 2yi ˜

xT

i w + y2 i )

= wT (

N

i=1

˜ xi ˜ xT

i )w − 2( N

i=1

yi ˜ xT

i )w + N

i=1

y2

i

= wT ˜ X

T ˜

Xw − 2yT Xw + yT y. (24) Another way to derive the matrix form is to utilise the operation of l2-norm (see Section 1.2 in the maths notes): O(w) =

N

i=1

(wT ˜ xi − yi)

2

= ∥ ˜ Xw − y∥

2 2

= ( ˜ Xw − y)

T ( ˜

Xw − y) = (wT ˜ X

T − yT )( ˜

Xw − y) = wT ˜ X

T ˜

Xw − yT ˜ Xw − wT ˜ X

T y + yT y

= wT ˜ X

T ˜

Xw − 2yT ˜ Xw + yT y. (25) The error function O(w) contains three terms: wT ˜ X

T ˜

Xw is a quadratic function of w, 2yT ˜ Xw is a linear function of w, and yT y is a constant term. Utilising the gradient formulations for linear and quadratic functions (see Section 3 in the maths notes), it is straightforward to derive the gradient of O(w) with respect to w: ▽wO(w) = ( ˜ X

T ˜

X)

T

w + ˜ X

T ˜

Xw − (2yT ˜ X)

T = 2 ˜

X

T ˜

Xw − 2 ˜ X

T y.

(26) 5

slide-6
SLIDE 6

When the minimum of O(w) is reached, its gradient has to be equal to zero: ▽wO(w) = 0. Therefore ˜ X

T ˜

Xw∗ = ˜ X

T y,

(27) based on which the normal equations w∗ = ( ˜ X

T ˜

X)

−1 ˜

X

T y are derived.

4.2 Multi-output Case The sum-of-squares error function in Eq. (16) can be re-written as O(W) =

N

i=1 c

j=1

(wT

j ˜

xi − yij)

2 = N

i=1

∥WT ˜ xi − yi∥2

2 = ∥ ˜

XW − Y∥

2 F .

(28) You can check that the above equation holds using the definition of the l2-norm and the Frobenius norm in Section 1.2 of the maths notes. This gives O(W) = ∥ ˜ XW − Y∥

2 F

= tr[( ˜ XW − Y)

T ( ˜

XW − Y)] = tr[(WT ˜ X

T − YT )( ˜

XW − Y)] = tr(WT ˜ X

T ˜

XW − YT ˜ XW − WT ˜ X

T Y + YT Y)

= tr(WT ˜ X

T ˜

XW) − tr(YT ˜ XW) − tr(WT ˜ X

T Y) + tr(YT Y).

(29) Based on the trace property as shown in Eq. (15) of the maths notes, it has tr(WT ˜ X

T Y) = tr[(WT ˜

X

T Y) T

] = tr(YT ˜ XW). (30) Therefore, O(W) = tr(WT ˜ X

T ˜

XW) − 2tr(WT ˜ X

T Y) + tr(YT Y).

(31) We can use the following readily given trace derivative rules to compute the gradient: ∂tr(ZT A) ∂Z = A, (32) ∂tr(ZT BZ) ∂Z = BZ + BT Z. (33) In our case, we have Z ← W. We also have B ← ˜ X

T ˜

X for the first term in O(W), and A ← ˜ X

T Y for the second term in O(W). Therefore the gradient of O(W) with respect to

W is given by ▽WO(W) = ˜ X

T ˜

XW + ( ˜ X

T ˜

X)

T

W − 2 ˜ X

T Y = 2 ˜

X

T ˜

XW − 2 ˜ X

T Y.

(34) Setting the gradient to zero ▽WO(W) = 0, we have ˜ X

T ˜

XW∗ = ˜ X

T Y,

(35) which gives the normal equations W∗ = ( ˜ X

T ˜

X)

−1 ˜

X

T Y.

6

slide-7
SLIDE 7

4.3 Regularised Least Squares model The modified error function of the regularised least squares model in Eq. (20) can be re-written in the matrix form as below: O(w) = ∥ ˜ Xw − y∥

2 2 + λ∥w∥2 2.

(36) For the multi-output case, the modified error function in Eq. (20) can be re-written as O(W) = ∥ ˜ XW − Y∥

2 F + λ∥W∥2 F .

(37) Based on these, the gradient of O(w) with respect to w and the gradient of O(W) with respect to W can be derived by following a similar procedure as explained above. You can give it a go as a practice. 7