Optimization MS Maths Big Data Alexandre Gramfort - - PowerPoint PPT Presentation

optimization ms maths big data
SMART_READER_LITE
LIVE PREVIEW

Optimization MS Maths Big Data Alexandre Gramfort - - PowerPoint PPT Presentation

Optimization MS Maths Big Data Alexandre Gramfort alexandre.gramfort@telecom-paristech.fr Telecom ParisTech M2 Maths Big Data Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge Plan Notations 1 Ridge


slide-1
SLIDE 1

Optimization MS Maths Big Data

Alexandre Gramfort alexandre.gramfort@telecom-paristech.fr

Telecom ParisTech

M2 Maths Big Data

slide-2
SLIDE 2

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Plan

1

Notations

2

Ridge regression and quadratic forms

3

SVD

4

Woodbury

5

Dense Ridge

6

Sparse Ridge

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 2

slide-3
SLIDE 3

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Optimization problem

Definition (Optimization problem (P)) min f (x), x ∈ C, where f : Rn → R ∪ {+∞} is called the

  • bjective function

C = {x ∈ Rn/g(x) ≤ 0 et h(x) = 0} is the feasible set g(x) ≤ 0 represent inequality constraints. g(x) = (g1(x), . . . , gp(x)) so with p contraints. h(x) = 0 represent equality contraints. h(x) = (h1(x), . . . , hq(x)) so with q contraints. an element x ∈ C is said to be feasible

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 3

slide-4
SLIDE 4

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Taylor at order 2

Assuming f is twice differentiable, the Taylor expansion at order 2

  • f f at x reads:

∀h ∈ Rn, f (x + h) = f (x) + ∇f (x)Th + 1 2hT∇2f (x)h + o(h2) ∇f (x) ∈ Rn is the gradient. ∇2f (x) ∈ Rn×n the Hessian matrix. Remark: Local quadratic approximation

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 4

slide-5
SLIDE 5

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Plan

1

Notations

2

Ridge regression and quadratic forms

3

SVD

4

Woodbury

5

Dense Ridge

6

Sparse Ridge

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 5

slide-6
SLIDE 6

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Ridge regression

We consider problems with n samples, observations, and p features, variables. Definition (Ridge regression) Let y ∈ Rn the n targets to predict and (xi)i the n samples in Rp. Ridge regression consists in solving the following problem min

w,b

1 2y − Xw − b2 + λ 2 w2 , λ > 0 where w ∈ Rp is called the weights vector, b ∈ R is the intercept (a.k.a. bias) and the ith row of X is xi.

Remark: : Note that the intercept is not penalized with λ.

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 6

slide-7
SLIDE 7

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Taking care of the intercept

There are different ways to deal with the intercept. Option 1: Center the target y and each column feature. After centering the problem reads: min

w

1 2y − Xw2 + λ 2 w2 , λ > 0 Option 2: Add a column of 1 to X and try not to penalize it (too much). Exercise Denote by y ∈ R the mean of y and by X ∈ Rp the mean of each column of X. Show that ˆ b = −X

T ˆ

w + y.

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 7

slide-8
SLIDE 8

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Ridge regression

Definition (Quadratic form) A quadratic form reads f (x) = 1 2xTAx + bTx + c where x ∈ Rp, A ∈ Rp×p, b ∈ Rp and c ∈ R.

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 8

slide-9
SLIDE 9

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Ridge regression

Questions Show that ridge regression boils down to the minimization of a quadratic form. Propose a closed form solution. Show that the solution is obtained by solving a linear system. Is the objective function strongly convex? Assuming n < p what is the value of the constant of strong convexity?

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 9

slide-10
SLIDE 10

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Plan

1

Notations

2

Ridge regression and quadratic forms

3

SVD

4

Woodbury

5

Dense Ridge

6

Sparse Ridge

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 10

slide-11
SLIDE 11

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Singular value decomposition (SVD)

SVD is a factorization of a matrix (real here) M = UΣV T where M ∈ Rn×p, U ∈ Rn×n, Σ ∈ Rn×p, V ∈ Rp×p UTU = UUT = In (orthogonal matrix) V TV = VV T = Ip (orthogonal matrix) Σ diagonal matrix Σi,i are called the singular values U are left-singular vectors V are right-singular vectors

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 11

slide-12
SLIDE 12

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Singular value decomposition (SVD)

SVD is a factorization of a matrix (real here) U contains the eigenvectors of MMT associated to the eigenvalues Σ2

i,i

V contains the eigenvectors of MTM associated to the eigenvalues Σ2

i,i

we assume here Σi,i = 0 for min(n, p) ≤ i ≤ max(n, p) SVD is particularly useful to find the rank, null-space, image and pseudo-inverse of a matrix

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 12

slide-13
SLIDE 13

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Plan

1

Notations

2

Ridge regression and quadratic forms

3

SVD

4

Woodbury

5

Dense Ridge

6

Sparse Ridge

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 13

slide-14
SLIDE 14

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Matrix inversion lemma

Proposition (Matrix inversion lemma) also known as Sherman–Morrison–Woodbury formula states that: (A + UCV )−1 = A−1 − A−1U

  • C −1 + VA−1U

−1 VA−1, where A ∈ Rn×n, U ∈ Rn×k, C ∈ Rk×k, V ∈ Rk×n.

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 14

slide-15
SLIDE 15

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Matrix inversion lemma (proof)

Just check that (A+UCV) times the RHS of the Woodbury identity gives the identity matrix: (A + UCV )

  • A−1 − A−1U
  • C −1 + VA−1U

−1 VA−1 = I + UCVA−1 − (U + UCVA−1U)(C −1 + VA−1U)−1VA−1 = I + UCVA−1 − UC(C −1 + VA−1U)(C −1 + VA−1U)−1VA−1 = I + UCVA−1 − UCVA−1 = I Questions Using the matrix inversion lemma show that if n < p, the ridge regression problem can be solved by inverting a matrix of size n × n rather than p × p.

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 15

slide-16
SLIDE 16

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Plan

1

Notations

2

Ridge regression and quadratic forms

3

SVD

4

Woodbury

5

Dense Ridge

6

Sparse Ridge

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 16

slide-17
SLIDE 17

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Primal and dual implementation

The solution of the ridge regression problem (without intercept) is

  • btained by solving the problem in the primal form:

ˆ w = (X TX + λIp)−1X Ty

  • r in the dual form:

ˆ w = X T(XX T + λIn)−1y In the dual formulation the matrix to invert in Rn×n. What if X is sparse, n is 1e5 and p is 1e6?

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 17

slide-18
SLIDE 18

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Primal and dual implementation

The solution of the ridge regression problem (without intercept) is

  • btained by solving the problem in the primal form:

ˆ w = (X TX + λIp)−1X Ty

  • r in the dual form:

ˆ w = X T(XX T + λIn)−1y In the dual formulation the matrix to invert in Rn×n. What if X is sparse, n is 1e5 and p is 1e6?

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 17

slide-19
SLIDE 19

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Plan

1

Notations

2

Ridge regression and quadratic forms

3

SVD

4

Woodbury

5

Dense Ridge

6

Sparse Ridge

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 18

slide-20
SLIDE 20

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Conjugate gradient: Solve Ax = b, A ∈ Rn×n and b ∈ Rn

1: x0 ∈ Rn, g0 = Ax0 − b 2: for k = 0 to n do 3:

if gk = 0 then

4:

break

5:

end if

6:

if k = 0 then

7:

wk = g0

8:

else

9:

αk = −

gk,Awk−1 wk−1,Awk−1

10:

wk = gk + αkwk−1

11:

end if

12:

ρk =

gk,wk wk,Awk

13:

xk+1 = xk − ρkwk

14:

gk+1 = Axk+1 − b

15: end for 16: return xk+1

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 19

slide-21
SLIDE 21

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Sparse ridge with CG

  • cf. Notebook

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 20

slide-22
SLIDE 22

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Logistic regression with CG

  • cf. Notebook

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 21

slide-23
SLIDE 23

Notations Ridge regression and quadratic forms SVD Woodbury Dense Ridge Sparse Ridge

Warm starts and paths

In machine learning it is common to try to solve a problem that is very similar to a previous one. You train a model every day and you need just to "update" the model You look for the best hyperparmater and evaluate the parameter on a grid of values. For example on a grid of λ.

Alexandre Gramfort - Telecom ParisTech Optimization MS Maths Big Data 22