Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan - - PowerPoint PPT Presentation

matrix differential calculus
SMART_READER_LITE
LIVE PREVIEW

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan - - PowerPoint PPT Presentation

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani Review Matrix differentials: soln to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)]


slide-1
SLIDE 1

Matrix differential calculus

10-725 Optimization Geoff Gordon Ryan Tibshirani

slide-2
SLIDE 2

Geoff Gordon—10-725 Optimization—Fall 2012

Review

  • Matrix differentials: sol’n to matrix calculus pain
  • compact way of writing Taylor expansions, or …
  • definition:
  • df = a(x; dx) [+ r(dx)]
  • a(x; .) linear in 2nd arg
  • r(dx)/||dx|| → 0 as dx → 0
  • d(…) is linear: passes thru +, scalar *
  • Generalizes Jacobian, Hessian, gradient, velocity

2

slide-3
SLIDE 3

Geoff Gordon—10-725 Optimization—Fall 2012

Review

  • Chain rule
  • Product rule
  • Bilinear functions: cross product, Kronecker,

Frobenius, Hadamard, Khatri-Rao, …

  • Identities
  • rules for working with , tr()
  • trace rotation
  • Identification theorems

3

slide-4
SLIDE 4

Geoff Gordon—10-725 Optimization—Fall 2012

Finding a maximum

  • r minimum, or saddle point

4

3 2 1 1 2 3 1 0.5 0.5 1 1.5 2

ID for df(x) scalar x vector x matrix X scalar f

df = a dx df = aTdx df = tr(ATdX)

slide-5
SLIDE 5

Geoff Gordon—10-725 Optimization—Fall 2012

Finding a maximum

  • r minimum, or saddle point

5

ID for df(x) scalar x vector x matrix X scalar f

df = a dx df = aTdx df = tr(ATdX)

slide-6
SLIDE 6

Geoff Gordon—10-725 Optimization—Fall 2012

And so forth…

  • Can’t draw it for X a matrix, tensor, …
  • But same principle holds: set coefficient of dX

to 0 to find min, max, or saddle point:

  • if df = c(A; dX) [+ r(dX)] then
  • so: max/min/sp iff
  • for c(.; .) any “product”,

6

slide-7
SLIDE 7

Geoff Gordon—10-725 Optimization—Fall 2012

Ex: Infomax ICA

  • Training examples xi ∈ ℝd, i = 1:n
  • Transformation yi = g(Wxi)
  • W ∈ ℝd!d
  • g(z) =
  • Want:

23

10 5 5 10 10 5 5 10

Wxi

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

yi

10 5 5 10 10 5 5 10

xi

slide-8
SLIDE 8

Geoff Gordon—10-725 Optimization—Fall 2012

Volume rule

8

slide-9
SLIDE 9

Geoff Gordon—10-725 Optimization—Fall 2012

Ex: Infomax ICA

  • yi = g(Wxi)
  • dyi =
  • Method: maxW !i –ln(P(yi))
  • where P(yi) =

24

10 5 5 10 10 5 5 10

Wxi

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

yi

10 5 5 10 10 5 5 10

xi

slide-10
SLIDE 10

Geoff Gordon—10-725 Optimization—Fall 2012

Gradient

  • L = ! ln |det Ji| yi = g(Wxi) dyi = Ji dxi

10

i

slide-11
SLIDE 11

Geoff Gordon—10-725 Optimization—Fall 2012

Gradient

11

Ji = diag(ui) W dJi = diag(ui) dW + diag(vi) diag(dW xi) W dL =

slide-12
SLIDE 12

Geoff Gordon—10-725 Optimization—Fall 2012

Natural gradient

  • L(W): Rd"d → R dL = tr(GTdW)
  • step S = arg maxS M(S) = tr(GTS) – ||SW-1||2 /2
  • scalar case: M = gs – s2 / 2w2
  • M =
  • dM =

12

F

slide-13
SLIDE 13

Geoff Gordon—10-725 Optimization—Fall 2012

yi

ICA natural gradient

  • [W-T + C] WTW =

13

Wxi

start with W0 = I

slide-14
SLIDE 14

Geoff Gordon—10-725 Optimization—Fall 2012

yi

ICA natural gradient

  • [W-T + C] WTW =

13

Wxi

start with W0 = I

slide-15
SLIDE 15

Geoff Gordon—10-725 Optimization—Fall 2012

ICA on natural image patches

14

slide-16
SLIDE 16

Geoff Gordon—10-725 Optimization—Fall 2012

ICA on natural image patches

15

slide-17
SLIDE 17

Geoff Gordon—10-725 Optimization—Fall 2012

More info

  • Minka’s cheat sheet:
  • http://research.microsoft.com/en-us/um/people/minka/

papers/matrix/

  • Magnus & Neudecker. Matrix Differential Calculus.

Wiley, 1999. 2nd ed.

  • http://www.amazon.com/Differential-Calculus-

Applications-Statistics-Econometrics/dp/047198633X

  • Bell & Sejnowski. An information-maximization

approach to blind separation and blind

  • deconvolution. Neural Computation, v7, 1995.

16

slide-18
SLIDE 18

Newton’s method

10-725 Optimization Geoff Gordon Ryan Tibshirani

slide-19
SLIDE 19

Geoff Gordon—10-725 Optimization—Fall 2012

Nonlinear equations

  • x ∈ Rd f: Rd→Rd, diff’ble
  • solve:
  • Taylor:
  • J:
  • Newton:

18

1 2 1 0.5 0.5 1 1.5

slide-20
SLIDE 20

Geoff Gordon—10-725 Optimization—Fall 2012

Error analysis

19

slide-21
SLIDE 21

Geoff Gordon—10-725 Optimization—Fall 2012

dx = x*(1-x*phi)

20

0: 0.7500000000000000 1: 0.5898558813281841 2: 0.6167492604787597 3: 0.6180313181415453 4: 0.6180339887383547 5: 0.6180339887498948 6: 0.6180339887498949 7: 0.6180339887498948 8: 0.6180339887498949 *: 0.6180339887498948

slide-22
SLIDE 22

Geoff Gordon—10-725 Optimization—Fall 2012

Bad initialization

21

1.3000000000000000

  • 0.1344774409873226
  • 0.2982157033270080
  • 0.7403273854022190
  • 2.3674743431148597
  • 13.8039236412225819
  • 335.9214859516196157
  • 183256.0483360671496484
  • 54338444778.1145248413085938
slide-23
SLIDE 23

Geoff Gordon—10-725 Optimization—Fall 2012

Minimization

  • x ∈ Rd f: Rd→R, twice diff’ble
  • find:
  • Newton:

22

slide-24
SLIDE 24

Geoff Gordon—10-725 Optimization—Fall 2012

Descent

  • Newton step: d = –(f’’(x))-1 f’(x)
  • Gradient step: –g = –f’(x)
  • Taylor: df =
  • Let t > 0, set dx =
  • df =
  • So:

23

slide-25
SLIDE 25

Geoff Gordon—10-725 Optimization—Fall 2012

Steepest descent

24

x x + ∆xnt x + ∆xnsd

g = f’(x) H = f’’(x) ||d||H =

slide-26
SLIDE 26

Geoff Gordon—10-725 Optimization—Fall 2012

Newton w/ line search

  • Pick x1
  • For k = 1, 2, …
  • gk = f’(xk); Hk = f’’(xk)
  • dk = –Hk \ gk
  • tk = 1
  • while f(xk + tk dk) > f(xk) + t gkTdk / 2
  • tk = β tk
  • xk+1 = xk + tk dk

25

gradient & Hessian Newton direction backtracking line search step β<1

slide-27
SLIDE 27

Geoff Gordon—10-725 Optimization—Fall 2012

Properties of damped Newton

  • Affine invariant: suppose g(x) = f(Ax+b)
  • x1, x2, … from Newton on g()
  • y1, y2, … from Newton on f()
  • If y1 = Ax1 + b, then:
  • Convergent:
  • if f bounded below, f(xk) converges
  • if f strictly convex, bounded level sets, xk converges
  • typically quadratic rate in neighborhood of x*

26