matrix differential calculus
play

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan - PowerPoint PPT Presentation

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani Review Matrix differentials: soln to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)]


  1. Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani

  2. Review • Matrix differentials: sol’n to matrix calculus pain ‣ compact way of writing Taylor expansions, or … ‣ definition: ‣ df = a(x; dx) [+ r(dx)] ‣ a(x; .) linear in 2nd arg ‣ r(dx)/||dx|| → 0 as dx → 0 • d(…) is linear: passes thru +, scalar * • Generalizes Jacobian, Hessian, gradient, velocity Geoff Gordon—10-725 Optimization—Fall 2012 2

  3. Review • Chain rule • Product rule • Bilinear functions: cross product, Kronecker, Frobenius, Hadamard, Khatri-Rao, … • Identities ‣ rules for working with � , tr() ‣ trace rotation • Identification theorems Geoff Gordon—10-725 Optimization—Fall 2012 3

  4. Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T d x df = tr(A T dX) 2 1.5 1 0.5 0 � 0.5 � 1 � 3 � 2 � 1 0 1 2 3 Geoff Gordon—10-725 Optimization—Fall 2012 4

  5. Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T d x df = tr(A T dX) Geoff Gordon—10-725 Optimization—Fall 2012 5

  6. And so forth… • Can’t draw it for X a matrix, tensor, … • But same principle holds: set coefficient of dX to 0 to find min, max, or saddle point: ‣ if df = c(A; dX) [+ r(dX)] then ‣ so: max/min/sp iff ‣ for c(.; .) any “product”, Geoff Gordon—10-725 Optimization—Fall 2012 6

  7. 10 Ex: Infomax ICA 5 0 x i � 5 • Training examples x i ∈ ℝ d , i = 1:n � 10 � 10 � 5 0 5 10 • Transformation y i = g(Wx i ) 10 5 ‣ W ∈ ℝ d ! d 0 ‣ g(z) = � 5 Wx i • Want: � 10 � 10 � 5 0 5 10 0.8 0.6 0.4 0.2 y i Geoff Gordon—10-725 Optimization—Fall 2012 23 0.2 0.4 0.6 0.8

  8. Volume rule Geoff Gordon—10-725 Optimization—Fall 2012 8

  9. 10 Ex: Infomax ICA 5 0 • y i = g(Wx i ) x i � 5 � 10 � 10 � 5 0 5 10 ‣ dy i = 10 5 • Method: max W ! i –ln(P(y i )) 0 ‣ where P(y i ) = � 5 Wx i � 10 � 10 � 5 0 5 10 0.8 0.6 0.4 0.2 y i Geoff Gordon—10-725 Optimization—Fall 2012 24 0.2 0.4 0.6 0.8

  10. Gradient • L = ! ln |det J i | y i = g(Wx i ) dy i = J i dx i i Geoff Gordon—10-725 Optimization—Fall 2012 10

  11. Gradient J i = diag(u i ) W dJ i = diag(u i ) dW + diag(v i ) diag(dW x i ) W dL = Geoff Gordon—10-725 Optimization—Fall 2012 11

  12. Natural gradient • L(W): R d " d → R dL = tr(G T dW) • step S = arg max S M(S) = tr(G T S) – ||SW -1 || 2 /2 F ‣ scalar case: M = gs – s 2 / 2w 2 • M = • dM = Geoff Gordon—10-725 Optimization—Fall 2012 12

  13. ICA natural gradient • [W -T + C] W T W = y i Wx i start with W 0 = I Geoff Gordon—10-725 Optimization—Fall 2012 13

  14. ICA natural gradient • [W -T + C] W T W = y i Wx i start with W 0 = I Geoff Gordon—10-725 Optimization—Fall 2012 13

  15. ICA on natural image patches Geoff Gordon—10-725 Optimization—Fall 2012 14

  16. ICA on natural image patches Geoff Gordon—10-725 Optimization—Fall 2012 15

  17. More info • Minka’s cheat sheet: ‣ http://research.microsoft.com/en-us/um/people/minka/ papers/matrix/ • Magnus & Neudecker. Matrix Differential Calculus . Wiley, 1999. 2nd ed. ‣ http://www.amazon.com/Differential-Calculus- Applications-Statistics-Econometrics/dp/047198633X • Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. Geoff Gordon—10-725 Optimization—Fall 2012 16

  18. Newton’s method 10-725 Optimization Geoff Gordon Ryan Tibshirani

  19. Nonlinear equations • x ∈ R d f: R d → R d , diff’ble 1.5 ‣ solve: 1 • Taylor: 0.5 ‣ J: 0 • Newton: � 0.5 � 1 0 1 2 Geoff Gordon—10-725 Optimization—Fall 2012 18

  20. Error analysis Geoff Gordon—10-725 Optimization—Fall 2012 19

  21. dx = x*(1-x*phi) 0: 0 .7500000000000000 1: 0 .5898558813281841 2: 0.61 67492604787597 3: 0.61803 13181415453 4: 0.6180339887 383547 5: 0.6180339887498948 6: 0.618033988749894 9 7: 0.6180339887498948 8: 0.618033988749894 9 *: 0.6180339887498948 Geoff Gordon—10-725 Optimization—Fall 2012 20

  22. Bad initialization 1.3000000000000000 -0.1344774409873226 -0.2982157033270080 -0.7403273854022190 -2.3674743431148597 -13.8039236412225819 -335.9214859516196157 -183256.0483360671496484 -54338444778.1145248413085938 Geoff Gordon—10-725 Optimization—Fall 2012 21

  23. Minimization • x ∈ R d f: R d → R, twice diff’ble ‣ find: • Newton: Geoff Gordon—10-725 Optimization—Fall 2012 22

  24. Descent • Newton step: d = –(f’’(x)) -1 f’(x) • Gradient step: –g = –f’(x) • Taylor: df = • Let t > 0, set dx = ‣ df = • So: Geoff Gordon—10-725 Optimization—Fall 2012 23

  25. Steepest descent g = f’(x) H = f’’(x) x ||d|| H = x + ∆ x nsd x + ∆ x nt Geoff Gordon—10-725 Optimization—Fall 2012 24

  26. Newton w/ line search • Pick x 1 • For k = 1, 2, … ‣ g k = f’(x k ); H k = f’’(x k ) gradient & Hessian ‣ d k = –H k \ g k Newton direction backtracking line search ‣ t k = 1 ‣ while f(x k + t k d k ) > f(x k ) + t g kT d k / 2 ‣ t k = β t k β <1 ‣ x k+1 = x k + t k d k step Geoff Gordon—10-725 Optimization—Fall 2012 25

  27. Properties of damped Newton • Affine invariant: suppose g(x) = f(Ax+b) ‣ x 1 , x 2 , … from Newton on g() ‣ y 1 , y 2 , … from Newton on f() ‣ If y 1 = Ax 1 + b, then: • Convergent: ‣ if f bounded below, f(x k ) converges ‣ if f strictly convex, bounded level sets, x k converges ‣ typically quadratic rate in neighborhood of x* Geoff Gordon—10-725 Optimization—Fall 2012 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend