 
              Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani
Review • Matrix differentials: sol’n to matrix calculus pain ‣ compact way of writing Taylor expansions, or … ‣ definition: ‣ df = a(x; dx) [+ r(dx)] ‣ a(x; .) linear in 2nd arg ‣ r(dx)/||dx|| → 0 as dx → 0 • d(…) is linear: passes thru +, scalar * • Generalizes Jacobian, Hessian, gradient, velocity Geoff Gordon—10-725 Optimization—Fall 2012 2
Review • Chain rule • Product rule • Bilinear functions: cross product, Kronecker, Frobenius, Hadamard, Khatri-Rao, … • Identities ‣ rules for working with � , tr() ‣ trace rotation • Identification theorems Geoff Gordon—10-725 Optimization—Fall 2012 3
Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T d x df = tr(A T dX) 2 1.5 1 0.5 0 � 0.5 � 1 � 3 � 2 � 1 0 1 2 3 Geoff Gordon—10-725 Optimization—Fall 2012 4
Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T d x df = tr(A T dX) Geoff Gordon—10-725 Optimization—Fall 2012 5
And so forth… • Can’t draw it for X a matrix, tensor, … • But same principle holds: set coefficient of dX to 0 to find min, max, or saddle point: ‣ if df = c(A; dX) [+ r(dX)] then ‣ so: max/min/sp iff ‣ for c(.; .) any “product”, Geoff Gordon—10-725 Optimization—Fall 2012 6
10 Ex: Infomax ICA 5 0 x i � 5 • Training examples x i ∈ ℝ d , i = 1:n � 10 � 10 � 5 0 5 10 • Transformation y i = g(Wx i ) 10 5 ‣ W ∈ ℝ d ! d 0 ‣ g(z) = � 5 Wx i • Want: � 10 � 10 � 5 0 5 10 0.8 0.6 0.4 0.2 y i Geoff Gordon—10-725 Optimization—Fall 2012 23 0.2 0.4 0.6 0.8
Volume rule Geoff Gordon—10-725 Optimization—Fall 2012 8
10 Ex: Infomax ICA 5 0 • y i = g(Wx i ) x i � 5 � 10 � 10 � 5 0 5 10 ‣ dy i = 10 5 • Method: max W ! i –ln(P(y i )) 0 ‣ where P(y i ) = � 5 Wx i � 10 � 10 � 5 0 5 10 0.8 0.6 0.4 0.2 y i Geoff Gordon—10-725 Optimization—Fall 2012 24 0.2 0.4 0.6 0.8
Gradient • L = ! ln |det J i | y i = g(Wx i ) dy i = J i dx i i Geoff Gordon—10-725 Optimization—Fall 2012 10
Gradient J i = diag(u i ) W dJ i = diag(u i ) dW + diag(v i ) diag(dW x i ) W dL = Geoff Gordon—10-725 Optimization—Fall 2012 11
Natural gradient • L(W): R d " d → R dL = tr(G T dW) • step S = arg max S M(S) = tr(G T S) – ||SW -1 || 2 /2 F ‣ scalar case: M = gs – s 2 / 2w 2 • M = • dM = Geoff Gordon—10-725 Optimization—Fall 2012 12
ICA natural gradient • [W -T + C] W T W = y i Wx i start with W 0 = I Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA natural gradient • [W -T + C] W T W = y i Wx i start with W 0 = I Geoff Gordon—10-725 Optimization—Fall 2012 13
ICA on natural image patches Geoff Gordon—10-725 Optimization—Fall 2012 14
ICA on natural image patches Geoff Gordon—10-725 Optimization—Fall 2012 15
More info • Minka’s cheat sheet: ‣ http://research.microsoft.com/en-us/um/people/minka/ papers/matrix/ • Magnus & Neudecker. Matrix Differential Calculus . Wiley, 1999. 2nd ed. ‣ http://www.amazon.com/Differential-Calculus- Applications-Statistics-Econometrics/dp/047198633X • Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. Geoff Gordon—10-725 Optimization—Fall 2012 16
Newton’s method 10-725 Optimization Geoff Gordon Ryan Tibshirani
Nonlinear equations • x ∈ R d f: R d → R d , diff’ble 1.5 ‣ solve: 1 • Taylor: 0.5 ‣ J: 0 • Newton: � 0.5 � 1 0 1 2 Geoff Gordon—10-725 Optimization—Fall 2012 18
Error analysis Geoff Gordon—10-725 Optimization—Fall 2012 19
dx = x*(1-x*phi) 0: 0 .7500000000000000 1: 0 .5898558813281841 2: 0.61 67492604787597 3: 0.61803 13181415453 4: 0.6180339887 383547 5: 0.6180339887498948 6: 0.618033988749894 9 7: 0.6180339887498948 8: 0.618033988749894 9 *: 0.6180339887498948 Geoff Gordon—10-725 Optimization—Fall 2012 20
Bad initialization 1.3000000000000000 -0.1344774409873226 -0.2982157033270080 -0.7403273854022190 -2.3674743431148597 -13.8039236412225819 -335.9214859516196157 -183256.0483360671496484 -54338444778.1145248413085938 Geoff Gordon—10-725 Optimization—Fall 2012 21
Minimization • x ∈ R d f: R d → R, twice diff’ble ‣ find: • Newton: Geoff Gordon—10-725 Optimization—Fall 2012 22
Descent • Newton step: d = –(f’’(x)) -1 f’(x) • Gradient step: –g = –f’(x) • Taylor: df = • Let t > 0, set dx = ‣ df = • So: Geoff Gordon—10-725 Optimization—Fall 2012 23
Steepest descent g = f’(x) H = f’’(x) x ||d|| H = x + ∆ x nsd x + ∆ x nt Geoff Gordon—10-725 Optimization—Fall 2012 24
Newton w/ line search • Pick x 1 • For k = 1, 2, … ‣ g k = f’(x k ); H k = f’’(x k ) gradient & Hessian ‣ d k = –H k \ g k Newton direction backtracking line search ‣ t k = 1 ‣ while f(x k + t k d k ) > f(x k ) + t g kT d k / 2 ‣ t k = β t k β <1 ‣ x k+1 = x k + t k d k step Geoff Gordon—10-725 Optimization—Fall 2012 25
Properties of damped Newton • Affine invariant: suppose g(x) = f(Ax+b) ‣ x 1 , x 2 , … from Newton on g() ‣ y 1 , y 2 , … from Newton on f() ‣ If y 1 = Ax 1 + b, then: • Convergent: ‣ if f bounded below, f(x k ) converges ‣ if f strictly convex, bounded level sets, x k converges ‣ typically quadratic rate in neighborhood of x* Geoff Gordon—10-725 Optimization—Fall 2012 26
Recommend
More recommend