satyen kale yahoo research
play

Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM - PowerPoint PPT Presentation

Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM Almaden) and Manfred Warmuth (UCSC) x 2 y 1 y 2 x T x 1 y T Input: pairs of unit vectors in R n : (x 1 , y 1 ), (x 2 , y 2 ), , ( x T , y T ) Assumption: y t = Rx t + noise, where R


  1. Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM Almaden) and Manfred Warmuth (UCSC)

  2. x 2 y 1 y 2 x T x 1 y T Input: pairs of unit vectors in R n : (x 1 , y 1 ), (x 2 , y 2 ), …, ( x T , y T ) Assumption: y t = Rx t + noise, where R is an unknown rotation matrix Problem: find “best - fit” rotation matrix for the data, i.e. arg min R  t k Rx t – y t k 2

  3.  k Rx t – y t k 2 = k Rx t k 2 + k y t k 2 – 2(y t x t > ) ² R > ) ² R. = 2 - 2(y t x t A ² B = Tr(A > B) = Linear in R  ij A ij B ij > ² R  arg min R  t k Rx t – y t k 2 = arg max R  t y t x t  Computing arg max R M ² R: “Wahba’s problem”  Can be solved using SVD of M

  4. x 2 R 1 x 1 y 1 y 2 x T x 1 R 2 x 2 R T x T y T Choose rot matrix R T Choose rot matrix R 2 Choose rot matrix R 1 Predict R T x T Predict R 2 x 2 Predict R 1 x 1 T (R T ) = k R T x T – y T k 2 L 2 (R 2 ) = k R 2 x 2 – y 2 k 2 L L 1 (R 1 ) = k R 1 x 1 – y 1 k 2 Open problem Goal: Minimize regret: from COLT 2008 Regret =  t L t (R t ) – min R  t L t (R) [Smith, Warmuth]

  5.  Rot matrix ´ orthogonal matrix of determinant 1  Set of rot matrices, SO(n):  Non-convex: so online convex optimization techniques like gradient descent, exponentiated gradient, etc. don’t apply directly  Lie group with Lie algebra = set of all skew-symmetric matrices  Lie group gives universal representation for all Lie groups via a conformal embedding

  6.  [Arora , NIPS ’09] using Lie group/Lie algebra structure  Based on matrix exponentiated gradient: matrix exp maps Lie algebra to Lie group  Deterministic algorithm   (T) lower bound on any such deterministic algorithm, so randomization is crucial

  7. Adversary can compute R t since alg is deterministic  Assume for convenience that n is even.  Bad example: x t = e 1 , y t = -R t x t .  L t (R t ) = k R t x t - y t k 2 = k 2y t k 2 = 4. So total loss = 4T.  Since n is even, both I, -I are rot matrices, and  t L t (I) + L t (-I) =  t 2 k y t k 2 + 2 k x t k 2 = 4T.  Hence, min R  t L t (R) · 2T.  So, Regret ¸ 2T.

  8.  Randomized algorithm with expected regret O( p nL), where L = min R  t L t (R)  Lower bound on regret of any online learning algorithm for choosing rot matrices of  ( p nT)  Using Hannan/Kalai- Vempala’s Follow-The- Perturbed-Leader technique based on linearity of loss function

  9. Sample noise matrix N with i.i.d entries distributed uniformly in [-1/  , 1/  ] t-1 L i (R) - N ² R. In round t, use R t = arg min R  1 Using SVD solution to Wahba’s problem Thm [KV’05]: Regret · O(n 5/4 p T).

  10. Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Sample 2 orthogonal matrices U, V from the uniform Haar measure Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). t-1 L i (R) - N ² R. In round t, use R t = arg min R  1

  11. Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Sample 2 orthogonal matrices U, V from the uniform Haar measure E.g. using QR-decomposition Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). of matrix with i.i.d. standard Gaussian entries t-1 L i (R) - N ² R. In round t, use R t = arg min R  1

  12. Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Effectively, we choose N w.p. / exp(-  k N k * ), where k N k * = trace norm, i.e. sum of singular values of N Sample 2 orthogonal matrices U, V from the uniform Haar measure Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). t-1 L i (R) - N ² R. In round t, use R t = arg min R  1

  13.  Stability Lemma [KV’05]: E[Regret] ·  t E[L t (R t )] – E[L t (R t+1 )] + 2E[ k N k * ] · 2  L = 2n/   Choose  = p n/L, and we get E[Regret] · O( p nL).

  14. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1 Re- randomization doesn’t change expected regret

  15. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0. D = dist of N, D’ = dist of N’

  16. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0.  However, k D ’ – D k 1 ·  .  So E D’ [L t (R t+1 )] – E D [L t (R t+1 )] · 2  .

  17. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0.  However, k D ’ – D k 1 ·  .  So E D’ [L t (R t+1 )] – E D [L t (R t+1 )] · 2  . Pr D ’ [N]/Pr D [N] ¼ exp( §  k y t x t > k * ) ¼ 1 §  .

  18. E[ k N k * ] = E[  i  i ] =  i E[  i ] = n/  . Because  i is drawn from the exponential distribution of density  exp(-  )

  19.  Bad example: x t = e t mod n , y t = § x t w.p. ½ each *  Opt rot matrix R * = diag(sgn(X 1 ),…, sgn(X n )) X i = sum of § signs over all t s.t. (t mod n) = i. * ignoring det(R * ) = 1 issue

  20.  Bad example: x t = e t mod n , y t = § x t w.p. ½ each  Opt rot matrix R * = diag(sgn(X 1 ),…, sgn(X n )) *  Expected total loss = 2T – 2  i E[|X i | ] ¸ 2T - n ¢  ( p T/n) = 2T -  ( p nT)  But for any R t , E[L t (R t )] = 2 – 2E[(y t x t > ) ² R t ] = 2, and hence total expected loss of alg = 2T.  So, E[Regret] ¸  ( p nT). * ignoring det(R * ) = 1 issue

  21.  Optimal algorithm for online learning of rotations with regret O( p nL)  Based on FSPL  Open questions:  Other applications for FSPL? Matrix Hedge? Faster algorithms for SDPs? More details in Manfred’s open problem talk.  Any other example of natural problems where FPL is the only known technique that works? Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend