 
              Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM Almaden) and Manfred Warmuth (UCSC)
x 2 y 1 y 2 x T x 1 y T Input: pairs of unit vectors in R n : (x 1 , y 1 ), (x 2 , y 2 ), …, ( x T , y T ) Assumption: y t = Rx t + noise, where R is an unknown rotation matrix Problem: find “best - fit” rotation matrix for the data, i.e. arg min R  t k Rx t – y t k 2
 k Rx t – y t k 2 = k Rx t k 2 + k y t k 2 – 2(y t x t > ) ² R > ) ² R. = 2 - 2(y t x t A ² B = Tr(A > B) = Linear in R  ij A ij B ij > ² R  arg min R  t k Rx t – y t k 2 = arg max R  t y t x t  Computing arg max R M ² R: “Wahba’s problem”  Can be solved using SVD of M
x 2 R 1 x 1 y 1 y 2 x T x 1 R 2 x 2 R T x T y T Choose rot matrix R T Choose rot matrix R 2 Choose rot matrix R 1 Predict R T x T Predict R 2 x 2 Predict R 1 x 1 T (R T ) = k R T x T – y T k 2 L 2 (R 2 ) = k R 2 x 2 – y 2 k 2 L L 1 (R 1 ) = k R 1 x 1 – y 1 k 2 Open problem Goal: Minimize regret: from COLT 2008 Regret =  t L t (R t ) – min R  t L t (R) [Smith, Warmuth]
 Rot matrix ´ orthogonal matrix of determinant 1  Set of rot matrices, SO(n):  Non-convex: so online convex optimization techniques like gradient descent, exponentiated gradient, etc. don’t apply directly  Lie group with Lie algebra = set of all skew-symmetric matrices  Lie group gives universal representation for all Lie groups via a conformal embedding
 [Arora , NIPS ’09] using Lie group/Lie algebra structure  Based on matrix exponentiated gradient: matrix exp maps Lie algebra to Lie group  Deterministic algorithm   (T) lower bound on any such deterministic algorithm, so randomization is crucial
Adversary can compute R t since alg is deterministic  Assume for convenience that n is even.  Bad example: x t = e 1 , y t = -R t x t .  L t (R t ) = k R t x t - y t k 2 = k 2y t k 2 = 4. So total loss = 4T.  Since n is even, both I, -I are rot matrices, and  t L t (I) + L t (-I) =  t 2 k y t k 2 + 2 k x t k 2 = 4T.  Hence, min R  t L t (R) · 2T.  So, Regret ¸ 2T.
 Randomized algorithm with expected regret O( p nL), where L = min R  t L t (R)  Lower bound on regret of any online learning algorithm for choosing rot matrices of  ( p nT)  Using Hannan/Kalai- Vempala’s Follow-The- Perturbed-Leader technique based on linearity of loss function
Sample noise matrix N with i.i.d entries distributed uniformly in [-1/  , 1/  ] t-1 L i (R) - N ² R. In round t, use R t = arg min R  1 Using SVD solution to Wahba’s problem Thm [KV’05]: Regret · O(n 5/4 p T).
Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Sample 2 orthogonal matrices U, V from the uniform Haar measure Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). t-1 L i (R) - N ² R. In round t, use R t = arg min R  1
Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Sample 2 orthogonal matrices U, V from the uniform Haar measure E.g. using QR-decomposition Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). of matrix with i.i.d. standard Gaussian entries t-1 L i (R) - N ² R. In round t, use R t = arg min R  1
Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Effectively, we choose N w.p. / exp(-  k N k * ), where k N k * = trace norm, i.e. sum of singular values of N Sample 2 orthogonal matrices U, V from the uniform Haar measure Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). t-1 L i (R) - N ² R. In round t, use R t = arg min R  1
 Stability Lemma [KV’05]: E[Regret] ·  t E[L t (R t )] – E[L t (R t+1 )] + 2E[ k N k * ] · 2  L = 2n/   Choose  = p n/L, and we get E[Regret] · O( p nL).
t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1 Re- randomization doesn’t change expected regret
t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0. D = dist of N, D’ = dist of N’
t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0.  However, k D ’ – D k 1 ·  .  So E D’ [L t (R t+1 )] – E D [L t (R t+1 )] · 2  .
t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0.  However, k D ’ – D k 1 ·  .  So E D’ [L t (R t+1 )] – E D [L t (R t+1 )] · 2  . Pr D ’ [N]/Pr D [N] ¼ exp( §  k y t x t > k * ) ¼ 1 §  .
E[ k N k * ] = E[  i  i ] =  i E[  i ] = n/  . Because  i is drawn from the exponential distribution of density  exp(-  )
 Bad example: x t = e t mod n , y t = § x t w.p. ½ each *  Opt rot matrix R * = diag(sgn(X 1 ),…, sgn(X n )) X i = sum of § signs over all t s.t. (t mod n) = i. * ignoring det(R * ) = 1 issue
 Bad example: x t = e t mod n , y t = § x t w.p. ½ each  Opt rot matrix R * = diag(sgn(X 1 ),…, sgn(X n )) *  Expected total loss = 2T – 2  i E[|X i | ] ¸ 2T - n ¢  ( p T/n) = 2T -  ( p nT)  But for any R t , E[L t (R t )] = 2 – 2E[(y t x t > ) ² R t ] = 2, and hence total expected loss of alg = 2T.  So, E[Regret] ¸  ( p nT). * ignoring det(R * ) = 1 issue
 Optimal algorithm for online learning of rotations with regret O( p nL)  Based on FSPL  Open questions:  Other applications for FSPL? Matrix Hedge? Faster algorithms for SDPs? More details in Manfred’s open problem talk.  Any other example of natural problems where FPL is the only known technique that works? Thank you!
Recommend
More recommend