liege university francqui chair 2011 2012 lecture 3 huge
play

Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale - PowerPoint PPT Presentation

Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32 March 9, 2012 1 / 32 Outline 1 Problems sizes 2


  1. Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32 March 9, 2012 1 / 32

  2. Outline 1 Problems sizes 2 Random coordinate search 3 Confidence level of solutions 4 Sparse Optimization problems 5 Sparse updates for linear operators 6 Fast updates in computational trees 7 Simple subgradient methods 8 Application examples Yu. Nesterov () Huge-scale optimization problems 2/32 March 9, 2012 2 / 32

  3. Nonlinear Optimization: problems sizes Class Operations Dimension Iter.Cost Memory 10 0 − 10 2 n 4 → n 3 10 3 Small-size All Kilobyte: 10 3 − 10 4 n 3 → n 2 A − 1 10 6 Medium-size Megabyte: 10 5 − 10 7 n 2 → n 10 9 Large-scale Gigabyte: Ax 10 8 − 10 12 10 12 Huge-scale x + y n → log n Terabyte: Sources of Huge-Scale problems Internet (New) Telecommunications (New) Finite-element schemes (Old) Partial differential equations (Old) Yu. Nesterov () Huge-scale optimization problems 3/32 March 9, 2012 3 / 32

  4. Very old optimization idea: Coordinate Search Problem: x ∈ R n f ( x ) min ( f is convex and differentiable). Coordinate relaxation algorithm For k ≥ 0 iterate 1 Choose active coordinate i k . 2 Update x k +1 = x k − h k ∇ i k f ( x k ) e i k ensuring f ( x k +1 ) ≤ f ( x k ). ( e i is i th coordinate vector in R n .) Main advantage: Very simple implementation. Yu. Nesterov () Huge-scale optimization problems 4/32 March 9, 2012 4 / 32

  5. Possible strategies 1 Cyclic moves. (Difficult to analyze.) 2 Random choice of coordinate (Why?) 3 Choose coordinate with the maximal directional derivative. Complexity estimate: assume x , y ∈ R n . �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , Let us choose h k = 1 L . Then 2 L |∇ i k f ( x k ) | 2 ≥ 1 1 2 nL �∇ f ( x k ) � 2 f ( x k ) − f ( x k +1 ) ≥ 2 nLR 2 ( f ( x k ) − f ∗ ) 2 . 1 ≥ Hence, f ( x k ) − f ∗ ≤ 2 nLR 2 , k ≥ 1. (For Grad.Method, drop n .) k This is the only known theoretical result known for CDM! Yu. Nesterov () Huge-scale optimization problems 5/32 March 9, 2012 5 / 32

  6. Criticism Theoretical justification: Complexity bounds are not known for the most of the schemes. The only justified scheme needs computation of the whole gradient. (Why don’t use GM?) Computational complexity: Fast differentiation: if function is defined by a sequence of operations, then C ( ∇ f ) ≤ 4 C ( f ). Can we do anything without computing the function’s values? Result: CDM are almost out of the computational practice. Yu. Nesterov () Huge-scale optimization problems 6/32 March 9, 2012 6 / 32

  7. Google problem Let E ∈ R n × n be an incidence matrix of a graph. Denote e = (1 , . . . , 1) T and E = E · diag ( E T e ) − 1 . ¯ Thus, ¯ E T e = e . Our problem is as follows: Find x ∗ ≥ 0 : Ex ∗ = x ∗ . ¯ Optimization formulation: f ( x ) def Ex − x � 2 + γ 2 � ¯ = 1 2 [ � e , x � − 1] 2 → min x ∈ R n Yu. Nesterov () Huge-scale optimization problems 7/32 March 9, 2012 7 / 32

  8. Huge-scale problems Main features The size is very big ( n ≥ 10 7 ). The data is distributed in space. The requested parts of data are not always available. The data is changing in time. Consequences Simplest operations are expensive or infeasible: Update of the full vector of variables. Matrix-vector multiplication. Computation of the objective function’s value, etc. Yu. Nesterov () Huge-scale optimization problems 8/32 March 9, 2012 8 / 32

  9. Structure of the Google Problem Let ua look at the gradient of the objective: ∇ i f ( x ) = � a i , g ( x ) � + γ [ � e , x � − 1] , i = 1 , . . . , n , ¯ (¯ Ex − x ∈ R n , g ( x ) = E = ( a 1 , . . . , a n )) . Main observations: The coordinate move x + = x − h i ∇ i f ( x ) e i needs O ( p i ) a.o. ( p i is the number of nonzero elements in a i .) � E + γ ee T � def ∇ 2 f def E T ¯ = ¯ i = γ + 1 d i = diag p i are available. We can use them for choosing the step sizes ( h i = 1 d i ). Reasonable coordinate choice strategy? Random! Yu. Nesterov () Huge-scale optimization problems 9/32 March 9, 2012 9 / 32

  10. Random coordinate descent methods (RCDM) x ∈ R N f ( x ) , min ( f is convex and differentiable) Main Assumption: | f ′ i ( x + h i e i ) − f ′ i ( x ) | ≤ L i | h i | , h i ∈ R , i = 1 , . . . , N , where e i is a coordinate vector. Then f ( x + h i e i ) ≤ f ( x ) + f ′ i ( x ) h i + L i 2 h 2 x ∈ R N , h i ∈ R . i . Define the coordinate steps: T i ( x ) def = x − 1 L i f ′ i ( x ) e i . Then, 1 2 L i [ f ′ i ( x )] 2 , f ( x ) − f ( T i ( x )) ≥ i = 1 , . . . , N . Yu. Nesterov () Huge-scale optimization problems 10/32 March 9, 2012 10 / 32

  11. Random coordinate choice We need a special random counter R α , α ∈ R : � � − 1 N Prob [ i ] = p ( i ) � L α L α = i · , i = 1 , . . . , N . α j j =1 Note: R 0 generates uniform distribution. Method RCDM ( α, x 0 ) For k ≥ 0 iterate: 1) Choose i k = R α . 2) Update x k +1 = T i k ( x k ) . Yu. Nesterov () Huge-scale optimization problems 11/32 March 9, 2012 11 / 32

  12. Complexity bounds for RCDM We need to introduce the following norms for x , g ∈ R N : � N � N � 1 / 2 � 1 / 2 � � 1 L α i [ x ( i ) ] 2 � g � ∗ i [ g ( i ) ] 2 � x � α = , α = . L α i =1 i =1 After k iterations, RCDM ( α, x 0 ) generates random output x k , which depends on ξ k = { i 0 , . . . , i k } . Denote φ k = E ξ k − 1 f ( x k ). Theorem. For any k ≥ 1 we have � � N � 2 φ k − f ∗ L α · R 2 ≤ k · 1 − α ( x 0 ) , j j =1 � � where R β ( x 0 ) = max x ∗ ∈ X ∗ � x − x ∗ � β : f ( x ) ≤ f ( x 0 ) max . x Yu. Nesterov () Huge-scale optimization problems 12/32 March 9, 2012 12 / 32

  13. Interpretation 1. α = 0. Then S 0 = N , and we get φ k − f ∗ 2 N k · R 2 ≤ 1 ( x 0 ) . Note N � We use the metric � x � 2 L i [ x ( i ) ] 2 . 1 = i =1 A matrix with diagonal { L i } N i =1 can have its norm equal to n . Hence, for GM we can guarantee the same bound. But its cost of iteration is much higher! Yu. Nesterov () Huge-scale optimization problems 13/32 March 9, 2012 13 / 32

  14. Interpretation 2. α = 1 2 . Denote � � 1 ≤ i ≤ N | x ( i ) − y ( i ) | : f ( x ) ≤ f ( x 0 ) D ∞ ( x 0 ) = max y ∈ X ∗ max max . x Then, R 2 1 / 2 ( x 0 ) ≤ S 1 / 2 D 2 ∞ ( x 0 ), and we obtain � N � 2 � L 1 / 2 φ k − f ∗ 2 · D 2 ≤ k · ∞ ( x 0 ) . i i =1 Note: For the first order methods, the worst-case complexity of minimizing over a box depends on N . Since S 1 / 2 can be bounded, RCDM can be applied in situations when the usual GM fail. Yu. Nesterov () Huge-scale optimization problems 14/32 March 9, 2012 14 / 32

  15. Interpretation 3. α = 1. Then R 0 ( x 0 ) is the size of the initial level set in the standard Euclidean norm. Hence, � N � � � N � � φ k − f ∗ 2 · R 2 2 N 1 · R 2 ≤ k · 0 ( x 0 ) ≡ k · L i L i 0 ( x 0 ) . N i =1 i =1 Rate of convergence of GM can be estimated as f ( x k ) − f ∗ ≤ γ k R 2 0 ( x 0 ) , where γ satisfies condition f ′′ ( x ) � γ · I , x ∈ R N . Note: maximal eigenvalue of symmetric matrix can reach its trace. In the worst case, the rate of convergence of GM is the same as that of RCDM . Yu. Nesterov () Huge-scale optimization problems 15/32 March 9, 2012 15 / 32

  16. Minimizing the strongly convex functions Theorem. Let f ( x ) be strongly convex with respect to � · � 1 − α with convexity parameter σ 1 − α > 0. Then, for { x k } generated by RCDM ( α, x 0 ) we have � � k 1 − σ 1 − α φ k − φ ∗ ( f ( x 0 ) − f ∗ ) . ≤ S α Proof: Let x k be generated by RCDM after k iterations. Let us estimate the expected result of the next iteration. N � p ( i ) f ( x k ) − E i k ( f ( x k +1 )) = α · [ f ( x k ) − f ( T i ( x k ))] i =1 N p ( i ) i ( x k )] 2 = � 1 2 L i [ f ′ 2 S α ( � f ′ ( x k ) � ∗ 1 − α ) 2 ≥ α i =1 ≥ σ 1 − α S α ( f ( x k ) − f ∗ ) . It remains to compute expectation in ξ k − 1 . Yu. Nesterov () Huge-scale optimization problems 16/32 March 9, 2012 16 / 32

  17. Confidence level of the answers Note: We have proved that the expected values of random f ( x k ) are good. Can we guarantee anything after a single run? Confidence level: Probability β ∈ (0 , 1), that some statement about random output is correct. Main tool: Chebyschev inequality ( ξ ≥ 0): E ( ξ ) Prob [ ξ ≥ T ] ≤ T . Our situation: Prob [ f ( x k ) − f ∗ ≥ ǫ ] ≤ 1 ǫ [ φ k − f ∗ ] ≤ 1 − β. We need φ k − f ∗ ≤ ǫ · (1 − β ). Too expensive for β → 1? Yu. Nesterov () Huge-scale optimization problems 17/32 March 9, 2012 17 / 32

  18. Regularization technique Consider f µ ( x ) = f ( x ) + µ 2 � x − x 0 � 2 1 − α . It is strongly convex. Therefore, we can obtain φ k − f ∗ µ ≤ ǫ · (1 − β ) in � � 1 1 µ S α ln iterations. O ǫ · (1 − β ) ǫ Theorem. Define α = 1, µ = 0 ( x 0 ) , and choose 4 R 2 � � 1 + 8 S 1 R 2 ln 2 S 1 R 2 0 ( x 0 ) 0 ( x 0 ) 1 k ≥ + ln . ǫ ǫ 1 − β Let x k be generated by RCDM (1 , x 0 ) as applied to f µ .Then Prob ( f ( x k ) − f ∗ ≤ ǫ ) ≥ β. ln 10 p = 2 . 3 p . β = 1 − 10 − p Note: ⇒ Yu. Nesterov () Huge-scale optimization problems 18/32 March 9, 2012 18 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend