Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale - PowerPoint PPT Presentation

Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32 March 9, 2012 1 / 32

Outline 1 Problems sizes 2 Random coordinate search 3 Confidence level of solutions 4 Sparse Optimization problems 5 Sparse updates for linear operators 6 Fast updates in computational trees 7 Simple subgradient methods 8 Application examples Yu. Nesterov () Huge-scale optimization problems 2/32 March 9, 2012 2 / 32

Nonlinear Optimization: problems sizes Class Operations Dimension Iter.Cost Memory 10 0 − 10 2 n 4 → n 3 10 3 Small-size All Kilobyte: 10 3 − 10 4 n 3 → n 2 A − 1 10 6 Medium-size Megabyte: 10 5 − 10 7 n 2 → n 10 9 Large-scale Gigabyte: Ax 10 8 − 10 12 10 12 Huge-scale x + y n → log n Terabyte: Sources of Huge-Scale problems Internet (New) Telecommunications (New) Finite-element schemes (Old) Partial differential equations (Old) Yu. Nesterov () Huge-scale optimization problems 3/32 March 9, 2012 3 / 32

Very old optimization idea: Coordinate Search Problem: x ∈ R n f ( x ) min ( f is convex and differentiable). Coordinate relaxation algorithm For k ≥ 0 iterate 1 Choose active coordinate i k . 2 Update x k +1 = x k − h k ∇ i k f ( x k ) e i k ensuring f ( x k +1 ) ≤ f ( x k ). ( e i is i th coordinate vector in R n .) Main advantage: Very simple implementation. Yu. Nesterov () Huge-scale optimization problems 4/32 March 9, 2012 4 / 32

Possible strategies 1 Cyclic moves. (Difficult to analyze.) 2 Random choice of coordinate (Why?) 3 Choose coordinate with the maximal directional derivative. Complexity estimate: assume x , y ∈ R n . �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , Let us choose h k = 1 L . Then 2 L |∇ i k f ( x k ) | 2 ≥ 1 1 2 nL �∇ f ( x k ) � 2 f ( x k ) − f ( x k +1 ) ≥ 2 nLR 2 ( f ( x k ) − f ∗ ) 2 . 1 ≥ Hence, f ( x k ) − f ∗ ≤ 2 nLR 2 , k ≥ 1. (For Grad.Method, drop n .) k This is the only known theoretical result known for CDM! Yu. Nesterov () Huge-scale optimization problems 5/32 March 9, 2012 5 / 32

Criticism Theoretical justification: Complexity bounds are not known for the most of the schemes. The only justified scheme needs computation of the whole gradient. (Why don’t use GM?) Computational complexity: Fast differentiation: if function is defined by a sequence of operations, then C ( ∇ f ) ≤ 4 C ( f ). Can we do anything without computing the function’s values? Result: CDM are almost out of the computational practice. Yu. Nesterov () Huge-scale optimization problems 6/32 March 9, 2012 6 / 32

Google problem Let E ∈ R n × n be an incidence matrix of a graph. Denote e = (1 , . . . , 1) T and E = E · diag ( E T e ) − 1 . ¯ Thus, ¯ E T e = e . Our problem is as follows: Find x ∗ ≥ 0 : Ex ∗ = x ∗ . ¯ Optimization formulation: f ( x ) def Ex − x � 2 + γ 2 � ¯ = 1 2 [ � e , x � − 1] 2 → min x ∈ R n Yu. Nesterov () Huge-scale optimization problems 7/32 March 9, 2012 7 / 32

Huge-scale problems Main features The size is very big ( n ≥ 10 7 ). The data is distributed in space. The requested parts of data are not always available. The data is changing in time. Consequences Simplest operations are expensive or infeasible: Update of the full vector of variables. Matrix-vector multiplication. Computation of the objective function’s value, etc. Yu. Nesterov () Huge-scale optimization problems 8/32 March 9, 2012 8 / 32

Structure of the Google Problem Let ua look at the gradient of the objective: ∇ i f ( x ) = � a i , g ( x ) � + γ [ � e , x � − 1] , i = 1 , . . . , n , ¯ (¯ Ex − x ∈ R n , g ( x ) = E = ( a 1 , . . . , a n )) . Main observations: The coordinate move x + = x − h i ∇ i f ( x ) e i needs O ( p i ) a.o. ( p i is the number of nonzero elements in a i .) � E + γ ee T � def ∇ 2 f def E T ¯ = ¯ i = γ + 1 d i = diag p i are available. We can use them for choosing the step sizes ( h i = 1 d i ). Reasonable coordinate choice strategy? Random! Yu. Nesterov () Huge-scale optimization problems 9/32 March 9, 2012 9 / 32

Random coordinate descent methods (RCDM) x ∈ R N f ( x ) , min ( f is convex and differentiable) Main Assumption: | f ′ i ( x + h i e i ) − f ′ i ( x ) | ≤ L i | h i | , h i ∈ R , i = 1 , . . . , N , where e i is a coordinate vector. Then f ( x + h i e i ) ≤ f ( x ) + f ′ i ( x ) h i + L i 2 h 2 x ∈ R N , h i ∈ R . i . Define the coordinate steps: T i ( x ) def = x − 1 L i f ′ i ( x ) e i . Then, 1 2 L i [ f ′ i ( x )] 2 , f ( x ) − f ( T i ( x )) ≥ i = 1 , . . . , N . Yu. Nesterov () Huge-scale optimization problems 10/32 March 9, 2012 10 / 32

Random coordinate choice We need a special random counter R α , α ∈ R : � � − 1 N Prob [ i ] = p ( i ) � L α L α = i · , i = 1 , . . . , N . α j j =1 Note: R 0 generates uniform distribution. Method RCDM ( α, x 0 ) For k ≥ 0 iterate: 1) Choose i k = R α . 2) Update x k +1 = T i k ( x k ) . Yu. Nesterov () Huge-scale optimization problems 11/32 March 9, 2012 11 / 32

Complexity bounds for RCDM We need to introduce the following norms for x , g ∈ R N : � N � N � 1 / 2 � 1 / 2 � � 1 L α i [ x ( i ) ] 2 � g � ∗ i [ g ( i ) ] 2 � x � α = , α = . L α i =1 i =1 After k iterations, RCDM ( α, x 0 ) generates random output x k , which depends on ξ k = { i 0 , . . . , i k } . Denote φ k = E ξ k − 1 f ( x k ). Theorem. For any k ≥ 1 we have � � N � 2 φ k − f ∗ L α · R 2 ≤ k · 1 − α ( x 0 ) , j j =1 � � where R β ( x 0 ) = max x ∗ ∈ X ∗ � x − x ∗ � β : f ( x ) ≤ f ( x 0 ) max . x Yu. Nesterov () Huge-scale optimization problems 12/32 March 9, 2012 12 / 32

Interpretation 1. α = 0. Then S 0 = N , and we get φ k − f ∗ 2 N k · R 2 ≤ 1 ( x 0 ) . Note N � We use the metric � x � 2 L i [ x ( i ) ] 2 . 1 = i =1 A matrix with diagonal { L i } N i =1 can have its norm equal to n . Hence, for GM we can guarantee the same bound. But its cost of iteration is much higher! Yu. Nesterov () Huge-scale optimization problems 13/32 March 9, 2012 13 / 32

Interpretation 2. α = 1 2 . Denote � � 1 ≤ i ≤ N | x ( i ) − y ( i ) | : f ( x ) ≤ f ( x 0 ) D ∞ ( x 0 ) = max y ∈ X ∗ max max . x Then, R 2 1 / 2 ( x 0 ) ≤ S 1 / 2 D 2 ∞ ( x 0 ), and we obtain � N � 2 � L 1 / 2 φ k − f ∗ 2 · D 2 ≤ k · ∞ ( x 0 ) . i i =1 Note: For the first order methods, the worst-case complexity of minimizing over a box depends on N . Since S 1 / 2 can be bounded, RCDM can be applied in situations when the usual GM fail. Yu. Nesterov () Huge-scale optimization problems 14/32 March 9, 2012 14 / 32

Interpretation 3. α = 1. Then R 0 ( x 0 ) is the size of the initial level set in the standard Euclidean norm. Hence, � N � � � N � � φ k − f ∗ 2 · R 2 2 N 1 · R 2 ≤ k · 0 ( x 0 ) ≡ k · L i L i 0 ( x 0 ) . N i =1 i =1 Rate of convergence of GM can be estimated as f ( x k ) − f ∗ ≤ γ k R 2 0 ( x 0 ) , where γ satisfies condition f ′′ ( x ) � γ · I , x ∈ R N . Note: maximal eigenvalue of symmetric matrix can reach its trace. In the worst case, the rate of convergence of GM is the same as that of RCDM . Yu. Nesterov () Huge-scale optimization problems 15/32 March 9, 2012 15 / 32

Minimizing the strongly convex functions Theorem. Let f ( x ) be strongly convex with respect to � · � 1 − α with convexity parameter σ 1 − α > 0. Then, for { x k } generated by RCDM ( α, x 0 ) we have � � k 1 − σ 1 − α φ k − φ ∗ ( f ( x 0 ) − f ∗ ) . ≤ S α Proof: Let x k be generated by RCDM after k iterations. Let us estimate the expected result of the next iteration. N � p ( i ) f ( x k ) − E i k ( f ( x k +1 )) = α · [ f ( x k ) − f ( T i ( x k ))] i =1 N p ( i ) i ( x k )] 2 = � 1 2 L i [ f ′ 2 S α ( � f ′ ( x k ) � ∗ 1 − α ) 2 ≥ α i =1 ≥ σ 1 − α S α ( f ( x k ) − f ∗ ) . It remains to compute expectation in ξ k − 1 . Yu. Nesterov () Huge-scale optimization problems 16/32 March 9, 2012 16 / 32

Confidence level of the answers Note: We have proved that the expected values of random f ( x k ) are good. Can we guarantee anything after a single run? Confidence level: Probability β ∈ (0 , 1), that some statement about random output is correct. Main tool: Chebyschev inequality ( ξ ≥ 0): E ( ξ ) Prob [ ξ ≥ T ] ≤ T . Our situation: Prob [ f ( x k ) − f ∗ ≥ ǫ ] ≤ 1 ǫ [ φ k − f ∗ ] ≤ 1 − β. We need φ k − f ∗ ≤ ǫ · (1 − β ). Too expensive for β → 1? Yu. Nesterov () Huge-scale optimization problems 17/32 March 9, 2012 17 / 32

Regularization technique Consider f µ ( x ) = f ( x ) + µ 2 � x − x 0 � 2 1 − α . It is strongly convex. Therefore, we can obtain φ k − f ∗ µ ≤ ǫ · (1 − β ) in � � 1 1 µ S α ln iterations. O ǫ · (1 − β ) ǫ Theorem. Define α = 1, µ = 0 ( x 0 ) , and choose 4 R 2 � � 1 + 8 S 1 R 2 ln 2 S 1 R 2 0 ( x 0 ) 0 ( x 0 ) 1 k ≥ + ln . ǫ ǫ 1 − β Let x k be generated by RCDM (1 , x 0 ) as applied to f µ .Then Prob ( f ( x k ) − f ∗ ≤ ǫ ) ≥ β. ln 10 p = 2 . 3 p . β = 1 − 10 − p Note: ⇒ Yu. Nesterov () Huge-scale optimization problems 18/32 March 9, 2012 18 / 32

Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale - PowerPoint PPT Presentation

Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32 March 9, 2012 1 / 32 Outline 1 Problems sizes 2

Liege University: Francqui Chair 2011-2012 Lecture 1: Intrinsic complexity of Black-Box

Liege University: Francqui Chair 2011-2012 Lecture 5: Algorithmic models of human behavior Yurii

Liege University: Francqui Chair 2011-2012 Lecture 4: Nonlinear analysis of combinatorial problems

Logic, Algorithms, and Automata A Historical Journey Wolfgang Thomas Francqui Lecture, Mons,

Winning Infinite Games in Finite Time Wolfgang Thomas Francqui Lecture, Mons, April 2013

The Composition Method Wolfgang Thomas Francqui Lecture, Mons, April 2013 Mastering compositions

Generalizing Strategies Wolfgang Thomas Francqui Lecture, Mons, April 2013 Wolfgang Thomas

Undecidability Results Wolfgang Thomas Francqui Lecture, Mons, April 2013 Fighting the

Prefix Rewriting and the Pushdown Hierarchy Wolfgang Thomas Francqui Lecture, Mons, April 2013

Results of the Golden 1960s Wolfgang Thomas Francqui Lecture, Mons, April 2013 Golden Times

Improved bounds on crossing numbers of graphs via semidefinite programming Etienne de Klerk

Susannah Taylor Chair Alabama Obesity Task Force Board Conference Call Vacant Co-Chair

Musings on Continual Learning Pulkit Agrawal tv.99 chair.98 chair.99 chair.90 dining

H1 2012 Results Main results Key figures H1 2012 H1 2011 Q2 2012 Q1 2012 Q2 2011 Q1 2011

Keeping things cool is a Keeping things cool is a huge industry! huge industry! Insulation

Designing the Industrial Internet IxDA SF Design Doing August 8, 2013 Dane Petersen

ccgarch: An R package for modelling multivariate GARCH models with conditional correlations

HelenOS in the Year HelenOS in the Year of the Fire Monkey of the Fire Monkey

Lecture 1/Chapter 1 a maximum of 4 points, and total maximum is 50 points. For problems

Observational constraints on the distribution of AGN BH spin in the 5 1 0 2 h c r a local

The what, why and how of data management planning (part 1) 1 Except when otherwise noted, this

Crosstalk Analysis Stuart N Wrigley Vincent Wan Guy J Brown Steve Renals 29 January 2003

Computer Graphics - Ray-Tracing II - Hendrik Lensch Computer Graphics WS07/08 Ray Tracing II

IST-2001 - Project 33522 IST-2001-33522 Animation and formal verification of a component-based