 
              Path algorithms Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1
Path algorithms In this lecture we consider problems of the form x ∈ R n g ( x ) + λ · h ( x ) min where λ ≥ 0 . In particular, we’ll look at path algorithms , which deliver { x ⋆ ( λ ) : λ ∈ [0 , ∞ ] } , called the solution path Path algorithms start at one end (either λ = 0 or λ = ∞ ) where computing the solution is easy, and essentially trace the solution path by successively satisfying the KKT conditions Properties: • They deliver the exact solution (no iteration, no error bound guarantees) at all values of λ • They provide useful platform for tuning parameter selection and statistical analysis 2
1d fused lasso (total variation denoising) Given y ∈ R n , consider the 1d fused lasso or 1d total variation denoising problem: n − 1 � n � 1 ( y i − x i ) 2 + λ min | x i − x i +1 | 2 x ∈ R n i =1 i =1 12 ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● Example: n = 100 , plotted in ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● red is solution x ⋆ at λ = 5 ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● Solution is piecewise constant ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● with adaptively chosen break ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● points; larger λ , fewer breaks ● ● ● ● 2 ● ● ● 0 20 40 60 80 100 3
At λ = 0 , the solution is simply x ⋆ (0) = y λ = 0.100 λ = 0.299 λ = 0.352 ● ● ● ● ● ● ● ● ● 10 10 10 ● ● ● 8 8 8 ● ● ● ● ● ● 6 6 6 ● ● ● ● ● ● 4 4 4 ● ● ● ● ● ● 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 λ = 0.460 λ = 0.648 λ = 1.102 ● ● ● ● ● ● ● ● ● 10 10 10 ● ● ● 8 8 8 ● ● ● ● ● ● 6 6 6 ● ● ● ● ● ● 4 4 4 ● ● ● ● ● ● 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Strategy: construct a sequence of λ values λ 1 ≤ λ 2 ≤ . . . at which adjacent coordinates of x ⋆ ( λ ) are equal or “fused” (Hoefling, 2009) 4
In between these critical values λ 1 ≤ λ 2 ≤ . . . , solution x ⋆ ( λ ) is simply linear: if λ ∈ [ λ k , λ k +1 ] , x ⋆ ( λ ) = α · x ⋆ ( λ k ) + (1 − α ) · x ⋆ ( λ k +1 ) where α = ( λ k +1 − λ ) / ( λ k +1 − λ k ) How many critical values are there? Can there be more than n − 1 ? We will rely on the following useful fact: Lemma (Friedman et al., 2007): For any coordinate i of the 1d i ( λ ′ ) = x ⋆ i +1 ( λ ′ ) fused lasso solution, if x ⋆ i ( λ ) = x ⋆ i +1 ( λ ) , then x ⋆ for all λ ′ > λ . I.e., once two coordinates fuse, they can never unfuse. So there are exactly n − 1 critical points 5
Our problem: n n − 1 � � 1 ( y i − x i ) 2 + λ min | x i − x i +1 | 2 x ∈ R n i =1 i =1 The KKT conditions: 0 = x i − y i + λ ( s i − s i − 1 ) , i = 1 , . . . n where s i ∈ ∂ | x i − x i +1 | , i = 1 , . . . n − 1 (and s 0 = s n = 0 ) At λ = 0 , x ⋆ (0) = y . For a small value λ > 0 , consider taking � � x ⋆ i ( λ ) = y i − λ sign( y i − y i +1 ) − sign( y i − 1 − y i ) , i = 1 , . . . n This satisfies the KKT conditions, and is hence a valid solution at λ , as long as sign( x ⋆ i ( λ ) − x ⋆ i +1 ( λ )) = sign( y i − y i +1 ) for all i 6
I.e., x ⋆ i ( λ ) = a i − λ · b i , linear function of λ , for i = 1 , . . . n This is a valid solution as long as x ⋆ i ( λ ) and x ⋆ i +1 ( λ ) don’t cross for some i 10 The critical value of λ at which 8 Coordinates of x this happens is 6 a i − a i +1 λ 1 = min b i − b i +1 i =1 ,...n − 1 4 0.0 0.1 0.2 0.3 0.4 0.5 λ Therefore we have computed the solution path x ⋆ ( λ ) exactly for all λ ∈ [0 , λ 1 ] 7
Now invoke our lemma: coordinates 4 and 5 (light blue and green) will remain fused for all λ ≤ λ 1 Hence, we can define the fused groups g 1 = { 1 } , . . . g 3 = { 3 } , g 4 = { 4 , 5 } , g 5 = { 6 } , . . . g n − 1 = { n } and rewrite our problem: n − 1 � � n − 2 � 1 ( y j − x g i ) 2 + λ min | x g i − x g i +1 | 2 x g 1 ,...x gn − 1 j ∈ g i i =1 i =1 and the KKT conditions: � 0 = | g i | · x g i − y j + λ ( s g i − s g i − 1 ) , i = 1 , . . . n − 1 , j ∈ g i where s g i ∈ ∂ | x g i − x g i +1 | , i = 1 , . . . n − 2 8
We know that these KKT conditions are satisfied at λ 1 by x ⋆ ( λ 1 ) (simply a reparametrization) Hence for λ > λ 1 , we propose x ⋆ g i ( λ ) = a i − λ · b i , i = 1 , . . . n − 1 where � 1 a i = y j | g i | j ∈ g i � � � � �� 1 x ⋆ g i ( λ 1 ) − x ⋆ x ⋆ g i − 1 ( λ 1 ) − x ⋆ b i = sign g i +1 ( λ 1 ) − sign g i ( λ 1 ) | g i | This will satisfy the KKT conditions, and hence be a valid solution, so long as sign( x ⋆ g i ( λ ) − x ⋆ g i +1 ( λ )) = sign( x ⋆ g i ( λ 1 ) − x ⋆ g i +1 ( λ 1 )) for all i 9
I.e., this is a valid solution as long as two adjacent coordinate paths don’t cross 10 The next critical value of λ at which this happens is 8 Coordinates of x a i − a i +1 λ 2 = min 6 b i − b i +1 i =1 ,...n − 2 4 (minimization here is only over values ≥ λ 1 ) 0.0 0.1 0.2 0.3 0.4 0.5 λ Now have computed the path x ⋆ ( λ ) exactly for all λ ∈ [0 , λ 2 ] This strategy can be repeated until all coordinates are fused into one single group — the end of the path 10
Summary of 1d fused lasso path algorithm: • Start with λ 0 = 0 , G = n , g i = { i } , i = 1 , . . . n , x ⋆ (0) = y • For k = 1 , . . . n − 1 : � ◮ Compute a i = 1 j ∈ g i y j and | g i | � � � 1 x ⋆ g i ( λ k − 1 ) − x ⋆ b i = sign g i +1 ( λ k − 1 ) − | g i | � �� x ⋆ g i − 1 ( λ k − 1 ) − x ⋆ sign g i ( λ k − 1 ) ◮ With x ⋆ g i ( λ ) = a i − λ · b i , i = 1 , . . . G , increase λ until the next critical value: a i − a i +1 λ k = min b i − b i +1 i =1 ,...G − 1 (minimization is only over values ≥ λ k − 1 ) ◮ Merge the appropriate groups, and decrement G 11
Coordinates of x 4 6 8 10 0 1 2 λ 3 4 5 12
Computational complexity Naive implementation: each iteration takes O ( n ) operations (scan over crossing times of all adjacent fused groups), and there are O ( n ) iterations (number of fused groups decreases by one at each iteration), so total complexity is O ( n 2 ) Tree-based implementation: note that, after a fusion, crossing times only change for neighbors of modified group. Therefore we can store the solution path in a tree; each leaf tells us a crossing time for an adjacent group, after 1 a fusion we can update the tree in constant number of operations, and finding the next (minimum) crossing time requires O (log n ) operations. Hence O ( n log n ) total operations 1 From Hoefling (2009), A path algorithm for the fused lasso signal approximator 13
Extensions Computing the exact solution path in O ( n log n ) operations is very fast . Are there extensions beyond 1d case? • For fused lasso over arbitrary graphs — e.g., consider 2d grid (image denoising) — the key lemma does not hold: as we increase λ , fused groups can unfuse. Counterintuitive! Now have to check, at each iteration, for both groups fusing and unfusing; each iteration can be reduced to solving a max flow problem, which is unfortunately more costly • Alternatively, we could have derived a path algorithm for the dual problem, which is arguably simpler for fused lasso over an arbitrary graph • Extensions to the regression loss � y − Ax � 2 (before we were considering A = I ) are also possible 14
Path algorithms in statistics and ML Exact path algorithms can be derived for, e.g., lasso, fused lasso over an arbitrary graph, trend filtering, locally adaptive regression splines, SVMs and kernel SVMs, 1-norm SVMs, relaxed maximum entropy problem In all these examples, solution path is piecewise linear in λ , so problem reduces to finding critical values λ 1 , λ 2 , . . . λ T Unfortunately, for majority of above examples, tight bounds for number critical values T are not known Empirically, number of critical values grows large with increasing problem sizes, so path algorithms are not scalable to huge problems Approximate path algorithms can be derived for problems in which path is not piecewise linear, e.g., lasso GLMs 15
Path algorithms and tuning parameter selection Recall the general problem form x ∈ R n g ( x ) + λ · h ( x ) min where λ ≥ 0 . This parameter balances the effective importance of two terms, controlling the amount of underfitting or overfitting Path algorithms trace out the solution as a function of tuning parameter λ — hence they provide a complete description of this tradeoff, and often aid in statistical understanding of optimization problem On a more practical note, the choice of λ is critical for essentially any statistical application; when computable, path algorithms can be helpful for this task 16
Recommend
More recommend