Path algorithms Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Path algorithms Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Path algorithms In this lecture we consider problems of the form x ∈ R n g ( x ) + λ · h ( x ) min where λ ≥ 0 . In particular, we’ll look at path algorithms , which deliver { x ⋆ ( λ ) : λ ∈ [0 , ∞ ] } , called the solution path Path algorithms start at one end (either λ = 0 or λ = ∞ ) where computing the solution is easy, and essentially trace the solution path by successively satisfying the KKT conditions Properties: • They deliver the exact solution (no iteration, no error bound guarantees) at all values of λ • They provide useful platform for tuning parameter selection and statistical analysis 2

1d fused lasso (total variation denoising) Given y ∈ R n , consider the 1d fused lasso or 1d total variation denoising problem: n − 1 � n � 1 ( y i − x i ) 2 + λ min | x i − x i +1 | 2 x ∈ R n i =1 i =1 12 ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● Example: n = 100 , plotted in ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● red is solution x ⋆ at λ = 5 ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● Solution is piecewise constant ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● with adaptively chosen break ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● points; larger λ , fewer breaks ● ● ● ● 2 ● ● ● 0 20 40 60 80 100 3

At λ = 0 , the solution is simply x ⋆ (0) = y λ = 0.100 λ = 0.299 λ = 0.352 ● ● ● ● ● ● ● ● ● 10 10 10 ● ● ● 8 8 8 ● ● ● ● ● ● 6 6 6 ● ● ● ● ● ● 4 4 4 ● ● ● ● ● ● 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 λ = 0.460 λ = 0.648 λ = 1.102 ● ● ● ● ● ● ● ● ● 10 10 10 ● ● ● 8 8 8 ● ● ● ● ● ● 6 6 6 ● ● ● ● ● ● 4 4 4 ● ● ● ● ● ● 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Strategy: construct a sequence of λ values λ 1 ≤ λ 2 ≤ . . . at which adjacent coordinates of x ⋆ ( λ ) are equal or “fused” (Hoefling, 2009) 4

In between these critical values λ 1 ≤ λ 2 ≤ . . . , solution x ⋆ ( λ ) is simply linear: if λ ∈ [ λ k , λ k +1 ] , x ⋆ ( λ ) = α · x ⋆ ( λ k ) + (1 − α ) · x ⋆ ( λ k +1 ) where α = ( λ k +1 − λ ) / ( λ k +1 − λ k ) How many critical values are there? Can there be more than n − 1 ? We will rely on the following useful fact: Lemma (Friedman et al., 2007): For any coordinate i of the 1d i ( λ ′ ) = x ⋆ i +1 ( λ ′ ) fused lasso solution, if x ⋆ i ( λ ) = x ⋆ i +1 ( λ ) , then x ⋆ for all λ ′ > λ . I.e., once two coordinates fuse, they can never unfuse. So there are exactly n − 1 critical points 5

Our problem: n n − 1 � � 1 ( y i − x i ) 2 + λ min | x i − x i +1 | 2 x ∈ R n i =1 i =1 The KKT conditions: 0 = x i − y i + λ ( s i − s i − 1 ) , i = 1 , . . . n where s i ∈ ∂ | x i − x i +1 | , i = 1 , . . . n − 1 (and s 0 = s n = 0 ) At λ = 0 , x ⋆ (0) = y . For a small value λ > 0 , consider taking � � x ⋆ i ( λ ) = y i − λ sign( y i − y i +1 ) − sign( y i − 1 − y i ) , i = 1 , . . . n This satisfies the KKT conditions, and is hence a valid solution at λ , as long as sign( x ⋆ i ( λ ) − x ⋆ i +1 ( λ )) = sign( y i − y i +1 ) for all i 6

I.e., x ⋆ i ( λ ) = a i − λ · b i , linear function of λ , for i = 1 , . . . n This is a valid solution as long as x ⋆ i ( λ ) and x ⋆ i +1 ( λ ) don’t cross for some i 10 The critical value of λ at which 8 Coordinates of x this happens is 6 a i − a i +1 λ 1 = min b i − b i +1 i =1 ,...n − 1 4 0.0 0.1 0.2 0.3 0.4 0.5 λ Therefore we have computed the solution path x ⋆ ( λ ) exactly for all λ ∈ [0 , λ 1 ] 7

Now invoke our lemma: coordinates 4 and 5 (light blue and green) will remain fused for all λ ≤ λ 1 Hence, we can define the fused groups g 1 = { 1 } , . . . g 3 = { 3 } , g 4 = { 4 , 5 } , g 5 = { 6 } , . . . g n − 1 = { n } and rewrite our problem: n − 1 � � n − 2 � 1 ( y j − x g i ) 2 + λ min | x g i − x g i +1 | 2 x g 1 ,...x gn − 1 j ∈ g i i =1 i =1 and the KKT conditions: � 0 = | g i | · x g i − y j + λ ( s g i − s g i − 1 ) , i = 1 , . . . n − 1 , j ∈ g i where s g i ∈ ∂ | x g i − x g i +1 | , i = 1 , . . . n − 2 8

We know that these KKT conditions are satisfied at λ 1 by x ⋆ ( λ 1 ) (simply a reparametrization) Hence for λ > λ 1 , we propose x ⋆ g i ( λ ) = a i − λ · b i , i = 1 , . . . n − 1 where � 1 a i = y j | g i | j ∈ g i � � � � �� 1 x ⋆ g i ( λ 1 ) − x ⋆ x ⋆ g i − 1 ( λ 1 ) − x ⋆ b i = sign g i +1 ( λ 1 ) − sign g i ( λ 1 ) | g i | This will satisfy the KKT conditions, and hence be a valid solution, so long as sign( x ⋆ g i ( λ ) − x ⋆ g i +1 ( λ )) = sign( x ⋆ g i ( λ 1 ) − x ⋆ g i +1 ( λ 1 )) for all i 9

I.e., this is a valid solution as long as two adjacent coordinate paths don’t cross 10 The next critical value of λ at which this happens is 8 Coordinates of x a i − a i +1 λ 2 = min 6 b i − b i +1 i =1 ,...n − 2 4 (minimization here is only over values ≥ λ 1 ) 0.0 0.1 0.2 0.3 0.4 0.5 λ Now have computed the path x ⋆ ( λ ) exactly for all λ ∈ [0 , λ 2 ] This strategy can be repeated until all coordinates are fused into one single group — the end of the path 10

Summary of 1d fused lasso path algorithm: • Start with λ 0 = 0 , G = n , g i = { i } , i = 1 , . . . n , x ⋆ (0) = y • For k = 1 , . . . n − 1 : � ◮ Compute a i = 1 j ∈ g i y j and | g i | � � � 1 x ⋆ g i ( λ k − 1 ) − x ⋆ b i = sign g i +1 ( λ k − 1 ) − | g i | � �� x ⋆ g i − 1 ( λ k − 1 ) − x ⋆ sign g i ( λ k − 1 ) ◮ With x ⋆ g i ( λ ) = a i − λ · b i , i = 1 , . . . G , increase λ until the next critical value: a i − a i +1 λ k = min b i − b i +1 i =1 ,...G − 1 (minimization is only over values ≥ λ k − 1 ) ◮ Merge the appropriate groups, and decrement G 11

Coordinates of x 4 6 8 10 0 1 2 λ 3 4 5 12

Computational complexity Naive implementation: each iteration takes O ( n ) operations (scan over crossing times of all adjacent fused groups), and there are O ( n ) iterations (number of fused groups decreases by one at each iteration), so total complexity is O ( n 2 ) Tree-based implementation: note that, after a fusion, crossing times only change for neighbors of modified group. Therefore we can store the solution path in a tree; each leaf tells us a crossing time for an adjacent group, after 1 a fusion we can update the tree in constant number of operations, and finding the next (minimum) crossing time requires O (log n ) operations. Hence O ( n log n ) total operations 1 From Hoefling (2009), A path algorithm for the fused lasso signal approximator 13

Extensions Computing the exact solution path in O ( n log n ) operations is very fast . Are there extensions beyond 1d case? • For fused lasso over arbitrary graphs — e.g., consider 2d grid (image denoising) — the key lemma does not hold: as we increase λ , fused groups can unfuse. Counterintuitive! Now have to check, at each iteration, for both groups fusing and unfusing; each iteration can be reduced to solving a max flow problem, which is unfortunately more costly • Alternatively, we could have derived a path algorithm for the dual problem, which is arguably simpler for fused lasso over an arbitrary graph • Extensions to the regression loss � y − Ax � 2 (before we were considering A = I ) are also possible 14

Path algorithms in statistics and ML Exact path algorithms can be derived for, e.g., lasso, fused lasso over an arbitrary graph, trend filtering, locally adaptive regression splines, SVMs and kernel SVMs, 1-norm SVMs, relaxed maximum entropy problem In all these examples, solution path is piecewise linear in λ , so problem reduces to finding critical values λ 1 , λ 2 , . . . λ T Unfortunately, for majority of above examples, tight bounds for number critical values T are not known Empirically, number of critical values grows large with increasing problem sizes, so path algorithms are not scalable to huge problems Approximate path algorithms can be derived for problems in which path is not piecewise linear, e.g., lasso GLMs 15

Path algorithms and tuning parameter selection Recall the general problem form x ∈ R n g ( x ) + λ · h ( x ) min where λ ≥ 0 . This parameter balances the effective importance of two terms, controlling the amount of underfitting or overfitting Path algorithms trace out the solution as a function of tuning parameter λ — hence they provide a complete description of this tradeoff, and often aid in statistical understanding of optimization problem On a more practical note, the choice of λ is critical for essentially any statistical application; when computable, path algorithms can be helpful for this task 16

Path algorithms Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Path algorithms Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Path algorithms In this lecture we consider problems of the form x R n g ( x ) + h ( x ) min where 0 . In particular, well look at path

ECE 242 Data Structures Lecture 31 Shortest Path Algorithms November 30, 2009 ECE242 L31:

A * A path finding algorithm. A path finding algorithm. Given a state space, such as a

On Path Generation, Path Following On Path Generation, Path Following and Time Coordination for

Using Off-Path and On-Path Signaling for Internet Security Saikat Guha, Paul Francis Cornell

Three Graph Algorithms Shortest Distance Paths Distance/Cost of a path in weighted graph sum of

Introduction to Path Analysis Ways to think about path analysis Path coefficients

Martha Brumfield, President and CEO C-Path Mission C-Path The Critical Path Institute is a

More On Paths Supplement to Chapter 4, Graph Theory Path definition What is a path? We

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

E E nhanced DE nhanced DE M-based flow path M-based flow path delineation algorithms for

Path categories and algorithms Rick Jardine GETCO 2015 April 8, 2015 Rick Jardine Path

AGENCY 1 Silkroad, Path to Development Silkroad, Path to Development Area of

Stability Theorem Xinyi Kong, 1281909 Eindhoven, university of Technology 31 May 2018 Overview

MATH 12002 - CALCULUS I 3.1: Maximum and Minimum Values - Examples Professor Donald L. White

Course 02402 Overview, Hypotheses Concerning Means Introduction to Statistics Motivating Example

Local invariants of maps between 3-manifolds Victor Goryunov University of Liverpool Conference

Z table and area under the Standard Normal Curve Z TABLE Area under the Standard Normal Curve (or

EE625 : ECONOMETRICS . . . . . . . . . . . . . . . . . . . . . . . . . .

Poli 5D Social Science Data Analytics Functions in Excel Shane Xinyang Xuan ShaneXuan.com

Event Generator Monte Carlo programs in Neutrino Oscillation Experiments 9 November, 2011 Steve