 
              SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning Guillaume Obozinski Swiss Data Science Center Joint work with Shell Xu Hu Imaging and Machine Learning workshop, IHP, April 2nd 2019
Outline Motivation and context 1 Formulation for CRF learning 2 Relaxing and reformulating in the dual 3 Dual augmented Lagrangian formulation and algorithm 4 Convergence results 5 Experiments 6 Conclusions 7
A motivating example: semantic segmentation Cityscapes dataset (Cordts et al., 2016)
Recent fast algorithms for large sums of functions n � w F ( w ) + λ 2 � w � 2 min with F ( w ) = F s ( w ) 2 s =1 and typically F s ( w ) = f s ( w ⊺ ϕ ( x s )) = ℓ ( w ⊺ ϕ ( x s ) , y s ) Stochastic gradient methods with variance reduction Iterate: pick s at random and update w t +1 = w t − ηg t with g t ∇ F s ( w t ) −∇ F s ( � w ) + 1 (SVRG) = n ∇ F ( � w ) and w = � � w epoch g t − 1 g t − 1 g t ∇ F s ( w t ) − g t s = ∇ F s ( w t ) (SAG) = + and s g t ∇ F s ( w t ) − g t − 1 + 1 g t − 1 g t s = ∇ F s ( w t ) (SAGA) = and s n Stochastic Dual Coordinate Ascent (Implicit Variance reduction) � � n n � � s ( α s ) + 1 2 � � f ∗ max ϕ ( x s ) α s � � 2 λ α 1 ,...,α n 2 s =1 s =1 � α t +1 α t +1 α t s − 1 L s ϕ ( x s ) ⊺ w t ) , ← α t Iterate ← Prox λ i , ∀ i � = s. s i Ls f ∗ s
Variance reduction techniques yield improved rates κ : condition number d : ambient dimension Running times to have Obj( w ) − Obj( w ∗ ) ≤ ε Stochastic GD d κ 1 ε GD d nκ log 1 d n √ κ log 1 ε Accelerated GD ε SAG(A), SVRG, SDCA, MISO d ( n + κ ) log 1 d ( n + √ nκ ) log 1 ε Accelerated variants ε Exploiting sum structure yields faster algorithms... n � ℓ ( w ⊺ φ ( x s ) , y s ) + λ 2 � w � 2 min 2 w s =1 y 1 y 2 y n � · x 1 x 2 x n
Conditional Random Fields Input image x Features at pixel s : ϕ s ( x ) Encoding of class at pixel s : y s = ( y s 1 , . . . , y sK ) with ◮ y sk = 1 if in class k ◮ y sk = 0 else. Options: 1 predict each pixel class individually: multiclass logistic regression � K � � y sk w k ⊺ ϕ s ( x ) p ( y s | x ) ∝ exp k =1 2 View image as a grid graph with vertices V and edges E , and predict all pixels classes jointly while accounting for dependencies: CRF � � � K K � � � y sk w τ 1 ,k ⊺ ϕ s ( x ) + p ( y 1 , . . . , y S | x ) ∝ exp w τ 2 ,kl y sk y tl s ∈V k =1 k,l =1 { s,t }∈E
Trick: log-likelihood as log-partition − log p ( y o | x o ) − � w, φ ( y o , x o ) � + A x o ( w ) = � − � w, φ ( y o , x o ) � + log exp � w, φ ( y, x o ) � = y � exp � w, φ ( y, x o ) − φ ( y o , x o ) � = log y � � � � � w τ c , φ c ( y c , x o ) − φ ( y o c , x o ) � = log exp y c ∈C � � � � = log exp � y c , θ ( c ) � y c ∈C � � � w τ c , φ c ( y ′ c , x o ) − φ ( y o c , x o ) with θ ( c ) = � c ∈Y c . � �� � y ′ ψ c ( y ′ c )
Conditional Random Fields Input image x Features at pixel s : ϕ s ( x ) Encoding of class at pixel s : y s = ( y s 1 , . . . , y sK ) with ◮ y sk = 1 if in class k ◮ y sk = 0 else. Options: 1 predict each pixel class individually: multiclass logistic regression � K � � y sk w k ⊺ ϕ s ( x ) p ( y s | x ) ∝ exp k =1 2 View image as a grid graph with vertices V and edges E , and predict all pixels classes jointly while accounting for dependencies: CRF � � � K K � � � y sk w τ 1 ,k ⊺ ϕ s ( x ) + p ( y 1 , . . . , y S | x ) ∝ exp w τ 2 ,kl y sk y tl s ∈V k =1 k,l =1 { s,t }∈E
Abstract CRF model � � K K � � � � sk w τ 1 ,k ⊺ ϕ s ( x o ) + p ( y o | x o ) ∝ exp y o w τ 2 ,kl y o sk y o tl s ∈V k =1 k,l =1 { s,t }∈E � � � � p ( y o | x o ) ∝ exp � w τ 1 , φ s ( y o s , x o ) � � w τ 2 , φ st ( y o s , y o t , x o ) � + s ∈V { s,t }∈E � c ) � − log Z ( x o , w ) , log p w ( y o | x o ) = � w τ c , φ c ( x o , y o Let C = V ∪ E , c ∈C � � � � � � w τ c , φ c ( x o , y c ) � with y { s,t } = y s y ⊺ t and Z ( x o , w ) = . . . exp y 1 y S c ∈C � � � � In fact − log p w ( y o | x o ) = log � w τ c , φ c ( x o , y c ) − φ c ( x o , y o exp c ) � y c ∈C � � � Ψ ⊺ = log exp ( c ) w, y c � y c ∈C � � � � Ψ ⊺ w =: f with f ( θ ) = log exp � θ ( c ) , y c � . y c ∈C
Regularized maximum likelihood estimation The regularized maximum likelihood estimation problem w − log p w ( y o | x o ) + λ 2 � w � 2 min 2 is reformulated as � � w f (Ψ ⊺ w ) + λ 2 � w � 2 min with f ( θ ) = log exp � θ ( c ) , y c � , 2 y c ∈C f is essentially another way of writing the log-partition function A . Major issue: NP-hardness of inference in graphical models f and its gradient are NP-difficult to compute. ⇒ the maximum likelihood estimator is intractable. f or ∇ F can be estimated using MCMC methods to perform approximate inference . Approximate inference can also be solved as an optimization problem with variational methods .
Compare with the “disconnected graph” case S � s | x o ) + λ log p w ( y o 2 � w � 2 min 2 w s =1 S � � s w ) + λ f s ( ψ ⊺ 2 � w � 2 min with f s ( θ ( s ) ) := log exp � θ ( s ) , y s � . 2 w s =1 y s f s is easy to compute: the sum of K terms The objective is a sum of a large number of terms ⇒ Very fast randomized algorithms can be used to solve this problem SAG Roux et al. (2012) SVRG Johnson and Zhang (2013) SAGA Defazio et al. (2014), etc SDCA Shalev-Shwartz and Zhang (2016) � � S S � � s ( α s ) + 1 2 � � f ∗ max ψ s α s � � 2 λ α 1 ,...,α S 2 s =1 s =1 Could we do the same for CRFs? With SDCA?
Fenchel conjugate of the log-partition function � � f ( θ ) := log exp � θ ( c ) , y c � = max µ ∈M � µ, θ � + H Shannon ( µ ) , y c ∈C The marginal polytope M is the set of all realizable moments vectors � � M := µ = ( µ c ) c ∈C | ∃ Y s.t. ∀ c ∈ C , µ c = E [ Y c ] . H Shannon is the Shannon entropy of the maximum entropy distribution with moments µ . � � + λ P # ( w ) := f Ψ ⊺ w 2 � w � 2 2 D # ( µ ) := H Shannon ( µ ) − ι M ( µ ) − 1 2 λ � Ψ µ � 2 2 w P # ( w ) D # ( µ ) min and max µ form a pair of primal and dual optimization problems. Both H Shannon and M are intractable → NP-hard problem in general
Relaxing the marginal into the local polytope. A classical relaxation for M : the local polytope L For C = E ∪ V Node and edge simplex constraints: � � µ s ∈ R k + | µ ⊺ ∀ s ∈ V , △ s := s 1 = 1 � � µ st ∈ R k × k | 1 ⊺ µ ⊺ ∀{ s, t } ∈ E , △ { s,t } := st 1 = 1 . + � � I := µ = ( µ c ) c ∈C | ∀ c ∈ C , µ c ∈ △ c � � µ ⊺ L := µ ∈ I | ∀{ s, t } ∈ E , µ st 1 = µ s , st 1 = µ t L = I ∩ { µ | Aµ = 0 } for an appropriate definition of A ...
Surrogates for the entropy Various entropy surrogates exist, e.g.: Bethe entropy (nonconvex), Tree-reweighted entropy (TRW) (convex on L but not on I ) Separable surrogates H approx � We consider surrogates of the form H approx ( µ ) = h c ( µ c ) , such that c ∈C each function h c is smooth a and convex on △ c and H approx is strongly convex on L In particular we propose to use the Gini entropy: h c ( µ c ) = 1 − � µ c � 2 F a quadratic counterpart of the oriented tree-reweighted entropy : a i.e. has Lipschitz gradients
Relaxed dual problem relax to M − → L = I ∩ { µ | Aµ = 0 } � relax to − → H approx ( µ ) := h c ( µ c ) . H Shannon c ∈C Problem relaxation D # ( µ ) := H Shannon ( µ ) − ι M ( µ ) − 1 2 λ � Ψ µ � 2 2 relax to ↓ D ( µ ) := H approx ( µ ) − ι I ( µ ) − ι { Aµ =0 } − 1 2 λ � Ψ µ � 2 2 so that with g ∗ ( µ ) = 1 f ∗ 2 λ � Ψ µ � 2 c ( µ c ) : − h c ( µ c ) + ι △ c ( µ c ) and 2 � f ∗ c ( µ c ) − g ∗ ( µ ) − ι { Aµ =0 } . we have D ( µ ) = − c ∈C
A dual augmented Lagrangian formulation � f ∗ c ( µ c ) − g ∗ ( µ ) − ι { Aµ =0 } D ( µ ) = − c ∈C Idea: without the linear constraint, we could exploit the form of the objective to use a fast algorithm such as stochastic dual coordinate ascent . � c ( µ c ) − g ∗ ( µ ) − � ξ, Aµ � − 1 f ∗ 2 ρ � Aµ � 2 D ρ ( µ, ξ ) = − 2 c ∈C By strong duality, we need to solve min d ( ξ ) with d ( ξ ) := max D ρ ( µ, ξ ) . µ ξ
The algorithm Need to solve min d ( ξ ) with d ( ξ ) := max D ρ ( µ, ξ ) . µ ξ � c ( µ c ) − g ∗ ( µ ) − � ξ, Aµ � − 1 f ∗ 2 ρ � Aµ � 2 with D ρ ( µ, ξ ) = − 2 . c ∈C Note that we have ∇ d ( ξ ) = Aµ ξ with µ ξ = arg min D ρ ( µ, ξ ) . ξ Combining an inexact dual Lagrangian method with a subsolver A At epoch t : Maximize D ρ partially w.r.t. µ using a fixed number of steps of a µ t from the ˆ µ t − 1 . (stochastic) linearly convergent algorithm A to get ˆ Take an inexact gradient step on d with ξ t +1 = ξ t − 1 µ t LA ˆ
Recommend
More recommend