SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF - PowerPoint PPT Presentation

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning Guillaume Obozinski Swiss Data Science Center Joint work with Shell Xu Hu Imaging and Machine Learning workshop, IHP, April 2nd 2019

Outline Motivation and context 1 Formulation for CRF learning 2 Relaxing and reformulating in the dual 3 Dual augmented Lagrangian formulation and algorithm 4 Convergence results 5 Experiments 6 Conclusions 7

A motivating example: semantic segmentation Cityscapes dataset (Cordts et al., 2016)

Recent fast algorithms for large sums of functions n � w F ( w ) + λ 2 � w � 2 min with F ( w ) = F s ( w ) 2 s =1 and typically F s ( w ) = f s ( w ⊺ ϕ ( x s )) = ℓ ( w ⊺ ϕ ( x s ) , y s ) Stochastic gradient methods with variance reduction Iterate: pick s at random and update w t +1 = w t − ηg t with g t ∇ F s ( w t ) −∇ F s ( � w ) + 1 (SVRG) = n ∇ F ( � w ) and w = � � w epoch g t − 1 g t − 1 g t ∇ F s ( w t ) − g t s = ∇ F s ( w t ) (SAG) = + and s g t ∇ F s ( w t ) − g t − 1 + 1 g t − 1 g t s = ∇ F s ( w t ) (SAGA) = and s n Stochastic Dual Coordinate Ascent (Implicit Variance reduction) � � n n � � s ( α s ) + 1 2 � � f ∗ max ϕ ( x s ) α s � � 2 λ α 1 ,...,α n 2 s =1 s =1 � α t +1 α t +1 α t s − 1 L s ϕ ( x s ) ⊺ w t ) , ← α t Iterate ← Prox λ i , ∀ i � = s. s i Ls f ∗ s

Variance reduction techniques yield improved rates κ : condition number d : ambient dimension Running times to have Obj( w ) − Obj( w ∗ ) ≤ ε Stochastic GD d κ 1 ε GD d nκ log 1 d n √ κ log 1 ε Accelerated GD ε SAG(A), SVRG, SDCA, MISO d ( n + κ ) log 1 d ( n + √ nκ ) log 1 ε Accelerated variants ε Exploiting sum structure yields faster algorithms... n � ℓ ( w ⊺ φ ( x s ) , y s ) + λ 2 � w � 2 min 2 w s =1 y 1 y 2 y n � · x 1 x 2 x n

Conditional Random Fields Input image x Features at pixel s : ϕ s ( x ) Encoding of class at pixel s : y s = ( y s 1 , . . . , y sK ) with ◮ y sk = 1 if in class k ◮ y sk = 0 else. Options: 1 predict each pixel class individually: multiclass logistic regression � K � � y sk w k ⊺ ϕ s ( x ) p ( y s | x ) ∝ exp k =1 2 View image as a grid graph with vertices V and edges E , and predict all pixels classes jointly while accounting for dependencies: CRF � � � K K � � � y sk w τ 1 ,k ⊺ ϕ s ( x ) + p ( y 1 , . . . , y S | x ) ∝ exp w τ 2 ,kl y sk y tl s ∈V k =1 k,l =1 { s,t }∈E

Trick: log-likelihood as log-partition − log p ( y o | x o ) − � w, φ ( y o , x o ) � + A x o ( w ) = � − � w, φ ( y o , x o ) � + log exp � w, φ ( y, x o ) � = y � exp � w, φ ( y, x o ) − φ ( y o , x o ) � = log y � � � � � w τ c , φ c ( y c , x o ) − φ ( y o c , x o ) � = log exp y c ∈C � � � � = log exp � y c , θ ( c ) � y c ∈C � � � w τ c , φ c ( y ′ c , x o ) − φ ( y o c , x o ) with θ ( c ) = � c ∈Y c . � �� y ′ ψ c ( y ′ c )

Conditional Random Fields Input image x Features at pixel s : ϕ s ( x ) Encoding of class at pixel s : y s = ( y s 1 , . . . , y sK ) with ◮ y sk = 1 if in class k ◮ y sk = 0 else. Options: 1 predict each pixel class individually: multiclass logistic regression � K � � y sk w k ⊺ ϕ s ( x ) p ( y s | x ) ∝ exp k =1 2 View image as a grid graph with vertices V and edges E , and predict all pixels classes jointly while accounting for dependencies: CRF � � � K K � � � y sk w τ 1 ,k ⊺ ϕ s ( x ) + p ( y 1 , . . . , y S | x ) ∝ exp w τ 2 ,kl y sk y tl s ∈V k =1 k,l =1 { s,t }∈E

Abstract CRF model � � K K � � � � sk w τ 1 ,k ⊺ ϕ s ( x o ) + p ( y o | x o ) ∝ exp y o w τ 2 ,kl y o sk y o tl s ∈V k =1 k,l =1 { s,t }∈E � � � � p ( y o | x o ) ∝ exp � w τ 1 , φ s ( y o s , x o ) � � w τ 2 , φ st ( y o s , y o t , x o ) � + s ∈V { s,t }∈E � c ) � − log Z ( x o , w ) , log p w ( y o | x o ) = � w τ c , φ c ( x o , y o Let C = V ∪ E , c ∈C � � � � � � w τ c , φ c ( x o , y c ) � with y { s,t } = y s y ⊺ t and Z ( x o , w ) = . . . exp y 1 y S c ∈C � � � � In fact − log p w ( y o | x o ) = log � w τ c , φ c ( x o , y c ) − φ c ( x o , y o exp c ) � y c ∈C � � � Ψ ⊺ = log exp ( c ) w, y c � y c ∈C � � � � Ψ ⊺ w =: f with f ( θ ) = log exp � θ ( c ) , y c � . y c ∈C

Regularized maximum likelihood estimation The regularized maximum likelihood estimation problem w − log p w ( y o | x o ) + λ 2 � w � 2 min 2 is reformulated as � � w f (Ψ ⊺ w ) + λ 2 � w � 2 min with f ( θ ) = log exp � θ ( c ) , y c � , 2 y c ∈C f is essentially another way of writing the log-partition function A . Major issue: NP-hardness of inference in graphical models f and its gradient are NP-difficult to compute. ⇒ the maximum likelihood estimator is intractable. f or ∇ F can be estimated using MCMC methods to perform approximate inference . Approximate inference can also be solved as an optimization problem with variational methods .

Compare with the “disconnected graph” case S � s | x o ) + λ log p w ( y o 2 � w � 2 min 2 w s =1 S � � s w ) + λ f s ( ψ ⊺ 2 � w � 2 min with f s ( θ ( s ) ) := log exp � θ ( s ) , y s � . 2 w s =1 y s f s is easy to compute: the sum of K terms The objective is a sum of a large number of terms ⇒ Very fast randomized algorithms can be used to solve this problem SAG Roux et al. (2012) SVRG Johnson and Zhang (2013) SAGA Defazio et al. (2014), etc SDCA Shalev-Shwartz and Zhang (2016) � � S S � � s ( α s ) + 1 2 � � f ∗ max ψ s α s � � 2 λ α 1 ,...,α S 2 s =1 s =1 Could we do the same for CRFs? With SDCA?

Fenchel conjugate of the log-partition function � � f ( θ ) := log exp � θ ( c ) , y c � = max µ ∈M � µ, θ � + H Shannon ( µ ) , y c ∈C The marginal polytope M is the set of all realizable moments vectors � � M := µ = ( µ c ) c ∈C | ∃ Y s.t. ∀ c ∈ C , µ c = E [ Y c ] . H Shannon is the Shannon entropy of the maximum entropy distribution with moments µ . � � + λ P # ( w ) := f Ψ ⊺ w 2 � w � 2 2 D # ( µ ) := H Shannon ( µ ) − ι M ( µ ) − 1 2 λ � Ψ µ � 2 2 w P # ( w ) D # ( µ ) min and max µ form a pair of primal and dual optimization problems. Both H Shannon and M are intractable → NP-hard problem in general

Relaxing the marginal into the local polytope. A classical relaxation for M : the local polytope L For C = E ∪ V Node and edge simplex constraints: � � µ s ∈ R k + | µ ⊺ ∀ s ∈ V , △ s := s 1 = 1 � � µ st ∈ R k × k | 1 ⊺ µ ⊺ ∀{ s, t } ∈ E , △ { s,t } := st 1 = 1 . + � � I := µ = ( µ c ) c ∈C | ∀ c ∈ C , µ c ∈ △ c � � µ ⊺ L := µ ∈ I | ∀{ s, t } ∈ E , µ st 1 = µ s , st 1 = µ t L = I ∩ { µ | Aµ = 0 } for an appropriate definition of A ...

Surrogates for the entropy Various entropy surrogates exist, e.g.: Bethe entropy (nonconvex), Tree-reweighted entropy (TRW) (convex on L but not on I ) Separable surrogates H approx � We consider surrogates of the form H approx ( µ ) = h c ( µ c ) , such that c ∈C each function h c is smooth a and convex on △ c and H approx is strongly convex on L In particular we propose to use the Gini entropy: h c ( µ c ) = 1 − � µ c � 2 F a quadratic counterpart of the oriented tree-reweighted entropy : a i.e. has Lipschitz gradients

Relaxed dual problem relax to M − → L = I ∩ { µ | Aµ = 0 } � relax to − → H approx ( µ ) := h c ( µ c ) . H Shannon c ∈C Problem relaxation D # ( µ ) := H Shannon ( µ ) − ι M ( µ ) − 1 2 λ � Ψ µ � 2 2 relax to ↓ D ( µ ) := H approx ( µ ) − ι I ( µ ) − ι { Aµ =0 } − 1 2 λ � Ψ µ � 2 2 so that with g ∗ ( µ ) = 1 f ∗ 2 λ � Ψ µ � 2 c ( µ c ) : − h c ( µ c ) + ι △ c ( µ c ) and 2 � f ∗ c ( µ c ) − g ∗ ( µ ) − ι { Aµ =0 } . we have D ( µ ) = − c ∈C

A dual augmented Lagrangian formulation � f ∗ c ( µ c ) − g ∗ ( µ ) − ι { Aµ =0 } D ( µ ) = − c ∈C Idea: without the linear constraint, we could exploit the form of the objective to use a fast algorithm such as stochastic dual coordinate ascent . � c ( µ c ) − g ∗ ( µ ) − � ξ, Aµ � − 1 f ∗ 2 ρ � Aµ � 2 D ρ ( µ, ξ ) = − 2 c ∈C By strong duality, we need to solve min d ( ξ ) with d ( ξ ) := max D ρ ( µ, ξ ) . µ ξ

The algorithm Need to solve min d ( ξ ) with d ( ξ ) := max D ρ ( µ, ξ ) . µ ξ � c ( µ c ) − g ∗ ( µ ) − � ξ, Aµ � − 1 f ∗ 2 ρ � Aµ � 2 with D ρ ( µ, ξ ) = − 2 . c ∈C Note that we have ∇ d ( ξ ) = Aµ ξ with µ ξ = arg min D ρ ( µ, ξ ) . ξ Combining an inexact dual Lagrangian method with a subsolver A At epoch t : Maximize D ρ partially w.r.t. µ using a fixed number of steps of a µ t from the ˆ µ t − 1 . (stochastic) linearly convergent algorithm A to get ˆ Take an inexact gradient step on d with ξ t +1 = ξ t − 1 µ t LA ˆ

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF - PowerPoint PPT Presentation

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning Guillaume Obozinski Swiss Data Science Center Joint work with Shell Xu Hu Imaging and Machine Learning workshop, IHP, April 2nd 2019 Outline Motivation and context 1

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and

Today Lagrangian Dual. Already saw example! Convex Separator. Farkas Lemma. Lagrangian Dual.

( TQM by KAIZEN ) (KAIZEN) KAI = ZEN = KAIZEN = CONTINUOUS IMPROVEMENT KAIZEN = GTR ( Genba

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

University of Waikato Powered by 2018 Hamilton campus Powered by The P Powered by Tauranga

Lagrangian Duality Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

Network performance requirements of Augmented Reality Systems Mike P. Wittie 1 Augmented

Outline Exploring inexact rhyme in TradiQons in studying Russian rhyme Russian verse The

Models for Inexact Reasoning Models for Inexact Reasoning Reasoning with Certainty Factors: The

An augmented Lagrangian Approach for the defocusing non-linear Schr odinger Equation Firas

The Future of Water Management Powered by Life beyond the 100th meridian 2 Powered by Our

IMPACT OF AUGMENTED REALITY ON SOCIETY BY DEREK MANDL AND STEPHEN SLADEK WHAT IS AUGMENTED

Powered by TCPDF (www.tcpdf.org) Powered by TCPDF (www.tcpdf.org) Powered by TCPDF

Frank-Wolfe Splitting via Augmented Lagrangian Method Fabian Pedregosa 2 Simon Lacoste-Julien 1

Calhoun Community College Dual Enrollment Info Session for Students & Parents What is Dual

DUAL CREDIT WHAT IS DUAL CREDIT? Dual credit means two things are happening at once. Students

Revisiting Virtual File System for Metadata Optimized Non-Volatile Main Memory File System Ying

Dixons random squares method Last time we discuss Dixons random squares method to

Last time: staging basics . < e > . 1/ 54 Staging recap Goal : specialise with available

February 2019 August 2019 March May 2019 2020 Proposal Implementation Submit Action Requests

Lecture 23: Aliasing in Frequency: the Sampling Theorem Mark Hasegawa-Johnson All content CC-SA

Models through Tensor Networks 2019.12.04 Naoki KAWASHIMA (ISSP) Tensor Network (TN) (1)

Combinatorics in Mayers theory of cluster and virial expansions Quantum Many Body Systems

Comparison of Hydrodynamics and Kinetic Transport Theory for p+A and A+A Collisions Carsten

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF - PowerPoint PPT Presentation

SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning Guillaume Obozinski Swiss Data Science Center Joint work with Shell Xu Hu Imaging and Machine Learning workshop, IHP, April 2nd 2019 Outline Motivation and context 1

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and

Today Lagrangian Dual. Already saw example! Convex Separator. Farkas Lemma. Lagrangian Dual.

( TQM by KAIZEN ) (KAIZEN) KAI = ZEN = KAIZEN = CONTINUOUS IMPROVEMENT KAIZEN = GTR ( Genba

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

University of Waikato Powered by 2018 Hamilton campus Powered by The P Powered by Tauranga

Lagrangian Duality Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

Network performance requirements of Augmented Reality Systems Mike P. Wittie 1 Augmented

Outline Exploring inexact rhyme in TradiQons in studying Russian rhyme Russian verse The

Models for Inexact Reasoning Models for Inexact Reasoning Reasoning with Certainty Factors: The

An augmented Lagrangian Approach for the defocusing non-linear Schr odinger Equation Firas

The Future of Water Management Powered by Life beyond the 100th meridian 2 Powered by Our

IMPACT OF AUGMENTED REALITY ON SOCIETY BY DEREK MANDL AND STEPHEN SLADEK WHAT IS AUGMENTED

Powered by TCPDF (www.tcpdf.org) Powered by TCPDF (www.tcpdf.org) Powered by TCPDF

Frank-Wolfe Splitting via Augmented Lagrangian Method Fabian Pedregosa 2 Simon Lacoste-Julien 1

Calhoun Community College Dual Enrollment Info Session for Students &amp; Parents What is Dual

DUAL CREDIT WHAT IS DUAL CREDIT? Dual credit means two things are happening at once. Students

Revisiting Virtual File System for Metadata Optimized Non-Volatile Main Memory File System Ying

Dixons random squares method Last time we discuss Dixons random squares method to

Last time: staging basics . &lt; e &gt; . 1/ 54 Staging recap Goal : specialise with available

February 2019 August 2019 March May 2019 2020 Proposal Implementation Submit Action Requests

Lecture 23: Aliasing in Frequency: the Sampling Theorem Mark Hasegawa-Johnson All content CC-SA

Models through Tensor Networks 2019.12.04 Naoki KAWASHIMA (ISSP) Tensor Network (TN) (1)

Combinatorics in Mayers theory of cluster and virial expansions Quantum Many Body Systems

Comparison of Hydrodynamics and Kinetic Transport Theory for p+A and A+A Collisions Carsten

Calhoun Community College Dual Enrollment Info Session for Students & Parents What is Dual

Last time: staging basics . < e > . 1/ 54 Staging recap Goal : specialise with available