Conditional gradient algorithms for machine learning Zaid Harchaoui - PowerPoint PPT Presentation

Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR and LJK, INRIA Joint work with A. Juditsky (Grenoble U., France) and A. Nemirovski (GeorgiaTech) and Matthijs Douze, Miro Dudik, Jerome Malick, Mattis Paulin Gargantua day, Grenoble Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 1 / 42

The advent of large-scale datasets and “big learning” From “The Promise and Perils of Benchmark Datasets and Challenges”, D. Forsyth, A. Efros, F.-F. Li, A. Torralba and A. Zisserman, Talk at “Frontiers of Computer Vision” Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 2 / 42

Large-scale supervised learning Large-scale supervised learning Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be i.i.d. labelled training data, and R emp ( · ) the empirical risk for any W ∈ R d × k . Constrained formulation Penalized formulation minimize R emp ( W ) minimize λ Ω( W ) + R emp ( W ) subject to Ω( W ) ≤ ρ Problem : minimize such objectives in the large-scale setting # examples ≫ 1 , # features ≫ 1 , # classes ≫ 1 Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 3 / 42

Large-scale supervised learning Large-scale supervised learning Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be i.i.d. labelled training data, and R emp ( · ) the empirical risk for any W ∈ R d × k . Constrained formulation Penalized formulation minimize R emp ( W ) minimize λ Ω( W ) + R emp ( W ) subject to Ω( W ) ≤ ρ Problem : minimize such objectives in the large-scale setting n ≫ 1 , d ≫ 1 , k ≫ 1 Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 4 / 42

Machine learning cuboid k d n Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 5 / 42

Motivating example : multi-class classification with trace-norm penalty Motivating the trace-norm penalty Embedding assumption : classes may embedded in a low-dimensional subspace of the feature space Computational efficiency : training time and test time efficiency require sparse matrix regularizers Trace-norm The trace-norm, aka nuclear norm, is defined as min( d,k ) � � σ ( W ) � 1 = σ p ( W ) p =1 where σ 1 ( W ) , . . . , σ min( d,k ) ( W ) denote the singular values of W . Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 6 / 42

Large-scale supervised learning Multi-class classification with trace-norm regularization Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be i.i.d. labelled training data, and R emp ( · ) the empirical risk for any W ∈ R d × k . Constrained formulation Penalized formulation minimize R emp ( W ) minimize λ � σ ( W ) � 1 + R emp ( W ) subject to � σ ( W ) � 1 ≤ ρ Trace-norm reg. penalty (Amit et al., 2007 ; Argyriou et al., 2007) Enforces a low-rank structure of W (sparsity of spectrum σ ( W ) ) Convex problems Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 7 / 42

About the different formulations “Alleged” equivalence For a particular set of examples, for any value ρ of the constraint in the constrained formulation, there exists a value of λ in the penalized formulation so that the solutions of resp. the constrained formulation and the penalized formulation coincide. Statistical learning theory theoretical results on penalized estimators and constrained estimators are of different nature → no rigorous comparison possible equivalence frequently called as the rescue depending on the theoretical tools available to jump from one formulation to the other Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 8 / 42

Summary In practice Recall that eventually “hyperparameters” ( λ, ρ, ε, · · · ) will have to be tuned. Choose the formulation in which you can easily incorporate prior knowledge � � n 1 � Constrained formulation I Minimize Loss i : � σ ( W ) � 1 ≤ ρ n W ∈ R d × k i =1 � � n 1 � Penalized formulation Minimize Loss i + λ � σ ( W ) � 1 n W ∈ R d × k i =1 � � � � n 1 � � � Loss i − R target Minimize λ � σ ( W ) � 1 : � ≤ ε Constrained formulation II � � emp � n � W ∈ R d × k � i =1 Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 9 / 42

Learning with trace-norm penalty : a convex problem Supervised learning with trace-norm regularization penalty Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be a set of i.i.d. labelled training data, with Y = { 0 , 1 } k for multi-class classification n 1 � Minimize Loss i + λ � σ ( W ) � 1 n W ∈ R d × k i =1 � �� convex Penalized formulation Trace-norm reg. penalty (Amit et al., 2007 ; Argyriou et al., 2007) Enforces a low-rank structure of W (sparsity of spectrum σ ( W ) ) Convex, but non-differentiable Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 10 / 42

Possible approaches Generic approaches “Blind” approach : subgradient, bundle method → slow convergence rate Other approaches : alternating optimization, iteratively reweighted least-squares, etc. → no finite-time convergence guarantees Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 11 / 42

Learning with trace-norm penalty : convex but non-smooth Supervised learning with trace-norm regularization penalty Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be a set of i.i.d. labelled training data, with Y = { 0 , 1 } k for multi-class classification n + 1 � Minimize λ � σ ( W ) � 1 Loss i n W ∈ R d × k � �� i =1 nonsmooth � �� smooth where Loss i is e.g. the multinomial logistic loss of i -th example   � � � w T ℓ x i − w T Loss i = log  1 + exp y x i  ℓ ∈Y\{ y i } Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 12 / 42

Learning with trace-norm penalty : a convex problem Supervised learning with trace-norm regularization penalty Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be a set of i.i.d. labelled training data, with Y = { 0 , 1 } k for multi-class classification n λ � σ ( W ) � 1 + 1 � Minimize Loss i n W ∈ R d × k i =1 Penalized formulation Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 13 / 42

Composite minimization for penalized formulation Strengths of composite minimization (aka proximal-gradient) Attractive algorithms when proximal operator is cheap, as e.g. for vector ℓ 1 -norm Accurate with medium-accuracy, finite-time accuracy guarantees Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 14 / 42

Proximal gradient Algorithm Initialize : W = 0 Iterate : � � W t − 1 W t +1 = Prox λ/L Ω( · ) L ∇ R emp ( W t ) 1 2 � U − W � 2 + λ with Prox λ/L Ω( · ) ( U ) := min L Ω( W ) W Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 15 / 42

Composite minimization for penalized formulation Strengths of composite minimization (aka proximal-gradient) Attractive algorithms when proximal operator is cheap, as e.g. for vector ℓ 1 -norm Accurate with medium-accuracy, finite-time accuracy guarantees Weaknesses of composite minimization Inappropriate when proximal operator is expensive to compute Too sensitive to conditioning of design matrix (correlated features) Situation with trace-norm, i.e. Prox µ Ω( · ) ( · ) with Ω( · ) = � · � σ, 1 proximal operator corresponds to singular value thresholding, requiring an SVD running in O ( k rk ( W ) 2 ) in time → impractical for large-scale problems Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 16 / 42

Alternative approach : conditional gradient We want an algorithm with no SVD, i.e. without any projection or proximal step. Let us get some inspiration from the constrained setting. Problem � � n 1 � Minimize Loss i : W ∈ ρ · convex hull ( { M t } t ≥ 1 ) n W ∈ R d × k i =1 Gauge/atomic decomposition of trace-norm � N � N � � � σ ( W ) � 1 = inf θ i | ∃ N, θ i > 0 , M i ∈ M with W = θ i M i θ i =1 i =1 M = { uv T | u ∈ R d , v ∈ R Y , � u � 2 = � v � 2 = 1 } Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 17 / 42

Conditional gradient descent Algorithm Initialize : W = 0 Iterate : Find M t ∈ ρ · convex hull ( M ) , such that M t = Arg max � M ℓ , −∇ R emp ( W t ) � M ℓ ∈M � �� linear min. oracle Perform line-search between W t and M t W t +1 = (1 − δ ) W t + δ M t Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 18 / 42

Conditional gradient descent : example with trace-norm constraint Algorithm Initialize : W = 0 Iterate : Find M t ∈ ρ · convex hull ( M ) such that � u ℓ v T M t = Arg max ℓ , −∇ R emp ( W t ) � ℓ u T ( −∇ R emp ( W t )) v = Arg max � u � 2 = � v � 2 =1 i.e. compute top pair of singular vectors of −∇ R emp ( W t ) . Perform line-search between W t and M t W t +1 = (1 − δ ) W t + δ M t Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 19 / 42

Conditional gradient algorithms for machine learning Zaid Harchaoui - PowerPoint PPT Presentation

Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR and LJK, INRIA Joint work with A. Juditsky (Grenoble U., France) and A. Nemirovski (GeorgiaTech) and Matthijs Douze, Miro Dudik, Jerome Malick, Mattis Paulin Gargantua

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Instrumental Variables Brady Neal causalcourse.com <latexit

Introduction to Reinforcement Learning Abdeslam Boularias Monday, June 25, 2018 1 / 93 What is

Our Responsibility to Defeat Mass Surveillance Erik Drnenburg Martin Fowler

What Can We Do? The Role of the Laity in a Time of Crisis Meghan Cokeley Director, Office for the

Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen 1 of 29 Outline of the

Generative Models for Discriminative Problems Chris Dyer DeepMind ASRU 2017 December 19,

CLIC status, plans and outlook Philip Burrows John Adams Institute Oxford University On behalf

Causal Inference and Graphical Models - II Jin Tian Iowa State University p.1 Outline

Conditional gradient algorithms for machine learning Zaid Harchaoui - PowerPoint PPT Presentation

Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR and LJK, INRIA Joint work with A. Juditsky (Grenoble U., France) and A. Nemirovski (GeorgiaTech) and Matthijs Douze, Miro Dudik, Jerome Malick, Mattis Paulin Gargantua

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Instrumental Variables Brady Neal causalcourse.com &lt;latexit

Introduction to Reinforcement Learning Abdeslam Boularias Monday, June 25, 2018 1 / 93 What is

Our Responsibility to Defeat Mass Surveillance Erik Drnenburg Martin Fowler

What Can We Do? The Role of the Laity in a Time of Crisis Meghan Cokeley Director, Office for the

Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen 1 of 29 Outline of the

Generative Models for Discriminative Problems Chris Dyer DeepMind ASRU 2017 December 19,

CLIC status, plans and outlook Philip Burrows John Adams Institute Oxford University On behalf

Causal Inference and Graphical Models - II Jin Tian Iowa State University p.1 Outline

Instrumental Variables Brady Neal causalcourse.com <latexit