Non-Uniform Stochastic Average Gradient for Training Conditional - PowerPoint PPT Presentation

Non-Uniform Stochastic Average Gradient for Training Conditional Random Fields Mark Schmidt, Reza Babanezhad, Mohamed Ahmed Ann Clifton, Anoop Sarkar University of British Columbia, Simon Fraser University NIPS Optimization Workshop, 2014

Motivation: Structured Prediction Classical supervised learning:

Motivation: Structured Prediction Classical supervised learning: Structured prediction:

Motivation: Structured Prediction Classical supervised learning: Structured prediction: Other structure prediction tasks: Labelling all people/places in Wikiepdia, finding coding regions in DNA sequences, labelling all voxels in an MRI as normal or tumor, predicting protein structure from sequence, weather forecasting, translating from French to English,etc.

Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T �

Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T � This requires parameter vector w k for all possible words k .

Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T � This requires parameter vector w k for all possible words k . Multinomial logistic regression to predict each letter: exp ( w T y j F ( x j )) p ( y j | x j , w ) = j F ( x j )) . � j exp ( w T y ′ y ′ This works if you are really good at predicting individual letters.

Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T � This requires parameter vector w k for all possible words k . Multinomial logistic regression to predict each letter: exp ( w T y j F ( x j )) p ( y j | x j , w ) = j F ( x j )) . � j exp ( w T y ′ y ′ This works if you are really good at predicting individual letters. But this ignores dependencies between letters.

Motivation: Structured Prediction What letter is this?

Motivation: Structured Prediction What letter is this? What are these letters?

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters.

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter.

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’).

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending).

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending). F ( y j − 2 , y j − 1 , y j , j , x ) : third-order and position (English: ‘i-n-g’ end).

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending). F ( y j − 2 , y j − 1 , y j , j , x ) : third-order and position (English: ‘i-n-g’ end). F ( y ∈ D , x ) : is y in dictionary D ?

Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending). F ( y j − 2 , y j − 1 , y j , j , x ) : third-order and position (English: ‘i-n-g’ end). F ( y ∈ D , x ) : is y in dictionary D ? CRFs are a ubiquitous tool in natural language processing: Part-of-speech tagging, semantic role labelling, information extraction, shallow parsing, named-entity recognition, etc.

Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1

Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex.

Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex. Bad news: evaluating log p ( y i | x i , w ) and its gradient is expensive.

Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex. Bad news: evaluating log p ( y i | x i , w ) and its gradient is expensive. Chain-structures: run forward-backward on each example.

Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex. Bad news: evaluating log p ( y i | x i , w ) and its gradient is expensive. Chain-structures: run forward-backward on each example. General features: exponential in tree-width of dependency graph. A lot of work on approximate evaluation. This optimization problem remains a bottleneck.

Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required.

Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required. But each iteration requires log p ( y i | x i , w ) for all n examples.

Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required. But each iteration requires log p ( y i | x i , w ) for all n examples. To scale to large n , stochastic gradient methods were examined. [Vishwanathan et al., 2006] Iteration cost is independent of n .

Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required. But each iteration requires log p ( y i | x i , w ) for all n examples. To scale to large n , stochastic gradient methods were examined. [Vishwanathan et al., 2006] Iteration cost is independent of n . But has a sub linear convergence rate: O ( 1 /ǫ ) iterations required. Or with constant step-size you get linear rate up to fixed tolerance.

Non-Uniform Stochastic Average Gradient for Training Conditional - PowerPoint PPT Presentation

Non-Uniform Stochastic Average Gradient for Training Conditional Random Fields Mark Schmidt, Reza Babanezhad, Mohamed Ahmed Ann Clifton, Anoop Sarkar University of British Columbia, Simon Fraser University NIPS Optimization Workshop, 2014

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Non-uniform Self-Moduli Peter Gerdes Group in Logic University of California, Berkeley 2007-08

Average Connectivity and Average Edge-connectivity in Graphs Suil O joint work with Jaehoon Kim

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS675: Convex and Combinatorial Optimization Fall 2019 Introduction to Matroid Theory

The weak Bruhat order on the symmetric group is Sperner Yibo Gao Joint work with: Christian

Learning convex bounds for linear quadratic control policy synthesis Jack Umenberger Thomas

Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT

INTRODUCTION TO WEB DEVELOPMENT IN C++ WITH WT 4 https://www.webtoolkit.eu/wt Roel Standaert

Advanced Algorithms (IV) Chihao Zhang Shanghai Jiao Tong University Mar. 18, 2019 Advanced

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &

Sambuz

Useful Links

Newsletter

Mail Us

Non-Uniform Stochastic Average Gradient for Training Conditional - PowerPoint PPT Presentation

Non-Uniform Stochastic Average Gradient for Training Conditional Random Fields Mark Schmidt, Reza Babanezhad, Mohamed Ahmed Ann Clifton, Anoop Sarkar University of British Columbia, Simon Fraser University NIPS Optimization Workshop, 2014

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Non-Uniform Computation &amp; Circuits Lecture 10 Wherein every language can be decided 1

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Non-uniform Self-Moduli Peter Gerdes Group in Logic University of California, Berkeley 2007-08

Average Connectivity and Average Edge-connectivity in Graphs Suil O joint work with Jaehoon Kim

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS675: Convex and Combinatorial Optimization Fall 2019 Introduction to Matroid Theory

The weak Bruhat order on the symmetric group is Sperner Yibo Gao Joint work with: Christian

Learning convex bounds for linear quadratic control policy synthesis Jack Umenberger Thomas

Linear Models for Classification Henrik I Christensen Robotics &amp; Intelligent Machines @ GT

INTRODUCTION TO WEB DEVELOPMENT IN C++ WITH WT 4 https://www.webtoolkit.eu/wt Roel Standaert

Advanced Algorithms (IV) Chihao Zhang Shanghai Jiao Tong University Mar. 18, 2019 Advanced

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &amp;

Sambuz

Useful Links

Newsletter

Mail Us

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &