A majorization-minimization algorithm for (multiple) hyperparameter - PowerPoint PPT Presentation

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009

Supervised learning • Training set of m IID examples Labels may be real-valued, discrete, structured • Probabilistic model • Estimate parameters

Regularization prevents overfitting • Regularized maximum likelihood estimation Regularization Hyperparameter L 2 -regularized Logistic Regression • Also maximum a posteriori (MAP) estimation Data log- Log-prior over likelihood model parameters

How to select the hyperparameter(s)? • Grid search + Simple to implement − Scales exponentially with # hyperparameters • Gradient-based algorithms + Scales well with # hyperparameters − Non-trivial to implement Can we get the best of both worlds?

Our contribution � Striking ease of implementation � Simple, closed-form updates for C � Leverage existing solvers � Scales well to multiple hyperparameter case � Applicable to wide range of models

Outline 1. Problem definition 2. The “integrate out” strategy 3. The Majorization-Minimization algorithm 4. Experiments 5. Discussion

The “integrate out” strategy • Treat hyperparameter C as a random variable • Analytically integrate out C • Need a convenient prior p(C)

Integrating out a single hyperparameter • For L 2 regularization, • A convenient prior: • The result: 1. C is gone 2. Neither convex nor concave in w

The M ajorization- M inimization Algorithm • Replace hard problem by series of easier ones • EM-like; two steps: 1. M ajorization Upper bound the objective function 2. M inimization Minimize the upper bound

MM1: Upper-bounding the new prior • New prior: • Linearize the log: 4 3 2 1 0 y -1 -2 log(x) -3 expansion at x=1 expansion at x=1.5 -4 expansion at x=2 -5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

MM2: Solving the resultant optimization problem • Resultant linearized prior Terms independent of w • Get standard L 2 -regularization! Use existing solvers!

Visualization of the upper bound 3 4 log(0.5 x 2 + 1) 3 expansion at x=1 2.5 expansion at x=1.5 2 expansion at x=2 2 1 0 1.5 y y -1 1 -2 log(x) -3 expansion at x=1 0.5 expansion at x=1.5 -4 expansion at x=2 0 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x x

Overall algorithm 1. Closed form updates for C 2. Leverage existing solvers Converges to a local minimum

What about multiple hyperparameters? • Regularization groups w = ( w 1 , w 2 , w 3 , w 4 , w 5 ) Unigram Bigram “To C or not to C. That NLP feature feature is the question…” weights weights Mapping from weights to groups RNA Secondary Hairpin Bulge Structure loops loops Prediction C = ( C 1 , C 2 )

What about multiple hyperparameters? Separately update each regularization group Sum weights in each group Weighted L 2 -regularization

Experiments • 4 probabilistic models – Linear regression (too easy, not shown) – Binary logistic regression – Multinomial logistic regression – Conditional log-linear model • 3 competing algorithms – Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective • Algorithm run with α = 0, β = 1

Accuracy Results: Binary Logistic Regression 100 50 60 70 80 90 australian breast- cancer diabetes Grid german- numer Grad heart ionosphere Direct liver- disorders mushrooms MM sonar splice w1a

Accuracy Results: Multinomial Logistic Regression 100 30 40 50 60 70 80 90 � connect-4 � dna � glass Grid � iris � le6er Grad mnist1 � sa7mage Direct � segment � svmguide2 � usps MM � vehicle � vowel � wine

Results: Conditional Log-Linear Models • RNA secondary structure ROC Area prediction 0.65 • Multiple hyperparameters 0.64 0.63 0.62 AGCAGAGUGGCGCA 0.61 GUGGAAGCGUGCUG 0.6 GUCCCAUAACCCAGA GGUCCGAGGAUCGA 0.59 AACCUUGCUCUGCUA 0.58 Single Grouped (((((((((((((.......))))..((((((.... Gradient Direct MM (((....)))....))))))......))))))))).

Discussion • How to choose α, β in Gamma prior? – Sensitivity experiments – Simple choice reasonable – Further investigation required • Simple assumptions sometimes wrong • But competitive performance with Grid, Grad • Suited for ‘Quick-and-dirty’ implementations

Thank you!

A majorization-minimization algorithm for (multiple) hyperparameter - PowerPoint PPT Presentation

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009 Supervised learning Training set of m IID examples

Majorization and Extreme Points: Economic Applications Andreas Kleiner, Benny Moldovanu, and

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

A smoothing majorization method for l 2 2 - l p p matrix minimization Liwei Zhang Dalian

Generalized Majorization-Minimization Sobhan Naderi Kun He Reza Aghajani Stan

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Benefits of Radial Build Benefits of Radial Build Minimization and Requirements Minimization and

1 The Minimization Problem The Minimization Problem Input: A DFA (deterministic finite-state

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

ARS Workshop Context Markov Random Fields minimization and minimal cuts in Exact total variation

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

One-Dimensional Minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Minimization Using Descent Information we will consider the minimization of unconstrained

Cluster Minimization in Geometric Graphs Jakob Geiger Motivation Motivation Cluster

CS681: Advanced Topics in Computational Biology Week 10 Lecture 1 Can Alkan EA224

Photo 2. Looking southeast from the edge of the EBL shoulder towards Slide 1. The backscarp

High-Energy Gamma- Rays from the Milky Way: 3D Spatial Models for the CR and Radiation Field

Introduc)on:+GCDX+ Fe6.7keV (Nobukawa+12) 100pc* 0.7 Hotplasma*( 710

Making Sense of Life Psalm 73 Learning to see life from Gods perspective 1 Psalm

Relativistic suppression of Black-hole superkicks U. Sperhake CSIC-IEEC Barcelona California

par$cles sources produce parcles provide inial accelera*on

The Statistical Signature of BosonSampling Mattia Walschaers, Jack Kuipers, Juan-Diego Urbina,

A majorization-minimization algorithm for (multiple) hyperparameter - PowerPoint PPT Presentation

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009 Supervised learning Training set of m IID examples

Majorization and Extreme Points: Economic Applications Andreas Kleiner, Benny Moldovanu, and

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

A smoothing majorization method for l 2 2 - l p p matrix minimization Liwei Zhang Dalian

Generalized Majorization-Minimization Sobhan Naderi Kun He Reza Aghajani Stan

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Benefits of Radial Build Benefits of Radial Build Minimization and Requirements Minimization and

1 The Minimization Problem The Minimization Problem Input: A DFA (deterministic finite-state

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

ARS Workshop Context Markov Random Fields minimization and minimal cuts in Exact total variation

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

One-Dimensional Minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Minimization Using Descent Information we will consider the minimization of unconstrained

Cluster Minimization in Geometric Graphs Jakob Geiger Motivation Motivation Cluster

CS681: Advanced Topics in Computational Biology Week 10 Lecture 1 Can Alkan EA224

Photo 2. Looking southeast from the edge of the EBL shoulder towards Slide 1. The backscarp

High-Energy Gamma- Rays from the Milky Way: 3D Spatial Models for the CR and Radiation Field

Introduc)on:+GCDX+ Fe*6.7*keV (Nobukawa+*12) 100*pc* *0.7 Hot*plasma*( 710

Making Sense of Life Psalm 73 Learning to see life from Gods perspective 1 Psalm

Relativistic suppression of Black-hole superkicks U. Sperhake CSIC-IEEC Barcelona California

par$cles sources produce par*cles provide ini*al accelera*on

The Statistical Signature of BosonSampling Mattia Walschaers, Jack Kuipers, Juan-Diego Urbina,

Introduc)on:+GCDX+ Fe6.7keV (Nobukawa+12) 100pc* 0.7 Hotplasma*( 710

par$cles sources produce parcles provide inial accelera*on