A majorization-minimization algorithm for (multiple) hyperparameter - - PowerPoint PPT Presentation
A majorization-minimization algorithm for (multiple) hyperparameter - - PowerPoint PPT Presentation
A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009 Supervised learning Training set of m IID examples
Supervised learning
- Training set of m IID examples
- Probabilistic model
- Estimate parameters
Labels may be real-valued, discrete, structured
- Regularized maximum likelihood estimation
- Also maximum a posteriori (MAP) estimation
Regularization prevents overfitting
Log-prior over model parameters Data log- likelihood
L2-regularized Logistic Regression Regularization Hyperparameter
How to select the hyperparameter(s)?
- Grid search
+ Simple to implement − Scales exponentially with # hyperparameters
- Gradient-based algorithms
+ Scales well with # hyperparameters − Non-trivial to implement
Can we get the best of both worlds?
Our contribution
Striking ease of implementation
Simple, closed-form updates for C Leverage existing solvers
Scales well to multiple hyperparameter case Applicable to wide range of models
Outline
- 1. Problem definition
- 2. The “integrate out” strategy
- 3. The Majorization-Minimization algorithm
- 4. Experiments
- 5. Discussion
The “integrate out” strategy
- Treat hyperparameter C as a random variable
- Analytically integrate out C
- Need a convenient prior p(C)
Integrating out a single hyperparameter
- For L2 regularization,
- A convenient prior:
- The result:
- 2. Neither convex nor concave in w
- 1. C is gone
The Majorization-Minimization Algorithm
- Replace hard problem by series of easier ones
- EM-like; two steps:
- 1. Majorization
Upper bound the objective function
- 2. Minimization
Minimize the upper bound
MM1: Upper-bounding the new prior
- New prior:
- Linearize the log:
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
- 5
- 4
- 3
- 2
- 1
1 2 3 4 x y log(x) expansion at x=1 expansion at x=1.5 expansion at x=2
MM2: Solving the resultant
- ptimization problem
- Resultant linearized prior
- Get standard L2-regularization!
Terms independent of w
Use existing solvers!
Visualization of the upper bound
- 5
- 4
- 3
- 2
- 1
1 2 3 4 5 0.5 1 1.5 2 2.5 3 x y log(0.5 x2 + 1) expansion at x=1 expansion at x=1.5 expansion at x=2 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
- 5
- 4
- 3
- 2
- 1
1 2 3 4 x y log(x) expansion at x=1 expansion at x=1.5 expansion at x=2
Overall algorithm
- 1. Closed form updates for C
- 2. Leverage existing solvers
Converges to a local minimum
What about multiple hyperparameters?
- Regularization groups
w = ( w1, w2, w3, w4, w5 )
Unigram feature weights Hairpin loops Bigram feature weights Bulge loops NLP RNA Secondary Structure Prediction
C = ( C1 , C2 )
Mapping from weights to groups “To C or not to C. That is the question…”
What about multiple hyperparameters?
Separately update each regularization group
Sum weights in each group Weighted L2-regularization
Experiments
- 4 probabilistic models
– Linear regression (too easy, not shown) – Binary logistic regression – Multinomial logistic regression – Conditional log-linear model
- 3 competing algorithms
– Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective
- Algorithm run with α = 0, β = 1
50 60 70 80 90 100
australian breast- cancer diabetes german- numer heart ionosphere liver- disorders mushrooms sonar splice w1a
Accuracy Grid Grad Direct MM
Results: Binary Logistic Regression
Results: Multinomial Logistic Regression
30 40 50 60 70 80 90 100
- connect-4
- dna
- glass
- iris
- le6er
mnist1
- sa7mage
- segment
- svmguide2
- usps
- vehicle
- vowel
- wine
Accuracy Grid Grad Direct MM
Results: Conditional Log-Linear Models
- RNA secondary structure
prediction
- Multiple hyperparameters
ROC Area
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Single Grouped Gradient Direct MM
AGCAGAGUGGCGCA GUGGAAGCGUGCUG GUCCCAUAACCCAGA GGUCCGAGGAUCGA AACCUUGCUCUGCUA (((((((((((((.......))))..((((((.... (((....)))....))))))......))))))))).
Discussion
- How to choose α, β in Gamma prior?
– Sensitivity experiments – Simple choice reasonable – Further investigation required
- Simple assumptions sometimes wrong
- But competitive performance with Grid, Grad
- Suited for ‘Quick-and-dirty’ implementations