a majorization minimization algorithm for multiple
play

A majorization-minimization algorithm for (multiple) hyperparameter - PowerPoint PPT Presentation

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009 Supervised learning Training set of m IID examples


  1. A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009

  2. Supervised learning • Training set of m IID examples Labels may be real-valued, discrete, structured • Probabilistic model • Estimate parameters

  3. Regularization prevents overfitting • Regularized maximum likelihood estimation Regularization Hyperparameter L 2 -regularized Logistic Regression • Also maximum a posteriori (MAP) estimation Data log- Log-prior over likelihood model parameters

  4. How to select the hyperparameter(s)? • Grid search + Simple to implement − Scales exponentially with # hyperparameters • Gradient-based algorithms + Scales well with # hyperparameters − Non-trivial to implement Can we get the best of both worlds?

  5. Our contribution � Striking ease of implementation � Simple, closed-form updates for C � Leverage existing solvers � Scales well to multiple hyperparameter case � Applicable to wide range of models

  6. Outline 1. Problem definition 2. The “integrate out” strategy 3. The Majorization-Minimization algorithm 4. Experiments 5. Discussion

  7. The “integrate out” strategy • Treat hyperparameter C as a random variable • Analytically integrate out C • Need a convenient prior p(C)

  8. Integrating out a single hyperparameter • For L 2 regularization, • A convenient prior: • The result: 1. C is gone 2. Neither convex nor concave in w

  9. The M ajorization- M inimization Algorithm • Replace hard problem by series of easier ones • EM-like; two steps: 1. M ajorization Upper bound the objective function 2. M inimization Minimize the upper bound

  10. MM1: Upper-bounding the new prior • New prior: • Linearize the log: 4 3 2 1 0 y -1 -2 log(x) -3 expansion at x=1 expansion at x=1.5 -4 expansion at x=2 -5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

  11. MM2: Solving the resultant optimization problem • Resultant linearized prior Terms independent of w • Get standard L 2 -regularization! Use existing solvers!

  12. Visualization of the upper bound 3 4 log(0.5 x 2 + 1) 3 expansion at x=1 2.5 expansion at x=1.5 2 expansion at x=2 2 1 0 1.5 y y -1 1 -2 log(x) -3 expansion at x=1 0.5 expansion at x=1.5 -4 expansion at x=2 0 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x x

  13. Overall algorithm 1. Closed form updates for C 2. Leverage existing solvers Converges to a local minimum

  14. What about multiple hyperparameters? • Regularization groups w = ( w 1 , w 2 , w 3 , w 4 , w 5 ) Unigram Bigram “To C or not to C. That NLP feature feature is the question…” weights weights Mapping from weights to groups RNA Secondary Hairpin Bulge Structure loops loops Prediction C = ( C 1 , C 2 )

  15. What about multiple hyperparameters? Separately update each regularization group Sum weights in each group Weighted L 2 -regularization

  16. Experiments • 4 probabilistic models – Linear regression (too easy, not shown) – Binary logistic regression – Multinomial logistic regression – Conditional log-linear model • 3 competing algorithms – Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective • Algorithm run with α = 0, β = 1

  17. Accuracy Results: Binary Logistic Regression 100 50 60 70 80 90 australian breast- cancer diabetes Grid german- numer Grad heart ionosphere Direct liver- disorders mushrooms MM sonar splice w1a

  18. Accuracy Results: Multinomial Logistic Regression 100 30 40 50 60 70 80 90 � connect-4 � dna � glass Grid � iris � le6er Grad mnist1 � sa7mage Direct � segment � svmguide2 � usps MM � vehicle � vowel � wine

  19. Results: Conditional Log-Linear Models • RNA secondary structure ROC Area prediction 0.65 • Multiple hyperparameters 0.64 0.63 0.62 AGCAGAGUGGCGCA 0.61 GUGGAAGCGUGCUG 0.6 GUCCCAUAACCCAGA GGUCCGAGGAUCGA 0.59 AACCUUGCUCUGCUA 0.58 Single Grouped (((((((((((((.......))))..((((((.... Gradient Direct MM (((....)))....))))))......))))))))).

  20. Discussion • How to choose α, β in Gamma prior? – Sensitivity experiments – Simple choice reasonable – Further investigation required • Simple assumptions sometimes wrong • But competitive performance with Grid, Grad • Suited for ‘Quick-and-dirty’ implementations

  21. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend