A majorization-minimization algorithm for (multiple) hyperparameter - - PowerPoint PPT Presentation

a majorization minimization algorithm for multiple
SMART_READER_LITE
LIVE PREVIEW

A majorization-minimization algorithm for (multiple) hyperparameter - - PowerPoint PPT Presentation

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009 Supervised learning Training set of m IID examples


slide-1
SLIDE 1

A majorization-minimization algorithm for (multiple) hyperparameter learning

ICML 2009 Montreal, Canada 17th June 2009

Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University

slide-2
SLIDE 2

Supervised learning

  • Training set of m IID examples
  • Probabilistic model
  • Estimate parameters

Labels may be real-valued, discrete, structured

slide-3
SLIDE 3
  • Regularized maximum likelihood estimation
  • Also maximum a posteriori (MAP) estimation

Regularization prevents overfitting

Log-prior over model parameters Data log- likelihood

L2-regularized Logistic Regression Regularization Hyperparameter

slide-4
SLIDE 4

How to select the hyperparameter(s)?

  • Grid search

+ Simple to implement − Scales exponentially with # hyperparameters

  • Gradient-based algorithms

+ Scales well with # hyperparameters − Non-trivial to implement

Can we get the best of both worlds?

slide-5
SLIDE 5

Our contribution

Striking ease of implementation

Simple, closed-form updates for C Leverage existing solvers

Scales well to multiple hyperparameter case Applicable to wide range of models

slide-6
SLIDE 6

Outline

  • 1. Problem definition
  • 2. The “integrate out” strategy
  • 3. The Majorization-Minimization algorithm
  • 4. Experiments
  • 5. Discussion
slide-7
SLIDE 7

The “integrate out” strategy

  • Treat hyperparameter C as a random variable
  • Analytically integrate out C
  • Need a convenient prior p(C)
slide-8
SLIDE 8

Integrating out a single hyperparameter

  • For L2 regularization,
  • A convenient prior:
  • The result:
  • 2. Neither convex nor concave in w
  • 1. C is gone
slide-9
SLIDE 9

The Majorization-Minimization Algorithm

  • Replace hard problem by series of easier ones
  • EM-like; two steps:
  • 1. Majorization

Upper bound the objective function

  • 2. Minimization

Minimize the upper bound

slide-10
SLIDE 10

MM1: Upper-bounding the new prior

  • New prior:
  • Linearize the log:

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 x y log(x) expansion at x=1 expansion at x=1.5 expansion at x=2

slide-11
SLIDE 11

MM2: Solving the resultant

  • ptimization problem
  • Resultant linearized prior
  • Get standard L2-regularization!

Terms independent of w

Use existing solvers!

slide-12
SLIDE 12

Visualization of the upper bound

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 0.5 1 1.5 2 2.5 3 x y log(0.5 x2 + 1) expansion at x=1 expansion at x=1.5 expansion at x=2 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 x y log(x) expansion at x=1 expansion at x=1.5 expansion at x=2

slide-13
SLIDE 13

Overall algorithm

  • 1. Closed form updates for C
  • 2. Leverage existing solvers

Converges to a local minimum

slide-14
SLIDE 14

What about multiple hyperparameters?

  • Regularization groups

w = ( w1, w2, w3, w4, w5 )

Unigram feature weights Hairpin loops Bigram feature weights Bulge loops NLP RNA Secondary Structure Prediction

C = ( C1 , C2 )

Mapping from weights to groups “To C or not to C. That is the question…”

slide-15
SLIDE 15

What about multiple hyperparameters?

Separately update each regularization group

Sum weights in each group Weighted L2-regularization

slide-16
SLIDE 16

Experiments

  • 4 probabilistic models

– Linear regression (too easy, not shown) – Binary logistic regression – Multinomial logistic regression – Conditional log-linear model

  • 3 competing algorithms

– Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective

  • Algorithm run with α = 0, β = 1
slide-17
SLIDE 17

50 60 70 80 90 100

australian breast- cancer diabetes german- numer heart ionosphere liver- disorders mushrooms sonar splice w1a

Accuracy Grid Grad Direct MM

Results: Binary Logistic Regression

slide-18
SLIDE 18

Results: Multinomial Logistic Regression

30 40 50 60 70 80 90 100

  • connect-4
  • dna
  • glass
  • iris
  • le6er

mnist1

  • sa7mage
  • segment
  • svmguide2
  • usps
  • vehicle
  • vowel
  • wine

Accuracy Grid Grad Direct MM

slide-19
SLIDE 19

Results: Conditional Log-Linear Models

  • RNA secondary structure

prediction

  • Multiple hyperparameters

ROC Area

0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Single Grouped Gradient Direct MM

AGCAGAGUGGCGCA GUGGAAGCGUGCUG GUCCCAUAACCCAGA GGUCCGAGGAUCGA AACCUUGCUCUGCUA (((((((((((((.......))))..((((((.... (((....)))....))))))......))))))))).

slide-20
SLIDE 20

Discussion

  • How to choose α, β in Gamma prior?

– Sensitivity experiments – Simple choice reasonable – Further investigation required

  • Simple assumptions sometimes wrong
  • But competitive performance with Grid, Grad
  • Suited for ‘Quick-and-dirty’ implementations
slide-21
SLIDE 21

Thank you!