Empirical Risk Minimization October 29, 2015 Outline Empirical - - PowerPoint PPT Presentation

empirical risk minimization
SMART_READER_LITE
LIVE PREVIEW

Empirical Risk Minimization October 29, 2015 Outline Empirical - - PowerPoint PPT Presentation

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view Perceptron CRF Notation for Linear Models Training data: {(x 1 , y 1 ), (x 2 , y 2 ), , (x N , y N )} Testing data: {(x N+1 , y N+1 ),


slide-1
SLIDE 1

Empirical Risk Minimization

October 29, 2015

slide-2
SLIDE 2

Outline

  • Empirical risk minimization view

– Perceptron – CRF

slide-3
SLIDE 3

Notation for Linear Models

  • Training data: {(x1, y1), (x2, y2), …, (xN, yN)}
  • Testing data: {(xN+1, yN+1), … (xN+N', yN+N')}
  • Feature function: g
  • Weights: w
  • Decoding:
  • Learning:
  • Evaluation:
slide-4
SLIDE 4

Structured Perceptron

  • Described as an online algorithm.
  • On each iteration, take one example, and

update the weights according to:

  • Not discussing today: the theoretical

guarantees this gives, separability, and the averaged and voted versions.

slide-5
SLIDE 5

Empirical Risk Minimization

  • A unifying framework for many learning

algorithms.

  • Many options for the loss function L and

the regularization function R.

slide-6
SLIDE 6

Solving the Minimization Problem

  • In some friendly cases, there is a closed form

solution for the minimizer of w

– E.g., the maximum likelihood estimator for HMMs

  • Usually, we have to use an iterative

algorithm which amounts to progressively finding better versions of w

– involves hard/soft inference with each improved value of w on either part or all of the training set

slide-7
SLIDE 7

Loss Functions You May Know

Name Expression of Log loss (joint)
 Log loss (conditional) Zero-one loss
 Expected zero-

  • ne loss
slide-8
SLIDE 8

Loss Functions You May Know

Name Expression of Log loss (joint)
 Log loss (conditional) Zero-one loss
 Expected zero-

  • ne loss
slide-9
SLIDE 9

Loss Functions You May Know

Name Expression of Log loss (joint)
 Log loss (conditional) Cost
 Expected cost, a.k.a. “risk”

slide-10
SLIDE 10

CRFs and Loss

  • Plugging in the log-linear form (and not

worrying at this level about locality of features):

slide-11
SLIDE 11

CRFs and Loss

  • Plugging in the log-linear form (and not

worrying at this level about locality of features):

slide-12
SLIDE 12

Training CRFs and 
 Other Linear Models

  • Early days: iterative scaling (specialized

method for log-linear models only)

  • ~2002: quasi-Newton methods

– (using LBFGS which dates from the late 1980s)

  • ~2006: stochastic gradient descent
  • ~2010: adaptive gradient methods
slide-13
SLIDE 13

Perceptron and Loss

  • Not clear immediately what L is, but the

“gradient” of L should be:

  • The vector of above quantities is actually

a subgradient of:

slide-14
SLIDE 14

Compare

  • CRF (log-loss):
  • Perceptron:

slide-15
SLIDE 15

Loss Functions

slide-16
SLIDE 16

Loss Functions You Know

Name Expression of Convex?

Log loss (joint)

Log loss (conditional)

Cost
 Expected cost, a.k.a. “risk” Perceptron loss


slide-17
SLIDE 17

Loss Functions You Know

Name Expression of Cont.?

Log loss (joint)

Log loss (conditional)

Cost
 Expected cost, a.k.a. “risk”

Perceptron loss


slide-18
SLIDE 18

Loss Functions You Know

Name Expression of Cost?

Log loss (joint) Log loss (conditional) Cost


Expected cost, a.k.a. “risk”

Perceptron loss


slide-19
SLIDE 19

The Ideal Loss Function

For computational convenience:

  • Convex
  • Continuous

For good performance:

  • Cost-aware
  • Theoretically sound
slide-20
SLIDE 20

On Regularization

  • In principle, this choice is

independent from the choice

  • f the loss function.
  • Squared L2 norm is the most

common starting place.

  • L1 and other sparsity-

inducing regularizers as well as structured regularizers are of interest

λ λ λ λ

slide-21
SLIDE 21

Practical Advice

  • Features still more important than the loss

function.

– But general, easy-to-implement algorithms are quite useful!

  • Perceptron is easiest to implement.
  • CRFs and max margin techniques usually do

better.

  • Tune the regularization constant, λ.

– Never on the test data.