Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - - PowerPoint PPT Presentation

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor Berg-Kirkpatrick CMU, Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label, 1


slide-1
SLIDE 1

Classification II

Sachin Kumar - CMU Slides: Dan Klein – UC Berkeley, Taylor Berg-Kirkpatrick – CMU,

Algorithms for NLP

slide-2
SLIDE 2

Minimize Training Error?

▪ A loss function declares how costly each mistake is

▪ E.g. 0 loss for correct label, 1 loss for wrong label ▪ Can weight mistakes differently (e.g. false positives worse than false negatives or Hamming distance over structured labels)

▪ We could, in principle, minimize training loss: ▪ This is a hard, discontinuous optimization problem

slide-3
SLIDE 3

Objective Functions

▪ What do we want from our weights?

▪ Depends! ▪ So far: minimize (training) errors: ▪ This is the “zero-one loss”

▪ Discontinuous, minimizing is NP-complete

▪ Maximum entropy and SVMs have other

  • bjectives related to zero-one loss
slide-4
SLIDE 4

Linear Models: Maximum Entropy

▪ Maximum entropy (logistic regression)

▪ Use the scores as probabilities: ▪ Maximize the (log) conditional likelihood of training data

Make positive Normalize

slide-5
SLIDE 5

Maximum Entropy II

▪ Motivation for maximum entropy:

▪ Connection to maximum entropy principle (sort of) ▪ Might want to do a good job of being uncertain on noisy cases… ▪ … in practice, though, posteriors are pretty peaked

▪ Regularization (smoothing)

slide-6
SLIDE 6

Log-Loss

▪ If we view maxent as a minimization problem: ▪ This minimizes the “log loss” on each example ▪ One view: log loss is an upper bound on zero-one loss

slide-7
SLIDE 7

Maximum Margin

▪ Non-separable SVMs

▪ Add slack to the constraints ▪ Make objective pay (linearly) for slack: ▪ C is called the capacity of the SVM – the smoothing knob

▪ Learning:

▪ Can still stick this into Matlab if you want ▪ Constrained optimization is hard; better methods! ▪ We’ll come back to this later

Note: exist other choices of how to penalize slacks!

slide-8
SLIDE 8

Remember SVMs…

▪ We had a constrained minimization ▪ …but we can solve for ξi ▪ Giving

slide-9
SLIDE 9

Hinge Loss

▪ This is called the “hinge loss”

▪ Unlike maxent / log loss, you stop gaining objective once the true label wins by enough ▪ You can start from here and derive the SVM objective ▪ Can solve directly with sub-gradient decent (e.g. Pegasos: Shalev-Shwartz et al 07)

▪ Consider the per-instance objective:

Plot really only right in binary case

slide-10
SLIDE 10

Subgradient Descent

▪ Recall gradient descent ▪ Doesn’t work for non-differentiable functions

slide-11
SLIDE 11

Subgradient Descent

slide-12
SLIDE 12

Subgradient Descent

▪ Example

slide-13
SLIDE 13

Subgradient Descent

▪ Example

slide-14
SLIDE 14

Subgradient Descent

slide-15
SLIDE 15

Structure

slide-16
SLIDE 16

CFG Parsing

The screen was a sea of red

Recursive structure x y

slide-17
SLIDE 17

Generative vs Discriminative

  • Generative Models have many advantages

○ Can model both p(x) and p(y|x) ○ Learning is often clean and analytical: frequency estimation in penn treebank

  • Disadvantages?

○ Force us to make rigid independence assumptions (context free assumption)

slide-18
SLIDE 18

Generative vs Discriminative

  • We get more freedom in defining features -

no independence assumptions required

  • Disadvantages?

○ Computationally intensive ○ Use of more features can make decoding harder

slide-19
SLIDE 19

Structured Models

Assumption: Score is a sum of local “part” scores Parts = nodes, edges, productions space of feasible outputs

slide-20
SLIDE 20

Efficient Decoding

▪ Common case: you have a black box which computes at least approximately, and you want to learn w ▪ Easiest option is the structured perceptron [Collins 01]

▪ Structure enters here in that the search for the best y is typically a combinatorial algorithm (dynamic programming, matchings, ILPs, A*…) ▪ Prediction is structured, learning update is not

slide-21
SLIDE 21

Max-Ent, Structured, Global

  • Assumption: Score is sum of local “part” scores
slide-22
SLIDE 22

Max-Ent, Structured, Global

  • what do we need to compute the gradients?

○ Log normalizer ○ Expected feature counts (inside outside algorithm)

  • How to decode?

○ Search algorithms like viterbi (CKY)

slide-23
SLIDE 23

Max-Ent, Structured, Local

  • We assume that we can arrive at a globally optimal solution

by making locally optimal choices.

  • We can use arbitrarily complex features over the history and

lookahead over the future.

  • We can perform very efficient parsing, often with linear time

complexity

  • Shift-Reduce parsers
slide-24
SLIDE 24

Structured Margin (Primal)

Remember our primal margin objective? Still applies with structured output space!

slide-25
SLIDE 25

Structured Margin (Primal)

Just need efficient loss-augmented decode: Still use general subgradient descent methods! (Adagrad)

slide-26
SLIDE 26

Structured Margin

▪ Remember the constrained version of primal:

slide-27
SLIDE 27

▪ We want: ▪ Equivalently:

‘It was red’

Many Constraints!

a lot!

S A B C D S A B D F S A B C D S E F G H S A B C D S A B C D S A B C D

‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’ ‘It was red’

slide-28
SLIDE 28

Structured Margin - Working Set

slide-29
SLIDE 29

Working Set S-SVM

  • Working Set n-slack Algorithm
  • Working Set 1-slack Algorithm
  • Cutting Plane 1-Slack Algorithm [Joachims et al 09]

○ Requires Dual Formulation ○ Much faster convergence ○ In practice, works as fast as perceptron, more stable training

slide-30
SLIDE 30

Duals and Kernels

slide-31
SLIDE 31

Nearest Neighbor Classification

slide-32
SLIDE 32

Non-Parametric Classification

slide-33
SLIDE 33

A Tale of Two Approaches...

slide-34
SLIDE 34

Perceptron, Again

slide-35
SLIDE 35

Perceptron Weights

slide-36
SLIDE 36

Dual Perceptron

slide-37
SLIDE 37

Dual/Kernelized Perceptron

slide-38
SLIDE 38

Issues with Dual Perceptron

slide-39
SLIDE 39

Kernels: Who cares?

slide-40
SLIDE 40

Example: Kernels

▪ Quadratic kernels

slide-41
SLIDE 41

Non-Linear Separators

▪ Another view: kernels map an original feature space to some higher-dimensional feature space where the training set is (more) separable

Φ: y → φ(y)

slide-42
SLIDE 42

Why Kernels?

▪ Can’t you just add these features on your own (e.g. add all pairs of features instead of using the quadratic kernel)?

▪ Yes, in principle, just compute them ▪ No need to modify any algorithms ▪ But, number of features can get large (or infinite) ▪ Some kernels not as usefully thought of in their expanded representation, e.g. RBF or data-defined kernels [Henderson and Titov 05]

▪ Kernels let us compute with these features implicitly

▪ Example: implicit dot product in quadratic kernel takes much less space and time per dot product ▪ Of course, there’s the cost for using the pure dual algorithms…

slide-43
SLIDE 43

Tree Kernels

slide-44
SLIDE 44

Dual Formulation of SVM

slide-45
SLIDE 45

Dual Formulation II

slide-46
SLIDE 46

Dual Formulation III

slide-47
SLIDE 47

Back to Learning SVMs

slide-48
SLIDE 48

What are these alphas?

slide-49
SLIDE 49

Comparison

slide-50
SLIDE 50

To summarize

  • Can solve Structural versions of Max-Ent and SVMs

  • ur feature model factors into reasonably local, non-overlapping

structures (why?)

  • Issues?

○ Limited Scope of Features

slide-51
SLIDE 51

Reranking

slide-52
SLIDE 52

Training the reranker

▪ Training Data: ▪ Generate candidate parses for each x ▪ Loss function:

slide-53
SLIDE 53

Baseline and Oracle Results

Collins Model 2

slide-54
SLIDE 54

Experiment 1: Only “old” features

slide-55
SLIDE 55

Right Branching Bias

slide-56
SLIDE 56

Other Features

▪ Heaviness

▪ What is the span of a rule

▪ Neighbors of a span ▪ Span shape ▪ Ngram Features ▪ Probability of the parse tree ▪ ...

slide-57
SLIDE 57

Results with all the features

slide-58
SLIDE 58

Reranking

▪ Advantages:

▪ Directly reduce to non-structured case ▪ No locality restriction on features

▪ Disadvantages:

▪ Stuck with errors of baseline parser ▪ Baseline system must produce n-best lists ▪ But, feedback is possible [McCloskey, Charniak, Johnson 2006] ▪ But, a reranker (almost) never performs worse than a generative parser, and in practice performs substantially better.

slide-59
SLIDE 59

Reranking in other settings