Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - - PowerPoint PPT Presentation

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor Berg-Kirkpatrick CMU Parsing as Classification Input: Sentence X Output: Parse Y Potentially millions of candidates x y The


slide-1
SLIDE 1

Classification II

Sachin Kumar - CMU Slides: Dan Klein – UC Berkeley, Taylor Berg-Kirkpatrick – CMU

Algorithms for NLP

slide-2
SLIDE 2

Parsing as Classification

  • Input: Sentence X
  • Output: Parse Y
  • Potentially millions of candidates

The screen was a sea of red

x y

slide-3
SLIDE 3

Generative Model for Parsing

  • PCFG: Model joint probability P(S, Y)
  • Many advantages

○ Learning is often clean and analytical: count and divide

  • Disadvantages?

○ Rigid independence assumption ○ Lack of sensitivity to lexical information ○ Lack of sensitivity to structural frequencies

slide-4
SLIDE 4

Lack of sensitivity to lexical information

slide-5
SLIDE 5

Lack of sensitivity to structural frequencies: Coordination Ambiguity

slide-6
SLIDE 6

Lack of sensitivity to structural frequencies: Close attachment

slide-7
SLIDE 7

Discriminative Model for Parsing

  • Directly estimate the score of y given X
  • Distribution Free: Minimize expected los
  • Advantages?

○ We get more freedom in defining features -

■ no independence assumptions required

slide-8
SLIDE 8

Example: Right branching

slide-9
SLIDE 9

Example: Complex Features

slide-10
SLIDE 10

How to train?

  • Minimize training error?

○ Loss function for each example i 0 when the label is correct, 1 otherwise

  • Training Error to minimize
slide-11
SLIDE 11

Objective Function

1

  • step function returns 1 when argument is

negative, 0 otherwise

  • Difficult to optimize, zero gradients

everywhere.

  • Solution: Optimize differentiable upper

bounds of this function: MaxEnt or SVM

slide-12
SLIDE 12

Linear Models: Perceptron

▪ The (online) perceptron algorithm:

▪ Start with zero weights w ▪ Visit training instances one by one

▪ Try to classify ▪ If correct, no change! ▪ If wrong: adjust weights

slide-13
SLIDE 13

Linear Models: Maximum Entropy

▪ Maximum entropy (logistic regression)

▪ Convert scores to probabilities: ▪ Maximize the (log) conditional likelihood of training data

Make positive Normalize

slide-14
SLIDE 14

Maximum Entropy II

▪ Regularization (smoothing)

slide-15
SLIDE 15

Log-Loss

▪ This minimizes the “log loss” on each example ▪ log loss is an upper bound on zero-one loss

slide-16
SLIDE 16

How to update weights: Gradient Descent

slide-17
SLIDE 17

Gradient Descent: MaxEnt

  • what do we need to compute the gradients?

○ Log normalizer ○ Expected feature counts

slide-18
SLIDE 18

Maximum Margin

Linearly Separable

slide-19
SLIDE 19

Maximum Margin

▪ Non-separable SVMs

▪ Add slack to the constraints ▪ Make objective pay (linearly) for slack:

slide-20
SLIDE 20

Primal SVM

▪ We had a constrained minimization ▪ …but we can solve for ξi ▪ Giving: the hinge loss

slide-21
SLIDE 21

How to update weights with hinge loss?

  • Not differentiable everywhere
  • Use sub-gradients instead
slide-22
SLIDE 22

Loss Functions: Comparison

▪ Zero-One Loss ▪ Hinge ▪ Log

slide-23
SLIDE 23

Structured Margin

Just need efficient loss-augmented decode: Still use general subgradient descent methods!

slide-24
SLIDE 24

Duals and Kernels

slide-25
SLIDE 25

Nearest Neighbor Classification

slide-26
SLIDE 26

Non-Parametric Classification

slide-27
SLIDE 27

A Tale of Two Approaches...

slide-28
SLIDE 28

Perceptron, Again

slide-29
SLIDE 29

Perceptron Weights

slide-30
SLIDE 30

Dual Perceptron

slide-31
SLIDE 31

Dual/Kernelized Perceptron

slide-32
SLIDE 32

Issues with Dual Perceptron

slide-33
SLIDE 33

Kernels: Who cares?

slide-34
SLIDE 34

Example: Kernels

▪ Quadratic kernels

slide-35
SLIDE 35

Non-Linear Separators

▪ Another view: kernels map an original feature space to some higher-dimensional feature space where the training set is (more) separable

Φ: y → φ(y)

slide-36
SLIDE 36

Why Kernels?

▪ Can’t you just add these features on your own (e.g. add all pairs of features instead of using the quadratic kernel)?

▪ Yes, in principle, just compute them ▪ No need to modify any algorithms ▪ But, number of features can get large (or infinite) ▪ Some kernels not as usefully thought of in their expanded representation, e.g. RBF or data-defined kernels [Henderson and Titov 05]

▪ Kernels let us compute with these features implicitly

▪ Example: implicit dot product in quadratic kernel takes much less space and time per dot product ▪ Of course, there’s the cost for using the pure dual algorithms…

slide-37
SLIDE 37

Tree Kernels

slide-38
SLIDE 38

Dual Formulation of SVM

slide-39
SLIDE 39

Dual Formulation II

slide-40
SLIDE 40

Dual Formulation III

slide-41
SLIDE 41

Back to Learning SVMs

slide-42
SLIDE 42

What are these alphas?

slide-43
SLIDE 43

Comparison

slide-44
SLIDE 44

Reranking

slide-45
SLIDE 45

Training the reranker

▪ Training Data: ▪ Generate candidate parses for each x ▪ Loss function:

slide-46
SLIDE 46

Baseline and Oracle Results

Collins Model 2

slide-47
SLIDE 47

Experiment 1: Only “old” features

slide-48
SLIDE 48

Right Branching Bias

slide-49
SLIDE 49

Other Features

▪ Heaviness

▪ What is the span of a rule

▪ Neighbors of a span ▪ Span shape ▪ Ngram Features ▪ Probability of the parse tree ▪ ...

slide-50
SLIDE 50

Results with all the features

slide-51
SLIDE 51

Reranking

▪ Advantages:

▪ Directly reduce to non-structured case ▪ No locality restriction on features

▪ Disadvantages:

▪ Stuck with errors of baseline parser ▪ Baseline system must produce n-best lists ▪ But, feedback is possible [McCloskey, Charniak, Johnson 2006] ▪ But, a reranker (almost) never performs worse than a generative parser, and in practice performs substantially better.

slide-52
SLIDE 52

Summary

  • Generative parsing has many disadvantages

○ Independence assumptions ○ Difficult to express certain features without making grammar too large or parsing too complex

  • Discriminative Parsing allows to add complex

features while still being easy to train

  • Candidate set for discriminative parsing is too

large: Use reranking instead

slide-53
SLIDE 53

Another Application of Reranking: Information Retrieval

slide-54
SLIDE 54

Modern Reranking Methods

slide-55
SLIDE 55

Learn features using neural networks

Replace by a neural network

slide-56
SLIDE 56

Reranking for code generation

slide-57
SLIDE 57

Reranking for code generation (2)

  • Matching features
slide-58
SLIDE 58

Reranking for semantic parsing