Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - PowerPoint PPT Presentation

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: Dan Klein – UC Berkeley, Taylor Berg-Kirkpatrick – CMU

Parsing as Classification ● Input: Sentence X ● Output: Parse Y ● Potentially millions of candidates x y The screen was a sea of red

Generative Model for Parsing ● PCFG: Model joint probability P(S, Y) ● Many advantages ○ Learning is often clean and analytical: count and divide ● Disadvantages? ○ Rigid independence assumption ○ Lack of sensitivity to lexical information ○ Lack of sensitivity to structural frequencies

Lack of sensitivity to lexical information

Lack of sensitivity to structural frequencies: Coordination Ambiguity

Lack of sensitivity to structural frequencies: Close attachment

Discriminative Model for Parsing ● Directly estimate the score of y given X ● Distribution Free: Minimize expected los ● Advantages? ○ We get more freedom in defining features - ■ no independence assumptions required

Example: Right branching

Example: Complex Features

How to train? ● Minimize training error? ○ Loss function for each example i 0 when the label is correct, 1 otherwise ● Training Error to minimize

Objective Function 1 ● step function returns 1 when argument is negative, 0 otherwise ● Difficult to optimize, zero gradients 0 everywhere. ● Solution: Optimize differentiable upper bounds of this function: MaxEnt or SVM

Linear Models: Perceptron ▪ The (online) perceptron algorithm: ▪ Start with zero weights w ▪ Visit training instances one by one ▪ Try to classify ▪ If correct, no change! ▪ If wrong: adjust weights

Linear Models: Maximum Entropy ▪ Maximum entropy (logistic regression) ▪ Convert scores to probabilities: Make positive Normalize ▪ Maximize the (log) conditional likelihood of training data

Maximum Entropy II ▪ Regularization (smoothing)

Log-Loss ▪ This minimizes the “log loss” on each example ▪ log loss is an upper bound on zero-one loss

How to update weights: Gradient Descent

Gradient Descent: MaxEnt ● what do we need to compute the gradients? ○ Log normalizer ○ Expected feature counts

Maximum Margin Linearly Separable

Maximum Margin ▪ Non-separable SVMs ▪ Add slack to the constraints ▪ Make objective pay (linearly) for slack:

Primal SVM ▪ We had a constrained minimization ▪ … but we can solve for ξ i ▪ Giving: the hinge loss

How to update weights with hinge loss? ● Not differentiable everywhere ● Use sub-gradients instead

Loss Functions: Comparison ▪ Zero-One Loss ▪ Hinge ▪ Log

Structured Margin Just need efficient loss-augmented decode: Still use general subgradient descent methods!

Duals and Kernels

Nearest Neighbor Classification

Non-Parametric Classification

A Tale of Two Approaches...

Perceptron, Again

Perceptron Weights

Dual Perceptron

Dual/Kernelized Perceptron

Issues with Dual Perceptron

Kernels: Who cares?

Example: Kernels ▪ Quadratic kernels

Non-Linear Separators ▪ Another view: kernels map an original feature space to some higher-dimensional feature space where the training set is (more) separable Φ: y → φ ( y )

Why Kernels? ▪ Can’t you just add these features on your own (e.g. add all pairs of features instead of using the quadratic kernel)? ▪ Yes, in principle, just compute them ▪ No need to modify any algorithms ▪ But, number of features can get large (or infinite) ▪ Some kernels not as usefully thought of in their expanded representation, e.g. RBF or data-defined kernels [Henderson and Titov 05] ▪ Kernels let us compute with these features implicitly ▪ Example: implicit dot product in quadratic kernel takes much less space and time per dot product ▪ Of course, there’s the cost for using the pure dual algorithms …

Tree Kernels

Dual Formulation of SVM

Dual Formulation II

Dual Formulation III

Back to Learning SVMs

What are these alphas?

Comparison

Reranking

Training the reranker ▪ Training Data: ▪ Generate candidate parses for each x ▪ Loss function:

Baseline and Oracle Results Collins Model 2

Experiment 1: Only “old” features

Right Branching Bias

Other Features ▪ Heaviness ▪ What is the span of a rule ▪ Neighbors of a span ▪ Span shape ▪ Ngram Features ▪ Probability of the parse tree ▪ ...

Results with all the features

Reranking ▪ Advantages: ▪ Directly reduce to non-structured case ▪ No locality restriction on features ▪ Disadvantages: ▪ Stuck with errors of baseline parser ▪ Baseline system must produce n-best lists ▪ But, feedback is possible [McCloskey, Charniak, Johnson 2006] ▪ But, a reranker (almost) never performs worse than a generative parser, and in practice performs substantially better.

Summary ● Generative parsing has many disadvantages ○ Independence assumptions ○ Difficult to express certain features without making grammar too large or parsing too complex ● Discriminative Parsing allows to add complex features while still being easy to train ● Candidate set for discriminative parsing is too large: Use reranking instead

Another Application of Reranking: Information Retrieval

Modern Reranking Methods

Learn features using neural networks Replace by a neural network

Reranking for code generation

Reranking for code generation (2) ● Matching features

Reranking for semantic parsing

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - PowerPoint PPT Presentation

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor Berg-Kirkpatrick CMU Parsing as Classification Input: Sentence X Output: Parse Y Potentially millions of candidates x y The

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Storm surge model sensitivity to uncertain inputs Simon Warder 1 , Kevin Horsburgh 2 , Matthew

Improving the Sensitivity to Low-Mass WIMPs with XENON Detectors Moritz v. Sivers LHEP, AEC

Content of the lecture n Definition sensitivity and uncertainty n Sensitivity o Areas of the

ESP - - Path Path- -Sensitive Sensitive ESP Program Verification in Program Verification in

Resource-aware Program Analysis via Online Abstraction Coarsening Kihong Heo Hakjoo Oh Hongseok

Latency Sensitive Microservices in Java Reliability through highly reproducible systems Peter

Solar Cost Sensitivity Modeling CPUC Staff Analysis February 21, 2020 1 Purpose & Outline

Sensitivity Analysis and Active Subspace Construction for Surrogate Models Employed for Bayesian

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: - PowerPoint PPT Presentation

Algorithms for NLP Classification II Sachin Kumar - CMU Slides: Dan Klein UC Berkeley, Taylor Berg-Kirkpatrick CMU Parsing as Classification Input: Sentence X Output: Parse Y Potentially millions of candidates x y The

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Storm surge model sensitivity to uncertain inputs Simon Warder 1 , Kevin Horsburgh 2 , Matthew

Improving the Sensitivity to Low-Mass WIMPs with XENON Detectors Moritz v. Sivers LHEP, AEC

Content of the lecture n Definition sensitivity and uncertainty n Sensitivity o Areas of the

ESP - - Path Path- -Sensitive Sensitive ESP Program Verification in Program Verification in

Resource-aware Program Analysis via Online Abstraction Coarsening Kihong Heo Hakjoo Oh Hongseok

Latency Sensitive Microservices in Java Reliability through highly reproducible systems Peter

Solar Cost Sensitivity Modeling CPUC Staff Analysis February 21, 2020 1 Purpose &amp; Outline

Sensitivity Analysis and Active Subspace Construction for Surrogate Models Employed for Bayesian

Solar Cost Sensitivity Modeling CPUC Staff Analysis February 21, 2020 1 Purpose & Outline