Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick - PowerPoint PPT Presentation

Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick – CMU Slides: Dan Klein – UC Berkeley

Efficient Hashing ▪ Closed address hashing ▪ Resolve collisions with chains ▪ Easier to understand but bigger ▪ Open address hashing ▪ Resolve collisions with probe sequences ▪ Smaller but easy to mess up ▪ Direct-address hashing ▪ No collision resolution ▪ Just eject previous entries ▪ Not suitable for core LM storage

Integer Encodings word ids 7 1 15 the cat laughed 233 n-gram count

Bit Packing Got 3 numbers under 2 20 to store? 7 1 15 0 … 0011 0...00001 0...01111 1 20 bits 20 bits 20 bits Fits in a primitive 64-bit long

Integer Encodings n-gram encoding 15176595 = the cat laughed 233 n-gram count

Rank Values c(the) = 23135851162 < 2 35 35 bits to represent integers between 0 and 2 35 60 bits 35 bits 15176595 233 n-gram encoding count

Rank Values # unique counts = 770000 < 2 20 20 bits to represent ranks of all counts rank freq 60 bits 20 bits 0 1 1 2 15176595 3 2 51 n-gram encoding rank 3 233

So Far Word indexer N-gram encoding scheme unigram: f(id) = id bigram: f(id 1 , id 2 ) = ? trigram: f(id 1 , id 2 , id 3 ) = ? Count DB unigram bigram trigram Rank lookup

Hashing vs Sorting

Maximum Entropy Models

Improving on N-Grams? ▪ N-grams don’t combine multiple sources of evidence well P(construction | After the demolition was completed, the) ▪ Here: ▪ “the” gives syntactic constraint ▪ “demolition” gives semantic constraint ▪ Unlikely the interaction between these two has been densely observed in this specific n-gram ▪ We’d like a model that can be more statistically efficient

Some Definitions INPUTS close the ____ CANDIDATE {door, table, … } SET CANDIDATES table TRUE door OUTPUTS FEATURE VECTORS “close” in x ∧ y=“door” x -1 =“the” ∧ y=“door” y occurs in x x -1 =“the” ∧ y=“table”

More Features, Less Interaction x = closing the ____, y = doors ▪ N-Grams x -1 =“the” ∧ y=“doors” ▪ Skips x -2 =“closing” ∧ y=“doors” ▪ Lemmas x -2 =“close” ∧ y=“door” ▪ Caching y occurs in x

Data: Feature Impact Features Train Perplexity Test Perplexity 3 gram indicators 241 350 1-3 grams 126 172 1-3 grams + skips 101 164

Exponential Form ▪ Weights Features ▪ Linear score ▪ Unnormalized probability ▪ Probability

Likelihood Objective ▪ Model form: ▪ Log-likelihood of training data

Training

History of Training ▪ 1990’s: Specialized methods (e.g. iterative scaling) ▪ 2000’s: General-purpose methods (e.g. conjugate gradient) ▪ 2010’s: Online methods (e.g. stochastic gradient)

What Does LL Look Like? ▪ Example ▪ Data: xxxy ▪ Two outcomes, x and y ▪ One indicator for each ▪ Likelihood

Convex Optimization ▪ The maxent objective is an unconstrained convex problem ▪ One optimal value*, gradients point the way

Gradients Count of features under Expected count of features target labels under model predicted label distribution

Gradient Ascent ▪ The maxent objective is an unconstrained optimization problem ▪ Gradient Ascent ▪ Basic idea: move uphill from current guess ▪ Gradient ascent / descent follows the gradient incrementally ▪ At local optimum, derivative vector is zero ▪ Will converge if step sizes are small enough, but not efficient ▪ All we need is to be able to evaluate the function and its derivative

(Quasi)-Newton Methods ▪ 2 nd -Order methods: repeatedly create a quadratic approximation and solve it ▪ E.g. LBFGS, which tracks derivative to approximate (inverse) Hessian

Regularization

Regularization Methods ▪ Early stopping ▪ L2: L(w)-|w| 2 2 ▪ L1: L(w)-|w|

Regularization Effects ▪ Early stopping: don’t do this ▪ L2: weights stay small but non-zero ▪ L1: many weights driven to zero ▪ Good for sparsity ▪ Usually bad for accuracy for NLP

Scaling

Why is Scaling Hard? ▪ Big normalization terms ▪ Lots of data points

Hierarchical Prediction ▪ Hierarchical prediction / softmax [Mikolov et al 2013] ▪ Noise-Contrastive Estimation [Mnih, 2013] ▪ Self-Normalization [Devlin, 2014] Image: ayende.com

Stochastic Gradient ▪ View the gradient as an average over data points ▪ Stochastic gradient: take a step each example (or mini-batch) ▪ Substantial improvements exist, e.g. AdaGrad (Duchi, 11)

Other Methods

Neural Net LMs Image: (Bengio et al, 03)

Neural vs Maxent ▪ Maxent LM ▪ Neural Net LM nonlinear, e.g. tanh

Neural Net LMs … man door doors … –7.2 2.3 1.5 … 8.9 1.2 7.4 –3.3 1.1 v closing v the … … x -2 = closing x -1 = the

Maximum Entropy LMs ▪ Want a model over completions y given a context x: close the door | close the ▪ Want to characterize the important aspects of y = (v,x) using a feature function f ▪ F might include ▪ Indicator of v (unigram) ▪ Indicator of v, previous word (bigram) ▪ Indicator whether v occurs in x (cache) ▪ Indicator of v and each non-adjacent previous word ▪ …

Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick - PowerPoint PPT Presentation

Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Efficient Hashing Closed address hashing Resolve collisions with chains Easier to understand but bigger Open address

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

The Final Fulfillment of the Babylon Judgments Rev 18 I. God sends another strong angel who

Neural Program Synthesis from Diverse Demonstration Videos Sriram Shao-Hua Sun* 1 Hyeonwoo Noh* 2

r r t

Cheating in Online Games CS 161: Computer Security Prof. David Wagner April 20, 2016 <hp:26

Pick up an in-class quiz from the table near the door Data structures and algorithm analysis

Age Age Age Age E out ( h 1 ) E out ( h 2 ) E out ( h 3 ) E out ( h M )

HouseClassifier.com dn-ds What is HouseClassifier? A web app that classifies a given image of a

Housing Allocation Committee April 6, 2018 Committee Members: -Stephany Unruh -Griselda Lopez

Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick - PowerPoint PPT Presentation

Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Efficient Hashing Closed address hashing Resolve collisions with chains Easier to understand but bigger Open address

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Algorithms for NLP 11-711, Fall 2019 Lecture 26: Computational Ethics Yulia Tsvetkov 1

Algorithms for NLP IITP, Fall 2019 Lecture 25: Computational Ethics Yulia Tsvetkov 1 Tsvetkov

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

The Final Fulfillment of the Babylon Judgments Rev 18 I. God sends another strong angel who

Neural Program Synthesis from Diverse Demonstration Videos Sriram Shao-Hua Sun* 1 Hyeonwoo Noh* 2

r r t

Cheating in Online Games CS 161: Computer Security Prof. David Wagner April 20, 2016 &lt;hp:26

Pick up an in-class quiz from the table near the door Data structures and algorithm analysis

Age Age Age Age E out ( h 1 ) E out ( h 2 ) E out ( h 3 ) E out ( h M )

HouseClassifier.com dn-ds What is HouseClassifier? A web app that classifies a given image of a

Housing Allocation Committee April 6, 2018 Committee Members: -Stephany Unruh -Griselda Lopez

Cheating in Online Games CS 161: Computer Security Prof. David Wagner April 20, 2016 <hp:26