Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Where we’re at Lecture 25: Word Embeddings and neural LMs Lecture 26: Recurrent networks Lecture 27: Sequence labeling and Seq2Seq Lecture 28: Review for the final exam Lecture 29: In-class final exam � 2 CS447: Natural Language Processing (J. Hockenmaier)

Recap CS447: Natural Language Processing (J. Hockenmaier) � 3

What are neural nets? Simplest variant: single-layer feedforward net For binary Output unit: scalar y classification tasks: Input layer: vector x Single output unit Return 1 if y > 0.5 Return 0 otherwise For multiclass   Output layer: vector y classification tasks: Input layer: vector x K output units (a vector) Each output unit   y i = class i Return argmax i (y i ) � 4 CS447: Natural Language Processing (J. Hockenmaier)

Multi-layer feedforward networks We can generalize this to multi-layer feedforward nets Output layer: vector y Hidden layer: vector h n … … … … … … … … …. Hidden layer: vector h 1 Input layer: vector x � 5 CS447: Natural Language Processing (J. Hockenmaier)

Multiclass models: softmax(y i ) Multiclass classification = predict one of K classes. Return the class i with the highest score: argmax i (y i ) In neural networks, this is typically done by using the softmax function, which maps real-valued vectors in R N into a distribution over the N outputs For a vector z = (z 0 …z K ) : P(i) = softmax(z i ) = exp(z i ) ∕ ∑ k=0..K exp(z k ) (NB: This is just logistic regression) � 6 CS447: Natural Language Processing (J. Hockenmaier)

Neural Language Models CS546 Machine Learning in NLP � 7

Neural Language Models LMs define a distribution over strings: P (w 1 ….w k ) LMs factor P (w 1 ….w k ) into the probability of each word:   P (w 1 ….w k ) = P (w 1 ) · P (w 2 |w 1 ) · P (w 3 |w 1 w 2 ) · … · P (w k | w 1 ….w k − 1 ) A neural LM needs to define a distribution over the V words in the vocabulary, conditioned on the preceding words. Output layer: V units (one per word in the vocabulary) with softmax to get a distribution Input: Represent each preceding word by its   d-dimensional embedding. - Fixed-length history (n-gram): use preceding n − 1 words - Variable-length history: use a recurrent neural net � 8 CS447: Natural Language Processing (J. Hockenmaier)

Neural n-gram models Task: Represent P(w | w 1 …w k ) with a neural net Assumptions: - We’ll assume each word w i ∈ V in the context is a dense vector v(w): v(w) ∈ R dim(emb) - V is a finite vocabulary, containing UNK, BOS, EOS tokens. - We’ll use a feedforward net with one hidden layer h The input x = [v(w 1 ),…,v(w k )] to the NN   represents the context w 1 …w k Each w i ∈ V is a dense vector v(w) The output layer is a softmax: P(w | w 1 …w k ) = softmax( hW 2 + b 2 ) � 9 CS546 Machine Learning in NLP

Neural n-gram models Architecture: Input Layer: x = [v(w 1 )….v(w k )] v(w) = E [w] Hidden Layer: h = g( xW 1 + b 1 ) Output Layer: P(w | w1…wk) = softmax( hW 2 + b 2 ) Parameters: Embedding matrix: E ∈ R |V| × dim(emb) Weight matrices and biases: first layer: W 1 ∈ R k · dim(emb) × dim( h ) b 1 ∈ R dim( h ) second layer: W 2 ∈ R k · dim( h ) × |V| b 2 ∈ R |V| � 10 CS546 Machine Learning in NLP

      Word representations as by-product of neural LMs Output embeddings: Each column in W 2 is a dim( h )- dimensional vector that is associated with a vocabulary item w ∈ V   output layer hidden layer h h is a dense (non-linear) representation of the context Words that are similar appear in similar contexts. Hence their columns in W 2 should be similar. Input embeddings: each row in the embedding matrix is a representation of a word. � 11 CS546 Machine Learning in NLP

Obtaining Word Embeddings CS546 Machine Learning in NLP � 12

Word Embeddings (e.g. word2vec) Main idea: If you use a feedforward network to predict the probability of words that appear in the context of (near) an input word, the hidden layer of that network provides a dense vector representation of the input word. Words that appear in similar contexts (that have high distributional similarity) wils have very similar vector representations. These models can be trained on large amounts of raw text (and pretrained embeddings can be downloaded) � 13 CS447: Natural Language Processing (J. Hockenmaier)

  Word2Vec (Mikolov e t al. 2013) Modification of neural LM: - Two different context representations: CBOW or Skip-Gram - Two different optimization objectives:   Negative sampling (NS) or hierarchical softmax Task: train a classifier to predict a word from its context (or the context from a word) Idea: Use the dense vector representation that this classifier uses as the embedding of the word. � 14 CS546 Machine Learning in NLP

CBOW vs Skip-Gram INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT w(t-2) w(t-2) w(t-1) w(t-1) SUM w(t) w(t) w(t+1) w(t+1) w(t+2) w(t+2) CBOW Skip-gram Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. � 15 CS546 Machine Learning in NLP

Word2Vec: CBOW CBOW = Continuous Bag of Words Remove the hidden layer, and the order information of the context. Define context vector c as a sum of the embedding vectors of each context word c i , and score s( t , c ) as tc c = ∑ i=1…k c i s( t , c ) = tc 1 P ( + | t , c ) = 1 + exp ( − ( t ⋅ c 1 + t ⋅ c 2 + … + t ⋅ c k ) � 16 CS447: Natural Language Processing (J. Hockenmaier)

Word2Vec: SkipGram Don’t predict the current word based on its context,   but predict the context based on the current word. Predict surrounding C words (here, typically C = 10). Each context word is one training example � 17 CS546 Machine Learning in NLP

Skip-gram algorithm 1. Treat the target word and a neighboring context word as positive examples. 2. Randomly sample other words in the lexicon to get negative samples 3. Use logistic regression to train a classifier to distinguish those two cases 4. Use the weights as the embeddings �18 CS447: Natural Language Processing (J. Hockenmaier) 11/27/18

Word2Vec: Negative Sampling Training objective: Maximize log-likelihood of training data D+ ∪ D-: L ( Θ , D , D 0 ) = ∑ log P ( D = 1 | w , c ) ( w , c ) 2 D + ∑ log P ( D = 0 | w , c ) ( w , c ) 2 D 0 � 19 CS546 Machine Learning in NLP

Skip-Gram Training Data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 target c3 c4 Asssume context words are those in +/- 2 word window �20 CS447: Natural Language Processing (J. Hockenmaier) 11/27/18

Skip-Gram Goal Given a tuple (t,c) = target, context ( apricot, jam ) ( apricot, aardvark ) Return the probability that c is a real context word: P(D = + | t, c ) P ( D= − | t , c ) = 1 − P (D = + | t , c ) �21 CS447: Natural Language Processing (J. Hockenmaier) 11/27/18

How to compute p(+ | t, c)? Intuition: Words are likely to appear near similar words Model similarity with dot-product! Similarity(t,c) ∝ t · c Problem: Dot product is not a probability!   (Neither is cosine) �22 CS447: Natural Language Processing (J. Hockenmaier)

        Turning the dot product into a probability The sigmoid lies between 0 and 1: 1 σ ( x ) = 1 + exp( − x ) 1 P ( + | t , c ) = 1 + exp ( − t ⋅ c ) 1 exp ( − t ⋅ c ) P ( − | t , c ) = 1 − 1 + exp ( − t ⋅ c ) = 1 + exp ( − t ⋅ c ) �23 CS447: Natural Language Processing (J. Hockenmaier)

    Word2Vec: Negative Sampling Distinguish “good” (correct) word-context pairs (D=1),   from “bad” ones (D=0)   Probabilistic objective: P ( D = 1 | t, c ) defined by sigmoid:   1 P ( D = 1 | w , c ) = 1 + exp ( − s ( w , c )) P ( D = 0 | t, c ) = 1 — P ( D = 0 | t, c ) P ( D = 1 | t, c ) should be high when (t, c) ∈ D+ , and low when (t,c) ∈ D- � 24 CS546 Machine Learning in NLP

        For all the context words Assume all context words c 1:k are independent: k 1 ∏ P ( + | t , c 1: k ) = 1 + exp( − t ⋅ c i ) i =1 k 1 ∑ log P ( + | t , c 1: k ) = log 1 + exp( − t ⋅ c i ) i =1 � 25 CS447: Natural Language Processing (J. Hockenmaier)

Word2Vec: Negative Sampling Training data: D+ ∪ D- D+ = actual examples from training data Where do we get D- from? Lots of options. Word2Vec: for each good pair (w,c), sample k words and add each w i as a negative example (w i ,c) to D’ (D’ is k times as large as D) Words can be sampled according to corpus frequency   or according to smoothed variant where freq’(w) = freq(w) 0.75 (This gives more weight to rare words) � 26 CS546 Machine Learning in NLP

Skip-Gram Training data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Training data: input/output pairs centering on apricot Assume a +/- 2 word window � 27 CS447: Natural Language Processing (J. Hockenmaier)

Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Where were at Lecture 25: Word Embeddings and neural LMs

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Geometric Context from a Single Image Derek Hoiem Alexei A. Efros Martial Hebert Carnegie

Role of architect and Quality attributes School of Computer Science Jose E. Labra Gayo Course

IN5550: Neural Methods in Natural Language Processing Lecture 11/1 Contextualized embeddings

Style Compa,bility For 3D Furniture Models Tianqiang Liu 1 Aaron

for Inherent Privacy Awareness in Network Monitoring Maria N. Koukovini Eugenia I.

CITIES, HEALTH AND WELL-BEING NOVEMBER 2011 Urbanization Pattern in Asia & Well Being Athar

Finite-State Machines and Regular Languages Detmar Meurers: Intro to Computational Linguistics I

ORRs Retail Market Review industry workshop Discussion on ORRs emerging findings, and

Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Where were at Lecture 25: Word Embeddings and neural LMs

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Geometric Context from a Single Image Derek Hoiem Alexei A. Efros Martial Hebert Carnegie

Role of architect and Quality attributes School of Computer Science Jose E. Labra Gayo Course

IN5550: Neural Methods in Natural Language Processing Lecture 11/1 Contextualized embeddings

Style Compa,bility For 3D Furniture Models Tianqiang Liu 1 Aaron

for Inherent Privacy Awareness in Network Monitoring Maria N. Koukovini Eugenia I.

CITIES, HEALTH AND WELL-BEING NOVEMBER 2011 Urbanization Pattern in Asia &amp; Well Being Athar

Finite-State Machines and Regular Languages Detmar Meurers: Intro to Computational Linguistics I

ORRs Retail Market Review industry workshop Discussion on ORRs emerging findings, and

CITIES, HEALTH AND WELL-BEING NOVEMBER 2011 Urbanization Pattern in Asia & Well Being Athar