Last class Represent a word by a context vector Each word x is - PDF document

Last class Represent a word by a context vector ◮ Each word x is represented by a vector � v . Each dimension in the vector corresponds to a context word type y ANLP Lecture 22 ◮ Each v i measures the level of association between the word x Lexical Semantics with Dense Vectors and context word y i Pointwise Mutual Information Shay Cohen p ( x , y i ) ◮ Set each v i to log 2 (Based on slides by Henry Thompson and Dorota Glowacka) p ( x ) p ( y i ) ◮ Measures “colocationness’ ◮ Vectors have many dimensions and very sparse (when PMI 4 November 2019 < 0 is changed to 0) Similarity metric between � v and another context vector � w : v � � w ◮ The cosine of the angle between � v and � w : | � v || � w | Today’s Lecture Roadmap for Main Course of Today ◮ Skip-gram models - relying on the idea of pairing words with ◮ How to represent a word with vectors that are short (with dense context and target vectors. If a word co-occurs with a context word w c , then its target vector should be similar to length of 50 – 1,000) and dense (most values are non-zero) the context vector of w c ◮ Why short vectors? ◮ Easier to include as features in machine learning systems ◮ The computational problem with skip-gram models ◮ Because they contain fewer parameters, they generalise better and are less prone to overfitting ◮ An example solution to this problem: negative sampling skip-grams

Before the Main Course, on PMI and TF-IDF TF-IDF: Main Idea Key Idea: Combine together the frequency of a term in a context (such as document) with its relative frequency ◮ PMI is one way of trying to detect important co-occurrences overall in all documents. based on divergence between observed and predicted (from ◮ This is formalised under the name tf-idf unigram MLEs) bigram probabilities ◮ tf Term frequency ◮ A different take: a word that is common in only some contexts ◮ idf Inverse document frequency carries more information than one that is common everywhere ◮ Originally from Information Retrieval, where there a lots of How to formalise this idea? documents, often with lots of words in them ◮ Gives an “importance” level of a term in a specific context TF-IDF: Combine Two Factors Summary: TF-IDF ◮ tf: term frequency of a word t in document d : � 1 + log count ( t , d ) if count ( t , d ) > 0 tf ( t , d ) = . 0 otherwise frequency count of term i in document d ◮ Compare two words using tf-idf cosine to see if they are similar ◮ Idf: inverse document frequency: ◮ Compare two documents � N ◮ Take the centroid of vectors of all the terms in the document � idf ( t ) = log ◮ Centroid document vector is: df t d = t 1 + t 2 + · · · + t k ◮ N is total # of docs in collection k ◮ df t is # of docs that have term t ◮ Terms such as the or good have very low idf ◮ because df t ≈ N ◮ tf-idf value for word t in document d : tfidf t , d = tf t , d × idf t

TF-IDF and PMI are Sparse Representations Neural network-inspired dense embeddings ◮ Methods for generating dense embeddings inspired by neural network models ◮ TF-IDF and PMI vectors Key idea: Each word in the vocabulary is associated with ◮ have many dimensions (as the size of the vocabulary) two vectors: a context vector and a target vector. We try ◮ are sparse (most elements are zero) to push these two types of vectors such that the target ◮ Alternative: dense vectors, vectors which are vector of a word is close to the context vectors of words ◮ short (length 50–1000) with which it co-occurs. ◮ dense (most elements are non-zero) ◮ This is the main idea, and what is important to understand. Now to the details to make it operational... Skip-gram modelling (or Word2vec) Prediction with Skip-Grams ◮ Each word type w is associated with two dense vectors: v ( w ) ◮ Instead of counting how often each word occurs near “apricot” (target vector) and c ( w ) (context vector) ◮ Train a classifier on a binary prediction task: ◮ Skip-gram model predicts each neighbouring word in a ◮ Is the word likely to show up near “apricot”? context window of L words, e.g. context window L = 2 the ◮ A by-product of learning this classifier will be the context and context is [ w t − 2 , w t − 1 , w t +1 , w t +2 ] target vectors discussed. ◮ Skip-gram calculates the probability p ( w k | w j ) by computing ◮ These are the parameters of the classifier, and we will use these parameters as our word embeddings. dot product between context vector c ( w k ) of word w k and target vector v ( w j ) for word w j ◮ No need for hand-labelled supervision - use text with ◮ The higher the dot product between two vectors, the more co-occurrence similar they are

Prediction with Skip-grams Skip-gram with Negative Sampling ◮ Problem with skip-grams: Computing the denominator ◮ We use softmax function to normalise the dot product into requires computing dot product between each word in V and probabilities: the target word w j , which may take a long time exp ( c ( w k ) · v ( w j ) ) p ( w k | w j ) = w ∈ V exp ( c ( w ) · v ( w j ) ) � Instead: ◮ Given a pair of target and context words, predict + or - where V is our vocabulary. (telling whether they co-occur together or not) ◮ If both fruit and apricot co-occur with delicious, then v ( fruit ) ◮ This changes the classification into a binary classification and v ( apricot ) should be similar both to c ( delicious ), and as problem, no issue with normalisation such, to each other ◮ It is easy to get example for the + label (words co-occur) ◮ Problem: Computing the denominator requires computing dot ◮ Where do we get examples for - (words do not co-occur)? product between each word in V and the target word w j , which may take a long time ◮ Solution: randomly sample “negative” examples Skip-gram with Negative Sampling Skip-Gram Goal ◮ Training sentence for example word apricot : To recap: lemon, a tablespoon of apricot preserves or jam ◮ Given a pair ( w t , w c ) = target, context ◮ Select k = 2 noise words for each of the context words: ◮ (apricot, jam) ◮ (apricot, aardvark) cement bacon dear coaxial apricot ocean hence never puddle return probability that w c is a real context word: n 1 n 2 n 3 n 4 w n 5 n 6 n 7 n 8 ◮ P (+ | w t , w c ) ◮ We want noise words w n i to have a low dot-product with ◮ P ( −| w t , w c ) = 1 − P (+ | w t , w c ) target embedding w ◮ Learn from examples ( w t , w c , ℓ ) where ℓ ∈ { + , −} and the ◮ We want the context word to have high dot-product with negative examples are obtained through sampling target embedding w

How to Compute p (+ | w t , w c )? Skip-gram with Negative Sampling Intuition: ◮ Words are likely to appear near similar words ◮ Again use dot-product to indicative positive/negative label, So, given the learning objective is to maximise: coupled with logistic regression. This means log P (+ | w t , w c ) + � k i =1 log P ( −| w t , w n i ) 1 P (+ | w t , w c ) = where we have k negative-sampled words w n 1 , · · · , w n k 1 + exp ( − v ( w t ) · c ( w c )) ◮ We want to maximise the dot product of a word target vector with a true context word context vector exp ( − v ( w t ) · c ( w c )) ◮ We want to minimise over all the dot products of a target P ( −| w t , w c ) = 1 − P (+ | w t , w c ) = 1 + exp ( − v ( w t ) · c ( w c )) word with all the untrue contexts ◮ How do we maximise this learning objective? Using gradient The function descent 1 σ ( x ) = 1 + e − x is also referred to as “the sigmoid” How to Use the Context and Target Vectors? Some Real Embeddings ◮ After this learning process, use: Examples of the closest tokens to some target words using a phrase-based extension of the skip-gram algorithm (Mikolov et al. ◮ v ( w ) as the word embedding, discarding c ( w ) ◮ Or the concatenation of c ( w ) with v ( w ) 2013): A good example of representation learning: through our Redmond Havel ninjutsu graffiti capitulate classifier setup, we learned how to represent words to fit Redmond Vaclav Havel ninja spray paint capitulation the classifier model to the data Wash Redmond President Martial arts graffiti capitulated Food for thought: are c ( w ) and v ( w ) going to be similar for each Washington Vaclav Havel w ? Why? Velvet Microsoft swordsmanship taggers capitulating Revolution v ( fruit ) → c ( delicious ) → v ( apricot ) → c ( fruit )

Properties of Embeddings Summary Offsets can also capture Offsets between embeddings can grammatical number capture relations between words, e.g. vector(king)+ (vector(woman) − vector(man)) is close to vector(queen) skip-grams (and related approaches such as continuous bag of words (CBOW)) are often referred to as word2vec ◮ Code available online - try it! ◮ Very fast to train ◮ Idea: predict rather than count

Last class Represent a word by a context vector Each word x is - PDF document

Last class Represent a word by a context vector Each word x is represented by a vector v . Each dimension in the vector corresponds to a context word type y ANLP Lecture 22 Each v i measures the level of association between the word x

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

Who Is My Counselor? Last Name A-Co: Mrs. Ary Last Name Cr-He: Mr. Peslak Last Name Hi-Ma:

Math 3B: Lecture 2 Noah White September 26, 2016 Last time Last time, we spoke about The

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Calculus (Math 1A) Lecture 10 Vivek Shende September 15, 2017 Hello and welcome to class!

Context Since we are at the end Announcements This is the last class of the semester -- no

Presentation Last Names A-E Ms. Kennair Last Names F-L Ms. Fornera Last Names M-R Ms. Tippins

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Word counts with bag- of-words Katharine Jarmul Founder, kjamistan DataCamp Natural Language

CSE 7/5337: Information Retrieval and Web Search Scoring, term weighting, the vector space model

Retrieval Strategy Retrieval Strategies: Vector Space Model An IR strategy is a technique by

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Binary Independence Models In

Entity Linking to Knowledge Graphs to Infer Column Types and Properties Avijit Thawani , Minda Hu,

Content-based recommendation systems (based on chapter 9 of Mining of Massive Datasets, a book by

GpKex : Genetically Programmed Keyphrase Extraction from Croatian Texts Marko Bekavac and Jan

Lecturer, Computational Science and Engineering, Georgia Tech Text is everywhere We use