Natural Language Processing Info 159/259 Lecture 8: Vector - PowerPoint PPT Presentation

David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259   Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley

Announcements • Homework 2 party today 5-7pm: 202 South Hall • DB office hours on Monday 10/25 10-noon (no office hours this Friday) • No quiz 10/3 or 10/5

http://dlabctawg.github.io 356 Barrows Hall (D-Lab)   Wed 3-5pm

Recurrent neural network • RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history. from last time

Recurrent neural network from last time Goldberg 2017

Recurrent neural network • Each time step has two inputs: • x i (the observation at time step i); one-hot vector, feature vector or distributed representation. • s i-1 (the output of the previous state); base case: s 0 = 0 vector from last time

Training RNNs • Given this definition of an RNN: s i = R ( x i , s i − 1 ) = g ( s i − 1 W s + x i W x + b ) y i = O ( s i ) = �� ( s i W o + b o ) • We have five sets of parameters to learn: W s , W x , W o , b, b o from last time

Lexical semantics “You shall know a word by the company it keeps”   [Firth 1957]

everyone likes ______________ a bottle of ______________ is on the table ______________ makes you drunk a cocktail with ______________ and seltzer

Context “You shall know a word by the company it keeps”   [Firth 1957] • A few different ways we can encode the notion of “company” (or context).

context everyone likes ______________ a bottle of ______________ is on the table ______________ makes you drunk a cocktail with ______________ and seltzer

Distributed representation • Vector representation that encodes information about the distribution of contexts a word appears in • Words that appear in similar contexts have similar representations (and similar meanings, by the distributional hypothesis).

Term-document matrix Romeo   Julius   Hamlet Macbeth Richard III Tempest Othello King Lear & Juliet Caesar knife 1 1 4 2 2 2 dog 2 6 6 2 12 sword 17 2 7 12 2 17 love 64 135 63 12 48 like 75 38 34 36 34 41 27 44 Context = appearing in the same document.

Vector Vector Hamlet King Lear representation of 1 2 the document; 2 12 vector size = V 17 17 64 48 75 44

Vectors knife 1 1 4 2 2 2 sword 17 2 7 12 2 17 Vector representation of the term; vector size = number of documents

Weighting dimensions • Not all dimensions are equally informative

TF-IDF • Term frequency-inverse document frequency • A scaling to represent a feature as function of how frequently it appears in a data point but accounting for its frequency in the overall collection • IDF for a given term = the number of documents in collection / number of documents that contain term

TF-IDF • Term frequency ( tf t,d ) = the number of times term t occurs in document d • Inverse document frequency = inverse fraction of number of documents containing ( D t ) among total number of documents N f ( t, d ) = tf t,d × log N tfid D t

IDF Hamlet Macbet Romeo   Richard Julius   Tempes King Othello IDF h & Juliet III Caesar t Lear knife 1 1 4 2 2 2 0.12 dog 2 6 6 2 12 0.20 sword 17 2 7 12 2 17 0.12 love 64 135 63 12 48 0.20 like 75 38 34 36 34 41 27 44 0 IDF for the informativeness of the terms when comparing documents

PMI • Mutual information provides a measure of how independent two variables (X and Y) are. • Pointwise mutual information measures the independence of two outcomes (x and y)

PMI P ( x, y ) log 2 P ( x ) P ( y ) P ( w, c ) What’s this value for w and c log 2 w = word, c = context that never occur together? P ( w ) P ( c ) � � P ( w, c ) PPMI = max log 2 P ( w ) P ( c ) , 0

Macbet Romeo   Richard Julius   King Hamlet Tempest Othello total h & Juliet III Caesar Lear knife 1 1 4 2 2 2 12 dog 2 6 6 2 12 28 sword 17 2 7 12 2 17 57 love 64 135 63 12 48 322 like 75 38 34 36 34 41 27 44 329 total 159 41 186 119 34 59 27 123 748 135 748 PMI (love , R&J) = 186 748 × 322 748

Term-term matrix • Rows and columns are both words; cell counts = the number of times word w i and w j show up in the same document. • More common to define document = some smaller context (e.g., a window of 5 tokens)

Term-document matrix Romeo   Julius   Hamlet Macbeth Richard III Tempest Othello King Lear & Juliet Caesar knife 1 1 4 2 2 2 dog 2 6 6 2 12 sword 17 2 7 12 2 17 love 64 135 63 12 48 like 75 38 34 36 34 41 27 44

Term-term matrix knife dog sword love like knife 6 5 6 5 5 dog 5 5 5 5 5 sword 6 5 6 5 5 love 5 5 5 5 5 like 5 5 5 5 8

Term-term matrix Jurafsky and Martin 2017

write a book write a poem • First-order co-occurrence (syntagmatic association): write co-occurs with book in the same sentence. • Second-order co-occurrence (paradigmatic association): book co-occurs with poem (since each co-occur with write)

Syntactic context Lin 1998; Levy and Goldberg 2014

Cosine Similarity � F i = 1 x i y i cos ( x , y ) = �� F �� F i = 1 x 2 i = 1 y 2 i i • We can calculate the cosine similarity of two vectors to judge the degree of their similarity [Salton 1971] • Euclidean distance measures the magnitude of distance between two points • Cosine similarity measures their orientation

Intrinsic Evaluation human word 1 word 2 score • Relatedness: midday noon 9.29 correlation (Spearman/Pearson) journey voyage 9.29 between vector car automobile 8.94 similarity of pair of words and human … … … judgments professor cucumber 0.31 king cabbage 0.23 WordSim-353 (Finkelstein et al. 2002)

Intrinsic Evaluation • Analogical reasoning (Mikolov et al. 2013). For analogy   Germany : Berlin :: France : ???, find closest vector to   v(“Berlin”) - v(“Germany”) + v(“France”) target possibly impossibly certain uncertain generating generated shrinking shrank think thinking look looking Baltimore Maryland Oakland California shrinking shrank slowing slowed Rabat Morocco Astana Kazakhstan

Sparse vectors A 0 a 0 aa 0 aal 0 aalii 0 aam 0 Aani 0 aardvark 1 aardwolf 0 ... 0 “aardvark” zymotoxic 0 zymurgy 0 Zyrenian 0 Zyrian 0 V-dimensional vector, single 1 for Zyryan 0 zythem 0 the identity of the element Zythia 0 zythum 0 Zyzomys 0 Zyzzogeton 0

Dense 1 vectors → 0.7 1.3 -4.5

Singular value decomposition • Any n ⨉ p matrix X can be decomposed into the product of three matrices (where m = the number of linearly independent rows) 9 4 3 1 2 ⨉ ⨉ 7 9 8 1 n x m m x p m x m (diagonal)

Singular value decomposition • We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix 9 4 0 0 0 ⨉ ⨉ 0 0 0 0 n x m m x p m x m (diagonal)

Singular value decomposition • We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix (the k largest singular values) 9 4 0 0 0 ⨉ ⨉ 0 0 0 0 n x m m x m m x p

Romeo   Richard Julius   King Hamlet Macbeth Tempest Othello & Juliet III Caesar Lear knife 1 1 4 2 2 2 dog 2 6 6 2 12 sword 17 2 7 12 2 17 love 64 135 63 12 48 like 75 38 34 36 34 41 27 44 Hamle Macbet Romeo   Richar Julius   Tempe King knife Othello t h & Juliet d III Caesar st Lear dog sword love like

Low-dimensional Low-dimensional representation for representation for terms (here 2-dim) documents (here 2-dim) Hamle Macbet Romeo   Richar Julius   Tempe King knife Othello t h & Juliet d III Caesar st Lear dog sword love like

Latent semantic analysis • Latent Semantic Analysis/Indexing (Deerwester et al. 1998) is this process of applying SVD to the term-document co-occurence matrix • Terms typically weighted by tf-idf • This is a form of dimensionality reduction (for terms, from a D-dimensionsal sparse vector to a K- dimensional dense one), K << D.

Dense vectors from prediction • Learning low-dimensional representations of words by framing a predicting task: using context to predict words in a surrounding window • Transform this into a supervised prediction problem; similar to language modeling but we’re ignoring order within the context window

Dense vectors from prediction x y a cocktail with gin a gin and seltzer cocktail gin with gin and gin seltzer gin Window size = 3

Dimensionality reduction … … the 1 the a 0 an 0 4.1 for 0 in 0 -0.9 on 0 dog 0 cat 0 … … the is a point in V-dimensional space the is a point in 2-dimensional space

W V x 1 gin y gin h 1 x 2 cocktail y cocktail h 2 globe x 3 y globe W x V y -0.5 1.3 gin 0 1 4.1 0.7 0.1 cocktail 0.4 0.08 1 -0.9 1.3 0.3 0 globe 1.7 3.1 0 0

Natural Language Processing Info 159/259 Lecture 8: Vector - PowerPoint PPT Presentation

David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202 South Hall DB office hours on

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Combinat orics Definition 1 (Combinatorics). Combinatorics is the science of counting. Theorem 1

Using Evidence Maps in SRDR to Efficiently Plan Systematic Reviews Center for Evidence-Based

High Speed Transport Protocols Evaluation in Grid5000 Date Pascale Vicat-Blanc Primet Senior

A model of several point-like windows in the resonator boundary with the Dirichlet boundary

Proving Correctness of Compilers Using Structured Graphs Patrick Bahr University of Copenhagen,

Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science

Rental Assistance Demonstration (RAD) Relocation Requirements Notice July 2014 S ECTION 1- P

Income inequality and American business Christopher Jencks Harvard Kennedy School HBS

Natural Language Processing Info 159/259 Lecture 8: Vector - PowerPoint PPT Presentation

David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202 South Hall DB office hours on

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Combinat orics Definition 1 (Combinatorics). Combinatorics is the science of counting. Theorem 1

Using Evidence Maps in SRDR to Efficiently Plan Systematic Reviews Center for Evidence-Based

High Speed Transport Protocols Evaluation in Grid5000 Date Pascale Vicat-Blanc Primet Senior

A model of several point-like windows in the resonator boundary with the Dirichlet boundary

Proving Correctness of Compilers Using Structured Graphs Patrick Bahr University of Copenhagen,

Support Vector Machines Marco Chiarandini Department of Mathematics &amp; Computer Science

Rental Assistance Demonstration (RAD) Relocation Requirements Notice July 2014 S ECTION 1- P

Income inequality and American business Christopher Jencks Harvard Kennedy School HBS

Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science