Distributed Representations CMSC 473/673 UMBC September 27 th , - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC September 27 th , 2017 Some slides adapted from 3SLP

Course Announement: Assignment 2 Due Wednesday October 18 th by 11:59 AM “Capstone:” Perform language id with maxent models on code-switched data

Course Announement: Assignment 2 Due Wednesday October 18 th by 11:59 AM “Capstone:” Perform language id with maxent models on code-switched data 1. Develop intuitions about maxent models & feature design 4. Get credit for successfully implementing the gradient 5. Perform classification with the models

Recap from last time…

Maxent Objective: Log-Likelihood Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on θ ) The objective is implicitly defined with respect to (wrt) your data on hand

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ thinks it computes for feature f k

Log-Likelihood Derivative Derivation 𝜖𝐺 𝑙 𝑦 𝑗 , 𝑧 ′ 𝑞 𝑧 ′ 𝑦 𝑗 ) = ෍ 𝑔 𝑙 𝑦 𝑗 , 𝑧 𝑗 − ෍ ෍ 𝑔 𝜖𝜄 𝑙 𝑧 ′ 𝑗 𝑗

N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word

Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector θ wi C = f representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

Word Similarity  Plagiarism Detection

Distributional models of meaning = vector-space models of meaning = vector semantics Zellig Harris (1954): “oculist and eye - doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.” Firth (1957): “You shall know a word by the company it keeps!”

Continuous Meaning The paper reflected the truth.

Continuous Meaning The paper reflected the truth. reflected paper truth

Continuous Meaning The paper reflected the truth. reflected paper glean hide truth falsehood

(Some) Properties of Embeddings Capture “like” (similar) words Mikolov et al. (2013)

(Some) Properties of Embeddings Capture “like” (similar) words Capture relationships vector( ‘king’ ) – vector( ‘man’ ) + vector( ‘woman’ ) ≈ vector(‘queen’) vector( ‘Paris’ ) - vector( ‘France’ ) + vector( ‘Italy’ ) ≈ vector(‘Rome’) Mikolov et al. (2013)

Semantic Projection

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 basic bag-of- words counting

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 Assumption: Two documents are similar if their vectors are similar

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 Assumption: Two words are similar if their vectors are similar

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!

“You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 Context: those other words within a small “window” of a target word

“You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 Context: those other words within a small “window” of a target word a cloud computer stores digital data on a remote computer

“You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 The size of windows depends on your goals The shorter the windows , the more syntactic the representation ± 1- 3 more “syntax - y” The longer the windows, the more semantic the representation ± 4- 10 more “semantic - y”

“You shall know a word by the company it keeps!” Firth (1957) context (↓) -word (→) count matrix apricot pineapple digital information aardvark 0 0 0 0 computer 0 0 2 1 data 0 10 1 6 pinch 1 1 0 0 result 0 0 1 4 sugar 1 1 0 0 Context: those other words within a small “window” of a target word Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!

Four kinds of vector models Sparse vector representations 1. Mutual-information weighted word co-occurrence matrices Dense vector representations: 2. Singular value decomposition/Latent Semantic Analysis 3. Neural-network-inspired models (skip-grams, CBOW) 4. Brown clusters

Shared Intuition Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself

What’s the Meaning of Life?

What’s the Meaning of Life? LIFE’

What’s the Meaning of Life? LIFE’ (.478, - .289, .897, …)

“ Embeddings ” Did Not Begin In This Century Hinton (1986): “Learning Distributed Representations of Concepts” Deerwester et al. (1990): “Indexing by Latent Semantic Analysis” Brown et al. (1992): “Class -based n-gram models of natural language”

Four kinds of vector models Sparse vector representations 1. Mutual-information weighted word co-occurrence matrices Dense vector representations: You already saw some of this in assignment 1 (question 3)! 2. Singular value decomposition/Latent Semantic Analysis 3. Neural-network-inspired models (skip-grams, CBOW) 4. Brown clusters

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts Raw word frequency is not a great measure of association between words It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative We’d rather have a measure that asks whether a context word is particularly informative about the target word. (Positive) Pointwise Mutual Information ((P)PMI)

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts Raw word frequency is not a Pointwise mutual information : great measure of association Do events x and y co-occur more than if they between words were independent? It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative We’d rather have a measure that asks whether a context word is particularly informative about the target word. (Positive) Pointwise Mutual Information ((P)PMI)

Distributed Representations CMSC 473/673 UMBC September 27 th , - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC September 27 th , 2017 Some slides adapted from 3SLP Course Announement: Assignment 2 Due Wednesday October 18 th by 11:59 AM Capstone: Perform language id with maxent models on code-switched

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

61A Lecture 16 Announcements String Representations String Representations 4 String

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Mikolovs Language Models: Distributed Representations of Sentences and Documents Recurrent

Lecture on Distributed Representations and Coarse Coding Geoffrey Hinton Localist

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Distributed Representations of Web Browsing Sequences for Ad Targeting Yukihiro Tagami, Hayato

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Support vector machines and kernels Thurs April 20 Kristen Grauman UT Austin Last time

Advanced Natural Language Processing: Background and Overview Regina Barzilay and Michael

Semantic Knowledge Acquisition using Frequency Based Patterns Roy Schwartz and Ari Rappoport

Distributional Lexical Semantics CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

A New Investment Opportunity 0 7 . 1 0 . 2 0 1 9 E x p o R e a l , M u n i c h Porto Macro

Tutorial on Voting Theory Ulle Endriss Institute for Logic, Language and Computation University

QLectives: evolving software to support quality Nigel Gilbert and the QLectives team This work

Distributed Representations CMSC 473/673 UMBC September 27 th , - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC September 27 th , 2017 Some slides adapted from 3SLP Course Announement: Assignment 2 Due Wednesday October 18 th by 11:59 AM Capstone: Perform language id with maxent models on code-switched

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

61A Lecture 16 Announcements String Representations String Representations 4 String

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

CSC421/2516 Lecture 3: Automatic Differentiation &amp; Distributed Representations Jimmy Ba

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Mikolovs Language Models: Distributed Representations of Sentences and Documents Recurrent

Lecture on Distributed Representations and Coarse Coding Geoffrey Hinton Localist

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Distributed Representations of Web Browsing Sequences for Ad Targeting Yukihiro Tagami, Hayato

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Support vector machines and kernels Thurs April 20 Kristen Grauman UT Austin Last time

Advanced Natural Language Processing: Background and Overview Regina Barzilay and Michael

Semantic Knowledge Acquisition using Frequency Based Patterns Roy Schwartz and Ari Rappoport

Distributional Lexical Semantics CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

A New Investment Opportunity 0 7 . 1 0 . 2 0 1 9 E x p o R e a l , M u n i c h Porto Macro

Tutorial on Voting Theory Ulle Endriss Institute for Logic, Language and Computation University

QLectives: evolving software to support quality Nigel Gilbert and the QLectives team This work

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges