Distributed Representations CMSC 473/673 UMBC Some slides adapted - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP

Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types Evaluation Common continuous representation models

Maxent Objective: Log-Likelihood (n- gram LM ( ℎ 𝑗 , 𝑦 𝑗 )) log ෑ 𝑞 𝜄 𝑦 𝑗 ℎ 𝑗 = ෍ log 𝑞 𝜄 (𝑦 𝑗 |ℎ 𝑗 ) 𝑗 𝑗 𝜄 𝑈 𝑔 𝑦 𝑗 , ℎ 𝑗 − log 𝑎(ℎ 𝑗 ) = ෍ 𝑗 = 𝐺 𝜄 Differentiating this becomes nicer (even though Z depends on θ ) The objective is implicitly defined with respect to (wrt) your data on hand

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the ෍ 𝑔 𝑙 (𝑦 𝑗 , ℎ 𝑗 ) training data 𝑗 and the total value the current model p θ 𝑙 (𝑦 ′ , ℎ 𝑗 ) ෍ 𝔽 𝑦 ′ ~ 𝑞 𝑔 thinks it computes for feature f k 𝑗

N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word

Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

Recall from Deck 2: Representing a Linguistic “Blob” 1. An array of sub-blobs Let V = vocab size (# types) word → array of characters 1. Represent each word type sentence → array of words with a unique integer i, where 0 ≤ 𝑗 < 𝑊 2. Integer 2. Or equivalently, … representation/one-hot – Assign each word to some encoding index i, where 0 ≤ 𝑗 < 𝑊 – Represent each word w with a V-dimensional binary vector 𝑓 𝑥 , where 𝑓 𝑥,𝑗 = 1 and 0 3. Dense embedding otherwise

Recall from Deck 2: One-Hot Encoding Example • Let our vocab be {a, cat, saw, mouse, happy} • V = # types = 5 0 • Assign: 0 How do we 𝑓 cat = 1 a 4 represent “cat?” 0 cat 2 saw 3 0 mouse 0 0 happy 1 1 How do we 𝑓 happy = 0 represent 0 “happy?” 0

Recall from Deck 2: Representing a Linguistic “Blob” 1. An array of sub-blobs word → array of characters Let E be some embedding sentence → array of words size (often 100, 200, 300, etc.) 2. Integer representation/one-hot Represent each word w with encoding an E-dimensional real- valued vector 𝑓 𝑥 3. Dense embedding

Recall from Deck 2: A Dense Representation (E=2)

Maxent Plagiarism Detector? Given two documents 𝑦 1 , 𝑦 2 , predict y = 1 (plagiarized) or y = 0 (not plagiarized) What is/are the: • Method/steps for predicting? • General formulation? • Features?

Plagiarism Detection: Word Similarity?

Distributional Representations A dense , “ low ” dimensional vector representation

How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar?

How have we represented words? Each word is a distinct item Bijection between the strings and unique integer ids: "cat" --> 3, "kitten" --> 792 "dog" --> 17394 Are "cat" and "kitten" similar? Equivalently: "One-hot" encoding Represent each word type w with a vector the size of the vocabulary This vector has V-1 zero entries, and 1 non-zero (one) entry

Distributional Representations A dense , “ low ” dimensional vector representation An E-dimensional vector, often (but not always) real-valued

Distributional Representations A dense , “ low ” dimensional vector representation Up till ~2013: E could be An E-dimensional any size vector, often (but not 2013-present: E << vocab always) real-valued

Distributional Representations A dense , “ low ” dimensional vector representation Many values Up till ~2013: E could be An E-dimensional are not 0 (or at any size vector, often (but not least less 2013-present: E << vocab always) real-valued sparse than one-hot)

Distributional Representations A dense , “ low ” dimensional vector representation Many values Up till ~2013: E could be An E-dimensional are not 0 (or at any size vector, often (but not least less 2013-present: E << vocab always) real-valued sparse than These are also called one-hot) • embeddings • Continuous representations • (word/sentence/…) vectors • Vector-space model

Distributional models of meaning = vector-space models of meaning = vector semantics Zellig Harris (1954): “oculist and eye - doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.” Firth (1957): “You shall know a word by the company it keeps!”

Continuous Meaning The paper reflected the truth.

Continuous Meaning The paper reflected the truth. reflected paper truth

Continuous Meaning The paper reflected the truth. reflected paper glean hide truth falsehood

(Some) Properties of Embeddings Capture “like” (similar) words Mikolov et al. (2013)

(Some) Properties of Embeddings Capture “like” (similar) words Capture relationships vector( ‘king’ ) – vector( ‘man’ ) + vector( ‘woman’ ) ≈ vector(‘queen’) vector( ‘Paris’ ) - vector( ‘France’ ) + vector( ‘Italy’ ) ≈ vector(‘Rome’) Mikolov et al. (2013)

“ Embeddings ” Did Not Begin In This Century Hinton (1986): “Learning Distributed Representations of Concepts” Deerwester et al. (1990): “Indexing by Latent Semantic Analysis” Brown et al. (1992): “Class -based n-gram models of natural language”

Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w

Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics

Key Ideas 1. Acquire basic contextual statistics (often counts) for each word type w 2. Extract a real-valued vector v for each word w from those statistics 3. Use the vectors to represent each word in later tasks

Key Ideas: Generalizing to “blobs” 1. Acquire basic contextual statistics (often counts) for each blob type w 2. Extract a real-valued vector v for each blob w from those statistics 3. Use the vectors to represent each blob in later tasks

“ Acquire basic contextual statistics (often counts) for each word type w” • Two basic, initial counting approaches – Record which words appear in which documents – Record which words appear together • These are good first attempts, but with some large downsides

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0

“You shall know a word by the company it keeps!” Firth (1957) document (↓) -word (→) count matrix battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 0 Henry V 15 36 5 0 basic bag-of- words counting

Distributed Representations CMSC 473/673 UMBC Some slides adapted - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

61A Lecture 16 Announcements String Representations String Representations 4 String

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Mikolovs Language Models: Distributed Representations of Sentences and Documents Recurrent

Lecture on Distributed Representations and Coarse Coding Geoffrey Hinton Localist

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Distributed Representations of Web Browsing Sequences for Ad Targeting Yukihiro Tagami, Hayato

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

Index Introduction Previous approaches Our proposal Evaluation Conclusions and future work J.

Graph-based Algorithms in NLP Edge weights w ( u, v ) define a measure of pairwise similarity

Orthogonal Polynomials on Polynomial Lemniscates Brian Simanek (Vanderbilt University, USA) MWAA

3-2: Learning Goals Lets see how big different things are. Download for free at

Distributed Representations CMSC 473/673 UMBC Some slides adapted - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap Maxent models Basic neural language models Continuous representations Motivation Key idea: represent blobs with vectors Two common counting types

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

61A Lecture 16 Announcements String Representations String Representations 4 String

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

CSC421/2516 Lecture 3: Automatic Differentiation &amp; Distributed Representations Jimmy Ba

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Mikolovs Language Models: Distributed Representations of Sentences and Documents Recurrent

Lecture on Distributed Representations and Coarse Coding Geoffrey Hinton Localist

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Distributed Representations of Web Browsing Sequences for Ad Targeting Yukihiro Tagami, Hayato

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

More Distributional Semantics: New Models &amp; Applications CMSC 723 / LING 723 / INST 725 M

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

Index Introduction Previous approaches Our proposal Evaluation Conclusions and future work J.

Graph-based Algorithms in NLP Edge weights w ( u, v ) define a measure of pairwise similarity

Orthogonal Polynomials on Polynomial Lemniscates Brian Simanek (Vanderbilt University, USA) MWAA

3-2: Learning Goals Lets see how big different things are. Download for free at

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M