Distributed Representations CMSC 473/673 UMBC September 27 th , - - PowerPoint PPT Presentation

distributed representations
SMART_READER_LITE
LIVE PREVIEW

Distributed Representations CMSC 473/673 UMBC September 27 th , - - PowerPoint PPT Presentation

Distributed Representations CMSC 473/673 UMBC September 27 th , 2017 Some slides adapted from 3SLP Course Announement: Assignment 2 Due Wednesday October 18 th by 11:59 AM Capstone: Perform language id with maxent models on code-switched


slide-1
SLIDE 1

Distributed Representations

CMSC 473/673 UMBC September 27th, 2017

Some slides adapted from 3SLP

slide-2
SLIDE 2

Course Announement: Assignment 2

Due Wednesday October 18th by 11:59 AM “Capstone:” Perform language id with maxent models on code-switched data

slide-3
SLIDE 3

Course Announement: Assignment 2

Due Wednesday October 18th by 11:59 AM “Capstone:” Perform language id with maxent models

  • n code-switched data
  • 1. Develop intuitions about maxent models & feature

design

  • 4. Get credit for successfully implementing the gradient
  • 5. Perform classification with the models
slide-4
SLIDE 4

Recap from last time…

slide-5
SLIDE 5

Maxent Objective: Log-Likelihood

Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on θ) The objective is implicitly defined with respect to (wrt) your data on hand

slide-6
SLIDE 6

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pθ thinks it computes for feature fk

slide-7
SLIDE 7

Log-Likelihood Derivative Derivation

𝜖𝐺 𝜖𝜄𝑙 = ෍

𝑗

𝑔

𝑙 𝑦𝑗, 𝑧𝑗 − ෍ 𝑗

𝑧′

𝑔

𝑙 𝑦𝑗, 𝑧′ 𝑞 𝑧′ 𝑦𝑗)

slide-8
SLIDE 8

N-gram Language Models

predict the next word given some context…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1, 𝑥𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-9
SLIDE 9

Maxent Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝑔(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1, 𝑥𝑗))

slide-10
SLIDE 10

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew θwi

slide-11
SLIDE 11

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3, 𝑥𝑗−2, 𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew θwi

slide-12
SLIDE 12

Word Similarity  Plagiarism Detection

slide-13
SLIDE 13

Distributional models of meaning = vector-space models of meaning = vector semantics

Zellig Harris (1954):

“oculist and eye-doctor … occur in almost the same environments” “If A and B have almost identical environments we say that they are synonyms.”

Firth (1957):

“You shall know a word by the company it keeps!”

slide-14
SLIDE 14

Continuous Meaning

The paper reflected the truth.

slide-15
SLIDE 15

Continuous Meaning

The paper reflected the truth.

reflected paper truth

slide-16
SLIDE 16

Continuous Meaning

The paper reflected the truth.

reflected paper truth glean hide falsehood

slide-17
SLIDE 17

(Some) Properties of Embeddings

Capture “like” (similar) words

Mikolov et al. (2013)

slide-18
SLIDE 18

(Some) Properties of Embeddings

Capture “like” (similar) words Capture relationships

Mikolov et al. (2013)

vector(‘king’) – vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)

slide-19
SLIDE 19

Semantic Projection

slide-20
SLIDE 20

Semantic Projection

slide-21
SLIDE 21

Semantic Projection

slide-22
SLIDE 22

“You shall know a word by the company it keeps!” Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix

slide-23
SLIDE 23

“You shall know a word by the company it keeps!” Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix

basic bag-of- words counting

slide-24
SLIDE 24

“You shall know a word by the company it keeps!” Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix

Assumption: Two documents are similar if their vectors are similar

slide-25
SLIDE 25

“You shall know a word by the company it keeps!” Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix

Assumption: Two words are similar if their vectors are similar

slide-26
SLIDE 26

“You shall know a word by the company it keeps!” Firth (1957)

battle soldier fool clown As You Like It 1 2 37 6 Twelfth Night 1 2 58 117 Julius Caesar 8 12 1 Henry V 15 36 5 document (↓)-word (→) count matrix

Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed!

slide-27
SLIDE 27

“You shall know a word by the company it keeps!” Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix

Context: those other words within a small “window” of a target word

slide-28
SLIDE 28

“You shall know a word by the company it keeps!” Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix

a cloud computer stores digital data on a remote computer Context: those other words within a small “window” of a target word

slide-29
SLIDE 29

“You shall know a word by the company it keeps!” Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix The size of windows depends on your goals The shorter the windows , the more syntactic the representation ± 1-3 more “syntax-y” The longer the windows, the more semantic the representation ± 4-10 more “semantic-y”

slide-30
SLIDE 30

“You shall know a word by the company it keeps!” Firth (1957)

apricot pineapple digital information aardvark computer 2 1 data 10 1 6 pinch 1 1 result 1 4 sugar 1 1 context (↓)-word (→) count matrix

Assumption: Two words are similar if their vectors are similar Issue: Count word vectors are very large, sparse, and skewed! Context: those other words within a small “window” of a target word

slide-31
SLIDE 31

Four kinds of vector models

Sparse vector representations

  • 1. Mutual-information weighted word co-occurrence

matrices

Dense vector representations:

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters
slide-32
SLIDE 32

Shared Intuition

Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself

slide-33
SLIDE 33

What’s the Meaning of Life?

slide-34
SLIDE 34

What’s the Meaning of Life?

LIFE’

slide-35
SLIDE 35

What’s the Meaning of Life?

LIFE’ (.478, -.289, .897, …)

slide-36
SLIDE 36

“Embeddings” Did Not Begin In This Century

Hinton (1986): “Learning Distributed Representations

  • f Concepts”

Deerwester et al. (1990): “Indexing by Latent Semantic Analysis” Brown et al. (1992): “Class-based n-gram models of natural language”

slide-37
SLIDE 37

Four kinds of vector models

Sparse vector representations

  • 1. Mutual-information weighted word co-occurrence

matrices

Dense vector representations:

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters

You already saw some of this in assignment 1 (question 3)!

slide-38
SLIDE 38

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts

Raw word frequency is not a great measure of association between words

It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative

We’d rather have a measure that asks whether a context word is particularly informative about the target word.

(Positive) Pointwise Mutual Information ((P)PMI)

slide-39
SLIDE 39

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts

Raw word frequency is not a great measure of association between words

It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative

We’d rather have a measure that asks whether a context word is particularly informative about the target word.

(Positive) Pointwise Mutual Information ((P)PMI)

Pointwise mutual information:

Do events x and y co-occur more than if they were independent?

slide-40
SLIDE 40

Pointwise Mutual Information (PMI): Dealing with Problems of Raw Counts

Raw word frequency is not a great measure of association between words

It’s very skewed: “the” and “of” are very frequent, but maybe not the most discriminative

We’d rather have a measure that asks whether a context word is particularly informative about the target word.

(Positive) Pointwise Mutual Information ((P)PMI)

Pointwise mutual information:

Do events x and y co-occur more than if they were independent?

PMI between two words: (Church & Hanks

1989)

Do words x and y co-occur more than if they were independent?

slide-41
SLIDE 41

“Noun Classification from Predicate- Argument Structure,” Hindle (1990)

Object of “drink” Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5

“drink it” is more common than “drink wine” “wine” is a better “drinkable” thing than “it”

slide-42
SLIDE 42

Four kinds of vector models

Sparse vector representations

  • 1. Mutual-information weighted word co-occurrence

matrices

Dense vector representations:

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters

Learn more in:

  • Your project
  • Paper (673)
  • Other classes (478/678)
slide-43
SLIDE 43

Four kinds of vector models

Sparse vector representations

  • 1. Mutual-information weighted word co-occurrence

matrices

Dense vector representations:

  • 2. Singular value decomposition/Latent Semantic Analysis
  • 3. Neural-network-inspired models (skip-grams, CBOW)
  • 4. Brown clusters
slide-44
SLIDE 44

Brown clustering (Brown et al., 1992)

An agglomerative clustering algorithm that clusters words based on which words precede or follow them These word clusters can be turned into a kind of vector

slide-45
SLIDE 45

Brown Clusters as vectors

Build a binary tree from bottom to top based on how clusters are merged Each word represented by binary string = path from root to leaf Each intermediate node is a cluster

CEO chairman president November 001 000 0011 0010 00 01 … 010 root In practice, use an available implementation: https://github.com/percyliang/brown-cluster

slide-46
SLIDE 46

Brown cluster examples

slide-47
SLIDE 47

Evaluating Similarity

Extrinsic (task-based, end-to-end) Evaluation:

Question Answering Spell Checking Essay grading

Intrinsic Evaluation:

Correlation between algorithm and human word similarity ratings

Wordsim353: 353 noun pairs rated 0-10. sim(plane,car)=5.77

Taking TOEFL multiple-choice vocabulary tests

slide-48
SLIDE 48

Cosine: Measuring Similarity

Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra

High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution

slide-49
SLIDE 49

Cosine: Measuring Similarity

Given 2 target words v and w how similar are their vectors? Dot product or inner product from linear algebra

High when two vectors have large values in same dimensions, low for orthogonal vectors with zeros in complementary distribution

Correct for high magnitude vectors

slide-50
SLIDE 50

Cosine Similarity

Divide the dot product by the length of the two vectors This is the cosine of the angle between them

slide-51
SLIDE 51

Cosine as a similarity metric

  • 1: vectors point in opposite

directions +1: vectors point in same directions 0: vectors are orthogonal

slide-52
SLIDE 52

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1

slide-53
SLIDE 53

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =

slide-54
SLIDE 54

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) = 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622

slide-55
SLIDE 55

Example: Word Similarity

large data computer apricot 2 digital 1 2 information 1 6 1 cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) = 2 + 0 + 0 4 + 0 + 0 1 + 36 + 1 = 0.1622 0 + 6 + 2 0 + 1 + 4 1 + 36 + 1 = 0.5804 0 + 0 + 0 4 + 0 + 0 0 + 1 + 4 = 0.0

slide-56
SLIDE 56

Other Similarity Measures

slide-57
SLIDE 57

Adding Morphology, Syntax, and Semantics to Embeddings

Lin (1998): “Automatic Retrieval and Clustering of Similar Words” Padó and Lapata (2007): “Dependency-based Construction of Semantic Space Models” Levy and Goldberg (2014): “Dependency-Based Word Embeddings” Cotterell and Schütze (2015): “Morphological Word Embeddings” Ferraro et al. (2017): “Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto- Roles”

and many more…