Distributional Semantics Crash Course September 11, 2018 CSCI - - PowerPoint PPT Presentation

distributional semantics crash course
SMART_READER_LITE
LIVE PREVIEW

Distributional Semantics Crash Course September 11, 2018 CSCI - - PowerPoint PPT Presentation

Distributional Semantics Crash Course September 11, 2018 CSCI 2952C: Computational Semantics Instructor: Ellie Pavlick HTA: Arun Drelich UTA: Jonathan Chang Agenda Quick Background Your Discussion Questions Step through of


slide-1
SLIDE 1

Distributional Semantics Crash Course

September 11, 2018 CSCI 2952C: Computational Semantics Instructor: Ellie Pavlick HTA: Arun Drelich UTA: Jonathan Chang

slide-2
SLIDE 2

Agenda

  • Quick Background
  • Your Discussion Questions
  • Step through of VMS/word2vec
  • Announcements
slide-3
SLIDE 3

Agenda

  • Quick Background
  • Your Discussion Questions
  • Step through of VMS/word2vec
  • Announcements

… :-D !!! @!#$ ??? >:(

Constant interruptions from you all

slide-4
SLIDE 4

— J. R. Firth

“You shall know a word by the company it keeps!”

slide-5
SLIDE 5

Some Historical Context

1930 2010 1970 1950 1990 Firth, Harris 1910

slide-6
SLIDE 6

Some Historical Context

1930 2010 1970 1950 1990 Firth, Harris 1910 Montague

1957: Syntactic Structures

Chomsky

slide-7
SLIDE 7

Some Historical Context

1930 2010 1970 1950 1990 Firth, Harris 1910 Chomsky Montague

1957: Syntactic Structures

“Modern” Statistical NLP

1988: IBM Model 1

slide-8
SLIDE 8

Some Historical Context

1930 2010 1970 1950 1990 Firth, Harris 1910 Chomsky Montague

1957: Syntactic Structures

“Modern” Statistical NLP

1988: IBM Model 1

Behaviorism (Pavlov, Skinner)

1926: Conditioned Reflexes

slide-9
SLIDE 9

Some Historical Context

1930 2010 1970 1950 1990 Firth, Harris 1910 Chomsky Montague

1957: Syntactic Structures

“Modern” Statistical NLP

1988: IBM Model 1

Behaviorism (Pavlov, Skinner)

1926: Conditioned Reflexes

Logic/Computation (Tarski, Church, Turing)

1936: Church-Turing Thesis

slide-10
SLIDE 10

Behaviorism

“Behaviorism was developed with the mandate that only

  • bservations that satisfied the criteria of the

scientific method, namely that they must be repeatable at different times and by independent

  • bservers, were to be admissible as evidence. This

effectively dismissed introspection, the main technique of psychologists following Wilhelm Wundt's experimental psychology, the dominant paradigm in psychology in the early twentieth century. Thus, behaviorism can be seen as a form of materialism, denying any independent significance to processes of the mind.”

http://www.newworldencyclopedia.org/entry/Behaviorism

slide-11
SLIDE 11

Firth (1957)

  • Language is a learned behavior, no different than
  • ther learned behaviors
  • Restricted languages and registers
  • Collocations: word types -> meaning
  • Colligations: word categories -> syntax
slide-12
SLIDE 12

Contextualism vs. “Linguistic Meaning”

slide-13
SLIDE 13

Contextualism vs. “Linguistic Meaning”

Look-ahead: Frege’s Sense and Reference (for this Thursday)

slide-14
SLIDE 14

Contextualism vs. “Linguistic Meaning”

slide-15
SLIDE 15

Contextualism vs. “Linguistic Meaning”

slide-16
SLIDE 16

Contextualism vs. “Linguistic Meaning”

“the robot” “the autonomous agent” “that little guy”

slide-17
SLIDE 17

Contextualism vs. “Linguistic Meaning”

“the robot” “the autonomous agent” “that little guy”

slide-18
SLIDE 18

Contextualism vs. “Linguistic Meaning”

“the robot” “the autonomous agent” “that little guy”

slide-19
SLIDE 19

Discussion! Firth

  • different contexts for same word “meaning”
  • non-linguistic context, including collocation vs. context,

augmented datasets (e.g. tagging)

  • emphasis/speech patterns
  • language vs. dialect
  • slips of the tongue—semantic or prosodic?
  • Alice in Wonderland…what else is lost in translation?
  • learning “online” without first enumerating all the collocations
slide-20
SLIDE 20

Discussion! VMSs

  • This paper is from 2010—have there been any fundamental advances since?
  • Matrix: multiple levels of context (words, subwords, phrases)? how are patterns chosen? do they

make sense out of context? how does context size effect meaning captured? can we model longer phrases and/or morphological roots on the rows? can we put ngrams on the columns?

  • Frequencies: how should frequent vs. rare events factor into meaning? should/shouldn’t we care

more about rare events? what happens with unknown words in the test set?

  • Linear Algebraic Assumptions: what to make of the assumptions about vector spaces, e.g.

inverses/associativity? is it fair to say that dimensionality reduction -> “higher order features”? why can’t we represent arbitrary FOL statements?

  • Applications: plagiarism detection? text processing (tokenization/normalization)?
  • Evaluation/Similarity Metrics: should we model relational similarity directly (pair-pattern) or

implicitly, via vector arithmetic? could we reduce attributional similarity to relational similarity/when would this help? do these models only work well on “passive” tasks, or can they work in generation tasks which require knowledge/state?

  • Bias/Ethics: how do we prevent these models from encoding biases in the data/evaluations? what

are the ethical implications e.g. “gaming the system” on resume cites, mining personal information?

slide-21
SLIDE 21

Discussion! word2vec

  • Matrix: word ordering, size of context
  • Frequency: effect of low frequency words, both on rows and columns +
  • Representations: what differs between parts of speech? what do polysemous words look like?

can these capture different senses and more fine-grained “meanings” (e.g. speaker- dependent, context-dependent)? generalizing to new languages?

  • Vector Arithmetic: what to make of it? why does France - Paris != capitol? can this structure be

used to build e.g. ontologies? is the a + b - c order-sensitive, or are they hiding some limitations by focusing on this one type of operation?

  • Evaluation/Similarity: can these spaces capture different notions of similarity? why does

syntax appear to be easier than semantics? why is it “not surprising” that the NN LM does better than the RNN LM? why is skipgram better than CBOW at semantics? does it have to do with averaging?

  • Loss Functions: would more complex loss functions help to learn e.g. transitive verbs? can

analogical reasoning relationships be trained directly/incorporated into loss? can multiple loss functions be combined?

  • Efficiency: does computational complexity matter that much? is the point moot as machines

get faster?

slide-22
SLIDE 22

Vector Space Models

slide-23
SLIDE 23

You shall know a word by the company it keeps! Words that occur in similar contexts tend to have similar meanings. If words have similar row vectors in a word– context matrix, then they tend to have similar meanings.

Vector Space Models

slide-24
SLIDE 24

Vector Space Models

Term-Document markets below levinson

  • lsen

remorse schuyler rodents scrambled likely minnesota doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9 doc10 # of times “remorse” appears in document #4

slide-25
SLIDE 25

Vector Space Models

Term-Document markets below levinson

  • lsen

remorse schuyler rodents scrambled likely minnesota doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9 doc10 Documents as bags of words?

slide-26
SLIDE 26

Vector Space Models

Word-Context markets below levinson

  • lsen

remorse schuyler rodents scrambled likely minnesota

chrissie supernova berths landowner backup roam ps palaiologos

  • perative

administrative

# of times “remorse” appears next to “landowner”

slide-27
SLIDE 27

Vector Space Models

Word-Context markets below levinson

  • lsen

remorse schuyler rodents scrambled likely minnesota

chrissie supernova berths landowner backup roam ps palaiologos

  • perative

administrative

# of times “remorse” appears next to “landowner”

slide-28
SLIDE 28

Vector Space Models

Word-Context markets below levinson

  • lsen

remorse schuyler rodents scrambled likely minnesota

chrissie supernova berths landowner backup roam ps palaiologos

  • perative

administrative

Turney and Pantel note that VSMs aren’t by limited to text context

slide-29
SLIDE 29

Vector Space Models

Pair-Pattern

peace/region enjoyable/block

  • f/surprise

duties/received to/morakot 1942/field returns/golden g/overtaken space/second infiltrated/hong X and Y the X was Y X has Y Y has X X is Y Y is not X Y or X Y, X the X Yed Y’s X

# of times the phrase “peace has region” appears

slide-30
SLIDE 30

Vector Space Models

Pair-Pattern

peace/region enjoyable/block

  • f/surprise

duties/received to/morakot 1942/field returns/golden g/overtaken space/second infiltrated/hong X and Y the X was Y X has Y Y has X X is Y Y is not X Y or X Y, X the X Yed Y’s X

# of times the phrase “peace has region” appears

slide-31
SLIDE 31

Vector Space Models

Pair-Pattern

peace/region enjoyable/block

  • f/surprise

duties/received to/morakot 1942/field returns/golden g/overtaken space/second infiltrated/hong X and Y the X was Y X has Y Y has X X is Y Y is not X Y or X Y, X the X Yed Y’s X

Relationship to Firth’s ideas of word classes/ abstraction? Colligation?

slide-32
SLIDE 32

chrissie supern

  • va

berths landow ner backup roam ps palaiolo gos

  • perativ

e adminis trative

markets 1000 40 500 700 400 3 80 100 15 6

Vector Space Models

slide-33
SLIDE 33

1000 chrissie 40

supernova

500 berths 700 landowner 400 backup 3

roam

80

ps

100 palaiologos 15

  • perative

6

administrative

markets

1

Vector Space Models

slide-34
SLIDE 34

https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b

slide-35
SLIDE 35

https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b

slide-36
SLIDE 36

Clarifications/ Procrastinations

  • (Neural) Language Modeling:
  • The quick brown fox ___?
  • Stochastic gradient descent (“SGD”)
  • Back-propagation (“Backprop”)
slide-37
SLIDE 37

CBOW

slide-38
SLIDE 38

SkipGram

slide-39
SLIDE 39
  • Cosine — cares about angle but not length
  • Dice/Jaccard — for sets/sparse vectors
  • Metrics with high vs. low frequency biases — What

would Firth say?

  • Use as features in ML models (“pretraining”)

Similarity Metrics

slide-40
SLIDE 40
  • How much should things like efficiency/scalability

matter in a theory of linguistic representation?

Optimizations/ Approximations

slide-41
SLIDE 41
  • How much should things like efficiency/scalability

matter in a theory of linguistic representation?

  • What about computing exactly vs. approximately
  • vs. heuristically? Word embeddings vs.

“representation learning”?

Optimizations/ Approximations

slide-42
SLIDE 42
  • Types vs. tokens
  • Tokenization/Phrasal Collocations — what should we

consider to be the “basic units” of the language?

  • Punctuation — “okay…” vs. “okay!”
  • Normalization — “Trump” vs. “trump”
  • Stop words — “pb and jelly” vs. “pb or jelly”
  • Tagging — “fish fish fish fish fish”

Linguistic Preprocessing

slide-43
SLIDE 43
  • Counts: one-hot, frequency, tf-idf/PMI
  • Limiting vocab size — problems?
  • Subsampling in Skipgram: drop words relative to

their frequency—what would Firth say about this?

  • Dimensionality/sparsity — does a “bottle neck”

lead to better representations?

Mathematical Preprocessing

slide-44
SLIDE 44

Loss Functions

  • Softmax: is the predicted distribution (over all

words in the vocabulary) the right one?

  • Hierarchical Softmax: represent loss function using

binary tree, so compute loss for log(V) nodes per word, rather than V words per word.

  • NCE/Negative Sampling: can you distinguish the

real word from a randomly drawn word (or actually, k randomly drawn words)

slide-45
SLIDE 45

If it isn’t 11:40 or later, then the fact that I am on this slide means you didn’t interrupt enough. If it is 11:40 or later: well done, team!

slide-46
SLIDE 46

Announcements

  • Reading for Thursday…there is less of it
  • Welcome Jonathan! Office hours TBD (?)
  • Arun’s office hours: 5pm Wednesday
  • My office hours: 5pm this Friday (or some other

time?), Monday thereafter 4pm

slide-47
SLIDE 47

Assignment 1 is up!

  • Quick overview (Arun)
  • Due September 25 (in 2 weeks)