An introduction to computational psycholinguistics: Modeling human - - PDF document

an introduction to computational psycholinguistics
SMART_READER_LITE
LIVE PREVIEW

An introduction to computational psycholinguistics: Modeling human - - PDF document

An introduction to computational psycholinguistics: Modeling human sentence processing Shravan Vasishth University of Potsdam, Germany http://www.ling.uni-potsdam.de/ vasishth vasishth@acm.org September 2005, Bochum Probabilistic models:


slide-1
SLIDE 1

An introduction to computational psycholinguistics: Modeling human sentence processing

Shravan Vasishth University of Potsdam, Germany http://www.ling.uni-potsdam.de/∼vasishth vasishth@acm.org September 2005, Bochum

Probabilistic models: (Crocker & Keller, 2005)

  • In ambiguous sentences, a preferred interpretation is immediately assigned, with later

backtracking to reanalyze. (1)

  • a. The horse raced past the barn fell. (Bever 1970)
  • b. After the student moved the chair broke.
  • c. The daughter of the colonel who was standing by the window.
  • On what basis do humans choose one over the other interpretation?
  • One plausible answer: experience. (Can you think of some others?)
  • We’ve seen an instance of how experience could determine parsing decisions:

connectionist models.

1

slide-2
SLIDE 2

The role of linguistic experience

  • Experience: the number of times the speaker has encountered a particular entity in the

past.

  • It’s impractical to measure or quantify experience of a particular entity based on the

entire set of linguistic items seen by a speaker; but we can estimate them through (e.g.) corpora, norming (e.g., sentence completion) studies. Consider the S/NP ambiguity of the verb “know”:

  • There is a reliable correlation between corpora and norming studies (Lapata, Keller, &

Schulte im Walde, 2001). (2) The teacher knew . . .

  • The critical issue is how the human processor uses this experience to resolve ambiguities,

and at what level of granularity experience plays a role (lexical, syntactic structure, verb frames).

2

The granularity issue

  • It’s clear that lexical frequencies play a role. But are frequencies used at the lemma

level or the token level? (Roland and Jurafsky 2002)

  • Structural frequencies:

do we use frequencies of individual phrase structure rules? Probabilistic modelers say: yes.

3

slide-3
SLIDE 3

Probabilistic grammars

  • Context-free grammar rules:

S -> NP VP NP -> Det N

  • Probabilities associated with each rule, derived from a (treebank) corpus:

1.0 S -> NP VP 0.7 NP -> Det N 0.3 NP -> Nplural

  • A normalization constraint on the PCFG: the probabilities of all rules with the same

LHS must sum to 1. (See appendix A of Hale 2003). ∀i Y

j

P (N i → ζj) =  (1)

4

  • The probability of a parse tree is the product of the rule probabilities.

P (t) = Y

(N→ζ)∈R

P (N → ζ) (2)

  • Jurafsky (1996) has suggested that the probability of a grammar rule models the ease

with which the rule can be accessed by the human sentence processor.

  • Example from Crocker and Keller (2005) shows how this setup can be used to predict

parse preferences.

  • Further reading: Manning and Sch¨

utze, and Jurafsky and Martin.

  • NLTK demo.

5

slide-4
SLIDE 4

Estimating the rule probabilities and parsing

  • Maximum likelihood: estimates the probability of a rule based on the number of times

it occurs in a treebank corpus.

  • Expectation maximization: given a grammar, computes a set of rule probabilities that

make the sentences maximally likely.

  • Viterbi algorithm for computing the best parse.

6

Linking probabilities to processing difficulty

  • The goal is usually to model reading time, acceptability judgments.
  • Possible measures:

– probability ratios of alternatives – Entropy reduction during incremental parsing: very different approach from computing probabilities of parses. Hale (2003): “cognitive load is related, perhaps linearly, to the reduction in the perceiver’s uncertainty about what the producer meant.” (p. 102)

7

slide-5
SLIDE 5

Some assumptions in Hale’s approach

  • During comprehension, sentence understanders determine a syntactic structure for the

perceived signal.

  • Producer and comprehender share the same grammar.
  • Comprehension is eager; no processing is deferred beyond the first point at which it

could happen. We’re now going to look at the mechanism he builds up, starting with the incredibly beautiful notion of entropy. In the following discussion I rely on (Schneider, 2005).

8

An introduction to entropy

Imagine a device D that can emit three symbols A, B, C.

  • Before D can emit anything, we are uncertain about which symbol among the three

possible ones it will emit. We can quantify this uncertainty and say it’s 3.

  • Now a symbol appears, say, A. Our uncertainty decreases. To 2. In other words, we’ve

received some information.

  • Information is decrease in uncertainty.
  • Now suppose that another machine D emits 1 or 2.
  • The composition of D × D results in 6 possible emissions: A1, A2, B1, B2, C1, C2.

So what’s our uncertainty now?

  • It would be nice to be able to talk about increase in uncertainty additively–hence the

use of logarithms.

  • D: log(3) is the new uncertainty; D log(2). D × D=log(6). When we use

base 2, the units of uncertainty are in bits, base 10 (units: digits), e (unit: nats/nits) also possible.

9

slide-6
SLIDE 6

Question

If a device emits only one symbol, what’s the uncertainty? How many bits?

10

Deriving the formula for uncertainty

Let M be the number of symbol-emissions possible. Uncertainty is now log (M). (From now on, log means log .)

log M = − log(M−) (3) = − log( 1 M ) (4) = − log(P ) . . . [Letting P = 1 M ] (5) (6) Let P i be the various probabilities of the symbols, such that

M

X

i=1

P i =  (7)

11

slide-7
SLIDE 7

Surprise

The surprise we get when we see the ith symbol is called “surprisal”. ui = − log(P i) (8) if P i =  then ui = ∞ . . . we are very surprised if P i =  then ui =  . . . we are not surprised Average surprisal for an infinite string of symbols: Assume a string of length N, M symbols. Let the ith symbol appear N i times: N =

M

X

i=1

N i (9)

12

Since there are N i cases of ui, average surprisal is:

M

P

i=1

N i × ui P

i=1

N i =

M

X

i=1

N i × ui N (10) if we measure this for an infinite string of symbols, frequency Ni

N approaches probability P i.

H =

M

X

i=1

P i × ui (11) = −

M

X

i=1

P i × P log P i (12) (13)

13

slide-8
SLIDE 8

That is Shannon’s formula for uncertainty. Search on Google for Claude Shannon, and you will find the original paper. Read it and memorize it.

14

Exercise

Suppose P  = P  = · · · = P M. H = −

M

X

i=1

P i log P i =? (14)

15

slide-9
SLIDE 9

Solution

Suppose P  = P  = · · · = P M. Then P i =

 M for i = 1 · · · m.

H = −

M

X

i=1

P i log P i = − »  M log  M + · · · +  M log  M – (15) = − 1 M × M log 1 M (16) = log M (17) Recall our earlier idea that if we have M outcomes we can express our uncertainty as log(M). Device D was log(3), for example.

16

Another exercise

Let p = , p = , · · · , p = . What is the average uncertainty or entropy H?

17

slide-10
SLIDE 10

Hale’s approach

  • Assume a grammar G.
  • Let T G represent all the possible derivations of G; each derivation has a probability.
  • Let W be all the possible strings in G.

Let’s represent the information conveyed by the first i words of a sentence generated by G by: I(T G | W ···i). I(T G | W ···i) =H(T G) − H(T G | W ...i) (18)

18

Hale’s approach

I(T G | W ···i) = H(T G) − H(T G | W ...i) (19) The above is just like our example with device D: I(D | A) =H(D) − H(D | A) (20) =3 − 2 (21) =1 (22) The above example with D matches our intuition: once an A has been emitted, our uncertainty has been reduced by 1.

19

slide-11
SLIDE 11

Information conveyed by a particular word

I(T G | W i = wi) = H(T G | w...i−) − H(T G | w...i) (23) This gives us the information a comprehender gets from a word.

20

The entropy of a grammar symbol, say VP,

is the sum of

  • the entropy of a single-rule rewrite decision
  • the expected entropy of any children of that symbol

(Grenander 1967)

21

slide-12
SLIDE 12

The entropy of a grammar symbol

Let the productions in a PCFG G be Q. For a left-hand side ξ, the rules rewriting it are Q(ξ). Example: VP -> V NP VP -> V Let h be a vector that is indexed by ξi, the symbols. So we can say h(ξ) to mean, say, the first cell in the vector. For any grammar symbol ξi we can compute the Entropy using the usual formula: h(ξ) = − X

r∈Q(ξ)

pr log pr (24)

22

The entropy of a grammar symbol

More generally: h(ξi) = − X

r∈Q(ξi)

pr log pr (25) So now the vector h has the Entropies of each ξi.

23

slide-13
SLIDE 13

The entropy of a grammar symbol, say VP,

is the sum of

  • the entropy of a single-rule rewrite decision ✔
  • the expected entropy of any children of that symbol

Now we work on the second part.

24

The expected entropy of any children of a symbol

Suppose Rule r rewrites s nonterminal ξi as n daughters. Rule r1 VP -> V NP Rule r2 VP -> V ξ = V P Q(V P ) = r1, r2 H(V P ) = h(V P ) + [H(V r) + H(NP r) + H(V r)] (26)

25

slide-14
SLIDE 14

The expected entropy of any children of a symbol

More generally: H(ξi) = h(ξi) + X

r∈Q(ξi)

[H(ξi) + · · · + H(ξin)] (27)

26

The information conveyed by a word

I(T G | W i = wi) = H(T G | w...i−) − H(T G | w...i) (28)

27

slide-15
SLIDE 15

Example

28

The horse raced past the barn fell

Initial conditional entropy of the parser state:HG(S) = .. Input “the”: Every sentence in the grammar begins with “the”, so information gained, no reduction in entropy.

29

slide-16
SLIDE 16

The horse raced

30

The horse raced past the barn fell

31

slide-17
SLIDE 17

Subject relatives

32

Object relatives

33

slide-18
SLIDE 18

In closing

  • What are some of the similarities and differences between connectionist probabilistic

models?

  • How would one (could one?) reconcile symbolic and connectionist and/or probabilistic

approaches?

34

*References Crocker, M., & Keller, F. (2005). Probablistic grammars as models of gradience in language processing. In G. Fanselow, C. F´ ery, R. Vogel, & M. Schlesewsky (Eds.), Gradience in grammar: Generative perspectives. Oxford University Press. Hale, J. (2003). The information conveyed by words in sentences. Journal of Psycholinguistic Research, 32(2), 101–123. Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and

  • disambiguation. Cognition, 20, 137–194.

Lapata, M., Keller, F., & Schulte im Walde, S. (2001). Verb frame frequency as a predictor

  • f verb bias. Journal of Psycholinguistic Research, 30(4), 419-435.

Schneider, T. (2005, March). Information theory primer. Manuscript on internet.

35