Information-theoretic locality properties of natural language - - PowerPoint PPT Presentation

information theoretic locality properties of natural
SMART_READER_LITE
LIVE PREVIEW

Information-theoretic locality properties of natural language - - PowerPoint PPT Presentation

Information-theoretic locality properties of natural language Richard Futrell Department of Language Science Department of Computer Science University of California, Irvine @rljfutrell rfutrell@uci.edu Quantitative Syntax 2019 2019-08-26


slide-1
SLIDE 1

Information-theoretic locality properties 


  • f natural language

Richard Futrell

Department of Language Science Department of Computer Science University of California, Irvine @rljfutrell rfutrell@uci.edu

Quantitative Syntax 2019 2019-08-26

1

slide-2
SLIDE 2

Efficiency Hypothesis

2

slide-3
SLIDE 3

Efficiency Hypothesis

2

  • Each human language is a solution to the problem
  • f maximally efficient communication…
slide-4
SLIDE 4

Efficiency Hypothesis

2

  • Each human language is a solution to the problem
  • f maximally efficient communication…
  • subject to fixed human information processing

constraints.

slide-5
SLIDE 5

Efficiency Hypothesis

2

  • Each human language is a solution to the problem
  • f maximally efficient communication…
  • subject to fixed human information processing

constraints.

  • Efficiency Hypothesis: Languages are optimized

so that messages we want to express are easy to produce and comprehend accurately. 


(Zipf, 1949; Hockett, 1960; Slobin, 1973; Givón, 1991, 1992; Hawkins, 1994, 2004, 2014; Christiansen & Chater, 2008; Jaeger & Tily, 2011; Fedzechkina et al., 2012; MacDonald, 2013)

slide-6
SLIDE 6

Efficiency Hypothesis

3

  • Efficiency Hypothesis: Languages are optimized

so that messages we want to express are easy to produce and comprehend accurately. 


(Zipf, 1949; Hockett, 1960; Slobin, 1973; Givón, 1991, 1992; Hawkins, 1994, 2004, 2014; Christiansen & Chater, 2008; Jaeger & Tily, 2011; Fedzechkina et al., 2012; MacDonald, 2013)

slide-7
SLIDE 7

Efficiency Hypothesis

3

  • Efficiency Hypothesis: Languages are optimized

so that messages we want to express are easy to produce and comprehend accurately. 


(Zipf, 1949; Hockett, 1960; Slobin, 1973; Givón, 1991, 1992; Hawkins, 1994, 2004, 2014; Christiansen & Chater, 2008; Jaeger & Tily, 2011; Fedzechkina et al., 2012; MacDonald, 2013)

  • Mathematical formalization: human languages

are solutions to a constrained optimization problem describing communication subject to cognitive constraints.

  • So, what is the objective function that human

languages optimize?

slide-8
SLIDE 8

Information-Theoretic Models of Natural Language

4

slide-9
SLIDE 9

Information-Theoretic Models of Natural Language

4

  • For example, Ferrer i Cancho & Solé (2003) propose that, for a

source random variable M, natural languages L are minima of:
 
 
 
 
 
 
 


slide-10
SLIDE 10

Information-Theoretic Models of Natural Language

4

  • For example, Ferrer i Cancho & Solé (2003) propose that, for a

source random variable M, natural languages L are minima of:
 
 
 
 
 
 
 


JM(L) = H[M|L] + λH[L]

slide-11
SLIDE 11

Information-Theoretic Models of Natural Language

4

  • For example, Ferrer i Cancho & Solé (2003) propose that, for a

source random variable M, natural languages L are minima of:
 
 
 
 
 
 
 


JM(L) = H[M|L] + λH[L]

Ambiguity of meaning of the signal. (Conditional entropy of meaning given signal)

slide-12
SLIDE 12

Information-Theoretic Models of Natural Language

4

  • For example, Ferrer i Cancho & Solé (2003) propose that, for a

source random variable M, natural languages L are minima of:
 
 
 
 
 
 
 


JM(L) = H[M|L] + λH[L]

Ambiguity of meaning of the signal. (Conditional entropy of meaning given signal) Effort of using the signal
 (Entropy of the signal)

slide-13
SLIDE 13

Information-Theoretic Models of Natural Language

4

  • For example, Ferrer i Cancho & Solé (2003) propose that, for a

source random variable M, natural languages L are minima of:
 
 
 
 
 
 
 


  • This function is also known as the Deterministic Information

Bottleneck (Strouse & Schwab, 2016) and the Infomax Criterion (Bell & Sejnowski, 1995; Friston, 2010).

JM(L) = H[M|L] + λH[L]

Ambiguity of meaning of the signal. (Conditional entropy of meaning given signal) Effort of using the signal
 (Entropy of the signal)

slide-14
SLIDE 14

Information-Theoretic Models of Natural Language

4

  • For example, Ferrer i Cancho & Solé (2003) propose that, for a

source random variable M, natural languages L are minima of:
 
 
 
 
 
 
 


  • This function is also known as the Deterministic Information

Bottleneck (Strouse & Schwab, 2016) and the Infomax Criterion (Bell & Sejnowski, 1995; Friston, 2010).

  • Key part: effort is quantified using entropy (average

surprisal).

JM(L) = H[M|L] + λH[L]

Ambiguity of meaning of the signal. (Conditional entropy of meaning given signal) Effort of using the signal
 (Entropy of the signal)

slide-15
SLIDE 15

Efficiency Models of Word Order

5

slide-16
SLIDE 16

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-17
SLIDE 17

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-18
SLIDE 18

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-19
SLIDE 19
  • Bob threw out the trash. 👎

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-20
SLIDE 20
  • Bob threw out the trash. 👎

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-21
SLIDE 21
  • Bob threw out the trash. 👎
  • Bob threw the trash out. 👎

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-22
SLIDE 22
  • Bob threw out the trash. 👎
  • Bob threw the trash out. 👎

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-23
SLIDE 23
  • Bob threw out the trash. 👎
  • Bob threw the trash out. 👎
  • Bob threw out the old trash that had been sitting in the kitchen. 👎

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-24
SLIDE 24
  • Bob threw out the trash. 👎
  • Bob threw the trash out. 👎
  • Bob threw out the old trash that had been sitting in the kitchen. 👎

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-25
SLIDE 25
  • Bob threw out the trash. 👎
  • Bob threw the trash out. 👎
  • Bob threw out the old trash that had been sitting in the kitchen. 👎
  • Bob threw the old trash that had been sitting in the kitchen out. 👏

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-26
SLIDE 26
  • Bob threw out the trash. 👎
  • Bob threw the trash out. 👎
  • Bob threw out the old trash that had been sitting in the kitchen. 👎
  • Bob threw the old trash that had been sitting in the kitchen out. 👏

Efficiency Models of Word Order

5

  • The most powerful efficiency-based model of word order in natural

language is dependency locality (aka dependency length minimization,

dependency distance minimization, domain minimization, early immediate constituents, principle of head proximity, Behaghel’s First Law, …)

slide-27
SLIDE 27

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

slide-28
SLIDE 28

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

slide-29
SLIDE 29

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

  • So the linear distance between words in dependencies

should be minimized (for recent reviews, see Dyer, 2017; Temperley & Gildea,

2018; Liu et al., 2018).

slide-30
SLIDE 30

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

  • So the linear distance between words in dependencies

should be minimized (for recent reviews, see Dyer, 2017; Temperley & Gildea,

2018; Liu et al., 2018).

  • Explains pervasive word order patterns across languages:
slide-31
SLIDE 31

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

  • So the linear distance between words in dependencies

should be minimized (for recent reviews, see Dyer, 2017; Temperley & Gildea,

2018; Liu et al., 2018).

  • Explains pervasive word order patterns across languages:
  • Harmonic word order correlations (Greenberg, 1963; Hawkins, 1994)
slide-32
SLIDE 32

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

  • So the linear distance between words in dependencies

should be minimized (for recent reviews, see Dyer, 2017; Temperley & Gildea,

2018; Liu et al., 2018).

  • Explains pervasive word order patterns across languages:
  • Harmonic word order correlations (Greenberg, 1963; Hawkins, 1994)
  • Short-before-long and long-before-short preferences 


(Hawkins, 1994, 2004, 2014; Wasow, 2002)

slide-33
SLIDE 33

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

  • So the linear distance between words in dependencies

should be minimized (for recent reviews, see Dyer, 2017; Temperley & Gildea,

2018; Liu et al., 2018).

  • Explains pervasive word order patterns across languages:
  • Harmonic word order correlations (Greenberg, 1963; Hawkins, 1994)
  • Short-before-long and long-before-short preferences 


(Hawkins, 1994, 2004, 2014; Wasow, 2002)

  • Tendency to projectivity (Ferrer-i-Cancho, 2006)
slide-34
SLIDE 34

Efficiency Models of Word Order

6

  • The most powerful efficiency-based model of word order in

natural language is dependency locality:

  • Robust evidence from psycholinguistics that long

dependencies cause processing difficulty (Gibson, 1998, 2000;

Grodner & Gibson, 2005; Bartek et al., 2011)

  • So the linear distance between words in dependencies

should be minimized (for recent reviews, see Dyer, 2017; Temperley & Gildea,

2018; Liu et al., 2018).

  • Explains pervasive word order patterns across languages:
  • Harmonic word order correlations (Greenberg, 1963; Hawkins, 1994)
  • Short-before-long and long-before-short preferences 


(Hawkins, 1994, 2004, 2014; Wasow, 2002)

  • Tendency to projectivity (Ferrer-i-Cancho, 2006)
slide-35
SLIDE 35

Focus of this Work

7

slide-36
SLIDE 36

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

slide-37
SLIDE 37

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

slide-38
SLIDE 38

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

slide-39
SLIDE 39

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

  • Information locality: Words are under pressure to be close in

proportion to their mutual information.

slide-40
SLIDE 40

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

  • Information locality: Words are under pressure to be close in

proportion to their mutual information.

  • I show that information locality makes correct predictions

beyond dependency locality in two domains:

slide-41
SLIDE 41

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

  • Information locality: Words are under pressure to be close in

proportion to their mutual information.

  • I show that information locality makes correct predictions

beyond dependency locality in two domains:

  • (1) Differences between different dependencies
slide-42
SLIDE 42

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

  • Information locality: Words are under pressure to be close in

proportion to their mutual information.

  • I show that information locality makes correct predictions

beyond dependency locality in two domains:

  • (1) Differences between different dependencies
  • (2) Relative order of adjectives
slide-43
SLIDE 43

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

  • Information locality: Words are under pressure to be close in

proportion to their mutual information.

  • I show that information locality makes correct predictions

beyond dependency locality in two domains:

  • (1) Differences between different dependencies
  • (2) Relative order of adjectives
slide-44
SLIDE 44

Focus of this Work

7

  • Problem. Dependency locality is motivated in terms of heuristic

arguments about memory usage.

  • Question. How does dependency locality fit in formally with

information-theoretic models of natural language?

  • Answer. When we adopt a more sophisticated model of

processing difficulty, we can derive dependency locality as a special case of a new information-theoretic principle:

  • Information locality: Words are under pressure to be close in

proportion to their mutual information.

  • I show that information locality makes correct predictions

beyond dependency locality in two domains:

  • (1) Differences between different dependencies
  • (2) Relative order of adjectives
slide-45
SLIDE 45

Information Locality

  • Introduction
  • Information Locality
  • Study 1: Strength of Dependencies
  • Study 2: Adjective Order
  • Conclusion

8

slide-46
SLIDE 46

What makes words hard to process?

9

slide-47
SLIDE 47

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

slide-48
SLIDE 48

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

slide-49
SLIDE 49

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)
slide-50
SLIDE 50

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)

Smith & Levy (2013). The effect of word predictability on reading time is

  • logarithmic. Cognition.
slide-51
SLIDE 51

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)
  • Accounts for:

Smith & Levy (2013). The effect of word predictability on reading time is

  • logarithmic. Cognition.
slide-52
SLIDE 52

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)
  • Accounts for:
  • Garden path effects (Hale, 2001)

Smith & Levy (2013). The effect of word predictability on reading time is

  • logarithmic. Cognition.
slide-53
SLIDE 53

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)
  • Accounts for:
  • Garden path effects (Hale, 2001)
  • Antilocality effects (Konieczny, 2000;

Levy, 2008)

Smith & Levy (2013). The effect of word predictability on reading time is

  • logarithmic. Cognition.
slide-54
SLIDE 54

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)
  • Accounts for:
  • Garden path effects (Hale, 2001)
  • Antilocality effects (Konieczny, 2000;

Levy, 2008)

  • Syntactic construction frequency

effects (Levy, 2008)

Smith & Levy (2013). The effect of word predictability on reading time is

  • logarithmic. Cognition.
slide-55
SLIDE 55

What makes words hard to process?

9

  • Surprisal theory (Hale, 2001; Levy, 2008; Smith

& Levy, 2013; Hale, 2016):

  • Processing difficulty at a word is equal to

the surprisal of that word in context:

  • Difficulty(w | context) = -logP(w | context)
  • Accounts for:
  • Garden path effects (Hale, 2001)
  • Antilocality effects (Konieczny, 2000;

Levy, 2008)

  • Syntactic construction frequency

effects (Levy, 2008)

  • In other words, the average processing

difficulty in a language is proportional to the entropy of the language H[L].

Smith & Levy (2013). The effect of word predictability on reading time is

  • logarithmic. Cognition.
slide-56
SLIDE 56

Limitations of Surprisal Theory

10

slide-57
SLIDE 57

Limitations of Surprisal Theory

10

slide-58
SLIDE 58

Limitations of Surprisal Theory

10

slide-59
SLIDE 59

Limitations of Surprisal Theory

10

slide-60
SLIDE 60

Limitations of Surprisal Theory

10

  • Surprisal theory has excellent empirical coverage for
  • bservable processing difficulty, except
slide-61
SLIDE 61

Limitations of Surprisal Theory

10

  • Surprisal theory has excellent empirical coverage for
  • bservable processing difficulty, except
  • It does not account for dependency locality effects

empirically (Levy, 2008, 2013) and provably cannot theoretically (Levy, 2006; Futrell, 2017).

slide-62
SLIDE 62

Limitations of Surprisal Theory

10

  • Surprisal theory has excellent empirical coverage for
  • bservable processing difficulty, except
  • It does not account for dependency locality effects

empirically (Levy, 2008, 2013) and provably cannot theoretically (Levy, 2006; Futrell, 2017).

  • Reason: Surprisal theory has no notion of memory

limitations.

slide-63
SLIDE 63

Limitations of Surprisal Theory

10

  • Surprisal theory has excellent empirical coverage for
  • bservable processing difficulty, except
  • It does not account for dependency locality effects

empirically (Levy, 2008, 2013) and provably cannot theoretically (Levy, 2006; Futrell, 2017).

  • Reason: Surprisal theory has no notion of memory

limitations.

  • So how can we build memory limitations into surprisal theory?
slide-64
SLIDE 64

Limitations of Surprisal Theory

10

  • Surprisal theory has excellent empirical coverage for
  • bservable processing difficulty, except
  • It does not account for dependency locality effects

empirically (Levy, 2008, 2013) and provably cannot theoretically (Levy, 2006; Futrell, 2017).

  • Reason: Surprisal theory has no notion of memory

limitations.

  • So how can we build memory limitations into surprisal theory?
slide-65
SLIDE 65

Limitations of Surprisal Theory

10

  • Surprisal theory has excellent empirical coverage for
  • bservable processing difficulty, except
  • It does not account for dependency locality effects

empirically (Levy, 2008, 2013) and provably cannot theoretically (Levy, 2006; Futrell, 2017).

  • Reason: Surprisal theory has no notion of memory

limitations.

  • So how can we build memory limitations into surprisal theory?
slide-66
SLIDE 66

How to fit memory into Surprisal Theory?

Futrell & Levy (2017)

slide-67
SLIDE 67

How to fit memory into Surprisal Theory?

  • Surprisal: Diff(w | context) = -logP(w | context)

Futrell & Levy (2017)

slide-68
SLIDE 68

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w

  • Surprisal: Diff(w | context) = -logP(w | context)

Futrell & Levy (2017)

slide-69
SLIDE 69

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w

  • Surprisal: Diff(w | context) = -logP(w | context)

context

Futrell & Levy (2017)

slide-70
SLIDE 70

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w

  • Surprisal: Diff(w | context) = -logP(w | context)

context next word

Futrell & Levy (2017)

slide-71
SLIDE 71

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w

  • Surprisal: Diff(w | context) = -logP(w | context)

context next word

prediction

Futrell & Levy (2017)

slide-72
SLIDE 72

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w

  • Surprisal: Diff(w | context) = -logP(w | context)

context next word

  • bjective


context memory representation

Futrell & Levy (2017)

slide-73
SLIDE 73

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w

  • Surprisal: Diff(w | context) = -logP(w | context)

context next word

  • bjective


context memory representation

prediction

Futrell & Levy (2017)

slide-74
SLIDE 74

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w memory representation

  • Surprisal: Diff(w | context) = -logP(w | context)

context next word

  • bjective


context memory representation

prediction

Futrell & Levy (2017)

slide-75
SLIDE 75

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w memory representation

  • Surprisal: Diff(w | context) = -logP(w | context)

context next word

  • bjective


context memory representation

prediction

Futrell & Levy (2017)

slide-76
SLIDE 76

How to fit memory into Surprisal Theory?

Bob threw the old trash that had been sitting in the kitchen

  • ut

context w memory representation

context next word

  • bjective


context memory representation

prediction

Futrell & Levy (2017)

  • Lossy-context surprisal: Diff(w | context) = -logP(w | memory representation)
slide-77
SLIDE 77

What makes words hard to process?

12

slide-78
SLIDE 78

What makes words hard to process?

12

  • Lossy-context surprisal: Processing difficulty per word is
slide-79
SLIDE 79

What makes words hard to process?

12

  • Lossy-context surprisal: Processing difficulty per word is

Diff(wi|w1,…,i−1) ∝ − log p(wi|mi),

slide-80
SLIDE 80

What makes words hard to process?

12

  • Lossy-context surprisal: Processing difficulty per word is

Diff(wi|w1,…,i−1) ∝ − log p(wi|mi),

where mi is a lossy compression of the context w1,…,i-1, 
 i.e. mi is an approximate epsilon-machine (Feldman & Crutchfield, 1998; Marzen & Crutchfield, 2017).

slide-81
SLIDE 81

What makes words hard to process?

12

  • Lossy-context surprisal: Processing difficulty per word is

Diff(wi|w1,…,i−1) ∝ − log p(wi|mi),

  • So the average processing difficulty for a language is a

cross entropy:

where mi is a lossy compression of the context w1,…,i-1, 
 i.e. mi is an approximate epsilon-machine (Feldman & Crutchfield, 1998; Marzen & Crutchfield, 2017).

slide-82
SLIDE 82

What makes words hard to process?

12

  • Lossy-context surprisal: Processing difficulty per word is

Diff(wi|w1,…,i−1) ∝ − log p(wi|mi),

  • So the average processing difficulty for a language is a

cross entropy:

Diff(L) ∝ 𝔽

w1,…,i

[−log p(wi|mi)]

where mi is a lossy compression of the context w1,…,i-1, 
 i.e. mi is an approximate epsilon-machine (Feldman & Crutchfield, 1998; Marzen & Crutchfield, 2017).

slide-83
SLIDE 83

What makes words hard to process?

12

  • Lossy-context surprisal: Processing difficulty per word is

Diff(wi|w1,…,i−1) ∝ − log p(wi|mi),

  • So the average processing difficulty for a language is a

cross entropy:

Diff(L) ∝ 𝔽

w1,…,i

[−log p(wi|mi)]

≡ HL[L′]

where mi is a lossy compression of the context w1,…,i-1, 
 i.e. mi is an approximate epsilon-machine (Feldman & Crutchfield, 1998; Marzen & Crutchfield, 2017).

slide-84
SLIDE 84

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

slide-85
SLIDE 85

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

slide-86
SLIDE 86
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

slide-87
SLIDE 87
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

slide-88
SLIDE 88
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

slide-89
SLIDE 89
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-90
SLIDE 90
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-91
SLIDE 91
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Pointwise mutual information (pmi) is the most general statistical measure of how strongly two values predict each other (Church & Hanks, 1990) pmi(w; w’) = log p(w|w’) p(w)

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-92
SLIDE 92
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-93
SLIDE 93
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-94
SLIDE 94
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-95
SLIDE 95
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

ed : Proportion of information retained about the d'th most recent word

(Under the noisy memory model, this must decrease monotonically.)

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-96
SLIDE 96
  • If information about words is lost at a constant rate (noisy memory), then the

memory representation will have less information about words that have been in memory longer.

  • This leads to information locality. Difficulty increases when words with high

mutual information are distant.

  • Theorem (Futrell & Levy, 2017):

13

ed : Proportion of information retained about the d'th most recent word

(Under the noisy memory model, this must decrease monotonically.)

Information Locality

  • bjective context
  • ut

Bob threw the old trash sitting in the kitchen

memory representation

C(wi|w1, ..., wi−1) ≈ − log P(w) −

i−1

  • j=1

ei−jpmi(wi; wj)

Diff

slide-97
SLIDE 97

Information Locality

14

slide-98
SLIDE 98

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.

14

slide-99
SLIDE 99

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.
  • How does this relate to dependency locality?

14

slide-100
SLIDE 100

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.
  • How does this relate to dependency locality?
  • Linking Hypothesis: Words in syntactic dependencies have high mutual

information (de Paiva Alves, 1996; Yuret, 1998)

14

slide-101
SLIDE 101

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.
  • How does this relate to dependency locality?
  • Linking Hypothesis: Words in syntactic dependencies have high mutual

information (de Paiva Alves, 1996; Yuret, 1998)

  • Makes sense a priori: Mutual information is a measure of strength of

covariance.

14

slide-102
SLIDE 102

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.
  • How does this relate to dependency locality?
  • Linking Hypothesis: Words in syntactic dependencies have high mutual

information (de Paiva Alves, 1996; Yuret, 1998)

  • Makes sense a priori: Mutual information is a measure of strength of

covariance.

  • If this is true, then we can see dependency locality effects as a subset of

information locality effects.

14

slide-103
SLIDE 103

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.
  • How does this relate to dependency locality?
  • Linking Hypothesis: Words in syntactic dependencies have high mutual

information (de Paiva Alves, 1996; Yuret, 1998)

  • Makes sense a priori: Mutual information is a measure of strength of

covariance.

  • If this is true, then we can see dependency locality effects as a subset of

information locality effects.

  • I have a talk about this tomorrow! (Futrell, Qian, Gibson, Fedorenko & Blank, 2019).

14

slide-104
SLIDE 104

Information Locality

  • Information locality: I predict processing difficulty when words that predict each
  • ther (have high mutual information) are far apart.
  • How does this relate to dependency locality?
  • Linking Hypothesis: Words in syntactic dependencies have high mutual

information (de Paiva Alves, 1996; Yuret, 1998)

  • Makes sense a priori: Mutual information is a measure of strength of

covariance.

  • If this is true, then we can see dependency locality effects as a subset of

information locality effects.

  • I have a talk about this tomorrow! (Futrell, Qian, Gibson, Fedorenko & Blank, 2019).

14

Information locality Dependency Locality

Words with high mutual information 
 should be close Words in dependencies 
 should be close

slide-105
SLIDE 105

Information Locality

  • Introduction
  • Information Locality
  • Study 1: Strength of Dependencies
  • Study 2: Adjective Order
  • Conclusion

15

slide-106
SLIDE 106

Strength of Dependencies

slide-107
SLIDE 107

Strength of Dependencies

  • Dependency locality says: All words in dependencies should

be close.

slide-108
SLIDE 108

Strength of Dependencies

  • Dependency locality says: All words in dependencies should

be close.

  • Information locality says: Words want to be close in

proportion to their mutual information.

slide-109
SLIDE 109

Strength of Dependencies

  • Dependency locality says: All words in dependencies should

be close.

  • Information locality says: Words want to be close in

proportion to their mutual information.

  • Information locality prediction: Words in dependencies

which predict each other other strongly will be especially attracted to each other, beyond dependency locality effects.

slide-110
SLIDE 110

Strength of Dependencies

  • Dependency locality says: All words in dependencies should

be close.

  • Information locality says: Words want to be close in

proportion to their mutual information.

  • Information locality prediction: Words in dependencies

which predict each other other strongly will be especially attracted to each other, beyond dependency locality effects.

slide-111
SLIDE 111

Strength of Dependencies

slide-112
SLIDE 112

Strength of Dependencies

  • So: Fit a regression predicting the distance between a head

and dependent from the pmi of the head and dependent.

slide-113
SLIDE 113

Strength of Dependencies

  • So: Fit a regression predicting the distance between a head

and dependent from the pmi of the head and dependent.

slide-114
SLIDE 114

Strength of Dependencies

  • So: Fit a regression predicting the distance between a head

and dependent from the pmi of the head and dependent.

Distance between words in the r’th dependency in language I

slide-115
SLIDE 115

Strength of Dependencies

  • So: Fit a regression predicting the distance between a head

and dependent from the pmi of the head and dependent.

Distance between words in the r’th dependency in language I Strength of pmi-attraction effect

slide-116
SLIDE 116

Strength of Dependencies

  • So: Fit a regression predicting the distance between a head

and dependent from the pmi of the head and dependent.

  • Fit to UD v2.1 corpora of 50 languages.

Distance between words in the r’th dependency in language I Strength of pmi-attraction effect

slide-117
SLIDE 117

Strength of Dependencies

  • So: Fit a regression predicting the distance between a head

and dependent from the pmi of the head and dependent.

  • Fit to UD v2.1 corpora of 50 languages.
  • I measure pmi between POS tags, not wordforms,

because wordform mutual information is hard to estimate for natural language (see my talk tomorrow)

Distance between words in the r’th dependency in language I Strength of pmi-attraction effect

slide-118
SLIDE 118

Are words in dependencies with high pmi closer?

slide-119
SLIDE 119

Are words in dependencies with high pmi closer?

  • I find a significant pmi

attraction effect in 48/50 languages.

slide-120
SLIDE 120

Are words in dependencies with high pmi closer?

  • I find a significant pmi

attraction effect in 48/50 languages.

slide-121
SLIDE 121

Are words in dependencies with high pmi closer?

  • I find a significant pmi

attraction effect in 48/50 languages.

  • Average effect size is -0.3:
slide-122
SLIDE 122

Are words in dependencies with high pmi closer?

  • I find a significant pmi

attraction effect in 48/50 languages.

  • Average effect size is -0.3:
  • For each bit of pmi

between two words, they are 0.3 words closer together on average.

slide-123
SLIDE 123

Information Locality

  • Introduction
  • Information Locality
  • Study 1: Strength of Dependencies
  • Study 2: Adjective Order
  • Conclusion

19

slide-124
SLIDE 124

Adjective Order Constraints

20

slide-125
SLIDE 125

Adjective Order Constraints

20

The pretty red Italian car 👎

slide-126
SLIDE 126

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏

slide-127
SLIDE 127

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏

slide-128
SLIDE 128

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

slide-129
SLIDE 129

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

slide-130
SLIDE 130

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

slide-131
SLIDE 131

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

slide-132
SLIDE 132

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

slide-133
SLIDE 133

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

  • There are constraints on relative order of adjectives that

are stable across speakers and languages.

slide-134
SLIDE 134

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

  • There are constraints on relative order of adjectives that

are stable across speakers and languages.

  • Strongest empirical generalization: more subjective adjectives

are farther out (Scontras et al., 2017)

slide-135
SLIDE 135

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

  • There are constraints on relative order of adjectives that

are stable across speakers and languages.

  • Strongest empirical generalization: more subjective adjectives

are farther out (Scontras et al., 2017)

  • Information locality explanation: Adjectives with high pmi

with a noun will appear relatively close to that noun.

slide-136
SLIDE 136

Adjective Order Constraints

20

The pretty red Italian car 👎 The red pretty Italian car 👏 The Italian pretty red car 👏 The pretty Italian red car 👏

  • There are constraints on relative order of adjectives that

are stable across speakers and languages.

  • Strongest empirical generalization: more subjective adjectives

are farther out (Scontras et al., 2017)

  • Information locality explanation: Adjectives with high pmi

with a noun will appear relatively close to that noun.

  • Possibly conceptually related to subjectivity.
slide-137
SLIDE 137

Does Adjective Order Correspond to 
 Mutual Information?

21

slide-138
SLIDE 138

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

slide-139
SLIDE 139

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
slide-140
SLIDE 140

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

slide-141
SLIDE 141

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

slide-142
SLIDE 142

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

  • Data: Google Syntactic n-Grams (8.5 billion adjective-noun pairs)
slide-143
SLIDE 143

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

  • Data: Google Syntactic n-Grams (8.5 billion adjective-noun pairs)
  • Model: Logistic regression predicting order from pmi.
slide-144
SLIDE 144

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

  • Data: Google Syntactic n-Grams (8.5 billion adjective-noun pairs)
  • Model: Logistic regression predicting order from pmi.
  • Result: PMI predicts adjective order for held-out data with 66.9%

accuracy.

slide-145
SLIDE 145

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

  • Data: Google Syntactic n-Grams (8.5 billion adjective-noun pairs)
  • Model: Logistic regression predicting order from pmi.
  • Result: PMI predicts adjective order for held-out data with 66.9%

accuracy.

  • Best previously known predictor (subjectivity) gets 68.4%
slide-146
SLIDE 146

Does Adjective Order Correspond to 
 Mutual Information?

21

  • 1. Gather a large set of adjective—adjective—noun triples

from a corpus.

  • 2. Measure pmi between adjectives and nouns.
  • 3. Does pmi(A;N) predict that A will be closer to the noun than

the other adjective?

  • Data: Google Syntactic n-Grams (8.5 billion adjective-noun pairs)
  • Model: Logistic regression predicting order from pmi.
  • Result: PMI predicts adjective order for held-out data with 66.9%

accuracy.

  • Best previously known predictor (subjectivity) gets 68.4%
  • PMI + Subjectivity gets 72.9% accuracy
slide-147
SLIDE 147

Discussion

22

slide-148
SLIDE 148

Discussion

22

  • Other theories aim to explain the same data…
slide-149
SLIDE 149

Discussion

22

  • Other theories aim to explain the same data…
  • Dyer’s (2017, 2018) Integration Cost: Involves the

conditional entropy of dependency relation labels given words.

slide-150
SLIDE 150

Discussion

22

  • Other theories aim to explain the same data…
  • Dyer’s (2017, 2018) Integration Cost: Involves the

conditional entropy of dependency relation labels given words.

  • Hahn et al.’s (2018) Subjective Rational Speech Acts

Model: Involves noisy incremental memory in the computation of meaning.

slide-151
SLIDE 151

Discussion

22

  • Other theories aim to explain the same data…
  • Dyer’s (2017, 2018) Integration Cost: Involves the

conditional entropy of dependency relation labels given words.

  • Hahn et al.’s (2018) Subjective Rational Speech Acts

Model: Involves noisy incremental memory in the computation of meaning.

  • Scontras et al.’s (2019) Noisy composition model

explains adjective order in terms of noisy hierarchical computation of meaning.

slide-152
SLIDE 152

Discussion

22

  • Other theories aim to explain the same data…
  • Dyer’s (2017, 2018) Integration Cost: Involves the

conditional entropy of dependency relation labels given words.

  • Hahn et al.’s (2018) Subjective Rational Speech Acts

Model: Involves noisy incremental memory in the computation of meaning.

  • Scontras et al.’s (2019) Noisy composition model

explains adjective order in terms of noisy hierarchical computation of meaning.

  • Future work will have to rigorously disentangle the predictions
  • f these theories.
slide-153
SLIDE 153

Discussion

22

  • Other theories aim to explain the same data…
  • Dyer’s (2017, 2018) Integration Cost: Involves the

conditional entropy of dependency relation labels given words.

  • Hahn et al.’s (2018) Subjective Rational Speech Acts

Model: Involves noisy incremental memory in the computation of meaning.

  • Scontras et al.’s (2019) Noisy composition model

explains adjective order in terms of noisy hierarchical computation of meaning.

  • Future work will have to rigorously disentangle the predictions
  • f these theories.
  • Problem: The relevant information-theoretic quantities are

hard to estimate accurately.

slide-154
SLIDE 154

Information Locality

  • Introduction
  • Information Locality
  • Study 1: Strength of Dependencies
  • Study 2: Adjective Order
  • Conclusion

23

slide-155
SLIDE 155

Conclusion

24

slide-156
SLIDE 156

Conclusion

24

  • Question. How does dependency locality fit in formally

with information-theoretic models of natural language?

slide-157
SLIDE 157

Conclusion

24

  • Question. How does dependency locality fit in formally

with information-theoretic models of natural language?

JM(L) = H[M|L] + λH[L]

slide-158
SLIDE 158

Conclusion

24

  • Question. How does dependency locality fit in formally

with information-theoretic models of natural language?

JM(L) = H[M|L] + λHL[L′]

slide-159
SLIDE 159

Conclusion

24

  • Question. How does dependency locality fit in formally

with information-theoretic models of natural language?

JM(L) = H[M|L] + λHL[L′]

Dependency locality happens in this term in the form of information locality

slide-160
SLIDE 160

Outstanding Questions for Information Locality

25

slide-161
SLIDE 161

Outstanding Questions for Information Locality

25

  • Does information locality capture the trade-off of complex

morphology and deterministic word order? (Koplenig et al., 2017)

slide-162
SLIDE 162

Outstanding Questions for Information Locality

25

  • Does information locality capture the trade-off of complex

morphology and deterministic word order? (Koplenig et al., 2017)

  • Depends on the precise relationship between morphology

and inter-word MI.

slide-163
SLIDE 163

Outstanding Questions for Information Locality

25

  • Does information locality capture the trade-off of complex

morphology and deterministic word order? (Koplenig et al., 2017)

  • Depends on the precise relationship between morphology

and inter-word MI.

  • Is the right notion of mutual information purely MI between

words, or is it also something that takes into account meaning?

slide-164
SLIDE 164

Outstanding Questions for Information Locality

25

  • Does information locality capture the trade-off of complex

morphology and deterministic word order? (Koplenig et al., 2017)

  • Depends on the precise relationship between morphology

and inter-word MI.

  • Is the right notion of mutual information purely MI between

words, or is it also something that takes into account meaning?

  • E.g., dependency relation types, as in Dyer’s Integration

Cost theory

slide-165
SLIDE 165

Outstanding Questions for Information Locality

25

  • Does information locality capture the trade-off of complex

morphology and deterministic word order? (Koplenig et al., 2017)

  • Depends on the precise relationship between morphology

and inter-word MI.

  • Is the right notion of mutual information purely MI between

words, or is it also something that takes into account meaning?

  • E.g., dependency relation types, as in Dyer’s Integration

Cost theory

  • Does information locality make different predictions from

dependency locality wrt crossing dependencies?

slide-166
SLIDE 166

Outstanding Questions for Information Locality

25

  • Does information locality capture the trade-off of complex

morphology and deterministic word order? (Koplenig et al., 2017)

  • Depends on the precise relationship between morphology

and inter-word MI.

  • Is the right notion of mutual information purely MI between

words, or is it also something that takes into account meaning?

  • E.g., dependency relation types, as in Dyer’s Integration

Cost theory

  • Does information locality make different predictions from

dependency locality wrt crossing dependencies?

slide-167
SLIDE 167
  • All code is available online at


http://github.com/langprocgroup/adjorder and 
 http://github.com/langprocgroup/cliqs

  • Thanks to Roger Levy, Ted Gibson, and Tim O’Donnell for

discussions.

  • Thanks to the SyntaxFest reviewers for helpful comments.
  • Thanks to the Quasy organizers for a great conference!

Thanks all!

26