Syntactic-Ngrams over Time from a Very Large Corpus of English - - PowerPoint PPT Presentation

syntactic ngrams over time from a very large corpus of
SMART_READER_LITE
LIVE PREVIEW

Syntactic-Ngrams over Time from a Very Large Corpus of English - - PowerPoint PPT Presentation

Syntactic-Ngrams over Time from a Very Large Corpus of English Books Yoav Goldberg and Jon Orwant Presented at *SEM 2013, Atlanta, GA Many thanks to Google's parsing team Ryan Dipanjan Slav Kuzman Keith Terry Michael Fernando Hao


slide-1
SLIDE 1

Syntactic-Ngrams over Time from a Very Large Corpus

  • f English Books

Yoav Goldberg and Jon Orwant

Presented at *SEM 2013, Atlanta, GA

slide-2
SLIDE 2

Many thanks to Google's parsing team

Ryan Keith Slav Kuzman Dipanjan Fernando Hao Michael Joakim

(at the time)

Terry

slide-3
SLIDE 3

I'm a syntax and parsing guy I don't know much about semantics I'm not even sure I know what semantics mean

(I do know what tensors are, though, and some of you seem to like them)

however, I am pretty sure you will find this useful

Disclaimer:

slide-4
SLIDE 4

A lexical/syntactic resource Based on 350 billion parsed words Time-indexed Available for download

slide-5
SLIDE 5

distributed on the web under a Creative commons non-commercial share-alike license

slide-6
SLIDE 6

Lexical/syntactic resource

slide-7
SLIDE 7

“You shall know a word by the company it keeps”

  • Firth

what is the company of a word?

slide-8
SLIDE 8

sequential context is widely used The boy ate cake The, boy, ate boy, ate, cake

slide-9
SLIDE 9

The boy with the brown eyes ate the cake ...but sequential context is only a proxy

(often misleading)

slide-10
SLIDE 10

The boy with the brown eyes ate the cake ...but sequential context is only a proxy eyes, ate, the

(often misleading)

slide-11
SLIDE 11

The boy with the brown eyes ate the cake ...but sequential context is only a proxy eyes, ate, the brown, eyes, ate, the, cake

(often misleading)

slide-12
SLIDE 12

The boy with the brown eyes ate the cake what we really care for is the syntactic context boy ate cake

nsubj dobj

slide-13
SLIDE 13

The boy with the brown eyes ate the cake what we really care for is the syntactic context boy ate cake

nsubj dobj

boy with eyes ate cake The boy ate the cake

prep pobj nsubj dobj dobj nsubj det det

slide-14
SLIDE 14

The boy with the brown eyes ate the cake what we really care for is the syntactic context boy ate cake

nsubj dobj

boy with eyes ate cake The boy ate the cake

prep pobj nsubj dobj dobj nsubj det det

syntactic ngrams

slide-15
SLIDE 15

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams

slide-16
SLIDE 16

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams

English Google Books ~3.5M books published between 1520 to 2008 (most after 1800) ~350B words ~x100 times larger than prev efforts

slide-17
SLIDE 17

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams

slide-18
SLIDE 18

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams CRF tagger with cluster features induced from the books corpus

slide-19
SLIDE 19

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams arc-eager transition parser beam of size 8 features of Zhang and Nivre (2011) state of the art CRF tagger with cluster features induced from the books corpus

slide-20
SLIDE 20

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams Trained on WSJ + Brown + Question-Treebank arc-eager transition parser beam of size 8 features of Zhang and Nivre (2011) state of the art++ CRF tagger with cluster features induced from the books corpus

slide-21
SLIDE 21

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams counting at this scale is not trivial (luckily, Google has great infrastructure)

slide-22
SLIDE 22

We took a large corpus covering many years We parsed it with a good parser We extracted and counted syntactic-ngrams How do these look like? counting at this scale is not trivial (luckily, Google has great infrastructure)

slide-23
SLIDE 23

We provide several datasets, each with a different kind of syntactic-ngrams. they have names: arcs, biarcs, triarcs, quadarcs, … I will describe them shortly

(more details in the paper and website)

slide-24
SLIDE 24

content words vs. functional markers*

focus on relations between content words but retain information about functional markers

*Defined based on dependency labels

slide-25
SLIDE 25

content words vs. functional markers*

focus on relations between content words but retain information about functional markers said, dog, beautiful, quickly, he, John, 59, hundreds, increasing, jumped, ...

*Defined based on dependency labels

slide-26
SLIDE 26

content words vs. functional markers*

focus on relations between content words but retain information about functional markers said, dog, beautiful, quickly, he, John, 59, hundreds, increasing, jumped, ... to, will, his, the, not, did, your, has, some, ...

*Defined based on dependency labels

slide-27
SLIDE 27

content words vs. functional markers*

focus on relations between content words but retain information about functional markers said, dog, beautiful, quickly, he, John, 59, hundreds, increasing, jumped, ... to, will, his, the, not, did, your, has, some, ...

*Defined based on dependency labels

slide-28
SLIDE 28

arcs: two content words

slide-29
SLIDE 29

arcs: two content words

port/NN bombarded/VBN

nsubjpass ccomp

slide-30
SLIDE 30

arcs: two content words

bombardment/NN continued/VBD

rcmod dobj

slide-31
SLIDE 31

arcs: two content words

bombarding/NN of/IN heaven/NNP

dobj prep pobj

prepositions not counted as content words

slide-32
SLIDE 32

arcs: two content words

bombarding/VBG and/CC sinking/VBG

pcomp cc conj

coordinators not counted as content words

slide-33
SLIDE 33

arcs: two content words

“arcs” ngrams are very useful

slide-34
SLIDE 34

arcs: two content words

“arcs” ngrams are very useful can answer many natural queries:

  • subjects/objects of a given verb
  • adjectivial modifiers of a noun
  • things coordinated with a word
slide-35
SLIDE 35

arcs: two content words

“arcs” ngrams are very useful most work in syntactic vector-space models can be replicated using this set

slide-36
SLIDE 36

arcs: two content words

“arcs” ngrams are very useful most work in syntactic vector-space models can be replicated using this set

My student tried using it with our current model and got a very nice boost in accuracy!

  • C. Biemann, PhD, a few days ago
slide-37
SLIDE 37

arcs: two content words

“arcs” ngrams are very useful

I can inspect the modification patterns of gradable adjectives! This is sooo interesting for me :-)

  • G. Weidman Sassoon, PhD, a real semantician
slide-38
SLIDE 38

arcs: two content words

“arcs” ngrams are very useful

I can inspect the modification patterns of gradable adjectives! This is sooo interesting for me :-)

  • G. Weidman Sassoon, PhD, a real semantician

nearly/RB/advmod/2 tall/JJ/acomp/0 6707 unusually/RB/advmod/2 tall/JJ/acomp/0 4444 extremely/RB/advmod/2 tall/JJ/amod/0 4419 unusually/RB/advmod/2 tall/JJ/amod/0 4331 fairly/RB/advmod/2 tall/JJ/acomp/0 3466 extremely/RB/advmod/2 tall/JJ/acomp/0 3267 fairly/RB/advmod/2 tall/JJ/amod/0 3218 immensely/RB/advmod/2 tall/JJ/amod/0 2806 exceptionally/RB/advmod/2 tall/JJ/amod/0 2623 generally/RB/advmod/2 tall/JJ/acomp/0 2470 relatively/RB/advmod/2 tall/JJ/amod/0 2253 exceptionally/RB/advmod/2 tall/JJ/acomp/0 1929 enormously/RB/advmod/2 tall/JJ/amod/0 1567 nearly/RB/advmod/2 tall/JJ/amod/0 1550 really/RB/advmod/2 tall/JJ/acomp/0 1532 remarkably/RB/advmod/2 tall/JJ/acomp/0 1523 really/RB/advmod/2 tall/JJ/amod/0 1474 immensely/RB/advmod/2 tall/JJ/acomp/0 1452 relatively/RB/advmod/2 tall/JJ/acomp/0 1427 particularly/RB/advmod/2 tall/JJ/amod/0 1422 particularly/RB/advmod/2 tall/JJ/acomp/0 1379 moderately/RB/advmod/2 tall/JJ/amod/0 1360

slide-39
SLIDE 39

arcs: two content words

“arcs” ngrams are very useful there's much more available

slide-40
SLIDE 40

extended-arcs: two content words + all functional markers

slide-41
SLIDE 41

extended-arcs: two content words + all functional markers

an/DT enormous/JJ blunder/NN

amod dobj det

slide-42
SLIDE 42

extended-arcs: two content words + all functional markers

any blunder about that contract

dobj prep pobj

slide-43
SLIDE 43

extended-arcs: two content words + all functional markers

any blunder about that contract

dobj det det prep pobj

slide-44
SLIDE 44

extended-arcs: two content words + all functional markers

ports were bombarded

root nsubjpass

slide-45
SLIDE 45

extended-arcs: two content words + all functional markers

ports were bombarded

aux root nsubjpass

slide-46
SLIDE 46

extended-arcs: two content words + all functional markers

ports may not be bombarded

aux root nsubjpass neg

slide-47
SLIDE 47

extended-arcs: two content words + all functional markers

cute ass

amod

slide-48
SLIDE 48

extended-arcs: two content words + all functional markers

cute ass

amod

1352

slide-49
SLIDE 49

extended-arcs: two content words + all functional markers

cute ass

amod

1352 ?

slide-50
SLIDE 50

extended-arcs: two content words + all functional markers

cute ass

amod

a 445

slide-51
SLIDE 51

extended-arcs: two content words + all functional markers

cute ass

amod

a her your the that his my 445 235 221 167 103 82 71 28

slide-52
SLIDE 52

extended-arcs: two content words + all functional markers

cute ass

amod

a her your the that his my 445 235 221 167 103 82 71 28 cute ass

amod

a her your the that his my fat ass

amod

?

amod

slide-53
SLIDE 53

extended-arcs: two content words + all functional markers

cute ass

amod

a her your the that his my 445 235 221 167 103 82 71 28 cute ass

amod

a her your the that his my fat ass

amod

your his a her my that the … a frog 's

amod

2768 1504 1044 993 764 640 359 173 11

slide-54
SLIDE 54

Functional modifiers of coordinated nouns

the boy and the girl

conj cc conj

slide-55
SLIDE 55

Functional modifiers of coordinated nouns

___ */NN and ___ */NN

conj cc conj

slide-56
SLIDE 56

Functional modifiers of coordinated nouns

___ */NN and ___ */NN

conj cc conj

parallelism?

slide-57
SLIDE 57

79250839 the and the 15031401 a and a 3820439 the and its 2614562 the and his 2467965 his and his 2242856 a and the 2133545 the and a 2030446 the and their 1856827 an and a 1686133 a and an 1020169 their and their 892783 his and the 750079 my and my 714221 her and her 658563 its and its 475910 an and an 467310

  • ur

and our 459989 the and her

Functional modifiers of coordinated nouns

___ */NN and ___ */NN

conj cc conj

parallelism?

slide-58
SLIDE 58

79250839 the and the 15031401 a and a 3820439 the and its 2614562 the and his 2467965 his and his 2242856 a and the 2133545 the and a 2030446 the and their 1856827 an and a 1686133 a and an 1020169 their and their 892783 his and the 750079 my and my 714221 her and her 658563 its and its 475910 an and an 467310

  • ur

and our 459989 the and her

Functional modifiers of coordinated nouns

___ */NN and ___ */NN

conj cc conj

parallelism?

slide-59
SLIDE 59

biarcs: three content words

slide-60
SLIDE 60

biarcs: three content words

conserve scarce resources

amod dobj root

slide-61
SLIDE 61

biarcs: three content words

farmers conserve resources

nsubj dobj root

slide-62
SLIDE 62

biarcs: three content words

conserve habits possessed

rcmod dobj ccomp

slide-63
SLIDE 63

biarcs: three content words

conserve wildlife by leaving

prep dobj ccomp pobj

slide-64
SLIDE 64

biarcs: three content words

conserve wildlife by leaving

prep dobj ccomp pobj

slide-65
SLIDE 65

biarcs: three content words

conserve oil and gas

cc dobj xcomp conj

slide-66
SLIDE 66

biarcs: three content words

describes feeling attracted

xcomp root ccomp

slide-67
SLIDE 67

biarcs: three content words

capture interactions between subject, verb and object

slide-68
SLIDE 68

biarcs: three content words

capture interactions between two adjectives of a noun

slide-69
SLIDE 69

biarcs: three content words

capture interactions between verb, adverb and subject

slide-70
SLIDE 70

biarcs: three content words

VSM's not covered by “arcs” dataset are probably covered by this one

slide-71
SLIDE 71

biarcs: three content words

second-order questions

slide-72
SLIDE 72
  • ld, young, little, other, most, many, first, poor,

whole, white, ancient, average, obese, few, hungry, primitive, native, condemned, human, large, wild, black, great, small, starving, american, neotropical, rich, entire, ordinary, pregnant, thin, lean, normal, prehistoric, overweight, elder, fat, grave, wicked, local, holy, wealthy, working, unfortunate, miserable, sick, indian, cannibalistic, indigenous, savage, persian, maori, southern, primate, female, aboriginal, skinny, austrelian, ... ___ * ate

amod nsubj

adjectives of things that eat

slide-73
SLIDE 73

last, little, good, same, hearty, more, cold, whole, few, large, much, raw, small, great, hot, human, many, own, first, boiled, only, forbidden, big, other, light, simple, lobe, wild, fresh, green, roast, sweet, several, huge, delicious, quick, enormous, late, boiled, dry, white, frugal, early, next, fried, hasty, different, black, dried, red, fried, stale, canned, chinese, sour, cooked, french, vegetarian, mexican, baked, wonderful, poisoned, scrambled, roasted, enough, broiled, soft, kosher, ... ate ___ *

amod dobj

adjectives of things being eaten

slide-74
SLIDE 74

triarcs: four content words

slide-75
SLIDE 75

triarcs: four content words

consist of group of short fibers

prep root pobj prep pobj amod

slide-76
SLIDE 76

triarcs: four content words

consist of group of short fibers

prep root pobj prep pobj amod

slide-77
SLIDE 77

triarcs: four content words

consist principally of heavier hydrocarbons

prep root advmod pobj amod

slide-78
SLIDE 78

triarcs: four content words

consist vessel crosses spine

advcl root nsubj dobj

slide-79
SLIDE 79

triarcs: four content words

social situation exposed consisted

advmod root rcmod nsubj

slide-80
SLIDE 80

triarcs: four content words

tiny baby and small child

amod pobj amod conj cc

slide-81
SLIDE 81

Adjectivial modifiers of coordinated nouns

___ */NN and ___ */NN

conj cc conj

parallelism?

amod amod

slide-82
SLIDE 82

347380 late and early 318353 new and new 143298 good and good 123184 high and low 119851 social and social 87337 high and high 83516 % and % 82964 human and human 78980 low and high 74488 different different 72617 same and same 68260 great and great 67055 good and bad 62282 many and many 61822 other and

  • ther

61126 own and

  • wn

58781 more and more 57556 young and young 57392 black and white 54690 white and black

Adjectivial modifiers of coordinated nouns

___ */NN and ___ */NN

conj cc conj

parallelism!!

amod amod

slide-83
SLIDE 83

quadarcs: five content words

(but restricted to specific patterns)

slide-84
SLIDE 84

quadarcs: five content words

(but restricted to specific patterns)

parts of compilation constitute one work

num nsubj prep pobj dobj

slide-85
SLIDE 85

quadarcs: five content words

(but restricted to specific patterns)

consecrated emblems distinguished by materials and workmanship

amod nsubjpass prep pobj conj cc

slide-86
SLIDE 86

quadarcs: five content words

(but restricted to specific patterns) A content-word root, with two chains of two content-words each

slide-87
SLIDE 87

There are also the extended versions (with functional markers)

  • f biarcs, triarcs and quadarcs
slide-88
SLIDE 88

nounargs: noun and all its modifiers (+ all functional markers)

slide-89
SLIDE 89

nounargs: noun and all its modifiers (+ all functional markers)

a gradual decrease

det amod

slide-90
SLIDE 90

nounargs: noun and all its modifiers (+ all functional markers)

an exponential gradual decrease

det amod amod

slide-91
SLIDE 91

nounargs: noun and all its modifiers (+ all functional markers)

a decrease in the dimension

det prep pobj det

slide-92
SLIDE 92

nounargs: noun and all its modifiers (+ all functional markers)

a corresponding decrease in the quantity

det prep pobj det amod

slide-93
SLIDE 93

nounargs: noun and all its modifiers (+ all functional markers)

can be used for estimating dependency language models

slide-94
SLIDE 94

nounargs: noun and all its modifiers (+ all functional markers)

  • ther interesting questions:

PP co-occurrence patterns? adjectivial co-occurrence? definiteness patterns?

slide-95
SLIDE 95

verbargs: verb and all its modifiers (+ all functional markers)

slide-96
SLIDE 96

verbargs: verb and all its modifiers (+ all functional markers)

he detected a possibility

dobj det nsubj

slide-97
SLIDE 97

verbargs: verb and all its modifiers (+ all functional markers)

he detected and exploited the impulses

dobj det nsubj conj cc

slide-98
SLIDE 98

verbargs: verb and all its modifiers (+ all functional markers)

he detected confusion in the tone

pobj det nsubj dobj prep

slide-99
SLIDE 99

verbargs: verb and all its modifiers (+ all functional markers)

he has been detected

nsubjpass aux aux

slide-100
SLIDE 100

verbargs: verb and all its modifiers (+ all functional markers)

  • ften

unfairly dealt with

  • nly

briefly dealt with

  • nly

marginally dealt with

  • therwise severely

dealt with rather hardly dealt with so directly dealt with so easily dealt with

advmod advmod prep

slide-101
SLIDE 101

verbargs: verb and all its modifiers (+ all functional markers)

he conquered he conquered egypt he conquered for himself a duchy he conquered himself with an effort he conquered this obstacle by manufacturing he conquered during his journey

slide-102
SLIDE 102

verbargs: verb and all its modifiers (+ all functional markers)

frame induction? better SRL? verb groups?

slide-103
SLIDE 103

reasons to use the syntactic ngrams corpus in your research: freely available and easy to obtain based on state-of-the-art parser x100 bigger than previous efforts and ready-to-use a standarized dataset means you can compare yourself to others

slide-104
SLIDE 104

“I can use this to improve my super cool brain-emulating parser!” *

* I my be paraphrasing a little

  • W. Schuler, PhD.

Cognitive Computational Linguist Ohio State University

slide-105
SLIDE 105

now for the really cool stuff!

slide-106
SLIDE 106

Time-series information

slide-107
SLIDE 107

for each syntactic ngram, we have counts broken down by year a very strong tool for studying change over time

slide-108
SLIDE 108

Simple stuff: what do people drink? count noun object of drink/drank when subject is proper-noun or pronoun, graph by decade

slide-109
SLIDE 109

drinking trends

water coffee beer

slide-110
SLIDE 110

drinking trends

whisky alcohol brandy

slide-111
SLIDE 111

Somewhat less simple stuff:

but still fairly simple

change in word meaning over time through distributional similarity (hand picked example)

slide-112
SLIDE 112

rock – jazz rock – stone Compute distributional (cosine) similarity between:

@year @year

slide-113
SLIDE 113

rock – jazz rock – stone

@year @year

slide-114
SLIDE 114

many other fascinating questions: language evolution! number of senses over time syntactic change over time modification patterns over time polarity change over time …

slide-115
SLIDE 115

the tip of an iceberg

slide-116
SLIDE 116

What can YOU do with ready-to-use, time-indexed syntactic dependencies from 350 billion words?

slide-117
SLIDE 117
slide-118
SLIDE 118

Sizes

  • Eng / all: ~320GB compressed

all + 1M + gb + us + fiction: ~680GB

  • arcs

919M items extended:1.08B

  • biarcs

1.78B items extended:1.62B

  • triarcs

1.87B items extended:1.71B

  • quadarcs 187M items

extended:180M

  • nounargs 275M items

unlex: 195M

  • verbargs 130M items

unlex: 114M