1
Leaving no token behind: comprehensive (and delicious) annotation of MWEs and supersenses
nert
Nathan Schneider Georgetown University
LAW-MWE-CxG • 25 August 2018 • Santa Fe, NM
Leaving no token behind: comprehensive (and delicious) annotation - - PowerPoint PPT Presentation
Leaving no token behind: comprehensive (and delicious) annotation of MWEs and supersenses Nathan Schneider nert Georgetown University LAW-MWE-CxG 25 August 2018 Santa Fe, NM 1 Goal: corpora in a language annotated with some form
1
nert
Nathan Schneider Georgetown University
LAW-MWE-CxG • 25 August 2018 • Santa Fe, NM
some form of lexical semantics (for NLP , corpus linguistics)
quality, scalability?
2
Start with a general lexicon like WordNet, apply it to corpus.
LIMITATIONS: coverage (esp. for MWEs),
granularity, language-specificity, cost
3
Annotate general categories/criteria at the token level, and identify types as you go. Some options:
at the token level with general criteria [most work on
MWEs, e.g. PARSEME Shared Tasks 1.0, 1.1 focusing on VMWE subclasses; Savary et al. 2017, Ramisch et al. 2018].
Next session!
at the token level while populating a lexicon or constructicon of types [Dunietz et al. 2015, 2017]. Lori’s keynote tomorrow!
with coarse-grained categories. This talk!
4
Comprehensive annotation of MWEs and supersenses (semantic classes), without starting from a lexicon, is
constructions
5
The story of the STREUSLE corpus:
✦ comprehensive annotation ✦ data exploration: notable MWEs/constructions
✦ nouns & verbs ✦ prepositions
6
7
!
tiny.cc/streusle
[Schneider et al., LREC 2014]
8
9
We should annotate all kinds of MWEs in AMR! LOL good luck getting annotators to agree on what’s an MWE CHALLENGE ACCEPTED
Multiword expression (MWE): 2 or more
in form and/or function
ice cream, daddy longlegs, pay attention
by and large; plural of daddy longlegs?
highly recommended; no amount of … can …
10
11
12
Noam Chomsky daddy longlegs, hot dog dry out depend on, come across pay attention (to) put up with, give in (to) under the weather cut and dry in spite of pick up where __ left off easy as pie You’re welcome. To each his own. The structure of this paper is as follows. pay dry the clothes
close attention (to) they pick up where left off __ no attention was paid (to)
[Savary et al. 2017, Ramisch et al. 2018]
[Shigeto et al. 2013]
13
verbs
idioms, multiword tlemmas
phrasal idioms
14
examples of many kinds)
15
[Schneider et al., LREC 2014]
16
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs .
17
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs .
18
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs .
19
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs .
20
Alan_Black refused to give_in_to the vicious daddy_longlegs .
Noam_Chomsky refused to give_in_to the vicious daddy_longlegs .
My wife had taken_ her '07_Ford_Fusion _in for a routine oil_change .
21
22
23
English Web Treebank (Bies et al. 2012), fully annotated for MWEs
between at least 2 annotators
24
English-EWT corpus which has gold Universal Dependencies (included in STREUSLE release)
(single annotator)
25
26
!
tiny.cc/streusle
✦ 55k words of English web reviews ✴ 3,000 strong MWE mentions
PARSEME subtypes ✴ 700 weak MWE mentions
27
What a joy to stroll off historic Canyon_Road in Santa_Fe into a gallery with a gorgeous diversity of art
28
I googled restaurants in the area and Fuji_Sushi came_up and reviews were great so I made_ a carry_out _order
29
30
white-nosed coati
31
32
POS MWEs pattern contig. gappy most frequent types (lowercased lemmas) and their counts
N_N
331 1 customer service: 31
wait staff: 5 garage door: 4
ˆ_ˆ
325 1 santa fe: 4
V_P
217 44 work with: 27 deal with: 16 look for: 12 have to: 12 ask for: 8
V_T
149 42 pick up: 15 check out: 10 show up: 9 end up: 6 give up: 5
V_N
31 107 take time: 7 give chance: 5 waste time: 5 have experience: 5
A_N
133 3 front desk: 6 top notch: 6 last minute: 5
V_R
103 30 come in: 12 come out: 8 take in: 7 stop in: 6 call back: 5
D_N
83 1 a lot: 30 a bit: 13 a couple: 9
P_N
67 8
in town: 9 in fact: 7
R_R
72 1 at least: 10 at best: 7 as well: 6
at all: 5
V_D_N
46 21 take the time: 11 do a job: 8
V~N
7 56 do job: 9 waste time: 4
ˆ_ˆ_ˆ
63 home delivery service: 3 lake forest tots: 3
R~V
49 highly recommend: 43 well spend: 1 pleasantly surprise: 1
P_D_N
33 6
at this point: 2
A_P
39 pleased with: 7 happy with: 6 interested in: 5
P_P
39
due to: 9 because of: 7
V_O
38 thank you: 26 get it: 2 trust me: 2
V_V
8 30 get do: 8 let know: 5 have do: 4
N~N
34 1 channel guide: 2 drug seeker: 2 room key: 1 bus route: 1
A~N
31 hidden gem: 3 great job: 2 physical address: 2 many thanks: 2 great guy: 1
V_N_P
16 15 take care of: 14 have problem with: 5
N_V
18 10 mind blow: 2 test drive: 2 home make: 2
ˆ_$
28 bj s: 2 fraiser ’s: 2 ham s: 2 alan ’s: 2 max ’s: 2
D_A
28 a few: 13 a little: 11
R_P
25 1 all over: 3 even though: 3 instead of: 2 even if: 2
V_A
19 6 make sure: 7 get busy: 3 get healthy: 2 play dumb: 1
V_P_N
14 6 go to school: 2 put at ease: 2 be in hands: 2 keep in mind: 1
33
Canyon_Road Dr._Lori_Levin Harry_,_Prince_of_Wales Santa_Fe~,~NM Fourth_of_July
north_-_northeast north_east macOS_10.13.6 2002_Toyota_Camry 10 % A_+ 3 x the speed 3 x 4 = 12 #_1 5_star review 100 square_miles
34
Indian_elephant my dog is a yellow_lab furcifer_pardalis brown dog ice_cream chicken_salad sandwich macaroni_and_cheese General_Tso_’s_chicken cheese~and~crackers spaghetti~with~meatballs turkey sandwich strawberry banana milkshake green_tea
35
Our shirts are the_same half_a million dollars half_a mile away half of a mile away a_lot of cats a_few cats several cats plenty of cats
36
Just_Do_It
Construction Grammar: Framework positing continuity between lexicon and grammar construction = conventionalized form/function pairing of any grammatical shape, level of abstractness constructicon = structured inventory of constructions characterizing knowledge of a language
37
LEXICAL GRAMMATICAL cats kick the bucket ice cream SVO the Xer, the Yer spill the beans
Construction Grammar: Framework positing continuity between lexicon and grammar
thus partially lexical, partially syntactic
are nevertheless long-tail patterns that convey meaning
38
be learned
the topic of X in an institution of higher learning (based on a naming convention common at U.S. universities)
some universities count from 100
because only “101” is fixed, so misses this idiom.
39
These guys took Customer_Service 101 from a Neanderthal.
40
Modell_’s is nearby
business named after person,
to in the possessive [Quirk and
Greenbaum, 1973, pp. 329–330]
Kroger_’s is nearby
[Blodgett & Schneider 2018]
where possessive pronoun agrees with subject
41
Johni is quick_on_ hisi _feet Ii tried_ myi _best youi are on_ youri _own I helped in her hour_of_need I helped in Mary ’s_hour_of_need
[Blodgett & Schneider 2018]
42
that will take_ some _time youi should take_ youri _time he took_the_time to learn linguistics a long week/month/year/… she took_time_out_of her busy schedule to help you should take_ some _time_off to travel have/spend/save/waste~time 3 weeks/months/years/…_old
43
she suggested I go check_ it _out have_to
serial verb semi-auxiliary
be going_to
‘must’ ‘should’ ‘will’
I would_like this cookie to_go
polite ‘want’ ‘not to eat here’ Also challenging: light meanings of get (passive, causative, inchoative)
something went_wrong
V+A idiom
wanted to go and test_ it _out for my_self
go and V
44
hair as long as a horse ’s tail load the items one_by_one
complex P PP idiom V+P idiom (IAV) VPC
try not to pass_out pass_out the candy I came_across a nice restaurant
N+P idiom
in_front_of
A+P idiom
X is close_to Y the problem~with prepositions in_trouble at_least
(<2 lexicalized elements)
about minor syntactic patterns in English!
45
46
[Schneider & Smith, NAACL 2015]
more times to add noun and verb supersenses (classes from WordNet).
47
NATURAL OBJECT ARTIFACT LOCATION PERSON GROUP SUBSTANCE TIME RELATION QUANTITY FEELING MOTIVE COMMUNICATION COGNITION STATE ATTRIBUTE ACT EVENT PROCESS PHENOMENON SHAPE POSSESSION FOOD BODY PLANT ANIMAL OTHER BODY CHANGE COGNITION COMMUNICATION COMPETITION CONSUMPTION CONTACT CREATION EMOTION MOTION PERCEPTION POSSESSION SOCIAL STATIVE WEATHER
48
sewer
noun verb
languages
49
SemCor [Miller et al. 1993]
Wikipedia; D. Hovy et al. 2014: English Twitter]
Altun 2006; Paaß & Reichartz 2009; D. Hovy et al. 2014]
2010; Rossi et al., 2013], Chinese [Qiu et al., 2011],
Arabic [Schneider et al., 2013], Danish [Martínez
Alonso et al., 2015], Sanskrit [Hellwig 2017]
50
more times to add noun and verb supersenses.
expressions
51
52
What a joy to stroll
in Santa_Fe into a gallery with a gorgeous diversity
–
N.COGNITION V .MOTION N.LOCATION N.LOCATION N.GROUP N.COGNITION N.COGNITION
53
I googled restaurants in the area and Fuji_Sushi came_up and reviews were great so I made_ a carry_out _order
–
V .COMMUNICATION N.GROUP N.LOCATION N.GROUP V .COMMUNICATION N.COMMUNICATION V .COMMUNICATION N.POSSESSION
54
55
!
tiny.cc/streusle
✦ 55k words of English web reviews ✴ 3,000 strong MWE mentions ✴ 700 weak MWE mentions ✴ 9,000 noun mentions ✴ 8,000 verb mentions
supersenses along the lines of Ciaramita & Altun’s (2006) supersense tagger
56
results on test set (oracle POS) MWE (F1) SST (F1) Tag Acc full model 62.7 70.7 82.5
[Schneider & Smith, NAACL 2015]
SemEval 2016 Task 10 SemEval 2016
Nathan Schneider Dirk Hovy Anders Johannsen Marine Carpuat
All data at: https://github.com/dimsum16/dimsum-data
[Schneider et al., SemEval 2016]
58
STREUSLE Trustpilot Ritter Lowlands Tweebank IWSLT
(incl. NAIST-NTT)
5800 sentences
[Schneider et al., SemEval 2016]
training; annotation speed of ≈90s/sentence
59
ICL-HD system Reviews Tweets TED Supersenses 58 56 60 MWEs 53 59 57 Combined 57 57 60
[Schneider et al., SemEval 2016]
[Schneider et al. ACL 2018]
61
62
63
leave for Paris
spatial: goal/ destination
go to Paris ate for hours temporal: duration ate over most of an hour a gift for mother
recipient
give the gift to mother go to the store for eggs
purpose
go to the store to buy eggs pay/search for the eggs
theme
spend money on the eggs
spinoffs [Litkowski and Hargraves, 2005, 2007; Litkowski
2014; Ye and Baldwin, 2007; Saint-Dizier 2006; Dahlmeier et al. 2009; Tratz and Hovy 2009; Hovy et al. 2010, 2011; Tratz and Hovy 2013]
Moldovan 2009; O’Hara and Wiebe 2009; Srikumar and Roth 2011, 2013; Müller et al. 2012 for German]
that is comprehensive w.r.t. tokens AND types
[Schneider et al. 2015, 2016, 2018; Hwang et al. 2017]
64
65
Circumstance Temporal Time StartTime EndTime Frequency Duration Interval Locus Source Goal Path Direction Extent Means Manner Explanation Purpose Participant Causer Agent Co-Agent Theme Co-Theme Topic Stimulus Experiencer Originator Recipient Cost Beneficiary Instrument Configuration Identity Species Gestalt Possessor Whole Characteristic Possession PartPortion Stuff Accompanier InsteadOf ComparisonRef RateUnit Quantity Approximator SocialRel OrgRole
[Schneider et al., ACL 2018]
66
!
tiny.cc/streusle
✦ 55k words of English web reviews ✴ 3,000 strong MWE mentions ✴ 700 weak MWE mentions ✴ 9,000 noun mentions ✴ 8,000 verb mentions ✴ 4,000 prepositions ✴ 1,000 possessives
Three weeks ago, burglars tried to gain_entry into the rear of my home. Mrs._Tolchin provided us with excellent service and came with a_great_deal of knowledge and professionalism!
67
Characteristic Theme Quantity
[Schneider et al., ACL 2018] (simplified slightly)
Time Goal Whole Possessor
68
Other 2,503 Temporal 516 Spatial 1,148
P and PP tokens by scene role in web reviews (STREUSLE 4.1)
[Schneider et al., ACL 2018]
69
[Schneider et al., ACL 2018]
After a few rounds of pilot annotation on The Little Prince and minor additions to the guidelines: 78% on 216 unseen targets
prepositional, some not)
70
10 20 30 40 50 60 70 80 90 Prec. Rec. F1
Gold Syntax Automatic Syntax
[Schneider et al., ACL 2018]
71
10 20 30 40 50 60 70 80 90 Role Fxn Full
Most Frequent Neural Feature-rich linear
[Schneider et al., ACL 2018]
Can we use the supersenses for case markers and adpositions in other languages?
72
tagging without making MWE decisions
statistical sequence tagging—even gappy MWEs
[Diab & Bhutada 2009, Constant & Sigogne 2011, Schneider et al. 2014, Schneider & Smith 2015, DiMSUM task]
approximates productive constructions
with the long tail
74
representations and annotation methods
constructicons to apply to corpora—good for linguistic precision & detail
coverage/long tail
75
76
Many_thanks
(*Several thanks)
Thanks_a_million
(*Thanks a thousand)
Thanks_a_lot
(?Lots of thanks)
tiny.cc/streusle