A Dependency Parser for Tweets Lingpeng Kong, Nathan Schneider, - - PowerPoint PPT Presentation

a dependency parser for tweets
SMART_READER_LITE
LIVE PREVIEW

A Dependency Parser for Tweets Lingpeng Kong, Nathan Schneider, - - PowerPoint PPT Presentation

A Dependency Parser for Tweets Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith NLP for Social Media Boom! Ya ur website suxx bro @SarahKSilverman michelle obama great. job. and. whit all


slide-1
SLIDE 1

A Dependency Parser for Tweets

Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith

slide-2
SLIDE 2

/43

NLP for Social Media

michelle obama great. job. and. whit all my. respect

  • she. look. great. congrats. to. her.

—@OzzieGuillen

(Eisenstein, 2013)

Boom! Ya ur website suxx bro —@SarahKSilverman

2

slide-3
SLIDE 3

/43

! , ! D N N N ^ ^ A , N , & , V X D , V O , V A , , N , P , O ,

(Gimpel et al., 2011; Owoputi et al., 2013)

NER

(Ritter et al., 2011)

NLP for Social Media

3

Boom ! Ya ur website suxx bro michelle obama great . job . and . whit all my .

  • respect she . look . great . congrats . to . her .

The English Web Treebank (Bies et al., 2012) that was sufficient to support a shared task (Petrov and McDonald, 2012) on parsing the web.

slide-4
SLIDE 4

/43 4

Influential members of the House Ways and Means Committee introduced legislation that would restrict how the new savings-and-loan bailout agency can raise capital, creating another potential obstacle to the government's sale of sick thrifts. — @MitchellMarcus

NLP for Social Media

slide-5
SLIDE 5

/43 5

Pairwise corpus similarity ( ) using (Baldwin et al., 2013)

χ2

×103

Twitter-1 Twitter-2 Comments Forums Blogs Wikipedia Twitter-2 4.0 — — — — — Comments 63.7 62.4 — — — — Forums 91.8 90.6 62.3 — — — Blogs 115.8 119.1 128.4 61.7 — — Wikipedia 347.8 360.0 351.4 280.2 157.7 — BNC 251.8 258.8 245.2 164.1 78.7 92.5

How is Twitter syntax different?

slide-6
SLIDE 6

/43 6

Pairwise corpus similarity ( ) using (Baldwin et al., 2013)

χ2

×103

Twitter-1 Twitter-2 Comments Forums Blogs Wikipedia Twitter-2 4.0 — — — — — Comments 63.7 62.4 — — — — Forums 91.8 90.6 62.3 — — — Blogs 115.8 119.1 128.4 61.7 — — Wikipedia 347.8 360.0 351.4 280.2 157.7 — BNC 251.8 258.8 245.2 164.1 78.7 92.5

How is Twitter syntax different?

slide-7
SLIDE 7

/43

A Parser?

#hardtoparse: POS Tagging and Parsing the Twitterverse (Foster et al., 2011) Frustratingly Hard Domain Adaptation for Dependency Parsing (Dredze et al., 2011)

7

Fitting Twitter data to the PTB annotation guideline? Fitting the parsing task to Twitter data.

slide-8
SLIDE 8

/43

Building A Parser — Road Map

  • Annotation guidelines
  • An annotated corpus
  • Parser adaptation
  • Useful features

8

slide-9
SLIDE 9

/43

Building A Parser — Road Map

  • Annotation guidelines
  • An annotated corpus
  • Parser adaptation
  • Useful features

9

slide-10
SLIDE 10

/43

Not All Tokens Are Syntax

RT @justinbieber : now Hailee get a twitter Got #college admissions questions ? Ask them tonight during #CampusChat I’m looking forward to advice from @collegevisit http://bit.ly/cchOTk michelle obama great. job. and. whit all my. respect

  • she. look. great. congrats. to. her.

10

slide-11
SLIDE 11

/43

Token Selection

RT @justinbieber : now Hailee get a twitter Got #college admissions questions ? Ask them tonight during #CampusChat I’m looking forward to advice from @collegevisit http://bit.ly/cchOTk michelle obama great. job. and. whit all my. respect

  • she. look. great. congrats. to. her.

11

slide-12
SLIDE 12

/43

Token Selection

RT @justinbieber : now Hailee get a twitter Got #college admissions questions ? Ask them tonight during #CampusChat I’m looking forward to advice from @collegevisit http://bit.ly/cchOTk michelle obama great. job. and. whit all my. respect

  • she. look. great. congrats. to. her.

12

slide-13
SLIDE 13

/30

Token Selection

  • Pre-processing step
  • A first-order sequence model trained using

the structured perceptron (Collins, 2002)

  • It achieves 97.4% accuracy (ten-fold

cross-validated)

13

slide-14
SLIDE 14

/30

Multiword Expressions (MWEs)

Annotator’s freedom to group words as explicit MWEs:

proper names: Justin Bieber, World Series noncompositional or entrenched nominal compounds: belly button,

grilled cheese

connectives: as well as prepositions: out of adverbials: so far idioms: giving up, make sure

(Baldwin and Kim, 2010; Finkel and Manning, 2009; Constant and Sigogne, 2011; Schneider et al., 2014; Constant et al.,2012; Green et al., 2012; Candito and Constant, 2014; Le Roux et al., 2014)

Multiword expression should be a single node in the dependency parse from an annotator’s perspective.

14

slide-15
SLIDE 15

/30

Multiple Roots

Tweets

Single root is assumed in PTB — parse one sentence at one time — often contain multiple sentences or fragments (i.e. “utterances”)

15

We allow multiple attachments to the “wall” symbol (i.e. multi-rooted)

You * OMG! brought an iPhone 6 plus? are so rich… You

slide-16
SLIDE 16

/43

Full Analysis of a Tweet

16

slide-17
SLIDE 17

/43

Full Analysis of a Tweet

17

slide-18
SLIDE 18

/43

Building A Parser — Road Map

  • Annotation guidelines
  • An annotated corpus
  • Parser adaptation
  • Useful features

18

slide-19
SLIDE 19

/43

Building the Tweebank

  • Penn Treebank Annotation:
  • take years, involve

thousands of person- hours of work by linguists

  • Tweebank Annotation:
  • mostly built in a day by

two dozen annotators with only cursory training in the annotation scheme

19

slide-20
SLIDE 20

/43

Graph Fragment Language

  • A text-based notation that facilitates keyboard entry
  • f parses (Schneider et al., 2013)

bieber is an alien ! :O he went down to earth . bieber > is** < alien < an he > [went down]** < to < earth

20

slide-21
SLIDE 21

/43

(Mordowanec et al., 2014)

21

slide-22
SLIDE 22

/43

Tweebank

  • Tweebank contains 929 tweets (12,318

tokens) with manual dependency parses.

  • Tweets drawn from the POS-tagged Twitter

corpus of Owoputi et al. (2013), which are tokenized and contain manually annotated POS tags.

  • 170 of the tweets were annotated by multiple

users — Inter-annotator agreement > 90%

22

slide-23
SLIDE 23

/43

Statistics of our datasets

Train Test tweets 717 201 utterances 1,473 429 tokens 9,310 2,839 selected tokens 7,105 2,158

23

slide-24
SLIDE 24

/43

Building A Parser — Road Map

  • Annotation guidelines
  • An annotated corpus
  • Parser adaptations
  • Useful features

24

slide-25
SLIDE 25

/43 25

OMG I ♥ the Biebs & want to have his babies ! —> LA Times : Teen Pop Star Heartthrob is All the Rage on Social Media … #belieber

Parser Adaptation — Baseline

Out-of-the-Box Parser + Remove all the unselected tokens

slide-26
SLIDE 26

/43 26

OMG I ♥ the Biebs & want to have his babies LA Times Teen Pop Star Heartthrob is All the Rage on Social Media

Parser Adaptation — Baseline

Out-of-the-Box Parser + Remove all the unselected tokens lose information (Ma et al. 2014) “visible” to feature functions, but excluded from the parse tree

slide-27
SLIDE 27

/43 27

Parser Adaptation —TurboParser

A graph-based dependency parser (Martins et al., 2009; Martins et al., 2014) Decoding using AD3 (Martins et al., 2014). Many

  • verlapping parts (tree, head-automata etc.) can be

handled making use of separate combinatorial algorithms for efficiently handling subsets of constraints.

** AD3 — Alternating Directions Dual Decomposition

slide-28
SLIDE 28

/43 28

Parser Adaptation —TurboParser

Do NOT change the feature function + Do NOT remove the unselected tokens Adapt the decoding algorithm to excluded unselected tokens from the tree

+ Grand-sibling head automata (Koo et al., 2010; Martins et al, 2014) for an unselected xp or xg, and transitions that consider unselected tokens as children, are eliminated. Constrain zarc(i, j) = 0 whenever xi or xj is excluded For second order factorization (i.e. sibling [p,c,c’] & grandparent [p,c,g]) (McDonald and Satta, 2007; Carreras, 2007)

slide-29
SLIDE 29

/43 29

Unlabeled Attachment F1 (%) 78 79 80 81 82

  • PA

Main

80.9 79.2

Parser Adaptation

slide-30
SLIDE 30

/43

Building A Parser — Road Map

  • An annotation guideline
  • An annotated corpus
  • Parser adaptations
  • Useful features

30

slide-31
SLIDE 31

/43

PTB Features

Hailee Now get Twitter a *

get Twitter

3.05

get

2.39

Now

……

  • 0.63

Now Twitter

Getting the scores from a first-order model trained on the PTB

31

slide-32
SLIDE 32

/43

Hailee Now get Twitter a *

get Twitter

3.05

get

2.39

Now

……

  • 0.63

Now Twitter

wh = “get” & wm=“Twitter” ph = “V” & pm=“^” direction = “right” …… PTB model score = 3.05

PTB Features

32

slide-33
SLIDE 33

/43 33

Unlabeled Attachment F1 (%) 78 79 80 81 82

  • PTB

Main

80.9 80.2

PTB Features

slide-34
SLIDE 34

/43

Brown Clustering

  • Found very useful in dependency parsing

and Twitter POS tagging (Brown et al.,1992; Koo et al., 2008; Owoputi et al. 2013)

  • We use clusters trained on 56,345,753

tweets from Owoputi et al. (2012)

  • We implement the Brown clustering features

following Koo et al. (2008)

34

slide-35
SLIDE 35

/43 35

Unlabeled Attachment F1 (%) 78 79 80 81 82

  • Brown Clustering

Main

80.9 81.2

Brown Clustering

slide-36
SLIDE 36

/43

Building A Parser — Road Map

  • Annotation guidelines
  • An annotated corpus
  • Parser adaptations
  • Useful features

36

slide-37
SLIDE 37

/43

Experiments — Setup

Train Test-New Test-Foster tweets 717 201 < 250 utterances 1,473 429 337 tokens 9,310 2,839 2,841 selected tokens 7,105 2,158 2,366

37

slide-38
SLIDE 38

/43 38

Experiments

Unlabeled Attachment F Test-New Test-Foster Main Parser 80.9 76.1

On par with state-of-the-art reported results for news text in Turkish (77.6%; Koo et al., 2010) and Arabic (81.1%; Martins et al., 2011).

slide-39
SLIDE 39

/43 39

Test-New Test-Foster sample 50% — random sampled from tweets in 10/27/2010 50% — random sampled from 1/2011 through 6/2012 selected tweets from Bermingham and Smeaton’s (2010) corpus, which uses fifty predefined topics OOV

the Penn Treebank Training Set

45.2% 21.6% (PTB Test Set —13.2%)

Experiments — Dataset

slide-40
SLIDE 40

/43 40

Experiments — Preprocessing

Test-New Main Parser 80.9 (++) Gold POS and TS 83.2 (+) Gold POS, automatic TS 82.0 (+) Automatic POS, gold TS 82.0

slide-41
SLIDE 41

/43 41

Unlabeled Attachment F

  • mod. POS**

POS as-is Baseline 73.0 73.5 Main Parser 80.9

Experiments — Which Training Set?

** mod. POS — maps at-mentions to pronoun, and hashtags and URLs to noun at test time

slide-42
SLIDE 42

/43

Conclusion

  • TweeboParser — a dependency parser for English

tweets that achieves over 80% unlabeled attachment score on a new, high-quality test set.

  • Tweebank — a corpus of 929 tweets (12,318

tokens) with manual dependency parses

  • Adaptations to a statistical parsing algorithm
  • New approach to exploiting data in a better-

resourced domain (PTB)

42

slide-43
SLIDE 43

/43

Thanks!

http://www.ark.cs.cmu.edu/TweetNLP

43

The dataset and parser are available online!