Presenting TWITTIR-UD An Italian Twitter Treebank in Universal - - PowerPoint PPT Presentation

presenting twittir ud
SMART_READER_LITE
LIVE PREVIEW

Presenting TWITTIR-UD An Italian Twitter Treebank in Universal - - PowerPoint PPT Presentation

Presenting TWITTIR-UD An Italian Twitter Treebank in Universal Dependencies Alessandra Teresa Cignarella a,b Cristina Bosco b and Paolo Rosso a a. Universitat Politcnica de Valncia b. Universit degli Studi di Torino Motivation


slide-1
SLIDE 1

Presenting TWITTIRÒ-UD

An Italian Twitter Treebank in Universal Dependencies

Alessandra Teresa Cignarellaa,b Cristina Boscob and Paolo Rossoa

a. Universitat Politècnica de València b. Università degli Studi di Torino

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

  • 1. Sentiment Analysis and Opinion Mining
slide-4
SLIDE 4

Motivation

  • 1. Sentiment Analysis and Opinion Mining

→ irony, sarcasm, stance, hate speech, misogyny...

slide-5
SLIDE 5

Motivation

  • 1. Sentiment Analysis and Opinion Mining

→ irony, sarcasm, stance, hate speech, misogyny...

  • 2. Dealing with social media texts
slide-6
SLIDE 6

Motivation

  • 1. Sentiment Analysis and Opinion Mining

→ irony, sarcasm, stance, hate speech, misogyny...

  • 2. Dealing with social media texts

→ hard!!

slide-7
SLIDE 7

Motivation

  • 1. Sentiment Analysis and Opinion Mining

→ irony, sarcasm, stance, hate speech, misogyny...

  • 2. Dealing with social media texts

→ hard!!

  • 3. Syntax
slide-8
SLIDE 8

Motivation

  • 1. Sentiment Analysis and Opinion Mining

→ irony, sarcasm, stance, hate speech, misogyny...

  • 2. Dealing with social media texts

→ hard!!

  • 3. Syntax

→ Universal Dependencies are cool!

slide-9
SLIDE 9

Research Questions

slide-10
SLIDE 10

Research Questions

  • 1. How can we automatically detect irony ?
slide-11
SLIDE 11

Research Questions

  • 1. How can we automatically detect irony ?
  • 2. Could syntax information help in the detection of irony?
slide-12
SLIDE 12

Research Questions

  • 1. How can we automatically detect irony ?
  • 2. Could syntax information help in the detection of irony?

...and maybe help in other detection tasks too?

slide-13
SLIDE 13

Research Questions

  • 1. How can we automatically detect irony ?
  • 2. Could syntax information help in the detection of irony?

...and maybe help in other detection tasks too?

Our approach:

slide-14
SLIDE 14

Research Questions

  • 1. How can we automatically detect irony ?
  • 2. Could syntax information help in the detection of irony?

...and maybe help in other detection tasks too?

Our approach:

Let’s build a corpus and find out!

slide-15
SLIDE 15

What is TWITTIRÒ-UD ?

slide-16
SLIDE 16

What is TWITTIRÒ-UD ?

Treebank

slide-17
SLIDE 17

What is TWITTIRÒ-UD ?

Treebank Italian

slide-18
SLIDE 18

Twitter

What is TWITTIRÒ-UD ?

Treebank Italian

slide-19
SLIDE 19

Twitter

What is TWITTIRÒ-UD ?

Treebank Italian Universal Dependencies

slide-20
SLIDE 20

Twitter

What is TWITTIRÒ-UD ?

Treebank Italian Universal Dependencies Irony Sarcasm

slide-21
SLIDE 21

Related Work

slide-22
SLIDE 22

Related Work Social media & Twitter:

slide-23
SLIDE 23

Related Work Social media & Twitter:

  • Tagging the Twitterverse (Foster et al., 2011)
  • The French Social Media Bank (Seddah et al., 2012)
  • TWEEBANK (Kong et al., 2014)
  • TWEEBANK v2 (Liu et al., 2018)
  • Arabic (Albogamy and Ramsay, 2017)
  • African-American English (Blodgett et al., 2018)
  • Hindi English (Bhat et al., 2018)
slide-24
SLIDE 24

Related Work

slide-25
SLIDE 25

Related Work

slide-26
SLIDE 26

Related Work Two main references for our work:

slide-27
SLIDE 27

Related Work Two main references for our work:

  • UD_Italian treebank (Simi et al., 2014)
slide-28
SLIDE 28

Related Work Two main references for our work:

  • UD_Italian treebank (Simi et al., 2014)
  • PoSTWITA-UD (Sanguinetti et al., 2018)
slide-29
SLIDE 29

Data

slide-30
SLIDE 30

Data

  • 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018)
slide-31
SLIDE 31

Data

  • 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018)
  • fine-grained irony annotation (Karoui et al. 2017)
slide-32
SLIDE 32

Data

  • 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018)
  • fine-grained irony annotation (Karoui et al. 2017)

1. EXPLICIT 2. IMPLICIT

slide-33
SLIDE 33

Data

  • 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018)
  • fine-grained irony annotation (Karoui et al. 2017)

1. ANALOGY 2. EUPHEMISM 3. RHETORICAL QUESTION 4. OXYMORON or PARADOX 5. FALSE ASSERTION 6. CONTEXT SHIFT 7. HYPERBOLE or EXAGGERATION 8. OTHER 1. EXPLICIT 2. IMPLICIT

slide-34
SLIDE 34

Data

  • 1,424 tweets from TWITTIRÒ (Cignarella et al., 2018)
  • fine-grained irony annotation (Karoui et al. 2017)
  • sarcasm annotation (EVALITA 2018)

1. ANALOGY 2. EUPHEMISM 3. RHETORICAL QUESTION 4. OXYMORON or PARADOX 5. FALSE ASSERTION 6. CONTEXT SHIFT 7. HYPERBOLE or EXAGGERATION 8. OTHER 1. EXPLICIT 2. IMPLICIT

slide-35
SLIDE 35

Annotation

slide-36
SLIDE 36

Annotation

# text = Presentato il nuovo iPhone. È già al 36% di batteria.

slide-37
SLIDE 37

Annotation

# text = Presentato il nuovo iPhone. È già al 36% di batteria. # irony = EXPLICIT OXYMORON/PARADOX

slide-38
SLIDE 38

Annotation

# text = Presentato il nuovo iPhone. È già al 36% di batteria. # irony = EXPLICIT OXYMORON/PARADOX # sarcasm = 1

slide-39
SLIDE 39

Annotation

# text = Presentato il nuovo iPhone. È già al 36% di batteria. # irony = EXPLICIT OXYMORON/PARADOX # sarcasm = 1 Translation: The new iPhone has been launched. Battery is already at 36%.

slide-40
SLIDE 40

Data

slide-41
SLIDE 41

Data With the tool UDPipe:

slide-42
SLIDE 42

Data With the tool UDPipe:

  • tokenization
  • lemmatization
  • PoS-tagging
  • dependency parsing
slide-43
SLIDE 43

Data With the tool UDPipe:

  • tokenization
  • lemmatization
  • PoS-tagging
  • dependency parsing }
slide-44
SLIDE 44

Data With the tool UDPipe:

  • tokenization
  • lemmatization
  • PoS-tagging
  • dependency parsing }

1,424 tweets!

(17,933 tokens)

slide-45
SLIDE 45

Data With the tool UDPipe:

  • tokenization
  • lemmatization
  • PoS-tagging
  • dependency parsing

Full release in the UD repository: November 2019

}

1,424 tweets!

(17,933 tokens)

slide-46
SLIDE 46

Data

slide-47
SLIDE 47

Data

slide-48
SLIDE 48

Data

slide-49
SLIDE 49

Data

  • 1. Fine-grained annotation for irony
slide-50
SLIDE 50

Data

  • 1. Fine-grained annotation for irony
slide-51
SLIDE 51

Data

  • 1. Fine-grained annotation for irony
  • 2. Morpho-syntactic information
slide-52
SLIDE 52

Issues Encountered and Lessons Learned

slide-53
SLIDE 53

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
slide-54
SLIDE 54

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words

xkè → perché

slide-55
SLIDE 55

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
  • Punctuation irregularly used

xkè → perché

slide-56
SLIDE 56

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
  • Punctuation irregularly used
  • Twitter marks

xkè → perché

slide-57
SLIDE 57

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
  • Punctuation irregularly used
  • Twitter marks

#hashtag

xkè → perché

slide-58
SLIDE 58

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
  • Punctuation irregularly used
  • Twitter marks

#hashtag

@ m e n t i

  • n

xkè → perché

slide-59
SLIDE 59

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
  • Punctuation irregularly used
  • Twitter marks
  • No sentence splitting

#hashtag

@ m e n t i

  • n

xkè → perché

slide-60
SLIDE 60

Issues Encountered and Lessons Learned

  • Tokenization errors depending on misspelled words
  • Punctuation irregularly used
  • Twitter marks
  • No sentence splitting
  • Single-root constraint

#hashtag

@ m e n t i

  • n

xkè → perché

slide-61
SLIDE 61

Issues Encountered and Lessons Learned

slide-62
SLIDE 62

Issues Encountered and Lessons Learned

slide-63
SLIDE 63

Issues Encountered and Lessons Learned

slide-64
SLIDE 64

Issues Encountered and Lessons Learned

slide-65
SLIDE 65

Issues Encountered and Lessons Learned

slide-66
SLIDE 66

Issues Encountered and Lessons Learned

slide-67
SLIDE 67

Other Highlights

slide-68
SLIDE 68

Other Highlights

  • Punctuation is indeed exploited more extensively in the

two social media datasets rather than in UD_Italian.

slide-69
SLIDE 69

Other Highlights

  • Punctuation is indeed exploited more extensively in the

two social media datasets rather than in UD_Italian.

  • Mentions and hashtags have a similar distribution in

the two social media datasets.

slide-70
SLIDE 70

Other Highlights

  • Punctuation is indeed exploited more extensively in the

two social media datasets rather than in UD_Italian.

  • Mentions and hashtags have a similar distribution in

the two social media datasets.

  • The use of passive voices (aux:pass) is low in

PoSTWITA-UD and in TWITTIRÒ-UD, indicating a preference for the exploitation of active voices, as it happens in spoken language.

slide-71
SLIDE 71

A Parsing Experiment

slide-72
SLIDE 72

A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set.

slide-73
SLIDE 73

A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited:

slide-74
SLIDE 74

A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited:

  • 1. training UDPipe using only UD_Italian
slide-75
SLIDE 75

A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited:

  • 1. training UDPipe using only UD_Italian
  • 2. training UDPipe using only PoSTWITA-UD
slide-76
SLIDE 76

A Parsing Experiment We performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following settings were exploited:

  • 1. training UDPipe using only UD_Italian
  • 2. training UDPipe using only PoSTWITA-UD
  • 3. training UDPipe using both resources
slide-77
SLIDE 77

A Parsing Experiment

slide-78
SLIDE 78

A Parsing Experiment

slide-79
SLIDE 79

A Parsing Experiment

slide-80
SLIDE 80

A Parsing Experiment Results in-line with state of the art

(PoSTWITA-UD, Sanguinetti et al., 2018)

slide-81
SLIDE 81

Conclusions

slide-82
SLIDE 82

Conclusions

  • We discuss the annotation of this resource which

encompasses a fine-grained representation of irony and the UD morpho-syntactic analysis

slide-83
SLIDE 83

Conclusions

  • We discuss the annotation of this resource which

encompasses a fine-grained representation of irony and the UD morpho-syntactic analysis

  • Release of the complete resource (1,424 tweets) to be

accomplished in November 2019

slide-84
SLIDE 84

Conclusions

  • We discuss the annotation of this resource which

encompasses a fine-grained representation of irony and the UD morpho-syntactic analysis

  • Release of the complete resource (1,424 tweets) to be

accomplished in November 2019

  • It enriches the scenario of available resources for a text

genre which is especially hard to parse (social media texts)

slide-85
SLIDE 85

Future Work

slide-86
SLIDE 86

Future Work

  • Investigation of possible relationships between syntax

and semantics of the uses of figurative language (irony in particular)

slide-87
SLIDE 87

Future Work

  • Investigation of possible relationships between syntax

and semantics of the uses of figurative language (irony in particular)

→ ongoing experiments...

slide-88
SLIDE 88

Future Work

  • Investigation of possible relationships between syntax

and semantics of the uses of figurative language (irony in particular)

→ ongoing experiments...

  • A resource whose annotation encompasses both UD

relations and a fine-grained description of irony may indeed pave the way for the investigation of whether syntactic knowledge might help in SA and other related tasks

slide-89
SLIDE 89

Future Work

  • Investigation of possible relationships between syntax

and semantics of the uses of figurative language (irony in particular)

→ ongoing experiments...

  • A resource whose annotation encompasses both UD

relations and a fine-grained description of irony may indeed pave the way for the investigation of whether syntactic knowledge might help in SA and other related tasks

→ new NLP features for Sentiment Analysis?

slide-90
SLIDE 90

Thank you!

cigna@di.unito.it