A Crowd-Annotated Spanish Corpus for Humor Analysis Santiago - - PowerPoint PPT Presentation

a crowd annotated spanish corpus for humor analysis
SMART_READER_LITE
LIVE PREVIEW

A Crowd-Annotated Spanish Corpus for Humor Analysis Santiago - - PowerPoint PPT Presentation

A Crowd-Annotated Spanish Corpus for Humor Analysis Santiago Castro, Luis Chiruzzo, Aiala Ros, Diego Garat and Guillermo Moncecchi July 20 th , 2018 Grupo de Procesamiento de Lenguaje Natural, Universidad de la Repblica Uruguay 1


slide-1
SLIDE 1

A Crowd-Annotated Spanish Corpus for Humor Analysis

Santiago Castro, Luis Chiruzzo, Aiala Rosá, Diego Garat and Guillermo Moncecchi July 20th, 2018

Grupo de Procesamiento de Lenguaje Natural, Universidad de la República — Uruguay 1

slide-2
SLIDE 2

Outline

Background Extraction Annotation Dataset Analysis Conclusion HAHA Task

2

slide-3
SLIDE 3

Background

slide-4
SLIDE 4

Background i

  • Humor Detection is about telling if a text is humorous

(e. g., a joke). My grandpa came to America looking for freedom, but it didn’t work out, in the next flight my grandma was coming. IT’S REALLY HOT

3

slide-5
SLIDE 5

Background ii

  • Some previous work, such as Barbieri and Saggion (2014),

Mihalcea and Strapparava (2005), and Sjöbergh and Araki (2007), created binary Humor Classifiers for short texts written in English.

  • They extracted one-liners from the Internet and from

Twitter, such as: Beauty is in the eye of the beer holder.

  • Castro et al. (2016) worked on Spanish tweets since our

group is interested in leveraging tools for Spanish.

  • Back then, we conceived the first and only Spanish dataset

to study Humor.

4

slide-6
SLIDE 6

Background iii

  • Castro et al. (2016) corpus provided 40k tweets from 18

accounts, with 34k annotations. The annotators decided if the tweets were humorous or not, and if so they rated them from 1 to 5.

  • However, the dataset has some issues:
  • 1. low inter-annotator agreement (Fleiss’ κ = 0.3654)
  • 2. limited variety of sources (humorous: 9 Twitter accounts,

non-humorous: 3 about news accounts, 3 about inspirational thoughts and 3 about curious facts)

  • 3. very few annotations per tweet (less than 2 in average,

around 500 with ≥ 5 annotations)

  • 4. only 6k were considered humorous by the crowd

5

slide-7
SLIDE 7

Background iv

6

slide-8
SLIDE 8

Related work

Potash, Romanov, and Rumshisky (2017) built a corpus based

  • n tweets in English that aims to distinguish the degree of

funniness in a given tweet. They used the tweet set issued in response to a TV game show, labeling which tweets were considered humorous by the show. Used in SemEval 2017 Task 6 — #HashtagWars.

7

slide-9
SLIDE 9

Extraction

slide-10
SLIDE 10

Extraction i

  • 1. We wanted to have at least 20k tweets as balanced as

possible, at least 5 annotations each.

  • 2. We fetched tweets from 50 humorous accounts from

Spanish speaking countries, taking 12k at random.

  • 3. We fetched tweet samples written in Spanish throughout

February 2018, taking 12k at random.

8

slide-11
SLIDE 11

Extraction ii

  • 4. As expected, both sources contained a mix of humorous

and non-humorous tweets.

9

slide-12
SLIDE 12

Annotation

slide-13
SLIDE 13

Annotation i

We built a web page, similar to the one used by Castro et al. (2016):

10

slide-14
SLIDE 14

Annotation ii

clasificahumor.com

11

slide-15
SLIDE 15

Annotation iii

  • Tweets were randomly shown to annotators, but avoiding

duplicates (by using web cookies).

  • We wanted UI to be the more intuitive and

self-explanatory as possible, trying not to induce any bias

  • n users and letting them come up with their own

definition of humor.

  • The simple and friendly interface is meant to keep the

users engaged and having fun while classifying tweets.

12

slide-16
SLIDE 16

Annotation iv

  • People annotated from March 8th to 27th, 2018.
  • The first tweets shown to every session were the same: 3

tweets for which we know a clear answer.

  • During the annotation process, we added around 4,500

tweets coming from humorous accounts to help the balance.

13

slide-17
SLIDE 17

Dataset

slide-18
SLIDE 18

Dataset i

  • The dataset consists of two CSV files: tweets and

annotations. tweet ID

  • rigin

24 humorous account tweet ID session ID date value 24 YOH113F…C4R 2018-03-15 19:30:34 2

14

slide-19
SLIDE 19

Dataset ii

  • 27,282 tweets
  • 117,800 annotations (including 2,959 skips)
  • 107,634 “high quality” annotations (excluding skips)

15

slide-20
SLIDE 20

Analysis

slide-21
SLIDE 21

Annotation Distribution

2 4 6 8 10 12 2,000 4,000 6,000 8,000 10,000 12,000 Number of annotations Tweets

16

slide-22
SLIDE 22

Class Distribution

1% 3.2% 7% 10.3% 13.3% 65.2% Excellent Good Regular Little Funny Not Funny Not Humorous

17

slide-23
SLIDE 23

Annotators Distribution

1 10 100 1000 20k 40k 60k 80k 100k Annotators Annotations

18

slide-24
SLIDE 24

Agreement

  • Krippendorff’s α = 0.5710 (vs. 0.3654)
  • If we include the “low quality”, α = 0.5512
  • Funniness: α = 0.1625
  • If we only consider the 11 annotators who tagged more

than a 1,000 times (who tagged 50,939 times in total), the humor and funniness agreement are respectively 0.6345 and 0.2635.

19

slide-25
SLIDE 25

Conclusion

slide-26
SLIDE 26

Conclusion

  • We created a better version of a dataset to study Humor in
  • Spanish. 27,282 tweets coming from multiple sources, with

107,634 annotations “high quality” annotations.

  • Significant inter-annotator agreement value.
  • It is also a first step to study subjectivity. Although more

annotations per tweet would be appropriate, there is a subset of a thousand tweets with at least six annotations that could be used to study people’s opinion on the same instances.

20

slide-27
SLIDE 27

HAHA Task

slide-28
SLIDE 28

HAHA Task

  • An IberEval 2018 task.
  • Two subtasks: Humor Classification and Funniness

Average Prediction.

  • Subset of 20k tweets.
  • 3 participants,
  • 7 and 2 submissions respectively.

21

slide-29
SLIDE 29

Analysis

Category Votes Hits Humorous 3/5 52.25% 4/5 75.33% 5/5 85.04% Not humorous 3/5 68.54% 4/5 80.83% 5/5 82.42%

22

slide-30
SLIDE 30

References i

References

Barbieri, Francesco and Horacio Saggion (2014). “Automatic Detection of Irony and Humour in Twitter”. In: ICCC,

  • pp. 155–162.

Castro, Santiago et al. (2016). “Is This a Joke? Detecting Humor in Spanish Tweets”. In: Ibero-American Conference on Artificial Intelligence. Springer, pp. 139–150. doi: 10.1007/978-3-319-47955-2_12.

23

slide-31
SLIDE 31

References ii

Fleiss, Joseph L (1971). “Measuring nominal scale agreement among many raters”. In: Psychological bulletin 76.5, p. 378. doi: 10.1037/h0031619. Krippendorff, Klaus (2012). Content analysis: An introduction to its methodology. Sage. doi: 10.1111/j.1468-4446.2007.00153_10.x. Mihalcea, Rada and Carlo Strapparava (2005). “Making Computers Laugh: Investigations in Automatic Humor Recognition”. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT ’05. Vancouver, British Columbia, Canada: Association for Computational Linguistics,

  • pp. 531–538. doi: 10.3115/1220575.1220642.

24

slide-32
SLIDE 32

References iii

Potash, Peter, Alexey Romanov, and Anna Rumshisky (2017). “SemEval-2017 Task 6:# HashtagWars: Learning a sense of humor”. In: Proceedings of the 11th International Workshop

  • n Semantic Evaluation (SemEval-2017), pp. 49–57. doi:

10.18653/v1/s17-2004. Sjöbergh, Jonas and Kenji Araki (2007). “Recognizing Humor Without Recognizing Meaning”. In: WILF. Ed. by Francesco Masulli, Sushmita Mitra, and Gabriella Pasi.

  • Vol. 4578. Lecture Notes in Computer Science. Springer,
  • pp. 469–476. isbn: 978-3-540-73399-7. doi:

10.1007/978-3-540-73400-0_59.

25

slide-33
SLIDE 33

Questions?

https://pln-fing-udelar.github.io/humor/

26