Twitter User Profiling: Bot and Gender Identification 7 th Author - - PowerPoint PPT Presentation

twitter user profiling bot
SMART_READER_LITE
LIVE PREVIEW

Twitter User Profiling: Bot and Gender Identification 7 th Author - - PowerPoint PPT Presentation

Twitter User Profiling: Bot and Gender Identification 7 th Author Profiling Task PAN 2019 CLEF Workshop Dijana Kosmajac Dr Vlado Keselj Faculty of Computer Science, Dalhousie University Halifax, Nova Scotia, Canada Overview


slide-1
SLIDE 1

Twitter User Profiling: Bot and Gender Identification

7th Author Profiling Task PAN 2019 – CLEF Workshop

Dijana Kosmajac Dr Vlado Keselj Faculty of Computer Science, Dalhousie University Halifax, Nova Scotia, Canada

slide-2
SLIDE 2
  • Introduction
  • Bot Detection on Social Media
  • Methodology
  • DNA-inspired User Behaviour Fingerprint
  • Diversity Measures
  • Dataset of 7th Author Profiling Task
  • Experiments and Results
  • Conclusion

Note: for gender detection approach, please refer to the working notes

Overview

2

slide-3
SLIDE 3

Bot Detection on Social Media

  • Social media - convenient platforms for people to share,

communicate, and collaborate.

  • Openness of social media is great, but…

malicious behaviors happen, such as bullying, terrorist attack planning, and fraud information dissemination, etc.

  • Important task: detect these abnormal activities as accurately and

early as possible to prevent disasters and attacks.

  • For this study we approached to a subdomain: bot detection

Introduction Methodology Dataset Experiments Conclusion

3

slide-4
SLIDE 4

Bot and Gender Detection on Social Media

  • DeBot: Twitter Bot Detection via Warped Correlation, Chavoshi et al.,

2016

  • DNA-Inspired Online Behavioral Modeling and Its Application to

Spambot Detection, Cresci et al., 2016

Introduction Methodology Dataset Experiments Conclusion

4

slide-5
SLIDE 5

DNA-inspired User Behaviour Fingerprint

  • Introduced first time in Cresci et al., 2016

Introduction Methodology Dataset Experiments Conclusion User timeline ASCII(65+code) 3∗2^3= 24 different labels

ACBCADDCCAF…

5

slide-6
SLIDE 6

DNA-inspired User Behaviour Fingerprint

  • We used 1-, 2-, 3- and 4-grams
  • 3-gram example:

Introduction Methodology Dataset Experiments Conclusion

6

slide-7
SLIDE 7

Diversity Measures

  • Yule’s 𝐿 = 𝐷 −

1 𝑂 + σ𝑛=1 𝑛𝑛𝑏𝑦 𝑊(𝑛, 𝑂) 𝑛 𝑂 2

  • Shannon’s 𝐼 = − σ𝑗=1

𝑊(𝑂) 𝑞𝑗ln(𝑞𝑗)

  • Simpson’s 𝐸 =

1 σ𝑗=1

𝑊(𝑂) 𝑞𝑗 2

  • Honore’s 𝑆 = 100

log(𝑂) 1−𝑊(1,𝑂)

𝑊(𝑂)

  • Sichel’s 𝑇 =

𝑊(2,𝑂) 𝑂

Introduction Methodology Dataset Experiments Conclusion

7

slide-8
SLIDE 8

Dataset

  • Bot t-SNE visualization. (a) English, (b) Spanish
  • English:
  • 2,880 train and 1,240 dev
  • Spanish:
  • 2,080 train and 920 dev

Introduction Methodology Dataset Experiments Conclusion

8

slide-9
SLIDE 9

Dataset

  • Diversity measures visualization for English

Introduction Methodology Dataset Experiments Conclusion Yule’s K Shannon’s H Simpson’s D Honore’s R Sichel’s S

9

slide-10
SLIDE 10

Dataset

  • Diversity measures visualization for Spanish

Introduction Methodology Dataset Experiments Conclusion Yule’s K Shannon’s H Simpson’s D Honore’s R Sichel’s S

10

slide-11
SLIDE 11

Experiments with language-specific training

  • Experiment 1: character n-grams range 2-4, w/o diversity measures.
  • Experiment 2: character n-grams 1-3, w/ diversity measures

Introduction Methodology Dataset Experiments Conclusion

11

slide-12
SLIDE 12

Experiments with combined training

  • Experiment 3: same as E1, only combined training set
  • Experiment 4: same as E2, only combined training set

Introduction Methodology Dataset Experiments Conclusion

12

slide-13
SLIDE 13

Official results

  • 13th place in total, better than all baselines.

Introduction Methodology Dataset Experiments Conclusion

13

slide-14
SLIDE 14

Conclusion and Future Work

  • A novel, yet simple method for bot detection on social media.
  • Language independent, since it does not use the language-specific

features.

  • Disadvantage – doesn’t consider language-specific features which may be

more fine-grained.

  • Explore the effect of the length of the user fingerprint on ability to

differentiate bot and genuine users.

  • Explore the effect of the timespan the fingerprint is collected.
  • Explore the effect of using variable length fingerprint.
  • Explore possibility of unsupervised bot detection using diversity measures

and clustering.

Introduction Methodology Dataset Experiments Conclusion

14