Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 - PowerPoint PPT Presentation

6th Author Profiling task at PAN Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 Avignon, 10-14 September Francisco Rangel Paolo Rosso Manuel Montes y Gómez Martin Potthast & Benno Stein Autoritas Consulting & PRHLT Research Center INAOE - Mexico Bauhaus-Universität Weimar PRHLT Research Center - Universitat Politècnica de Valencia Universitat Politècnica de València

PAN’18 Introduction Author profiling aims at identifying personal traits such as age, gender , personality traits, native language, language variety… from writings? This is crucial for: - Marketing. Author Profiling - Security. - Forensics. 2

PAN’18 Task goal To investigate the identification of author’s gender with multimodal information: texts + images. Three languages: Author Profiling English Spanish Arabic 3

PAN’18 Corpus PAN-AP'17 subset extended with images shared in author's timelines: ● 100 tweets per author. ○ ○ 10 images per author. Author Profiling 4

PAN’18 Evaluation measures The accuracy is calculated per modality and language: ● Text-based. ● Image-based. ● Combined. The final ranking is the average of the combined* accuracies per language: Author Profiling * If only the textual approach was submitted, its accuracy has been used 5

PAN’18 Baselines BASELINE-stat: A statistical baseline that emulates random ● choice. Both modalities. ● BASELINE-bow: Documents represented as bag-of-words. ○ The 5,000 most common words in the training set. ○ Weighted by absolute frequency. ○ ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. Textual modality. ○ BASELINE-rgb: ● ○ RGB color for each pixel in each author images. Author Profiling The author is represented with the minimum, maximum, ○ mean, median, and standard deviation of the RGB values. Images modality. ○ 6

PAN’18 Netherlands Slovenia Israel UK Netherlands Japan Mexico USA Brazil Switzerland 23 participants Portugal France Author Profiling German 22 working notes India Turkey 17 countries Slovenia Sweden Spain 7 Canada

Author Profiling PAN’18 Approaches 8

PAN’18 Approaches - Preprocessing Punctuation signs Ciccone et al. , Stout et al. , HaCohen-Kerner et al. , Veenhoven et al. Character flooding Ciccone et al. , Raiyani et al. Lowercase Von Däniken et al. , Veenhoven et al. , Nieuwenhuis et al. , Bayot & Gonçalves, Kosse et al. , Stout et al. , Schaetti, HaCohen-Kerner et al. Stopwords Ciccone et al. , Raiyani et al. , HaCohen-Kerner et al. , Veenhoven et al. TEXTS Twitter specific components: Ciccone et al. , Takahashi et al. , Stout et al. , Raiyani et al. , Schaetti, hashtags, urls, mentions and HaCohen-Kerner et al. , Von Däniken et al. , Martinc et al. , Veenhoven et RTs al. , Nieuwenhuis et al. , Kosse et al. Contractions and abbreviations Stout et al. , Raiyani et al. Author Profiling Normalisation and diacritics Ciccone et al. removal in Arabic Resizing, rescaling Takahashi et al. , Martinc et al. , Sierra-Loaiza & González IMAGES Normalisation (subtracting the Takahashi et al. 9 average RGB value per lang)

PAN’18 Approaches - Textual Features Stylistic features: Patra et al. , Karlgren et al. ,HaCohen-Kerner et al., Von Däniken et - Ratios of links al. - Hashtag or user mentions - Character flooding - Emoticons / laugher expressions - Domain names N-gram models Stout et al., Sandroni-Dias & Paraboni, López-Santillán et al., Von Däniken et al., Tellez et al. , Nieuwenhuis et al. , Kosse et al., Daneshvar, HaCohen-Kerner et al., Ciccone et al., Aragón & López LSA Patra et al. Second order representation Áragon & López A variation of LDSE Gàribo-Orts Author Profiling Word embeddings Martinc et al. , Veenhoven et al. , Bayot & Gonçalves, López-Santillán et al. , Takahashi et al. , Patra et al. Character embeddings Schaetti 10

PAN’18 Approaches - Image Features Face detection Stout et al., Ciccone et al., Veenhoven et al. Objects detection Ciccone et al. Local binary patterns Ciccone et al. Hand-crafted features HaCohen-Kerner et al. Color histogram Ciccone et al., HaCohen-Kerner et al. Bag of Visual Words Tellez et al. Image resources and tools (e.g. Patra et al., Nieuwenhuis et al. , Aragón & López, Schaetti, ImageNet, TorchVision...) Takahashi et al. Author Profiling 11

PAN’18 Approaches - Methods Logistic regression Sandroni-Dias & Paraboni, HaCohen-Kerner et al. , Von Däniken et al. , Nieuwenhuis et al. SVM López-Santillán et al. , Aragón & López, Ciccone et al. , Patra et al. , Tellez et al. , Veenhoven et al. Multilayer Perceptron HaCohen-Kerner et al. Basic feed-forward network Kosse et al. Distance-based method Tellez et al. , Karlgren et al. IF condition Gáribo-Orts RNN Takahashi et al. , Bayot & Gonçalves, Stout et al. Author Profiling CNN Schaetti ResNet18 Schaetti Bi-LSTM Veenhoven et al. 12

PAN’18 Textual modality v Author Profiling ● AR: n-grams EN: n-grams ● ES: n-grams ● 13

PAN’18 Images modality v ● Best: Pre-trained CNN w. ImageNet Author Profiling 2nd. AR: VGG16 + ResNet50 from ImageNet ● 2nd. EN: VGG16 + ResNet50 from ImageNet ● 2nd. ES: Color histogram + faces + objects + ● local binary patterns 14

PAN’18 Improvement with images ● In average, there is almost no improvement. Author Profiling Some systems obtain high improvements (up to 7.73%) ● Pre-trained CNN w. ImageNet. ○ 15

Author Profiling PAN’18 Improvement (AR) 16 v

Author Profiling PAN’18 Improvement (EN) 17 v

Author Profiling PAN’18 Improvement (ES) 18

Author Profiling PAN’18 Final ranking * 19

PAN’18 PAN-AP 2018 best results Author Profiling 20

Conclusions PAN’18 Several approaches to tackle the task: ● ○ Deep learning prevailing. Textual classification: ● ○ Best results regarding textual subtask: n-grams + traditional methods (SVM, logistic reg.). The second best result for Spanish: bi-LSTM with word embeddings. ○ ● Images classification approaches based on: Face recognition. <- Failed! ○ ○ Pre-trained models and image processing tools such as ImageNet. <- Best results obtained with semantic features extracted from the images. ○ Hand-crafted features such as color histograms and bag-of-visual-words. Texts vs. Images: ● ○ Textual features discriminate better than images. On average, there is no improvement when images are used. ○ ○ Elaborated representations improves up to 7.73% (English). Best results: ● ○ Over 80% on average (EN 85.84%; ES 82%; AR: 81.80%). Author Profiling English (85.84%): Takahashi et al. with deep learning techniques (RNN for text, ImageNet + ○ CNN for images). Spanish (82%): Daneshvar with SVM and combinations of n-grams (only textual features). ○ ○ Arabic (81.80%): Tellez et al. with SVM + n-grams, and Bag of Visual Words. Insight: ● ○ Traditional approaches still remain competitive, but deep learning is acquiring strength. 21

PAN’18 Task impact PARTICIPANTS COUNTRIES PAN-AP 2013 21 16 PAN-AP 2014 10 8 PAN-AP 2015 22 13 PAN-AP 2016 22 15 PAN-AP 2017 22 19 Author Profiling PAN-AP 2018 23 17 22

PAN’18 Industry at PAN (Author Profiling) Organisation Sponsors Participants Author Profiling 23

PAN’18 2019 -> Robot or human? Author Profiling 24

PAN’18 On behalf of the author profiling task organisers: Author Profiling Thank you very much for participating and hope to see you next year!! 25

Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 - PowerPoint PPT Presentation

6th Author Profiling task at PAN Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 Avignon, 10-14 September Francisco Rangel Paolo Rosso Manuel Montes y Gmez Martin Potthast & Benno Stein Autoritas Consulting &

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Gender from a Multicultural Perspective A Guide for College Student Affairs Professionals Gender

Gender Stereotypes Institutionalising Social Gender As Natural Gender. Training Workshop

equality - policies and tools EuroMed Rights Gender Working Group meeting April 2-3 2016 Gender

Lecture (7) Gender and age Sex vs. Gender Sex is what youre born with. Gender is what

Gender Equality in Middle- Gender Equality in Middle -Income Countries: Income Countries:

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

Brown Bag Lunch| April 11, 2013 Briefing for Social, Gender and Energy Specialists Gender

GENDER IN MEDIA EDUCATION REGIONAL CONFERENCE 29 30 March 2010 Mainstreaming Gender into the

GENDER EQUALITY IRELAND AND THE EU AN OVERVIEW 2008 By Pauline Moreau Gender Equality

Deconfinement and Equation of State in QCD Pter Petreczky What is deconfinement in QCD ? What

Deconfinement and chiral transition in finite temperature lattice QCD Pter Petreczky for

Kestrel An XMPP-Based Framework for Many Task Computing Applications HISTORY/PURPOSE Lance

Stout An Adaptive Interface to Scalable Cloud Storage John Dunagan John C. McCullough Alec

Prophetstown for Their Own Purposes Violence and American Expansion Violence as

Theoretical Aspects of Orienting Fruit Using Stability Properties During Rotation Research team:

Bonding Bringing the atoms together More than one atom Until now, we have been consumed with

HOUSING IS HEALTH CARE: An In- Depth Look at Denver HCHs Integrated Model of Care Wednesday,

Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 - PowerPoint PPT Presentation

6th Author Profiling task at PAN Multimodal Gender Identification in Twitter PAN-AP-2018 CLEF 2018 Avignon, 10-14 September Francisco Rangel Paolo Rosso Manuel Montes y Gmez Martin Potthast & Benno Stein Autoritas Consulting &

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Gender from a Multicultural Perspective A Guide for College Student Affairs Professionals Gender

Gender Stereotypes Institutionalising Social Gender As Natural Gender. Training Workshop

equality - policies and tools EuroMed Rights Gender Working Group meeting April 2-3 2016 Gender

Lecture (7) Gender and age Sex vs. Gender Sex is what youre born with. Gender is what

Gender Equality in Middle- Gender Equality in Middle -Income Countries: Income Countries:

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

Brown Bag Lunch| April 11, 2013 Briefing for Social, Gender and Energy Specialists Gender

GENDER IN MEDIA EDUCATION REGIONAL CONFERENCE 29 30 March 2010 Mainstreaming Gender into the

GENDER EQUALITY IRELAND AND THE EU AN OVERVIEW 2008 By Pauline Moreau Gender Equality

Deconfinement and Equation of State in QCD Pter Petreczky What is deconfinement in QCD ? What

Deconfinement and chiral transition in finite temperature lattice QCD Pter Petreczky for

Kestrel An XMPP-Based Framework for Many Task Computing Applications HISTORY/PURPOSE Lance

Stout An Adaptive Interface to Scalable Cloud Storage John Dunagan John C. McCullough Alec

Prophetstown for Their Own Purposes Violence and American Expansion Violence as

Theoretical Aspects of Orienting Fruit Using Stability Properties During Rotation Research team:

Bonding Bringing the atoms together More than one atom Until now, we have been consumed with

HOUSING IS HEALTH CARE: An In- Depth Look at Denver HCHs Integrated Model of Care Wednesday,

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING