beyond binary labels political ideology prediction of
play

Beyond Binary Labels: Political Ideology Prediction of Twitter Users - PowerPoint PPT Presentation

Beyond Binary Labels: Political Ideology Prediction of Twitter Users Daniel Preot iuc-Pietro Joint work with Ye Liu (NUS), Daniel J Hopkins (Political Science), Lyle Ungar (CS) 2 August 2017 Motivation User attribute prediction from text


  1. Beyond Binary Labels: Political Ideology Prediction of Twitter Users Daniel Preot ¸iuc-Pietro Joint work with Ye Liu (NUS), Daniel J Hopkins (Political Science), Lyle Ungar (CS) 2 August 2017

  2. Motivation User attribute prediction from text is successful: ◮ Age (Rao et al. 2010 ACL) ◮ Gender (Burger et al. 2011 EMNLP) ◮ Location (Eisenstein et al. 2010 EMNLP) ◮ Personality (Schwartz et al. 2013 PLoS One) ◮ Impact (Lampos et al. 2014 EACL) ◮ Political Orientation (Volkova et al. 2014 ACL) ◮ Mental Illness (Coppersmith et al. 2014 ACL) ◮ Occupation (Preot ¸iuc-Pietro et al. 2015 ACL) ◮ Income (Preot ¸iuc-Pietro et al. 2015 PLoS One) ... and useful in many applications.

  3. Political Ideology & Text Hypothesis: Political ideology of a user is disclosed through language use ◮ partisan political mentions or issues ◮ cultural di ff erences

  4. Political Ideology & Text Previous CS / NLP research used data sets with user labels identified through: 1. User descriptions H1 Users are far more likely to be politically engaged

  5. Political Ideology & Text 2. Partisan Hashtags H2 The prediction problem was so far over-simplified

  6. Political Ideology & Text 3. Lists of Conservative / Liberal users H3 Neutral users

  7. Political Ideology & Text 4. Followers of partisan accounts H4 Di ff erences in language use exist between moderate and extreme users

  8. Data ◮ Political ideology ◮ specific of country and culture ◮ our use case is US politics (similar to all previous work) ◮ the major US ideology spectrum is Conservative – Liberal ◮ seven point scale

  9. Data We collect a new data set: ◮ 3.938 users (4.8M tweets) ◮ public Twitter handle with > 100 posts Political ideology is reported through an online survey ◮ only way to obtain unbiased ground truth labels (Flekova et al. 2016 ACL, Carpenter et al. 2016 SPPS) ◮ additionally reported age, gender and other demographics

  10. Data ◮ Data available at preotiuc.ro ◮ full data for research purposes ◮ aggregate for replicability ◮ Twitter Developer Agreement & Policy VII.A4 ” Twitter Content , and information derived from Twitter Content, may not be used by, or knowingly displayed, distributed, or otherwise made available to any entity to target , segment, or profile individuals based on [...] political a ffi liation or beliefs” ◮ Study approved by the Internal Review Board (IRB) of the University of Pennsylvania

  11. Class Distribution 1000 750 696 692 594 501 453 500 401 250 195 0 696 453

  12. Data For comparison to previous work, we collect a data set: ◮ 13.651 users (25.5M tweets) ◮ follow liberal / conservative politicians on Twitter

  13. Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Di ff erences in language use exist between moderate and extreme users

  14. Engagement H1 Previous studies used users far more likely to be politically engaged Manually coded: ◮ Political words (234) ◮ Political NEs: mentions of politician proper names (39) ◮ Media NEs: mentions of political media sources and pundints (20)

  15. Engagement Data set obtained using previous methods 4.00 Political word usage across 0.18 user groups 3.50 0.11 Media/Pundit Names 0.79 3.00 Politician Names 0.73 Political Words 2.50 2.00 1.50 1.00 0.50 2.64 2.95 0.00 Average percentage of political word usage

  16. Engagement Our data set 4.00 Political word usage across 0.18 user groups 3.50 0.11 Media/Pundit Names 0.79 3.00 Politician Names 0.73 Political Words 2.50 2.00 1.50 0.03 0.04 1.00 0.24 0.19 0.03 0.03 0.14 0.03 0.02 0.12 0.09 0.07 0.02 0.50 0.07 2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95 0.00 Average percentage of political word usage

  17. Engagement Our data set 4.00 Political word usage across 0.18 user groups 3.50 0.11 Media/Pundit Names 0.79 3.00 Politician Names 0.73 Political Words 2.50 2.00 1.50 0.03 0.04 1.00 0.24 0.19 0.03 0.03 0.14 0.03 0.02 0.12 0.09 0.07 0.02 0.50 0.07 2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95 0.00 Average percentage of political word usage

  18. Engagement Take aways: ◮ 3x more political terms for automatically identified users compared to the highest survey-based scores ◮ almost perfectly symmetrical U-shape across all three types of political terms ◮ The di ff erence between 1-2 / 6-7 is larger than 2-3 / 5-6

  19. Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Di ff erences in language use exist between moderate and extreme users

  20. Over-simplification H2 The prediction problem was so far over-simplified .972 .976 1.0 .891 .9 .8 .7 .6 .5 CvL Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10-fold cross-validation

  21. Over-simplification H2 The prediction problem was so far over-simplified .972 .976 1.0 .891 .9 .785 .785 .789 .8 .7 .6 .5 CvL 1v7 Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10-fold cross-validation

  22. Over-simplification H2 The prediction problem was so far over-simplified .972 .976 1.0 .891 .9 .785 .785 .789 .8 .690 .679 .7 .662 .6 .5 CvL 1v7 2v6 Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10-fold cross-validation

  23. Over-simplification H2 The prediction problem was so far over-simplified .972 .976 1.0 .891 .9 .785 .785 .789 .8 .690 .679 .7 .662 .625 .590 .581 .6 .5 CvL 1v7 2v6 3v5 Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10 fold-cross validation

  24. Over-simplification Predicting continuous political leaning (1 – 7) .40 .369 .300 .294 .286 .30 .256 .20 .145 .10 .00 Leaning Unigrams LIWC Topics Emotions Political All Pearson R between predictions and true labels, Linear Regression, 10-fold cross-validation

  25. Over-simplification Seven-class classification 30% 27.60% 26.20% 24.20% 22.20% 19.60% 20% 10% 0% Accuracy, 10-fold cross-validation GR – Logistic regression with Group Lasso regularisation

  26. Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Di ff erences in language use exist between moderate and extreme users

  27. Neutral Users H3 Neutral users can be identified Words associated with either Words associated with neutral extreme conservative or liberal users a a a correlation strength Correlations are age and gender controlled. Extreme groups are combined using matched age and gender distributions.

  28. Political Engagement H3a There is a separate dimension of political engagement Combine the classes into a scale: 4 – 3&5 – 2&6 – 1&7 .40 .369 .300 .294 .286 .30 .256 .196 .20 .165 .169 .169 .149 .145 .10 .079 .00 Leaning Engagement Unigrams LIWC Topics Emotions Political All Pearson R between predictions and true labels, Linear Regression, 10 fold-cross validation

  29. Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Di ff erences in language use exist between moderate and extreme users

  30. Moderate Users H4 Di ff erences between moderate and extreme users Words associated with moderate Words associated with extreme liberals (5 and 6). liberals (7). a a a correlation strength relative frequency Correlations are age and gender controlled

  31. Take Aways ◮ User-level trait acquisition methodologies can generate non-representative samples ◮ Political ideology: ◮ Goes beyond binary classes ◮ The problem was to date over-simplified ◮ New data set available for research ◮ New model to identify political leaning and engagement

  32. Questions? www.preotiuc.ro wwbp.org

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend