An analysis of the user occupational class through Twitter content - PowerPoint PPT Presentation

An analysis of the user occupational class through Twitter content ¸iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 Daniel Preot 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University College London 29 July 2015

Motivation User attribute prediction from text is successful: ◮ Age (Rao et al. 2010 ACL) ◮ Gender (Burger et al. 2011 EMNLP) ◮ Location (Eisenstein et al. 2011 EMNLP) ◮ Personality (Schwartz et al. 2013 PLoS One) ◮ Impact (Lampos et al. 2014 EACL) ◮ Political orientation (Volkova et al. 2014 ACL) ◮ Mental illness (Coppersmith et al. 2014 ACL) Downstream applications are benefiting from this: ◮ Sentiment analysis (Volkova et al. 2013 EMNLP) ◮ Text classification (Hovy 2015 ACL)

However... Socio-economic factors (occupation, social class, education, income) play a vital role in language use (Bernstein 1960, Labov 1972 / 2006) No large scale user level dataset to date Applications: ◮ sociological analysis of language use ◮ embedding to downstream tasks (e.g. controlling for socio-economic status)

At a Glance Our contributions: ◮ Predicting new user attribute: occupation ◮ New dataset: user ←→ occupation ◮ Gaussian Process classification for NLP tasks ◮ Feature ranking and analysis using non-linear methods

Standard Occupational Classification Standardised job classification taxonomy Developed and used by the UK O ffi ce for National Statistics (ONS) Hierarchical: ◮ 1-digit (major) groups: 9 ◮ 2-digit (sub-major) groups: 25 ◮ 3-digit (minor) groups: 90 ◮ 4-digit (unit) groups: 369 Jobs grouped by skill requirements

Standard Occupational Classification C1 Managers, Directors and Senior O ffi cials ◮ 11 Corporate Managers and Directors ◮ 111 Chief Executives and Senior O ffi cials ◮ 1115 Chief Executives and Senior O ffi cials Job: chief executive, bank manager ◮ 1116 Elected O ffi cers and Representatives ◮ 112 Production Managers and Directors ◮ 113 Functional Managers and Directors ◮ 115 Financial Institution Managers and Directors ◮ 116 Managers and Directors in Transport and Logistics ◮ 117 Senior O ffi cers in Protective Services ◮ 118 Health and Social Services Managers and Directors ◮ 119 Managers and Directors in Retail and Wholesale ◮ 12 Other Managers and Proprietors

Standard Occupational Classification C2 Professional Occupations Job: mechanical engineer, pediatrist, postdoctoral researcher C3 Associate Professional and Technical Occupations Job: system administrator, dispensing optician C4 Administrative and Secretarial Occupations Job: legal clerk, company secretary C5 Skilled Trades Occupations Job: electrical fitter, tailor C6 Caring, Leisure, Other Service Occupations Job: school assistant, hairdresser C7 Sales and Customer Service Occupations Job: sales assistant, telephonist C8 Process, Plant and Machine Operatives Job: factory worker, van driver C9 Elementary Occupations Job: shelf stacker, bartender

Data 5,191 users ←→ 3-digit job group Users collected by self-disclosure of job title in profile Manually filtered by the authors 10M tweets, average 94.4 users per 3-digit group

Data Here we classify only at the 1-digit top level group (9 classes) Feature representation and labels available online Raw data available for research purposes on request (per Twitter TOS)

Features User Level features ( 18 ), such as: ◮ number of: ◮ followers ◮ friends ◮ listings ◮ tweets ◮ proportion of: ◮ retweets ◮ hashtags ◮ @-replies ◮ links ◮ average: ◮ tweets / day ◮ retweets / tweet

Features Focus on interpretable features for analysis Compute over reference corpus of 400M tweets: ◮ SVD embeddings and clusters ◮ Word2Vec (W2V) embeddings and clusters

SVD Features Compute word × word similarity matrix Similarity metric is Normalized PMI (Bouma 2009) using the entire tweet as context SVD with di ff erent number of dimensions (30, 50, 100, 200) User is represented by summing its word representations The low-dimensional features o ff er no interpretability

SVD Features Spectral clustering to get hard clusters of words (30, 50, 100, 200 clusters) Each cluster consists of distributionally similar words ←→ topic User is represented by the number of times he uses a word from each cluster.

Word2Vec Features Trained Word2Vec (layer size 50) on our Twitter reference corpus Spectral clustering on the word × word similiarity matrix (30, 50, 100, 200 clusters) Similarity is cosine similarity of words in the embedding space

Gaussian Processes Brings together several key ideas in one framework: ◮ Bayesian ◮ kernelised ◮ non-parametric ◮ non-linear ◮ modelling uncertainty Elegant and powerful framework, with growing popularity in machine learning and application domains

Gaussian Process Graphical Model View f ∼ GP ( m , k ) k y ∼ N ( f ( x ) , σ 2 ) ◮ f : R D − > R is a latent f σ function ◮ y is a noisy realisation of f ( x ) N ◮ k is the covariance x y function or kernel ◮ m and σ 2 are learnt from data

Gaussian Process Classification Pass latent function through logistic function to squash the input from ( −∞ , ∞ ) to obtain probability, π ( x ) = p ( y i = 1 | f i ) (similar to logistic regression) The likelihood is non-Gaussian and solution is not analytical Inference using Expectation propagation (EP) FITC approximation for large data

Gaussian Process Classification ARD kernel learns feature importance → features most discriminative between classes We learn 9 one-vs-all binary classifiers This way, we find the most predictive features consistent for all classes

Gaussian Process Resources Free book: http://www.gaussianprocess.org/gpml/chapters/

Gaussian Process Resources ◮ GPs for Natural Language Processing tutorial (ACL 2014) http://www.preotiuc.ro ◮ GP Schools in She ffi eld and roadshows in Kampala, Pereira, Nyeri, Melbourne http://ml.dcs.shef.ac.uk/gpss/ ◮ Annotated bibliography and other materials http://www.gaussianprocess.org ◮ GPy Toolkit (Python) https://github.com/SheffieldML/GPy

Prediction 55 50 45 40 34.2 34 35 31.5 30 25 User Level LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 50 43.8 45 43.1 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 50 48.2 47.9 44.2 43.8 45 43.1 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) SVD-C (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 49 48.4 50 48.2 47.9 44.2 43.8 45 43.1 42.5 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) SVD-C (200) W2V-E (50) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction 55 52.7 51.7 49 48.4 50 48.2 47.9 46.9 44.2 43.8 45 43.1 42.5 40 40 34.2 34 35 31.5 30 25 User Level SVD-E (200) SVD-C (200) W2V-E (50) W2V-C (200) LR SVM-RBF GP Baseline Stratified 10 fold cross-validation

Prediction Analysis User level features have no predictive value Clusters outperform embeddings Word2Vec features are better than SVD / NPMI for prediction Non-linear methods (SVM-RBF and GP) significantly outperform linear methods 52.7% accuracy for 9-class classification is decent

Class Comparison Jensen-Shannon Divergence between topic distributions across occupational classes Some clusters of occupations are observable 1 0.03 2 3 0.02 4 5 6 0.01 7 8 9 0.00 1 2 3 4 5 6 7 8 9

Feature Analysis Rank Manual Label Topic (most frequent words) 1 Arts art, design, print, collection, poster, painting, custom, logo, printing, drawing 2 Health risk, cancer, mental, stress, pa- tients, treatment, surgery, dis- ease, drugs, doctor 3 Beauty Care beauty, natural, dry, skin, mas- sage, plastic, spray, facial, treat- ments, soap 4 Higher Education students, research, board, stu- dent, college, education, library, schools, teaching, teachers 5 Software Engineering service, data, system, services, access, security, development, software, testing, standard Most predictive Word2Vec 200 clusters as given by Gaussian Process ARD ranking

Feature Analysis Rank Manual Label Topic (most frequent words) 7 Football van, foster, cole, winger, terry, reckons, youngster, rooney, fielding, kenny 8 Corporate patent, industry, reports, global, survey, leading, firm, 2015, in- novation, financial 9 Cooking recipe, meat, salad, egg, soup, sauce, beef, served, pork, rice 12 Elongated Words wait, till, til, yay, ahhh, hoo, woo, woot, whoop, woohoo 16 Politics human, culture, justice, religion, democracy, religious, humanity, tradition, ancient, racism Most predictive Word2Vec 200 clusters as given by Gaussian Process ARD ranking

Feature Analysis - Cumulative density functions Higher Education (#21) 1 C1 0.8 User probability C2 C3 0.6 C4 C5 0.4 C6 C7 0.2 C8 C9 0 0.001 0.01 0.05 Topic proportion Topic more prevalent → CDF line closer to bottom-right corner

Feature Analysis - Cumulative density functions Arts (#116) 1 C1 0.8 User probability C2 C3 0.6 C4 C5 0.4 C6 C7 0.2 C8 C9 0 0.001 0.01 0.05 Topic proportion Topic more prevalent → CDF line closer to bottom-right corner

An analysis of the user occupational class through Twitter content - PowerPoint PPT Presentation

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 Daniel Preot 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University

Occupational Environmental Health Workshop Brantford John Oudyk MSc CIH ROH Occupational

Occupational Health Clinics Occupational Health Clinics for Ontario Workers, Inc. for Ontario

OCCUPATIONAL MEDICINE AT OCCUPATIONAL MEDICINE AT THE OHIO STATE UNIVERSITY THE OHIO STATE

Investigation of Occupational and Investigation of Occupational and Environmental Diseases

Occupational Therapy and ADHD Occupational Therapy Integrated Team for Children with

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos

MANDATORY CODE OF PRACTICE FOR AN OCCUPATIONAL HEALTH PROGRAMME (OCCUPATIONAL HYGIENE AND

Occupational Therapy Its the Ticket Home Presented by: Amy H. Avery MS OTR/L & Vicky Hall

Occupational Health Hazards PPT-SM-OCPHLTHHAZ 1 V.A.0.0 Occupational Health Hazards Three

Presented By: Ms Duduzile Mahlaba Occupational Medicine Directorate OUTLINE 1. Occupational

STATISTICS OF OCCUPATIONAL INJURIES Resolution concerning statistics of occupational injuries

construction Graeme McMinn HM Principal Inspector of Health and Safety What is OCCUPATIONAL

The Impact of Occupational Injury And Illness The Impact of Occupational Injury And Illness on

Occupational Health and Safety (OHS) Regulations Nunavut OCCUPATIONAL HEALTH AND SAFETY

Occupational Health and Safety Occupational Health and Safety Division Division OCCUPAT IONAL

Elementary Data Structures Biostatistics 615/815 Lecture 6: . . 1 / 29 . Array Radix sort

Outline 1. Poor design practice and remedy Sequential Circuit Design: 2. More counters 3.

Agenda Who we are What this talk is about Why? Background Timing as a

Combining NICMOS Parallel Observations A. B. Schultz 1 and H. Bushouse Space Telescope Science

C C # # 7, 7, 8, 8, and beyon ond: lang languag uage e fe features from design to to

AR annual report December, 31st 2005 Introduction to Hera Group 2 Hera achieved Leadership

IN USE FOR EXCHANGING DATA IN WWW Luis Kornblueh September 12, 2017 Max-Planck-Institut fr

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

An analysis of the user occupational class through Twitter content - PowerPoint PPT Presentation

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 Daniel Preot 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University

Occupational Environmental Health Workshop Brantford John Oudyk MSc CIH ROH Occupational

Occupational Health Clinics Occupational Health Clinics for Ontario Workers, Inc. for Ontario

OCCUPATIONAL MEDICINE AT OCCUPATIONAL MEDICINE AT THE OHIO STATE UNIVERSITY THE OHIO STATE

Investigation of Occupational and Investigation of Occupational and Environmental Diseases

Occupational Therapy and ADHD Occupational Therapy Integrated Team for Children with

RUN groupadd -r user &amp;&amp; useradd -r -g user user USER user $ docker run --read-only debian

An analysis of the user occupational class through Twitter content iuc-Pietro 1 Vasileios Lampos

MANDATORY CODE OF PRACTICE FOR AN OCCUPATIONAL HEALTH PROGRAMME (OCCUPATIONAL HYGIENE AND

Occupational Therapy Its the Ticket Home Presented by: Amy H. Avery MS OTR/L &amp; Vicky Hall

Occupational Health Hazards PPT-SM-OCPHLTHHAZ 1 V.A.0.0 Occupational Health Hazards Three

Presented By: Ms Duduzile Mahlaba Occupational Medicine Directorate OUTLINE 1. Occupational

STATISTICS OF OCCUPATIONAL INJURIES Resolution concerning statistics of occupational injuries

construction Graeme McMinn HM Principal Inspector of Health and Safety What is OCCUPATIONAL

The Impact of Occupational Injury And Illness The Impact of Occupational Injury And Illness on

Occupational Health and Safety (OHS) Regulations Nunavut OCCUPATIONAL HEALTH AND SAFETY

Occupational Health and Safety Occupational Health and Safety Division Division OCCUPAT IONAL

Elementary Data Structures Biostatistics 615/815 Lecture 6: . . 1 / 29 . Array Radix sort

Outline 1. Poor design practice and remedy Sequential Circuit Design: 2. More counters 3.

Agenda Who we are What this talk is about Why? Background Timing as a

Combining NICMOS Parallel Observations A. B. Schultz 1 and H. Bushouse Space Telescope Science

C C # # 7, 7, 8, 8, and beyon ond: lang languag uage e fe features from design to to

AR annual report December, 31st 2005 Introduction to Hera Group 2 Hera achieved Leadership

IN USE FOR EXCHANGING DATA IN WWW Luis Kornblueh September 12, 2017 Max-Planck-Institut fr

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

Occupational Therapy Its the Ticket Home Presented by: Amy H. Avery MS OTR/L & Vicky Hall