SLIDE 1 DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS
ASHUTOSH BAHETI, 12CS10012 RAHUL GURNANI, 12CS10039 DHRUV JAIN, 12CS30043 NISHKARSH SHASTRI, 12CS10034 SABYASACHEE BARAUH, 12CS30029
SLIDE 2 OBJECTIVE
- Identifying Personality of Quora users with respect to
the big five personality traits using linguistic features based analysis of their answer
- Openness T
- Experience
- Conscientiousness
- Extraversion
- Agreeableness
- Neuroticism
2
SLIDE 3
RELATED WORK
Psychological meaning of words : LIWC and computerised text analysis methods - Yla R. T ausczik and James W. Pennebaker Tausczik, Yla R., and James W. Pennebaker. "The psychological meaning of words: LIWC and computerized text analysis methods." Mairesse, François, et al. "Using linguistic cues for the automatic recognition of personality in conversation and text." Workshop on Computational Personality Recognition - Fabio Celli, Fabio Pianesi, David Stillwell, Michal Kosinski
3
SLIDE 4
Project Timeline 4
SLIDE 5
Classifying essay data based on LIWC as feature Identifying the linguistic features for the Big V personality traits Extraction of textual features from the essays Classifying based on new features and LIWC Survey with the Quora users to get a labelled dataset Crawling the answers of Surveyed users Using the Quora Dump to expand LIWC Trained the model based on labelled Quora Dataset Calculated the accuracy of the trained model
5
SLIDE 6
Classification of Essay Data
Straightforward ML approach
labelled essays with binary values for each personality sanitized the data present in the essays Created the trie structure for LIWC prefix matching Extracted the features based on LIWC word count for each category Applied SVM to the data using WEKA
Accuracy of model found to be 53%
6
SLIDE 7 Features Identified for Extroversion Word Variance (repetitivity) Type/Token Ratio Formality measure and Informality Measure F-Measure = (noun freq + adjective freq + preposition freq + article freq - pronoun freq - verb freq - adverb freq - interjection freq + 100)/2 I-Measure = (Wrong-typed Words freq. + Interjections freq. + Emoticon
Positivity of Text and Negativity Of Text Rich Vocabulary, use of difficult words Concrete and Frequent Words Use of more social words 7
SLIDE 8
Features Identified for Openness
Preference for longer words Words expressing tentativeness Avoidance of 1st person singular pronouns Present tense forms The avoidance of past tense indicates
8
SLIDE 9
Features Identified for Conscientiousness
Avoid negations Avoid words reflecting discrepancies (e.g., should and would) 2nd person pronouns Filler words (in males and not in females): More useful in speech analysis
9
SLIDE 10
Features Identified for Agreeableness
More positive emotions few negative emotions Few articles Negative and Positive emotion words Leisure activity
10
SLIDE 11
Features Identified for Neuroticism
1st person singular pronouns Noun Negative Multiple punctuations Fewer references to occupation
11
SLIDE 12
Extraction of Features
Python scripts using nltk to extract the features mentioned in previous five slides Speech based features were not extracted
12
SLIDE 13
NLP Techniques based features: Discourse Parsing
Used the discourse parsing on all the essays data. Created RST style discourse trees. Extracted main nucleus text from the data Extracted the relation count from the RST trees Normalized the relation count. Constructed the feature vector to include the discourse relation count
13
SLIDE 14
Expansion of LIWC Word Set Seeded LDA and Word2Vec Methods 14
SLIDE 15
Expansion of LIWC: Seeded LDA
Seeded LDA treats each document as a mixture of topics. It treats topics as a probability distribution of words. We can give a prior asymetric probability to a word topic pair to seed the topic with the given word. We have used the gensim package and the eta parameter to implement seeded LDA, however it did not give better results due to overfitting.
15
SLIDE 16
Expansion of LIWC: Word2Vec
Applied Word2Vec modelling on Quora Dump Found the most similar words for each word present under the tag Compared the similarity with 1Billion WIki Text Added the most similar words thus found to new LIWC dictionary Trained the models on new LIWC dictionary
16
SLIDE 17 Expansion of Posemo,Negemo,Funct-words
Added More Positive Words,Negative Words[1] Added more functional words[2]
- 1. Minqing Hu and Bing Liu. "Mining and Summarizing
Customer Reviews."
- 2. Leah Gilner and Franc Morales at [Sequence Publishing]
(http://www.sequencepublishing.com) for listing English function words
17
SLIDE 18
User Survey 18
SLIDE 19
Survey Method
Used a 10 question questionnaire - BFI 10 Contacted the Quora users having more than 30 answers 50 Users filled the survey Calculated the personality score for all the 5 personality traits between 1-10
19
SLIDE 20
Extraction of Data
Written the Python script to crawl all the answers of these users Sanitized the answers Pruned all the answers with less than 200 words Labelled the dataset thus obtained with survey results
20
SLIDE 21
Results 21
SLIDE 22 Only LIWC Features on labelled Essays 22
SMO Logisti c Adabo
SVM Rando m Forest Openn ess 60.534 8 % 59.927 1 % 59.116 7 % 51.539 7 % 55.105 3 % Consci entious ness 55.429 5 % 55.348 5 % 55.348 5 % 50.810 4 % 53.444 1 % Extrave rsion 54.578 6 % 54.781 2 % 54.862 2 % 51.742 3 % 53.201 % Agreea bleness 55.145 9 % 53.768 2 % 56.077 8 % 53.079 4 % 54.416 5 % Neuroti cism 55.996 8 % 56.118 3 % 54.335 5 % 50.040 5 % 52.593 2 %
SLIDE 23 LIWC Features + New Extracted Features on labelled Essays 23
SMO Logisti c Adabo
SVM Rando m Forest Openn ess 60.534 8 % 60.372 8 % 58.59 % 51.985 4 % 57.739 1 % Consci entious ness 56.361 4 % 55.024 3 % 55.226 9 % 51.215 6 % 53.282 % Extrave rsion 55.186 4 % 55.510 5 % 55.510 5 % 51.256 1 % 52.512 2 % Agreea bleness 54.902 8 % 53.687 2 % 56.726 1 % 53.038 9 % 52.107 % Neuroti cism 56.969 2 % 57.739 1 % 54.092 4 % 50.688 8 % 51.782 8 %
SLIDE 24 Expanded LIWC + New Extracted Features on labelled Essays 24
SMO Logisti c Adabo
SVM Rando m Forest Openn ess 61.183 1 % 61.750 4 % 59.886 5 % 53.079 4 % 56.320 9 % Consci entious ness 55.510 5 % 54.619 1 % 53.808 8 % 51.580 2 % 51.661 3 % Extrave rsion 54.213 9 % 54.335 5 % 55.875 2 % 52.025 9 % 50.607 8 % Agreea bleness 55.348 5 % 54.213 9 % 54.213 9 % 51.661 3 % 51.742 3 % Neuroti cism 57.577 % 56.685 6 % 54.862 2 % 51.256 1 % 51.944 9 %
SLIDE 25 Expanded LIWC + New Extracted Features + Discourse Relations on labelled Essays 25
SMO Logisti c Adabo
SVM Rando m Forest Openn ess 61.433 6 % 60.272 6 % 58.960 1 % 52.347 3 % 57.294 3 % Consci entious ness 56.486 6 % 55.679 % 53.054 % 51.590 1 % 51.236 7 % Extrave rsion 54.114 1 % 53.457 8 % 55.527 5 % 52.549 2 % 53.054 % Agreea bleness 56.789 5 % 56.991 4 % 54.568 4 % 53.861 7 % 54.820 8 % Neuroti cism 56.84 % 57.041 9 % 53.861 7 % 53.508 3 % 53.356 9 %
SLIDE 26 Only LIWC Features on Labelled Quora Dataset 26
SMO Logisti c Adabo
SVM Rando m Forest Openn ess 74.897 1 % 74.897 1 % 70.370 4 % 70.535 % 71.604 9 % Consci entious ness 68.971 2 % 66.913 6 % 68.971 2 % 68.971 2 % 69.794 2 % Extrave rsion 76.296 3 % 76.707 8 % 76.296 3 % 76.296 3 % 78.93 % Agreea bleness 67.818 9 % 66.584 4 % 63.456 8 % 63.456 8 % 66.172 8 % Neuroti cism 72.921 8 % 71.851 9 % 72.921 8 % 72.921 8 % 71.769 5 %
SLIDE 27 Expanded LIWC + Features Quora dataset 27
SMO Logis tic Adab
Adab
(rand
forest ) SVM Rand
Fores t Open ness 75.39 09 % 73.90 95 % 72.75 72 % 77.28 4 % 71.02 88 % 74.73 25 % Consc ientiou sness 70.12 35 % 67.57 2 % 68.97 12 % 73.66 26 % 68.55 97 % 71.27 57 % Extrav ersion 76.37 86 % 77.28 4 % 76.29 63 % 80.41 15 % 77.94 24 % 79.75 31 % Agree ablen ess 66.99 59 % 67.16 05 % 63.45 68 % 69.38 27 % 64.19 75 % 66.09 05 % Neuro ticism 73.00 41 % 70.04 12 % 72.92 18 % 75.22 63 % 71.93 42 % 72.34 57 %
SLIDE 28
Future Work
Expand LIWC by taking more unlabelled quora data Gathering richer labelled quora data by conducting paid personality surveys Evaluate on more labelled quora data Leveraging Discourse output to generate better discourse features Add more linguistic features by identifying patterns in quora answers
28
SLIDE 29
Thank You 29