DONT REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM - - PowerPoint PPT Presentation

don t remove my stop words identifying personality traits
SMART_READER_LITE
LIVE PREVIEW

DONT REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM - - PowerPoint PPT Presentation

DONT REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS ASHUTOSH BAHETI, 12CS10012 RAHUL GURNANI, 12CS10039 DHRUV JAIN, 12CS30043 NISHKARSH SHASTRI, 12CS10034 SABYASACHEE BARAUH, 12CS30029 OBJECTIVE 2 Identifying


slide-1
SLIDE 1

DON’T REMOVE MY STOP WORDS: IDENTIFYING PERSONALITY TRAITS FROM QUORA ANSWERS

ASHUTOSH BAHETI, 12CS10012 RAHUL GURNANI, 12CS10039 DHRUV JAIN, 12CS30043 NISHKARSH SHASTRI, 12CS10034 SABYASACHEE BARAUH, 12CS30029

slide-2
SLIDE 2

OBJECTIVE

  • Identifying Personality of Quora users with respect to

the big five personality traits using linguistic features based analysis of their answer

  • Openness T
  • Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism

2

slide-3
SLIDE 3

RELATED WORK

Psychological meaning of words : LIWC and computerised text analysis methods - Yla R. T ausczik and James W. Pennebaker Tausczik, Yla R., and James W. Pennebaker. "The psychological meaning of words: LIWC and computerized text analysis methods." Mairesse, François, et al. "Using linguistic cues for the automatic recognition of personality in conversation and text." Workshop on Computational Personality Recognition - Fabio Celli, Fabio Pianesi, David Stillwell, Michal Kosinski

3

slide-4
SLIDE 4

Project Timeline 4

slide-5
SLIDE 5

Classifying essay data based on LIWC as feature Identifying the linguistic features for the Big V personality traits Extraction of textual features from the essays Classifying based on new features and LIWC Survey with the Quora users to get a labelled dataset Crawling the answers of Surveyed users Using the Quora Dump to expand LIWC Trained the model based on labelled Quora Dataset Calculated the accuracy of the trained model

5

slide-6
SLIDE 6

Classification of Essay Data

Straightforward ML approach

labelled essays with binary values for each personality sanitized the data present in the essays Created the trie structure for LIWC prefix matching Extracted the features based on LIWC word count for each category Applied SVM to the data using WEKA

Accuracy of model found to be 53%

6

slide-7
SLIDE 7

Features Identified for Extroversion Word Variance (repetitivity) Type/Token Ratio Formality measure and Informality Measure F-Measure = (noun freq + adjective freq + preposition freq + article freq - pronoun freq - verb freq - adverb freq - interjection freq + 100)/2 I-Measure = (Wrong-typed Words freq. + Interjections freq. + Emoticon

  • freq. ) * 100

Positivity of Text and Negativity Of Text Rich Vocabulary, use of difficult words Concrete and Frequent Words Use of more social words 7

slide-8
SLIDE 8

Features Identified for Openness

Preference for longer words Words expressing tentativeness Avoidance of 1st person singular pronouns Present tense forms The avoidance of past tense indicates

8

slide-9
SLIDE 9

Features Identified for Conscientiousness

Avoid negations Avoid words reflecting discrepancies (e.g., should and would) 2nd person pronouns Filler words (in males and not in females): More useful in speech analysis

9

slide-10
SLIDE 10

Features Identified for Agreeableness

More positive emotions few negative emotions Few articles Negative and Positive emotion words Leisure activity

10

slide-11
SLIDE 11

Features Identified for Neuroticism

1st person singular pronouns Noun Negative Multiple punctuations Fewer references to occupation

11

slide-12
SLIDE 12

Extraction of Features

Python scripts using nltk to extract the features mentioned in previous five slides Speech based features were not extracted

12

slide-13
SLIDE 13

NLP Techniques based features: Discourse Parsing

Used the discourse parsing on all the essays data. Created RST style discourse trees. Extracted main nucleus text from the data Extracted the relation count from the RST trees Normalized the relation count. Constructed the feature vector to include the discourse relation count

13

slide-14
SLIDE 14

Expansion of LIWC Word Set Seeded LDA and Word2Vec Methods 14

slide-15
SLIDE 15

Expansion of LIWC: Seeded LDA

Seeded LDA treats each document as a mixture of topics. It treats topics as a probability distribution of words. We can give a prior asymetric probability to a word topic pair to seed the topic with the given word. We have used the gensim package and the eta parameter to implement seeded LDA, however it did not give better results due to overfitting.

15

slide-16
SLIDE 16

Expansion of LIWC: Word2Vec

Applied Word2Vec modelling on Quora Dump Found the most similar words for each word present under the tag Compared the similarity with 1Billion WIki Text Added the most similar words thus found to new LIWC dictionary Trained the models on new LIWC dictionary

16

slide-17
SLIDE 17

Expansion of Posemo,Negemo,Funct-words

Added More Positive Words,Negative Words[1] Added more functional words[2]

  • 1. Minqing Hu and Bing Liu. "Mining and Summarizing

Customer Reviews."

  • 2. Leah Gilner and Franc Morales at [Sequence Publishing]

(http://www.sequencepublishing.com) for listing English function words

17

slide-18
SLIDE 18

User Survey 18

slide-19
SLIDE 19

Survey Method

Used a 10 question questionnaire - BFI 10 Contacted the Quora users having more than 30 answers 50 Users filled the survey Calculated the personality score for all the 5 personality traits between 1-10

19

slide-20
SLIDE 20

Extraction of Data

Written the Python script to crawl all the answers of these users Sanitized the answers Pruned all the answers with less than 200 words Labelled the dataset thus obtained with survey results

20

slide-21
SLIDE 21

Results 21

slide-22
SLIDE 22

Only LIWC Features on labelled Essays 22

SMO Logisti c Adabo

  • st

SVM Rando m Forest Openn ess 60.534 8 % 59.927 1 % 59.116 7 % 51.539 7 % 55.105 3 % Consci entious ness 55.429 5 % 55.348 5 % 55.348 5 % 50.810 4 % 53.444 1 % Extrave rsion 54.578 6 % 54.781 2 % 54.862 2 % 51.742 3 % 53.201 % Agreea bleness 55.145 9 % 53.768 2 % 56.077 8 % 53.079 4 % 54.416 5 % Neuroti cism 55.996 8 % 56.118 3 % 54.335 5 % 50.040 5 % 52.593 2 %

slide-23
SLIDE 23

LIWC Features + New Extracted Features on labelled Essays 23

SMO Logisti c Adabo

  • st

SVM Rando m Forest Openn ess 60.534 8 % 60.372 8 % 58.59 % 51.985 4 % 57.739 1 % Consci entious ness 56.361 4 % 55.024 3 % 55.226 9 % 51.215 6 % 53.282 % Extrave rsion 55.186 4 % 55.510 5 % 55.510 5 % 51.256 1 % 52.512 2 % Agreea bleness 54.902 8 % 53.687 2 % 56.726 1 % 53.038 9 % 52.107 % Neuroti cism 56.969 2 % 57.739 1 % 54.092 4 % 50.688 8 % 51.782 8 %

slide-24
SLIDE 24

Expanded LIWC + New Extracted Features on labelled Essays 24

SMO Logisti c Adabo

  • st

SVM Rando m Forest Openn ess 61.183 1 % 61.750 4 % 59.886 5 % 53.079 4 % 56.320 9 % Consci entious ness 55.510 5 % 54.619 1 % 53.808 8 % 51.580 2 % 51.661 3 % Extrave rsion 54.213 9 % 54.335 5 % 55.875 2 % 52.025 9 % 50.607 8 % Agreea bleness 55.348 5 % 54.213 9 % 54.213 9 % 51.661 3 % 51.742 3 % Neuroti cism 57.577 % 56.685 6 % 54.862 2 % 51.256 1 % 51.944 9 %

slide-25
SLIDE 25

Expanded LIWC + New Extracted Features + Discourse Relations on labelled Essays 25

SMO Logisti c Adabo

  • st

SVM Rando m Forest Openn ess 61.433 6 % 60.272 6 % 58.960 1 % 52.347 3 % 57.294 3 % Consci entious ness 56.486 6 % 55.679 % 53.054 % 51.590 1 % 51.236 7 % Extrave rsion 54.114 1 % 53.457 8 % 55.527 5 % 52.549 2 % 53.054 % Agreea bleness 56.789 5 % 56.991 4 % 54.568 4 % 53.861 7 % 54.820 8 % Neuroti cism 56.84 % 57.041 9 % 53.861 7 % 53.508 3 % 53.356 9 %

slide-26
SLIDE 26

Only LIWC Features on Labelled Quora Dataset 26

SMO Logisti c Adabo

  • st

SVM Rando m Forest Openn ess 74.897 1 % 74.897 1 % 70.370 4 % 70.535 % 71.604 9 % Consci entious ness 68.971 2 % 66.913 6 % 68.971 2 % 68.971 2 % 69.794 2 % Extrave rsion 76.296 3 % 76.707 8 % 76.296 3 % 76.296 3 % 78.93 % Agreea bleness 67.818 9 % 66.584 4 % 63.456 8 % 63.456 8 % 66.172 8 % Neuroti cism 72.921 8 % 71.851 9 % 72.921 8 % 72.921 8 % 71.769 5 %

slide-27
SLIDE 27

Expanded LIWC + Features Quora dataset 27

SMO Logis tic Adab

  • ost

Adab

  • ost

(rand

  • m

forest ) SVM Rand

  • m

Fores t Open ness 75.39 09 % 73.90 95 % 72.75 72 % 77.28 4 % 71.02 88 % 74.73 25 % Consc ientiou sness 70.12 35 % 67.57 2 % 68.97 12 % 73.66 26 % 68.55 97 % 71.27 57 % Extrav ersion 76.37 86 % 77.28 4 % 76.29 63 % 80.41 15 % 77.94 24 % 79.75 31 % Agree ablen ess 66.99 59 % 67.16 05 % 63.45 68 % 69.38 27 % 64.19 75 % 66.09 05 % Neuro ticism 73.00 41 % 70.04 12 % 72.92 18 % 75.22 63 % 71.93 42 % 72.34 57 %

slide-28
SLIDE 28

Future Work

Expand LIWC by taking more unlabelled quora data Gathering richer labelled quora data by conducting paid personality surveys Evaluate on more labelled quora data Leveraging Discourse output to generate better discourse features Add more linguistic features by identifying patterns in quora answers

28

slide-29
SLIDE 29

Thank You 29