recsm summer school social media and big data research
play

RECSM Summer School: Social Media and Big Data Research Pablo - PowerPoint PPT Presentation

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of Economics www.pablobarbera.com Course website: pablobarbera.com/social-media-upf Supervised Machine Learning Applied to Social Media Text Supervised


  1. RECSM Summer School: Social Media and Big Data Research Pablo Barber´ a London School of Economics www.pablobarbera.com Course website: pablobarbera.com/social-media-upf

  2. Supervised Machine Learning Applied to Social Media Text

  3. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews...

  4. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into:

  5. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier

  6. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier

  7. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier):

  8. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods...

  9. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods... ◮ Approach to validate classifier: cross-validation

  10. Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods... ◮ Approach to validate classifier: cross-validation ◮ Performance metric to choose best classifier and avoid overfitting: confusion matrix, accuracy, precision, recall...

  11. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words”

  12. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches:

  13. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher

  14. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step

  15. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step ◮ Relative advantage of supervised methods: You already know the dimension being scaled, because you set it in the training stage

  16. Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step ◮ Relative advantage of supervised methods: You already know the dimension being scaled, because you set it in the training stage ◮ Relative disadvantage of supervised methods: You must already know the dimension being scaled, because you have to feed it good sample documents in the training stage

  17. Supervised learning v. dictionary methods ◮ Dictionary methods:

  18. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial

  19. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift)

  20. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift) ◮ Supervised learning can be conceptualized as a generalization of dictionary methods, where features associated with each categories (and their relative weight) are learned from the data

  21. Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift) ◮ Supervised learning can be conceptualized as a generalization of dictionary methods, where features associated with each categories (and their relative weight) are learned from the data ◮ By construction, they will outperform dictionary methods in classification tasks, as long as training sample is large enough

  22. Dictionaries vs supervised learning Source : Gonz´ alez-Bail´ on and Paltoglou (2015)

  23. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation

  24. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles

  25. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records

  26. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation

  27. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project

  28. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project ◮ In most projects, undergraduate students (expertise comes from training)

  29. Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project ◮ In most projects, undergraduate students (expertise comes from training) ◮ Crowd-sourced coding

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend