RECSM Summer School: Social Media and Big Data Research Pablo - PowerPoint PPT Presentation

RECSM Summer School: Social Media and Big Data Research Pablo Barber´ a London School of Economics www.pablobarbera.com Course website: pablobarbera.com/social-media-upf

Supervised Machine Learning Applied to Social Media Text

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews...

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into:

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier):

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods...

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods... ◮ Approach to validate classifier: cross-validation

Supervised machine learning Goal : classify documents into pre existing categories. e.g. authors of documents, sentiment of tweets, ideological position of parties based on manifestos, tone of movie reviews... What we need : ◮ Hand-coded dataset (labeled), to be split into: ◮ Training set: used to train the classifier ◮ Validation/Test set: used to validate the classifier ◮ Method to extrapolate from hand coding to unlabeled documents (classifier): ◮ Naive Bayes, regularized regression, SVM, K-nearest neighbors, BART, ensemble methods... ◮ Approach to validate classifier: cross-validation ◮ Performance metric to choose best classifier and avoid overfitting: confusion matrix, accuracy, precision, recall...

Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words”

Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches:

Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher

Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step

Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step ◮ Relative advantage of supervised methods: You already know the dimension being scaled, because you set it in the training stage

Supervised v. unsupervised methods compared ◮ The goal (in text analysis) is to differentiate documents from one another, treating them as “bags of words” ◮ Different approaches: ◮ Supervised methods require a training set that exemplify contrasting classes, identified by the researcher ◮ Unsupervised methods scale documents based on patterns of similarity from the term-document matrix, without requiring a training step ◮ Relative advantage of supervised methods: You already know the dimension being scaled, because you set it in the training stage ◮ Relative disadvantage of supervised methods: You must already know the dimension being scaled, because you have to feed it good sample documents in the training stage

Supervised learning v. dictionary methods ◮ Dictionary methods:

Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial

Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift)

Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift) ◮ Supervised learning can be conceptualized as a generalization of dictionary methods, where features associated with each categories (and their relative weight) are learned from the data

Supervised learning v. dictionary methods ◮ Dictionary methods: ◮ Advantage: not corpus-specific, cost to apply to a new corpus is trivial ◮ Disadvantage: not corpus-specific, so performance on a new corpus is unknown (domain shift) ◮ Supervised learning can be conceptualized as a generalization of dictionary methods, where features associated with each categories (and their relative weight) are learned from the data ◮ By construction, they will outperform dictionary methods in classification tasks, as long as training sample is large enough

Dictionaries vs supervised learning Source : Gonz´ alez-Bail´ on and Paltoglou (2015)

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project ◮ In most projects, undergraduate students (expertise comes from training)

Creating a labeled set How do we obtain a labeled set ? ◮ External sources of annotation ◮ Self-reported ideology in users’ profiles ◮ Gender in social security records ◮ Expert annotation ◮ “Canonical” dataset: Comparative Manifesto Project ◮ In most projects, undergraduate students (expertise comes from training) ◮ Crowd-sourced coding

RECSM Summer School: Social Media and Big Data Research Pablo - PowerPoint PPT Presentation

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of Economics www.pablobarbera.com Course website: pablobarbera.com/social-media-upf Supervised Machine Learning Applied to Social Media Text Supervised

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a School of

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

RECSM Summer School: Social Network Analysis Pablo Barber a School of International Relations

RECSM Summer School: Twitter Data Pablo Barber a School of International Relations

RECSM Summer School: Scraping the web Pablo Barber a School of International Relations

RECSM Summer School: Facebook + Topic Models Pablo Barber a School of International Relations

Social Media Advocacy and Social Media Advocacy and Data Driven Outreach Data Driven Outreach

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

RECSM Summer School: Machine Learning for Social Sciences Session 1.3: Supervised Learning and

RECSM Summer School: Machine Learning for Social Sciences Session 2.1: Introduction to

Research on Efficient Erasure-Coding- Based Cluster Storage Systems Patrick P. C. Lee The Chinese

with Erasure Coding Liangfeng Cheng 1 , Yuchong Hu 1 , Patrick P. C. Lee 2 1 Huazhong University of

Symplectic Heegaard splittings and generalizations Joan Birman (on joint work with Dennis Johnson

Gap between the alternation number and the dealternating number Mar a de los Angeles Guevara

Overview Agenda A selection of relevant concepts from Graph and Network Theory Markus

The Small World Problem Christoph Trattner Know-Center Graz University of Technology,

COHN LOCALIZATION, GENERALIZED FREE PRODUCTS AND BOUNDARY LINKS ANDREW RANICKI (Edinburgh)

to weight similarity measurement in collaborative filtering recommendations Y. Du , LGI2P, IMT