RECSM Summer School: Facebook + Topic Models Pablo Barber a - PowerPoint PPT Presentation

RECSM Summer School: Facebook + Topic Models Pablo Barber´ a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages

Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies...

Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore)

Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook

Collecting Facebook data Facebook only allows access to public pages’ data through the Graph API: 1. Posts on public pages 2. Likes, reactions, comments, replies... Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook R library: Rfacebook

Overview of text as data methods Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Events Quotes Locations Names . . . Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Events Cosine similarity Quotes Naive Bayes Locations Names . . . Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Events Cosine mixture similarity Quotes model? Naive Bayes Locations Names . . . Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Events Cosine mixture similarity Quotes model? Naive Bayes Locations Names . . . (ML methods) Fig. 1 in Grimmer and Stewart (2013)

Overview of text as data methods Entity Recognition Events Cosine mixture similarity Quotes model? Naive Bayes Locations Names . . . Models with covariates (sLDA, STM) (ML methods) Fig. 1 in Grimmer and Stewart (2013)

Latent Dirichlet allocation (LDA) ◮ Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) )+".()1 +"/,9#)1 .&/,0,"'1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 :5)2,'0("'1 &(2,#)1 65)7&(65//1 .&/,0,"'1 6$332,)%1 8""(65//1 - - -

Latent Dirichlet allocation (LDA) ◮ Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) )+".()1 +"/,9#)1 .&/,0,"'1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 :5)2,'0("'1 &(2,#)1 65)7&(65//1 .&/,0,"'1 6$332,)%1 8""(65//1 - - - ◮ Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ

Latent Dirichlet allocation (LDA) ◮ Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) )+".()1 +"/,9#)1 .&/,0,"'1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 :5)2,'0("'1 &(2,#)1 65)7&(65//1 .&/,0,"'1 6$332,)%1 8""(65//1 - - - ◮ Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ ◮ LDA is one of the simplest and most widely used topic models

Latent Dirichlet Allocation

Latent Dirichlet Allocation ◮ Document = random mixture over latent topics ◮ Topic = distribution over n-grams Probabilistic model with 3 steps: 1. Choose θ i ∼ Dirichlet ( α ) 2. Choose β k ∼ Dirichlet ( δ ) 3. For each word in document i : ◮ Choose a topic z m ∼ Multinomial ( θ i ) ◮ Choose a word w im ∼ Multinomial ( β i , k = z m ) where: α =parameter of Dirichlet prior on distribution of topics over docs. θ i =topic distribution for document i δ =parameter of Dirichlet prior on distribution of words over topics β k =word distribution for topic k

Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k ; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06 . . . Document N 0.01 0.01 0.96 0.01 0.01

Latent Dirichlet Allocation Key parameters: 1. θ = matrix of dimensions N documents by K topics where θ ik corresponds to the probability that document i belongs to topic k ; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06 . . . Document N 0.01 0.01 0.96 0.01 0.01 2. β = matrix of dimensions K topics by M words where β km corresponds to the probability that word m belongs to topic k ; i.e. assuming M = 6: W1 W2 W3 W4 W5 W6 Topic 1 0.40 0.05 0.05 0.10 0.10 0.30 Topic 2 0.10 0.10 0.10 0.50 0.10 0.10 . . . Topic k 0.05 0.60 0.10 0.05 0.10 0.10

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity ◮ Does variation in topic usage correspond with expected events?

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity ◮ Does variation in topic usage correspond with expected events? 4. Hypothesis validity

Validation From Quinn et al, AJPS, 2010: 1. Semantic validity ◮ Do the topics identify coherent groups of tweets that are internally homogenous, and are related to each other in a meaningful way? 2. Convergent/discriminant construct validity ◮ Do the topics match existing measures where they should match? ◮ Do they depart from existing measures where they should depart? 3. Predictive validity ◮ Does variation in topic usage correspond with expected events? 4. Hypothesis validity ◮ Can topic variation be used effectively to test substantive hypotheses?

RECSM Summer School: Facebook + Topic Models Pablo Barber a - PowerPoint PPT Presentation

RECSM Summer School: Facebook + Topic Models Pablo Barber a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Facebook Exchange Facebook Exchange (FBX) (FBX) Facebook Exchange The Facebook Exchange allows

Facebook Strategies Facebook www.facebook.com Facebook TIPS Idea #1: Share the School Calendar.

RECSM Summer School: Scraping the web Pablo Barber a School of International Relations

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Twitter Data Pablo Barber a School of International Relations

RECSM Summer School: Social Media and Big Data Research Pablo Barber a School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Media and Big Data Research Pablo Barber a London School of

RECSM Summer School: Social Network Analysis Pablo Barber a School of International Relations

RECSM Summer School: Machine Learning for Social Sciences Session 3.3: K -Means Clustering Reto

RECSM Summer School: Machine Learning for Social Sciences Session 3.2: Principal Components

Plans for installa,on Jim Stewart BNL July 18 2018 DUNE Far Detector layout Bridge There

Cybersecurity Introductions Skyline Technology Solutions Tom Burgoon - BD ITS Practice

5G: Where are we up to, and where are we going? 12 February 2018 Janette Stewart 2012775-65

TDI Audit Review John Rothermel January 17, 2019 In order to obtain a CE Certificate or CLE

Cycle 2 2019: Broad PCORI Funding Announcements (PFAs) Applicant Town Hall May 15, 2019 Agenda

Pace Layers The social economy ecosystem has many layers, all of which change at different

Mobile Malware: Why the traditional AV paradigm is doomed, and how to use physics to detect

SBA and Programs to Know About First Wednesday Virtual Learning Series 2018 www.sba.gov 1