RECSM Summer School: Facebook + Topic Models Pablo Barber a - - PowerPoint PPT Presentation

recsm summer school facebook topic models
SMART_READER_LITE
LIVE PREVIEW

RECSM Summer School: Facebook + Topic Models Pablo Barber a - - PowerPoint PPT Presentation

RECSM Summer School: Facebook + Topic Models Pablo Barber a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf


slide-1
SLIDE 1

RECSM Summer School: Facebook + Topic Models

Pablo Barber´ a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website:

github.com/pablobarbera/big-data-upf

slide-2
SLIDE 2

Collecting Facebook data

Facebook only allows access to public pages’ data through the Graph API:

  • 1. Posts on public pages
slide-3
SLIDE 3

Collecting Facebook data

Facebook only allows access to public pages’ data through the Graph API:

  • 1. Posts on public pages
  • 2. Likes, reactions, comments, replies...
slide-4
SLIDE 4

Collecting Facebook data

Facebook only allows access to public pages’ data through the Graph API:

  • 1. Posts on public pages
  • 2. Likes, reactions, comments, replies...

Some public user data (gender, location) was available through previous versions of the API (not anymore)

slide-5
SLIDE 5

Collecting Facebook data

Facebook only allows access to public pages’ data through the Graph API:

  • 1. Posts on public pages
  • 2. Likes, reactions, comments, replies...

Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook

slide-6
SLIDE 6

Collecting Facebook data

Facebook only allows access to public pages’ data through the Graph API:

  • 1. Posts on public pages
  • 2. Likes, reactions, comments, replies...

Some public user data (gender, location) was available through previous versions of the API (not anymore) Access to other (anonymized) data used in published studies requires permission from Facebook R library: Rfacebook

slide-7
SLIDE 7

Overview of text as data methods

  • Fig. 1 in Grimmer and Stewart (2013)
slide-8
SLIDE 8

Overview of text as data methods

Entity Recognition

  • Fig. 1 in Grimmer and Stewart (2013)
slide-9
SLIDE 9

Overview of text as data methods

Entity Recognition Events Quotes Locations Names . . .

  • Fig. 1 in Grimmer and Stewart (2013)
slide-10
SLIDE 10

Overview of text as data methods

Entity Recognition Events Quotes Locations Names . . . Cosine similarity Naive Bayes

  • Fig. 1 in Grimmer and Stewart (2013)
slide-11
SLIDE 11

Overview of text as data methods

Entity Recognition Events Quotes Locations Names . . . Cosine similarity Naive Bayes mixture model?

  • Fig. 1 in Grimmer and Stewart (2013)
slide-12
SLIDE 12

Overview of text as data methods

Entity Recognition Events Quotes Locations Names . . . Cosine similarity Naive Bayes mixture model?

(ML methods)

  • Fig. 1 in Grimmer and Stewart (2013)
slide-13
SLIDE 13

Overview of text as data methods

Entity Recognition Events Quotes Locations Names . . . Cosine similarity Naive Bayes mixture model?

(ML methods)

Models with covariates (sLDA, STM)

  • Fig. 1 in Grimmer and Stewart (2013)
slide-14
SLIDE 14

Latent Dirichlet allocation (LDA)

◮ Topic models are powerful tools for exploring large data

sets and for making inferences about the content of documents

!"#$%&'() *"+,#)

+"/,9#)1 +.&),3&'(1 "65%51 :5)2,'0("'1 .&/,0,"'1

  • .&/,0,"'1

2,'3$1 4$3,5)%1 &(2,#)1 6$332,)%1 )+".()1 65)&65//1 )"##&.1 65)7&(65//1 8""(65//1

slide-15
SLIDE 15

Latent Dirichlet allocation (LDA)

◮ Topic models are powerful tools for exploring large data

sets and for making inferences about the content of documents

!"#$%&'() *"+,#)

+"/,9#)1 +.&),3&'(1 "65%51 :5)2,'0("'1 .&/,0,"'1

  • .&/,0,"'1

2,'3$1 4$3,5)%1 &(2,#)1 6$332,)%1 )+".()1 65)&65//1 )"##&.1 65)7&(65//1 8""(65//1

  • ◮ Many applications in information retrieval, document

summarization, and classification

New+document+ What+is+this+document+about?+

Words+w1,+…,+wN+

θ

Distribu6on+of+topics+

weather+ .50+ finance+ .49+ sports+ .01+

slide-16
SLIDE 16

Latent Dirichlet allocation (LDA)

◮ Topic models are powerful tools for exploring large data

sets and for making inferences about the content of documents

!"#$%&'() *"+,#)

+"/,9#)1 +.&),3&'(1 "65%51 :5)2,'0("'1 .&/,0,"'1

  • .&/,0,"'1

2,'3$1 4$3,5)%1 &(2,#)1 6$332,)%1 )+".()1 65)&65//1 )"##&.1 65)7&(65//1 8""(65//1

  • ◮ Many applications in information retrieval, document

summarization, and classification

New+document+ What+is+this+document+about?+

Words+w1,+…,+wN+

θ

Distribu6on+of+topics+

weather+ .50+ finance+ .49+ sports+ .01+

◮ LDA is one of the simplest and most widely used topic

models

slide-17
SLIDE 17

Latent Dirichlet Allocation

slide-18
SLIDE 18

Latent Dirichlet Allocation

◮ Document = random mixture over latent topics ◮ Topic = distribution over n-grams

Probabilistic model with 3 steps:

  • 1. Choose θi ∼ Dirichlet(α)
  • 2. Choose βk ∼ Dirichlet(δ)
  • 3. For each word in document i:

◮ Choose a topic zm ∼ Multinomial(θi) ◮ Choose a word wim ∼ Multinomial(βi,k=zm)

where: α=parameter of Dirichlet prior on distribution of topics over docs. θi=topic distribution for document i δ=parameter of Dirichlet prior on distribution of words over topics βk=word distribution for topic k

slide-19
SLIDE 19

Latent Dirichlet Allocation

Key parameters:

  • 1. θ = matrix of dimensions N documents by K topics where θik

corresponds to the probability that document i belongs to topic k; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06 . . . Document N 0.01 0.01 0.96 0.01 0.01

slide-20
SLIDE 20

Latent Dirichlet Allocation

Key parameters:

  • 1. θ = matrix of dimensions N documents by K topics where θik

corresponds to the probability that document i belongs to topic k; i.e. assuming K = 5: T1 T2 T3 T4 T5 Document 1 0.15 0.15 0.05 0.10 0.55 Document 2 0.80 0.02 0.02 0.10 0.06 . . . Document N 0.01 0.01 0.96 0.01 0.01

  • 2. β = matrix of dimensions K topics by M words where βkm corresponds

to the probability that word m belongs to topic k; i.e. assuming M = 6: W1 W2 W3 W4 W5 W6 Topic 1 0.40 0.05 0.05 0.10 0.10 0.30 Topic 2 0.10 0.10 0.10 0.50 0.10 0.10 . . . Topic k 0.05 0.60 0.10 0.05 0.10 0.10

slide-21
SLIDE 21

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity
slide-22
SLIDE 22

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

slide-23
SLIDE 23

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity
slide-24
SLIDE 24

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity

◮ Do the topics match existing measures where they should

match?

slide-25
SLIDE 25

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity

◮ Do the topics match existing measures where they should

match?

◮ Do they depart from existing measures where they should

depart?

slide-26
SLIDE 26

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity

◮ Do the topics match existing measures where they should

match?

◮ Do they depart from existing measures where they should

depart?

  • 3. Predictive validity
slide-27
SLIDE 27

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity

◮ Do the topics match existing measures where they should

match?

◮ Do they depart from existing measures where they should

depart?

  • 3. Predictive validity

◮ Does variation in topic usage correspond with expected

events?

slide-28
SLIDE 28

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity

◮ Do the topics match existing measures where they should

match?

◮ Do they depart from existing measures where they should

depart?

  • 3. Predictive validity

◮ Does variation in topic usage correspond with expected

events?

  • 4. Hypothesis validity
slide-29
SLIDE 29

Validation

From Quinn et al, AJPS, 2010:

  • 1. Semantic validity

◮ Do the topics identify coherent groups of tweets that are

internally homogenous, and are related to each other in a meaningful way?

  • 2. Convergent/discriminant construct validity

◮ Do the topics match existing measures where they should

match?

◮ Do they depart from existing measures where they should

depart?

  • 3. Predictive validity

◮ Does variation in topic usage correspond with expected

events?

  • 4. Hypothesis validity

◮ Can topic variation be used effectively to test substantive

hypotheses?

slide-30
SLIDE 30

Example: open-ended survey responses

Bauer, Barber´ a et al, Political Behavior, 2016.

◮ Data: General Social Survey (2008) in Germany ◮ Responses to questions: Would you please tell me what

you associate with the term “left”? and would you please tell me what you associate with the term “right”?

◮ Open-ended questions minimize priming and potential

interviewer effects

◮ Sparse Additive Generative model instead of LDA (more

coherent topics for short text)

◮ K = 4 topics for each question

slide-31
SLIDE 31

Example: open-ended survey responses

Bauer, Barber´ a et al, Political Behavior, 2016.

slide-32
SLIDE 32

Example: open-ended survey responses

Bauer, Barber´ a et al, Political Behavior, 2016.

slide-33
SLIDE 33

Example: open-ended survey responses

Bauer, Barber´ a et al, Political Behavior, 2016.

slide-34
SLIDE 34

Example: topics in US legislators’ tweets

◮ Data: 651,116 tweets sent by US legislators from January

2013 to December 2014.

◮ 2,920 documents = 730 days × 2 chambers × 2 parties ◮ Why aggregating? Applications that aggregate by author or

day outperform tweet-level analyses (Hong and Davidson, 2010)

◮ K = 100 topics (more on this later) ◮ Validation: http://j.mp/lda-congress-demo

slide-35
SLIDE 35

Choosing the number of topics

◮ Choosing K is “one of the most difficult questions in

unsupervised learning” (Grimmer and Stewart, 2013, p.19)

slide-36
SLIDE 36

Choosing the number of topics

◮ Choosing K is “one of the most difficult questions in

unsupervised learning” (Grimmer and Stewart, 2013, p.19)

◮ We chose K = 100 based on cross-validated model fit.

  • Perplexity

logLikelihood 0.80 0.85 0.90 0.95 1.00 10 20 30 40 50 60 70 80 90 100 110 120

Number of topics Ratio wrt worst value

slide-37
SLIDE 37

Choosing the number of topics

◮ Choosing K is “one of the most difficult questions in

unsupervised learning” (Grimmer and Stewart, 2013, p.19)

◮ We chose K = 100 based on cross-validated model fit.

  • Perplexity

logLikelihood 0.80 0.85 0.90 0.95 1.00 10 20 30 40 50 60 70 80 90 100 110 120

Number of topics Ratio wrt worst value

◮ BUT: “there is often a negative relationship between the

best-fitting model and the substantive information provided”.

slide-38
SLIDE 38

Choosing the number of topics

◮ Choosing K is “one of the most difficult questions in

unsupervised learning” (Grimmer and Stewart, 2013, p.19)

◮ We chose K = 100 based on cross-validated model fit.

  • Perplexity

logLikelihood 0.80 0.85 0.90 0.95 1.00 10 20 30 40 50 60 70 80 90 100 110 120

Number of topics Ratio wrt worst value

◮ BUT: “there is often a negative relationship between the

best-fitting model and the substantive information provided”.

◮ GS propose to choose K based on “substantive fit.”

slide-39
SLIDE 39

Extensions of LDA

  • 1. Structural topic model (Roberts et al, 2014, AJPS)
  • 2. Dynamic topic model (Blei and Lafferty, 2006, ICML; Quinn

et al, 2010, AJPS)

  • 3. Hierarchical topic model (Griffiths and Tenembaun, 2004,

NIPS; Grimmer, 2010, PA) Why?

◮ Substantive reasons: incorporate specific elements of

DGP into estimation

◮ Statistical reasons: structure can lead to better topics.

slide-40
SLIDE 40

Structural topic model

◮ Prevalence: Prior on the

mixture over topics is now document-specific, and can be a function of covariates (documents with similar covariates will tend to be about the same topics)

◮ Content: distribution over

words is now document-specific and can be a function of covariates (documents with similar covariates will tend to use similar words to refer to the same topic)

slide-41
SLIDE 41

Dynamic topic model

Source: Blei, “Modeling Science”

slide-42
SLIDE 42

Dynamic topic model

Source: Blei, “Modeling Science”

slide-43
SLIDE 43

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions

slide-44
SLIDE 44

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale.

slide-45
SLIDE 45

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

slide-46
SLIDE 46

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

slide-47
SLIDE 47

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i

slide-48
SLIDE 48

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i ψm is frequency of word m

slide-49
SLIDE 49

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i ψm is frequency of word m βm is discrimination parameter of word m

slide-50
SLIDE 50

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i ψm is frequency of word m βm is discrimination parameter of word m

◮ Estimation using EM algorithm.

slide-51
SLIDE 51

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i ψm is frequency of word m βm is discrimination parameter of word m

◮ Estimation using EM algorithm. ◮ Identification:

slide-52
SLIDE 52

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i ψm is frequency of word m βm is discrimination parameter of word m

◮ Estimation using EM algorithm. ◮ Identification:

◮ Unit variance restriction for θi

slide-53
SLIDE 53

Wordfish (Slapin and Proksch, 2008, AJPS)

◮ Goal: unsupervised scaling of ideological positions ◮ Ideology of politician i, θi is a position in a latent scale. ◮ Word usage is drawn from a Poisson-IRT model:

Wim ∼ Poisson(λim) λim = exp(αi + ψm + βm × θi)

◮ where:

αi is “loquaciousness” of politician i ψm is frequency of word m βm is discrimination parameter of word m

◮ Estimation using EM algorithm. ◮ Identification:

◮ Unit variance restriction for θi ◮ Choose a and b such that θa > θb