Mixed Membership Word Embeddings for Computational Social Science - - PowerPoint PPT Presentation

mixed membership word embeddings
SMART_READER_LITE
LIVE PREVIEW

Mixed Membership Word Embeddings for Computational Social Science - - PowerPoint PPT Presentation

Mixed Membership Word Embeddings for Computational Social Science James Foulds (Jimmy) Department of Information Systems University of Maryland, Baltimore County UMBC ACM Faculty Talk, April 5 2018 Paper to be presented at the International


slide-1
SLIDE 1

Mixed Membership Word Embeddings

for Computational Social Science

UMBC ACM Faculty Talk, April 5 2018

James Foulds (Jimmy) University of Maryland, Baltimore County Department of Information Systems

Paper to be presented at the International Conference on Artificial Intelligence and Statistics (AISTATS 2018)

slide-2
SLIDE 2

Latent Variable Modeling

2

Understand, explore, predict

Data

Complicated, noisy, high-dimensional

slide-3
SLIDE 3

Latent Variable Modeling

3

Understand, explore, predict

Data

Complicated, noisy, high-dimensional Latent variable model

slide-4
SLIDE 4

Latent Variable Modeling

4

Understand, explore, predict

Data

Complicated, noisy, high-dimensional Low-dimensional, semantically meaningful representations Latent variable model

slide-5
SLIDE 5

Latent Variable Modeling

  • Latent variable modeling is a general, principled

approach for making sense of complex data sets

  • Core principles:

– Dimensionality reduction

5 Images due to Chris Bishop, Pattern Recognition and Machine Learning book

slide-6
SLIDE 6

Latent Variable Modeling

6 Images due to Chris Bishop, Pattern Recognition and Machine Learning book

  • Latent variable modeling is a general, principled

approach for making sense of complex data sets

  • Core principles:

– Dimensionality reduction – Probabilistic graphical models

slide-7
SLIDE 7

7 Images due to Chris Bishop, Pattern Recognition and Machine Learning book

Latent Variable Modeling

  • Latent variable modeling is a general, principled

approach for making sense of complex data sets

  • Core principles:

– Dimensionality reduction – Probabilistic graphical models – Statistical inference, especially Bayesian inference

slide-8
SLIDE 8

Latent Variable Modeling

8

Latent variable models are, basically, PCA on steroids!

Images due to Chris Bishop, Pattern Recognition and Machine Learning book

  • Latent variable modeling is a general, principled

approach for making sense of complex data sets

  • Core principles:

– Dimensionality reduction – Probabilistic graphical models – Statistical inference, especially Bayesian inference

slide-9
SLIDE 9

Motivating Applications

  • Industry:

– user modeling, recommender systems, and personalization, …

9

slide-10
SLIDE 10

Motivating Applications

  • Natural language processing

– Machine translation – Document summarization – Parsing – Question answering – Named entity recognition – Sentiment analysis – Opinion mining

10

slide-11
SLIDE 11

Motivating Applications

  • Furthering scientific understanding in:

– Cognitive psychology (Griffiths and Tenenbaum, 2006) – Sociology (Hoff, 2008) – Political science (Gerrish and Blei, 2012) – The humanities (Mimno, 2012) – Genetics (Pritchard, 2000) – Climate science (Bain et al., 2011) – …

11

slide-12
SLIDE 12

Motivating Applications

  • Social network analysis

– Identify latent social groups/communities – Test sociological theories (homophily, stochastic equivalence, triadic closure, balance theory,…)

12

slide-13
SLIDE 13

Motivating Applications

  • Computational social science,

digital humanities, …

13

slide-14
SLIDE 14

Example: Mining Classics Journals

14

slide-15
SLIDE 15

Example: Do U.S. Senators from the same state prioritize different issues? (Grimmer, 2010)

15

Schiller’s theory is false Schiller’s theory is true

Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010.

“ ”

slide-16
SLIDE 16

Example: Influence Relationships in the U.S. Supreme Court

16 Guo, F., Blundell, C., Wallach, H., and Heller, K. (2015). AISTATS

slide-17
SLIDE 17

Box’s Loop

17

Understand, explore, predict

Data

Complicated, noisy, high-dimensional Low-dimensional, semantically meaningful representations Latent variable model Evaluate, iterate

slide-18
SLIDE 18

Overview of my Research

18

Understand, explore, predict

Data

Evaluation General-purpose modeling frameworks KDD’15, ACL’15 (x2), EMNLP’13,’15 ICWSM’11, AISTATS’11, SDM’11 UAI’16, ArXiv’16 (submitted to, JMLR), AISTATS’17 UAI’14 KDD’13, AISTATS’11, ArXiv’16 (submitted to JMLR) Latent variable models Privacy ICML’15, RecSys’15 Older work KER’10, DS’10 AusAI’08, APJOR’06, IJOR’06

slide-19
SLIDE 19

Topic Models (Blei et al., 2003)

The quick brown fox jumps over the sly lazy dog

19

slide-20
SLIDE 20

Topic Models (Blei et al., 2003)

The quick brown fox jumps over the sly lazy dog [5 6 37 1 4 30 5 22 570 12]

20

slide-21
SLIDE 21

Topic Models (Blei et al., 2003)

The quick brown fox jumps over the sly lazy dog [5 6 37 1 4 30 5 22 570 12] Foxes Dogs Jumping [40% 40% 20% ]

21

slide-22
SLIDE 22

Topics

22

Topic 1 Reinforcement learning Topic 2 Learning algorithms Topic 3 Character recognition Distribution

  • ver all

words in dictionary

A vector of discrete probabilities (sums to one)

slide-23
SLIDE 23

Topic Models for Computational Social Science

23

time

slide-24
SLIDE 24

Naïve Bayes Document Model

Assumed generative process:

24

… …

Graphical model:

Documents d=1:D

slide-25
SLIDE 25

Mixed Membership Modeling

  • Naïve Bayes conditional independence assumption typically too

strong, not realistic

  • Mixed membership: relax “hard clustering” assumption to “soft

clustering” – Membership distribution over clusters – E.g.:

  • Text documents belong to a distribution of topics
  • Social network individuals belong partly to multiple communities
  • Our genes come from multiple different ancestral populations
  • Our genes come from multiple different ancestral populations

25

slide-26
SLIDE 26

Mixed Membership Modeling

  • Improves representational power

for a fixed number of topics/clusters

– We can have a powerful model with fewer clusters

  • Parameter sharing

– Can learn on smaller datasets, especially with Bayesian approach to manage uncertainty in cluster assignments

26

slide-27
SLIDE 27

Topic Model Latent Representations

  • Unsupervised

naïve Bayes (latent class model)

  • Topic model

(mixed membership model)

27

Foxes Dogs Jumping Doc 1 0.4 0.4 0.2 Doc 2 0.5 0.5 Doc 3 0.1 0.9 Foxes Dogs Jumping Doc 1 1 Doc 2 1 Doc 3 1

slide-28
SLIDE 28

Latent Dirichlet Allocation Topic Model (Blei et al., 2003)

Documents have distributions over topics θ(d) Topics are distributions over words φ(k) Assumed generative process: (full model includes priors on θ, φ)

  • For each document d
  • For each word wd,n
  • Draw a topic assignment zd,n ~ Discrete(θ(d))
  • Draw a word from the chosen topic wd,n ~ Discrete(φ(zd,n))

28

φ

slide-29
SLIDE 29

Collapsed Gibbs sampler for LDA Griffiths and Steyvers (2004)

  • Marginalize out the parameters, and perform

inference on the latent variables only

29

Z 𝛊 𝚾 Z

slide-30
SLIDE 30
  • Collapsed Gibbs sampler

Collapsed Gibbs sampler for LDA Griffiths and Steyvers (2004)

30

Topic counts Document-topic counts Word-topic counts Smoothing from prior (similar to Laplace smoothing)

slide-31
SLIDE 31

Word Embeddings

  • Language models which learn to represent

dictionary words with vectors

  • Nuanced representations for words
  • Improved performance for many NLP tasks

– translation, part-of-speech tagging, chunking, NER, …

  • NLP “from scratch”? (Collobert et al., 2011)

31

dog: (0.11, -1.5, 2.7, … ) cat: (0.15, -1.2, 3.2, … ) Paris: (4.5, 0.3, -2.1, …)

dog cat Paris

slide-32
SLIDE 32

Word Embeddings

  • Vector arithmetic solves analogy tasks:

32

man is to king as woman is to _____? v(king) - v(man) + v(woman) ≈ v(queen)

v(king) v(queen)

  • v(man)

v(woman)

slide-33
SLIDE 33

The Distributional Hypothesis

  • “There is a correlation between distributional similarity and

meaning similarity, which allows us to utilize the former in order to estimate the latter.” (Sahlgren, 2008)

33

_ _ _ _ w1 _ _ _ _ _ _ _ _ w2 _ _ _ _

similar similar similar

slide-34
SLIDE 34

The Distributional Hypothesis

  • “There is a correlation between distributional similarity and

meaning similarity, which allows us to utilize the former in order to estimate the latter.” (Sahlgren, 2008)

34

_ _ _ _ w1 _ _ _ _ _ _ _ _ w2 _ _ _ _

similar similar similar

slide-35
SLIDE 35

The Distributional Hypothesis

  • “There is a correlation between distributional similarity and

meaning similarity, which allows us to utilize the former in order to estimate the latter.” (Sahlgren, 2008)

35

_ _ _ _ w1 _ _ _ _ _ _ _ _ w2 _ _ _ _

similar similar similar

slide-36
SLIDE 36

Word2vec (Mikolov et al., 2013)

Skip-Gram

36 Figure due to Mikolov et al. (2013)

A log-bilinear classifier for the context of a given word

slide-37
SLIDE 37

The Skip-Gram Encodes the Distributional Hypothesis

  • Word vectors encode distribution of context words
  • Similar words assumed to have similar vectors

37

_ _ _ _ w1 _ _ _ _ _ _ _ _ w2 _ _ _ _

slide-38
SLIDE 38

Word2vec (Mikolov et al., 2013)

  • Key insights:

– Simple models can be trained efficiently on big data – High-dimensional simple embedding models, trained on massive data sets, can outperform sophisticated neural nets

38

slide-39
SLIDE 39

Word Embeddings for Computational Social Science?

  • Word embeddings have many advantages

– Capture similarities between words – Often better classification performance than topic models

  • Have not yet been widely adopted for

computational social science research, perhaps due to the following limitations:

  • Target corpus of interest is often not big data
  • It is important for the model to be interpretable

39

slide-40
SLIDE 40

Contributions of this Work

  • Interpretable, statistically efficient embedding model
  • Efficient training algorithm, using recent advances from

both topic models and word embeddings:

  • Experimental results and computational social science

case studies

  • Practical recommendations and insights

– use of generic big data embeddings, which is a very common practice in NLP

40

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-41
SLIDE 41

The Skip-Gram as a Probabilistic Model

  • Can view skip-gram as probabilistic model for

``generating’’ context words

41

Conditional discrete distribution over words: can identify with a topic

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-42
SLIDE 42

The Skip-Gram as a Probabilistic Model

42

Observed “cluster” assignment Naïve Bayes conditional independence “Topic” distribution for input word wi

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-43
SLIDE 43

Analogous Topic Model Corresponding to Skip-Gram

43

Observed “cluster” assignment Naïve Bayes conditional independence assumption “Topic” for input word wi

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-44
SLIDE 44

Grid of Models’ “Generative” Processes

44

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-45
SLIDE 45

Grid of Models’ “Generative” Processes

45

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-46
SLIDE 46

Grid of Models’ “Generative” Processes

46

Identifying word distributions with topics leads to analogous topic model

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-47
SLIDE 47

Grid of Models’ “Generative” Processes

47

Identifying word distributions with topics leads to analogous topic model Relax naïve Bayes assumption, replace with mixed membership model.

  • flexible representation for words
  • parameter sharing
  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-48
SLIDE 48

Grid of Models’ “Generative” Processes

48

Identifying word distributions with topics leads to analogous topic model Relax naïve Bayes assumption, replace with mixed membership model.

  • flexible representation for words
  • parameter sharing

Reinstate word vector representation

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-49
SLIDE 49

Mixed Membership Skip-Gram Topic Model

  • Each input word has a distribution over topics
  • Topics shared across all input words

49

Contexts Context words Vocab Topics

slide-50
SLIDE 50

Mixed Membership Skip-Gram

50

Contexts Context words Vocab Topics

slide-51
SLIDE 51

Mixed Membership Word Embeddings

51

Word embeddings are convex combinations of topic embeddings

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-52
SLIDE 52

Mixed Membership Word Embeddings

52

  • Words have mixed membership distributions over topics
  • Topics have embeddings , words don’t. Resolves polysemy
  • Fewer vectors than words: statistical efficiency on small data
  • Word embeddings recovered as prior mean or posterior mean vectors
  • convex combinations of topic embeddings
  • Interpretable: topics can be interpreted via top words lists, word embeddings are

defined in terms of topic embeddings

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-53
SLIDE 53

Mixed Membership Skip-Gram Posterior Inference for Topic Vector

  • Context can be leveraged for inferring the

topic vector at test time, via Bayes’ rule:

53

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-54
SLIDE 54

Bayesian Inference for MMSG Topic Model

  • Bayesian version of model with Dirichlet priors
  • Collapsed Gibbs sampling

54

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-55
SLIDE 55

Bayesian Inference for MMSG Topic Model

  • Challenge 1: want relatively large # topics
  • Solution: Metropolis-Hastings-Walker algorithm

(Li et al. 2014)

– Alias table data structure, amortized O(1) sampling – Sparse implementation, sublinear in topics K – Metropolis-Hastings correction for sampling from stale distributions

55

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-56
SLIDE 56

Metropolis-Hastings-Walker (Li et al. 2014)

  • Approximate second term of the mixture, sample

efficiently via alias tables, correct via Metropolis

56

Sparse Dense, slow-changing

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-57
SLIDE 57

Metropolis-Hastings-Walker Proposal

  • Dense part of Gibbs update is a “product of experts”

(Hinton, 2004), expert for each context word

  • Use a “mixture of experts” proposal distribution
  • Can sample efficiently from “experts” via alias tables

57

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-58
SLIDE 58

Metropolis-Hastings-Walker Proposal

  • Dense part of Gibbs update is a “product of experts”

(Hinton, 2004), expert for each context word

  • Use a “mixture of experts” proposal distribution
  • Can sample efficiently from “experts” via alias tables

58

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-59
SLIDE 59

Metropolis-Hastings-Walker Proposal

  • Dense part of Gibbs update is a “product of experts”

(Hinton, 2004), expert for each context word

  • Use a “mixture of experts” proposal distribution
  • Can sample efficiently from “experts” via alias tables

59

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-60
SLIDE 60

Bayesian Inference for MMSG Topic Model

  • Challenge 2: cluster assignment updates almost

deterministic, vulnerable to local maxima

  • Solution: simulated annealing

– Anneal temperature of model

  • adjusting Metropolis-Hastings acceptance probabilities

60

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-61
SLIDE 61

Approximate MLE for Mixed Membership Skip-Gram

  • Online EM impractical

– M-step is O(V) – E-step is O(KV)

  • Approximate online EM

– Key insight: MMSG topic model equivalent to word embedding model, up to Dirichlet prior

  • Pre-solve E-step via topic model CGS
  • Apply NCE to solve M-step

– Entire algorithm approximates maximum likelihood estimation via these two principled approximations

61

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-62
SLIDE 62

Approximate MLE for Mixed Membership Skip-Gram

  • Online EM impractical

– M-step is O(V) – E-step is O(KV)

  • Approximate online EM

– Key insight: MMSG topic model equivalent to word embedding model, up to Dirichlet prior

  • Pre-solve E-step via topic model CGS
  • Apply Noise Contrastive Estimation to solve M-step

– Entire algorithm approximates maximum likelihood estimation via these two principled approximations

62

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-63
SLIDE 63

Qualitative Results, NIPS Corpus

63

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-64
SLIDE 64

Qualitative Results, NIPS Corpus

64

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-65
SLIDE 65

Qualitative Results, NIPS Corpus

65

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-66
SLIDE 66

Qualitative Results, NIPS Corpus

66

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-67
SLIDE 67

67

Prediction Performance, NIPS Corpus

(similar results on three other small datasets, see the paper)

slide-68
SLIDE 68

68

Prediction Performance, NIPS Corpus

Mixed-membership models (w/ posterior) beat naïve Bayes models, for both word embedding and topic models (similar results on three other small datasets, see the paper)

slide-69
SLIDE 69

69

Prediction Performance, NIPS Corpus

Using the full context (posterior over topic or summing vectors) helps all models except the basic skip-gram (similar results on three other small datasets, see the paper)

slide-70
SLIDE 70

70

Prediction Performance, NIPS Corpus

Topic models beat their corresponding embedding models, for both naïve Bayes and Mixed Membership (similar results on three other small datasets, see the paper)

slide-71
SLIDE 71

Downstream Tasks: Classification and Regression

71

  • Document categorization (classification accuracy, larger is better), and predicting the year of

SOTU addresses (RMSE, smaller is better).

  • Target corpus beats generic big-data vectors (except for SOTU, which is very small)
  • Skip-gram beats MMSG for classification/regression – loss of granularity
  • But, concatenating the different vectors improves performance over individual embeddings
  • MMSG, SG, generic Google vectors learn complementary information
  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-72
SLIDE 72

Vector Composition in Topic Space

72

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-73
SLIDE 73

State of the Union Addresses (t-SNE Projection)

73

GOP (red), conservative topics Democrats (blue), liberal topics Early parties: light green = Whigs, pink = Demo-Republicans,

  • range = Federalists (John Adams), green = George Washington)

Size = recency (year), Bigger = more recent Gray = topics Addresses were embedded near-linearly by year!

A big gap between 1910 and 1930: 1914-1918 WWI 1930s Great Depression, FDR’s New Deal 1939-1945 WW2

slide-74
SLIDE 74

NIPS Authors

74

Reinforcement learning Bayesian methods Evaluating classifiers Blue = authors Gray = topics

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

slide-75
SLIDE 75

NIPS Documents

75

Color= recency (year), Red = older, blue = newer Gray = topics

slide-76
SLIDE 76

Conclusion

  • Proposed mixed membership, topic model versions of skip-

gram word embedding models

– Statistically efficient, interpretable

  • Efficient training via MHW collapsed Gibbs + NCE
  • Proposed models improve prediction, useful for

computational social science

  • Ongoing/future work:

– Scale to big data setting – Document embeddings

76

  • J. R. Foulds. Mixed Membership Word Embeddings for Computational Social Science. Proceedings of the 21st International

Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

Source code: https://github.com/jrfoulds/MixedMembershipWordEmbeddings

slide-77
SLIDE 77

My Research Group: The Latent Lab

77