Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, - - PowerPoint PPT Presentation
Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, - - PowerPoint PPT Presentation
Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, Noah A. Smith July 18, 2018 Outline Main points of this talk: 1. Introducing Scholar 1 : a neural model for documents with metadata Background (LDA, SAGE, SLDA, etc.) Model
Outline
Main points of this talk:
- 1. Introducing Scholar1: a neural model for documents with
metadata
Background (LDA, SAGE, SLDA, etc.) Model and related work Experiments and Results
- 2. Power of neural variational inference for interactive modeling
1Sparse Contextual Hidden and Observed Language Autoencoder
1
Latent Dirichlet Allocation
Blei, Ng, and Jordan. Latent Dirichlet Allocation. JMLR. 2003. David Blei. Probabilistic topic models. Comm. ACM. 2012
2
Types of metadata
Date or time Author(s) Rating Sentiment Ideology etc.
3
Variations and extensions
Author topic model (Rosen-Zvi et al 2004) Supervised LDA (SLDA; McAuliffe and Blei, 2008) Dirichlet multinomial regression (Mimno and McCallum, 2008) Sparse additive generative models (SAGE; Eisenstein et al, 2011) Structural topic model (Roberts et al, 2014) ...
4
Desired features of model
Fast, scalable inference. Easy modification by end-users.
5
Desired features of model
Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:
Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).
5
Desired features of model
Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:
Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics.
5
Desired features of model
Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:
Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics. Incorporate additional prior knowledge.
5
Desired features of model
Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:
Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).
Possibility of sparse topics. Incorporate additional prior knowledge. → Use variational autoencoder (VAE) style of inference (Kingma and Welling, 2014)
5
Desired outcome
Coherent groupings of words (something like topics), with
- ffsets for observed metadata
6
Desired outcome
Coherent groupings of words (something like topics), with
- ffsets for observed metadata
Encoder to map from documents to latent representations
6
Desired outcome
Coherent groupings of words (something like topics), with
- ffsets for observed metadata
Encoder to map from documents to latent representations Classifier to predict labels from from latent representation
6
Model
i k
words generator network: p(w
i) = fg( )
7
Model
i k
words generator network: p(w
i) = fg( )
p(
i
w)
8
Model
i k
words generator network: p(w
i) = fg( )
p(
i
w) q(
i
w)
9
Model
i k
words generator network: p(w
i) = fg( )
p(
i
w) q(
i
w)
ELBO = Eq[log p(words | θi)] − DKL[q(θi | words)p(θi)]
10
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( )
ELBO = Eq[log p(words | θi)] − DKL[q(θi | words)p(θi)]
11
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
ELBO = Eq[log p(words | ri)] − DKL[q(ri | words)p(ri)]
12
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
ELBO ≈ 1
S
S
s=1[log p(words | r (s) i
)] − DKL[q(ri | words)p(ri)]
13
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
(0, I)
ELBO ≈ 1
S
S
s=1[log p(words | r (s) i
)] − DKL[q(ri | words)p(ri)]
14
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
(0, I) =
q + (s) q
ELBO ≈ 1
S
S
s=1[log p(words | r (s) i
)] − DKL[q(ri | words)p(ri)]
15
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
(0, I) =
q + (s) q
Srivastava and Sutton, 2017, Miao et al, 2016
16
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
(0, I) =
q + (s) q
yi
17
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
(0, I) =
q + (s) q
yi ci
18
Model
i k
words generator network: p(w
i) = fg( )
words encoder network: q(
i
w) = fe( ) ri
k i = softmax(ri)
(0, I) =
q + (s) q
yi ci , ci, yi
19
Scholar
Generator network: p(word | θi, ci) = softmax(d + θT
i B(topic) + cT i B(cov)) 20
Scholar
Generator network: p(word | θi, ci) = softmax(d + θT
i B(topic) + cT i B(cov))
Optionally include interactions between topics and covariates
20
Scholar
Generator network: p(word | θi, ci) = softmax(d + θT
i B(topic) + cT i B(cov))
Optionally include interactions between topics and covariates p(yi | θi, ci) = fy(θi, ci)
20
Scholar
Generator network: p(word | θi, ci) = softmax(d + θT
i B(topic) + cT i B(cov))
Optionally include interactions between topics and covariates p(yi | θi, ci) = fy(θi, ci) Encoder: µi = fµ(words, ci, yi) log σi = fσ(words, ci, yi) Optional incorporation of word vectors to embed input
20
Optimization
Stochastic optimization using mini-batches of documents Tricks from Srivastava and Sutton, 2017:
Adam optimizer with high-learning rate to bypass mode collapse Batch-norm layers to avoid divergence
Annealing away from batch-norm output to keep results interpretable
21
Output of Scholar
B(topic), B(cov): Coherent groupings of positive and negative deviations from background (∼ topics)
22
Output of Scholar
B(topic), B(cov): Coherent groupings of positive and negative deviations from background (∼ topics) fµ, fσ: Encoder network: mapping from words to topics: ˆ θi = softmax(fe(words, ci, yi, ǫ))
22
Output of Scholar
B(topic), B(cov): Coherent groupings of positive and negative deviations from background (∼ topics) fµ, fσ: Encoder network: mapping from words to topics: ˆ θi = softmax(fe(words, ci, yi, ǫ)) fy: Classifier mapping from ˆ θi to labels: ˆ y = fy(θi, ci)
22
Evaluation
- 1. Performance as a topic model, without metadata (perplexity,
coherence)
- 2. Performance as a classifier, compared to SLDA
- 3. Exploratory data analysis
23
Quantitative results: basic model
1000 2000
Perplexity
0.0 0.1 0.2
Coherence
LDA 0.0 0.5
Sparsity
IMDB dataset (Maas, 2011)
24
Quantitative results: basic model
1000 2000
Perplexity
0.0 0.1 0.2
Coherence
LDA SAGE 0.0 0.5
Sparsity
IMDB dataset (Maas, 2011)
25
Quantitative results: basic model
1000 2000
Perplexity
0.0 0.1 0.2
Coherence
LDA SAGE NVDM 0.0 0.5
Sparsity
IMDB dataset (Maas, 2011)
26
Quantitative results: basic model
1000 2000
Perplexity
0.0 0.1 0.2
Coherence
LDA SAGE NVDM Scholar 0.0 0.5
Sparsity
IMDB dataset (Maas, 2011)
27
Quantitative results: basic model
1000 2000
Perplexity
0.0 0.1 0.2
Coherence
LDA SAGE NVDM Scholar Scholar +wv 0.0 0.5
Sparsity
IMDB dataset (Maas, 2011)
28
Quantitative results: basic model
1000 2000
Perplexity
0.0 0.1 0.2
Coherence
LDA SAGE NVDM Scholar Scholar +wv Scholar +sparsity 0.0 0.5
Sparsity
IMDB dataset (Maas, 2011)
29
Classification results
LR SLDA Scholar (labels) Scholar (covariates) 0.5 0.6 0.7 0.8 0.9 1.0
Accuracy
IMDB dataset (Maas, 2011)
30
Exploratory Data Analysis
Data: Media Frames Corpus (Card et al, 2015) Collection of thousands of news articles annotated in terms of tone and framing Relevant metadata: year of publication, newspaper, etc.
31
Tone as a label
1