Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, - - PowerPoint PPT Presentation

neural models for documents with metadata
SMART_READER_LITE
LIVE PREVIEW

Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, - - PowerPoint PPT Presentation

Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, Noah A. Smith July 18, 2018 Outline Main points of this talk: 1. Introducing Scholar 1 : a neural model for documents with metadata Background (LDA, SAGE, SLDA, etc.) Model


slide-1
SLIDE 1

Neural Models for Documents with Metadata

Dallas Card, Chenhao Tan, Noah A. Smith July 18, 2018

slide-2
SLIDE 2

Outline

Main points of this talk:

  • 1. Introducing Scholar1: a neural model for documents with

metadata

Background (LDA, SAGE, SLDA, etc.) Model and related work Experiments and Results

  • 2. Power of neural variational inference for interactive modeling

1Sparse Contextual Hidden and Observed Language Autoencoder

1

slide-3
SLIDE 3

Latent Dirichlet Allocation

Blei, Ng, and Jordan. Latent Dirichlet Allocation. JMLR. 2003. David Blei. Probabilistic topic models. Comm. ACM. 2012

2

slide-4
SLIDE 4

Types of metadata

Date or time Author(s) Rating Sentiment Ideology etc.

3

slide-5
SLIDE 5

Variations and extensions

Author topic model (Rosen-Zvi et al 2004) Supervised LDA (SLDA; McAuliffe and Blei, 2008) Dirichlet multinomial regression (Mimno and McCallum, 2008) Sparse additive generative models (SAGE; Eisenstein et al, 2011) Structural topic model (Roberts et al, 2014) ...

4

slide-6
SLIDE 6

Desired features of model

Fast, scalable inference. Easy modification by end-users.

5

slide-7
SLIDE 7

Desired features of model

Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:

Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).

5

slide-8
SLIDE 8

Desired features of model

Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:

Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).

Possibility of sparse topics.

5

slide-9
SLIDE 9

Desired features of model

Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:

Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).

Possibility of sparse topics. Incorporate additional prior knowledge.

5

slide-10
SLIDE 10

Desired features of model

Fast, scalable inference. Easy modification by end-users. Incorporation of metadata:

Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA).

Possibility of sparse topics. Incorporate additional prior knowledge. → Use variational autoencoder (VAE) style of inference (Kingma and Welling, 2014)

5

slide-11
SLIDE 11

Desired outcome

Coherent groupings of words (something like topics), with

  • ffsets for observed metadata

6

slide-12
SLIDE 12

Desired outcome

Coherent groupings of words (something like topics), with

  • ffsets for observed metadata

Encoder to map from documents to latent representations

6

slide-13
SLIDE 13

Desired outcome

Coherent groupings of words (something like topics), with

  • ffsets for observed metadata

Encoder to map from documents to latent representations Classifier to predict labels from from latent representation

6

slide-14
SLIDE 14

Model

i k

words generator network: p(w

i) = fg( )

7

slide-15
SLIDE 15

Model

i k

words generator network: p(w

i) = fg( )

p(

i

w)

8

slide-16
SLIDE 16

Model

i k

words generator network: p(w

i) = fg( )

p(

i

w) q(

i

w)

9

slide-17
SLIDE 17

Model

i k

words generator network: p(w

i) = fg( )

p(

i

w) q(

i

w)

ELBO = Eq[log p(words | θi)] − DKL[q(θi | words)p(θi)]

10

slide-18
SLIDE 18

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( )

ELBO = Eq[log p(words | θi)] − DKL[q(θi | words)p(θi)]

11

slide-19
SLIDE 19

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

ELBO = Eq[log p(words | ri)] − DKL[q(ri | words)p(ri)]

12

slide-20
SLIDE 20

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

ELBO ≈ 1

S

S

s=1[log p(words | r (s) i

)] − DKL[q(ri | words)p(ri)]

13

slide-21
SLIDE 21

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

(0, I)

ELBO ≈ 1

S

S

s=1[log p(words | r (s) i

)] − DKL[q(ri | words)p(ri)]

14

slide-22
SLIDE 22

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

(0, I) =

q + (s) q

ELBO ≈ 1

S

S

s=1[log p(words | r (s) i

)] − DKL[q(ri | words)p(ri)]

15

slide-23
SLIDE 23

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

(0, I) =

q + (s) q

Srivastava and Sutton, 2017, Miao et al, 2016

16

slide-24
SLIDE 24

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

(0, I) =

q + (s) q

yi

17

slide-25
SLIDE 25

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

(0, I) =

q + (s) q

yi ci

18

slide-26
SLIDE 26

Model

i k

words generator network: p(w

i) = fg( )

words encoder network: q(

i

w) = fe( ) ri

k i = softmax(ri)

(0, I) =

q + (s) q

yi ci , ci, yi

19

slide-27
SLIDE 27

Scholar

Generator network: p(word | θi, ci) = softmax(d + θT

i B(topic) + cT i B(cov)) 20

slide-28
SLIDE 28

Scholar

Generator network: p(word | θi, ci) = softmax(d + θT

i B(topic) + cT i B(cov))

Optionally include interactions between topics and covariates

20

slide-29
SLIDE 29

Scholar

Generator network: p(word | θi, ci) = softmax(d + θT

i B(topic) + cT i B(cov))

Optionally include interactions between topics and covariates p(yi | θi, ci) = fy(θi, ci)

20

slide-30
SLIDE 30

Scholar

Generator network: p(word | θi, ci) = softmax(d + θT

i B(topic) + cT i B(cov))

Optionally include interactions between topics and covariates p(yi | θi, ci) = fy(θi, ci) Encoder: µi = fµ(words, ci, yi) log σi = fσ(words, ci, yi) Optional incorporation of word vectors to embed input

20

slide-31
SLIDE 31

Optimization

Stochastic optimization using mini-batches of documents Tricks from Srivastava and Sutton, 2017:

Adam optimizer with high-learning rate to bypass mode collapse Batch-norm layers to avoid divergence

Annealing away from batch-norm output to keep results interpretable

21

slide-32
SLIDE 32

Output of Scholar

B(topic), B(cov): Coherent groupings of positive and negative deviations from background (∼ topics)

22

slide-33
SLIDE 33

Output of Scholar

B(topic), B(cov): Coherent groupings of positive and negative deviations from background (∼ topics) fµ, fσ: Encoder network: mapping from words to topics: ˆ θi = softmax(fe(words, ci, yi, ǫ))

22

slide-34
SLIDE 34

Output of Scholar

B(topic), B(cov): Coherent groupings of positive and negative deviations from background (∼ topics) fµ, fσ: Encoder network: mapping from words to topics: ˆ θi = softmax(fe(words, ci, yi, ǫ)) fy: Classifier mapping from ˆ θi to labels: ˆ y = fy(θi, ci)

22

slide-35
SLIDE 35

Evaluation

  • 1. Performance as a topic model, without metadata (perplexity,

coherence)

  • 2. Performance as a classifier, compared to SLDA
  • 3. Exploratory data analysis

23

slide-36
SLIDE 36

Quantitative results: basic model

1000 2000

Perplexity

0.0 0.1 0.2

Coherence

LDA 0.0 0.5

Sparsity

IMDB dataset (Maas, 2011)

24

slide-37
SLIDE 37

Quantitative results: basic model

1000 2000

Perplexity

0.0 0.1 0.2

Coherence

LDA SAGE 0.0 0.5

Sparsity

IMDB dataset (Maas, 2011)

25

slide-38
SLIDE 38

Quantitative results: basic model

1000 2000

Perplexity

0.0 0.1 0.2

Coherence

LDA SAGE NVDM 0.0 0.5

Sparsity

IMDB dataset (Maas, 2011)

26

slide-39
SLIDE 39

Quantitative results: basic model

1000 2000

Perplexity

0.0 0.1 0.2

Coherence

LDA SAGE NVDM Scholar 0.0 0.5

Sparsity

IMDB dataset (Maas, 2011)

27

slide-40
SLIDE 40

Quantitative results: basic model

1000 2000

Perplexity

0.0 0.1 0.2

Coherence

LDA SAGE NVDM Scholar Scholar +wv 0.0 0.5

Sparsity

IMDB dataset (Maas, 2011)

28

slide-41
SLIDE 41

Quantitative results: basic model

1000 2000

Perplexity

0.0 0.1 0.2

Coherence

LDA SAGE NVDM Scholar Scholar +wv Scholar +sparsity 0.0 0.5

Sparsity

IMDB dataset (Maas, 2011)

29

slide-42
SLIDE 42

Classification results

LR SLDA Scholar (labels) Scholar (covariates) 0.5 0.6 0.7 0.8 0.9 1.0

Accuracy

IMDB dataset (Maas, 2011)

30

slide-43
SLIDE 43

Exploratory Data Analysis

Data: Media Frames Corpus (Card et al, 2015) Collection of thousands of news articles annotated in terms of tone and framing Relevant metadata: year of publication, newspaper, etc.

31

slide-44
SLIDE 44

Tone as a label

1

p(pro-immigration | topic) arrested charged charges agents operation state gov benefits arizona law bill bills bush border president bill republicans labor jobs workers percent study wages asylum judge appeals deportation court visas visa applications students citizenship boat desert died men miles coast haitian english language city spanish community

32

slide-45
SLIDE 45

Tone as a covariate, with interactions

Base topics Anti-immigration Pro-immigration ice customs agency criminal customs detainees detention population born percent jobs million illegals english newcomers judge case court guilty guilty charges man asylum court judge patrol border miles patrol border died authorities desert licenses drivers card foreign sept visas green citizenship card island story chinese smuggling federal island school ellis guest worker workers bill border house workers tech skilled benefits bill welfare republican california law welfare students

33

slide-46
SLIDE 46

Conclusions

Variational autoencoders (VAEs) provide a powerful framework for latent variable modeling We use the VAE framework to create a customizable model for documents with metadata We obtain comparable performance with enhanced flexibility and scalability Code is available: www.github.com/dallascard/scholar

34