Anchored Correlation Explanation: Topic Modeling with Minimal - - PowerPoint PPT Presentation

anchored correlation explanation topic modeling with
SMART_READER_LITE
LIVE PREVIEW

Anchored Correlation Explanation: Topic Modeling with Minimal - - PowerPoint PPT Presentation

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Ryan J. Gallagher @ryanjgallag github.com/gregversteeg/corex_topic Anchored Correlation Explanation: How to Topic Model with Literally Thousands of Information


slide-1
SLIDE 1

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge

Ryan J. Gallagher

@ryanjgallag

github.com/gregversteeg/corex_topic

slide-2
SLIDE 2

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge

Ryan J. Gallagher

@ryanjgallag

github.com/gregversteeg/corex_topic

How to Topic Model with Literally Thousands of Information Bottlenecks 🍿

slide-3
SLIDE 3

NAACL 2018, New Orleans, LA @ryanjgallag

LDA is a generative topic model

3

slide-4
SLIDE 4

NAACL 2018, New Orleans, LA @ryanjgallag

LDA is a generative topic model

The Good:

Priors explicitly encode your beliefs about what topics can be, and easily allow for iterative development of new topic models

4

slide-5
SLIDE 5

NAACL 2018, New Orleans, LA @ryanjgallag

Domain Knowledge via Dirichlet Forest Priors

“Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors.” Andrzejewski et al. ICML (2009) 5

slide-6
SLIDE 6

NAACL 2018, New Orleans, LA @ryanjgallag

Domain Knowledge via First-Order Logic

“A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation Using First-Order Logic.” Andrzejewski et al. IJCAI (2011). 6

slide-7
SLIDE 7

NAACL 2018, New Orleans, LA @ryanjgallag

SeededLDA

“Incorporating Lexical Priors into Topic Models.” Jagarlamudi et al. EACL (2012) 7

slide-8
SLIDE 8

NAACL 2018, New Orleans, LA @ryanjgallag

Hierarchical LDA

“Hierarchical Topic Models and the Nested Chinese Restaurant Process.” Griffiths et al. Neural Information Processing Systems (2003). 8

slide-9
SLIDE 9

NAACL 2018, New Orleans, LA @ryanjgallag

A Generative Modeling Tradeoff

The Bad:

Each additional prior takes a very specific view of the problem at hand, which both limits what a topic can be and makes it harder to justify in applications and to domain experts

The Good:

Priors explicitly encode your beliefs about what topics can be, and easily allow for iterative development of new topic models

9

slide-10
SLIDE 10

NAACL 2018, New Orleans, LA @ryanjgallag

We propose a topic model that learns topics through information-theoretic criteria, rather than a generative model, within a framework that yields hierarchical and semi- supervised extensions with no additional assumptions We propose a topic model that learns topics through information-theoretic criteria, rather than a generative model, within a framework that yields hierarchical and semi- supervised extensions with no additional assumptions

Proposed Work

… …

10

slide-11
SLIDE 11

NAACL 2018, New Orleans, LA @ryanjgallag

Proposed Work

… … We propose a topic model that learns topics through information-theoretic criteria, rather than a generative model, within a framework that yields hierarchical and semi- supervised extensions with no additional assumptions

11

slide-12
SLIDE 12

NAACL 2018, New Orleans, LA @ryanjgallag

A Different Perspective on “Topics”

LDA: a topic is a distribution over words Consider three documents:

12

slide-13
SLIDE 13

NAACL 2018, New Orleans, LA @ryanjgallag

A Different Perspective on “Topics”

LDA: a topic is a distribution over words CorEx: a topic is a binary latent factor Consider three documents:

13

slide-14
SLIDE 14

NAACL 2018, New Orleans, LA @ryanjgallag

A Different Perspective on “Topics”

LDA: a topic is a distribution over words CorEx: a topic is a binary latent factor Consider three documents:

14

slide-15
SLIDE 15

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective (example)

1/2 1/2

Documents Probability table

15

slide-16
SLIDE 16

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective (example)

Documents Words 1 and 2 are related: Probability table

1/2 1/2

16

slide-17
SLIDE 17

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective (example)

Documents Words 1 and 2 are related: Hypothesize a latent factor: Probability table

1/2 1/2

17

slide-18
SLIDE 18

NAACL 2018, New Orleans, LA @ryanjgallag

Then conditioned on , words 1 and 2 are independent

CorEx Objective (example)

Documents Words 1 and 2 are related: Probability table

1/2 1/2

Hypothesize a latent factor:

18

slide-19
SLIDE 19

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective (example)

Documents Words 1 and 2 are related: Goal: find latent factors that make words conditionally independent Then conditioned on , words 1 and 2 are independent Probability table

1/2 1/2

Hypothesize a latent factor:

19

slide-20
SLIDE 20

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective

Goal: find latent factors that make words conditionally independent

20

slide-21
SLIDE 21

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective

Total correlation conditioned on Y

Goal: find latent factors that make words conditionally independent

21

slide-22
SLIDE 22

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective

if and only if the topic “explains” all the dependencies (total correlation)

Hence, “Total Correlation Explanation” (CorEx) Goal: find latent factors that make words conditionally independent

22

slide-23
SLIDE 23

NAACL 2018, New Orleans, LA @ryanjgallag

In order to maximize the information between a group of words in topic we consider a tractable lower bound:

CorEx Objective

Goal: find latent factors that make words conditionally independent

23

slide-24
SLIDE 24

NAACL 2018, New Orleans, LA @ryanjgallag

In order to maximize the information between a group of words in topic we consider a tractable lower bound:

CorEx Objective

Goal: find latent factors that make words conditionally independent We maximize this lower bound over topics

24

slide-25
SLIDE 25

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective

We can now rewrite the objective:

25

slide-26
SLIDE 26

NAACL 2018, New Orleans, LA @ryanjgallag

We transform this from a combinatorial to a continuous optimization by introducing variables and “relaxing” words into informative topics

CorEx Objective

We can now rewrite the objective:

26

slide-27
SLIDE 27

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Objective

This relaxation yields a set of update equations which we can iterate through until convergence We transform this from a combinatorial to a continuous optimization by introducing variables and “relaxing” words into informative topics We can now rewrite the objective:

27

slide-28
SLIDE 28

NAACL 2018, New Orleans, LA @ryanjgallag

Under the hood:

1. We introduce a sparsity optimization for the update equations,


by assuming words are represented by binary random variables
 2. The current relaxation scheme places each word in one topic, resulting in a partition of the vocabulary, rather than mixed membership topics

CorEx Objective

We transform this from a combinatorial to a continuous optimization by introducing variables and “relaxing” words into informative topics We can now rewrite the objective:

28

slide-29
SLIDE 29

NAACL 2018, New Orleans, LA @ryanjgallag

Under the hood:

1. We introduce a sparsity optimization for the update equations,


by assuming words are represented by binary random variables
 2. The current relaxation scheme places each word in one topic, resulting in a partition of the vocabulary, rather than mixed membership topics

CorEx Objective

We transform this from a combinatorial to a continuous optimization by introducing variables and “relaxing” words into informative topics We can now rewrite the objective:

29

slide-30
SLIDE 30

NAACL 2018, New Orleans, LA @ryanjgallag

Under the hood:

1. We introduce a sparsity optimization for the update equations,


by assuming words are represented by binary random variables
 2. The current relaxation scheme places each word in one topic, resulting in a partition of the vocabulary, rather than mixed membership topics

CorEx Objective

We transform this from a combinatorial to a continuous optimization by introducing variables and “relaxing” words into informative topics We can now rewrite the objective:

30

slide-31
SLIDE 31

NAACL 2018, New Orleans, LA @ryanjgallag

Under the hood:

1. We introduce a sparsity optimization for the update equations,


by assuming words are represented by binary random variables
 2. The current relaxation scheme places each word in one topic, resulting in a partition of the vocabulary, rather than mixed membership topics

CorEx Objective

We transform this from a combinatorial to a continuous optimization by introducing variables and “relaxing” words into informative topics We can now rewrite the objective: These are issues of speed, not theory

31

slide-32
SLIDE 32

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

32

slide-33
SLIDE 33

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life

33

slide-34
SLIDE 34

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life

Words ranked by mutual information with topic

34

slide-35
SLIDE 35

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life 1: 3: 6: 8: 9: 13: 14:

Topics ranked by total correlation

35

slide-36
SLIDE 36

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life

Most informative topic 36

slide-37
SLIDE 37

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life

37

slide-38
SLIDE 38

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life

38

slide-39
SLIDE 39

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Topic Examples

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about Hillary Clinton’s presidential campaign, up to August 2016

Clinton Article Topics

1: server, department, classified, information, private, investigation, fbi, email, emails, secretary 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 9: federal, its, officials, law, including, committee, staff, statement, director, group 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 6: crowd, woman, speech, night, women, stage, man, mother, audience, life

39

slide-40
SLIDE 40

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Performs Favorably Against LDA

40

slide-41
SLIDE 41

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Performs Favorably Against LDA

41

slide-42
SLIDE 42

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Extensions

With no additional assumptions, the CorEx topic model yields two extensions:

  • 1. A hierarchical topic model
  • 2. A semi-supervised topic model at the word level

42

slide-43
SLIDE 43

NAACL 2018, New Orleans, LA @ryanjgallag

Hierarchical CorEx

CorEx Text

(Binarized documents)

43

slide-44
SLIDE 44

NAACL 2018, New Orleans, LA @ryanjgallag

Hierarchical CorEx

CorEx Binary latent topics Text

(Binarized documents)

… …

44

slide-45
SLIDE 45

NAACL 2018, New Orleans, LA @ryanjgallag

Hierarchical CorEx

Input: binary topic representations

  • ver docs

… …

CorEx Binary latent topics Text

(Binarized documents)

45

slide-46
SLIDE 46

NAACL 2018, New Orleans, LA @ryanjgallag

Hierarchical CorEx

Data: ~20,000 humanitarian assistance and disaster relief news articles

46

slide-47
SLIDE 47

NAACL 2018, New Orleans, LA @ryanjgallag

Anchored CorEx and the Information Bottleneck

Objective:

47

slide-48
SLIDE 48

NAACL 2018, New Orleans, LA @ryanjgallag

Anchored CorEx and the Information Bottleneck

Objective:

Maintain information about individual words

48

slide-49
SLIDE 49

NAACL 2018, New Orleans, LA @ryanjgallag

Anchored CorEx and the Information Bottleneck

Objective:

Compress documents into topics Maintain information about individual words

49

slide-50
SLIDE 50

NAACL 2018, New Orleans, LA @ryanjgallag

Anchored CorEx and the Information Bottleneck

Objective:

Compress documents into topics Maintain information about individual words

“The Information Bottleneck Method.” Tishby et al. (2000).

{

Information bottleneck

50

slide-51
SLIDE 51

NAACL 2018, New Orleans, LA @ryanjgallag

Anchored CorEx and the Information Bottleneck

Objective:

Compress documents into topics Maintain information about individual words

“The Information Bottleneck Method.” Tishby et al. (2000).

A user can anchor words to the latent topics by modifying the weight of the relationship between a word and a topic

{

Information bottleneck

51

slide-52
SLIDE 52

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring Strategies

Topic Representation

avalanche snow freezing lava volcano Anchoring to unveil topics that do not naturally emerge

52

slide-53
SLIDE 53

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring Strategies

Topic Representation Topic Separability

avalanche snow freezing lava volcano Anchoring to unveil topics that do not naturally emerge Anchoring to help enforce separation between topics social media platform science computational

53

slide-54
SLIDE 54

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring Strategies

Topic Representation Topic Separability Topic Aspects

avalanche snow freezing lava volcano Anchoring to unveil topics that do not naturally emerge Anchoring to help enforce separation between topics Anchoring to disambiguate different frames around a word social media platform science computational election

54

slide-55
SLIDE 55

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Representation

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about the campaigns of Clinton and Trump, up to August 2016 Method: train one CorEx topic model for each corpus, anchor words for comparison

white whites immigrants immigration muslims islam

55

slide-56
SLIDE 56

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Representation

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about the campaigns of Clinton and Trump, up to August 2016

white whites immigrants immigration muslims islam

Method: train one CorEx topic model for each corpus, anchor words for comparison

Clinton Topic Trump Topic

1: immigration, immigrants, jobs, economic, trade, health, tax, wall, care, economy 1: immigration, immigrants, illegal, border, mexican, undocumented, rapists, mexico, wall, illegally 56

slide-57
SLIDE 57

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Representation

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about the campaigns of Clinton and Trump, up to August 2016 Method: train one CorEx topic model for each corpus, anchor words for comparison

white whites immigrants immigration muslims islam

Clinton Topic Trump Topic

2: muslims, islam, islamic, gun, terrorism, war, military, iraq, terrorist, syria 2: muslims, islam, united, ban, entering, islamic, muslim, terrorism, terrorist, terrorists 57

slide-58
SLIDE 58

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Representation

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Data: news articles about the campaigns of Clinton and Trump, up to August 2016 Method: train one CorEx topic model for each corpus, anchor words for comparison

white whites immigrants immigration muslims islam

Clinton Topic Trump Topic

3: white, i, you, what, do, if, we, it’s, like, people 3: white, house, whites, supremacists, supremacist, duke, klan, klux, ku, supremacy 58

slide-59
SLIDE 59

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

aleppo

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute 59

slide-60
SLIDE 60

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

aleppo

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

Note: this data broadly covers the Middle East and a priori we do not expect 10 topics to emerge about Aleppo

60

slide-61
SLIDE 61

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

aleppo 1: aleppo, killed, police, security, attack, state, arrested, authorities

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

2: aleppo, forces, syria, military, war, army, civilians, iraq, militants 3: aleppo, health, medical, food, care, water, small, conditions, treatment, patients 4: country, aleppo, east, across, group, region, middle 5: two, aleppo, took, another, place, taking, leaders 6: aleppo, russia, iran, barack, obama, moscow, washington, putin 7: aleppo, political, court, part, accused, opposition, called, saying, parliament, democratic 8: government, aleppo, minister, foreign, states, united, prime, UN, law, nations 9: aleppo, city, area, near, air, northern, least, town, eastern, injured 10: aleppo, people, children, human, rights, women, social, school, society, lives

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

61

slide-62
SLIDE 62

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

aleppo 1: aleppo, killed, police, security, attack, state, arrested, authorities

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

2: aleppo, forces, syria, military, war, army, civilians, iraq, militants 3: aleppo, health, medical, food, care, water, small, conditions, treatment, patients 4: country, aleppo, east, across, group, region, middle 5: two, aleppo, took, another, place, taking, leaders 6: aleppo, russia, iran, barack, obama, moscow, washington, putin 7: aleppo, political, court, part, accused, opposition, called, saying, parliament, democratic 8: government, aleppo, minister, foreign, states, united, prime, UN, law, nations 9: aleppo, city, area, near, air, northern, least, town, eastern, injured 10: aleppo, people, children, human, rights, women, social, school, society, lives

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

62

slide-63
SLIDE 63

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

aleppo 1: aleppo, killed, police, security, attack, state, arrested, authorities

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

2: aleppo, forces, syria, military, war, army, civilians, iraq, militants 3: aleppo, health, medical, food, care, water, small, conditions, treatment, patients 4: country, aleppo, east, across, group, region, middle 5: two, aleppo, took, another, place, taking, leaders 6: aleppo, russia, iran, barack, obama, moscow, washington, putin 7: aleppo, political, court, part, accused, opposition, called, saying, parliament, democratic 8: government, aleppo, minister, foreign, states, united, prime, UN, law, nations 9: aleppo, city, area, near, air, northern, least, town, eastern, injured 10: aleppo, people, children, human, rights, women, social, school, society, lives

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

63

slide-64
SLIDE 64

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

aleppo 1: aleppo, killed, police, security, attack, state, arrested, authorities

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

2: aleppo, forces, syria, military, war, army, civilians, iraq, militants 3: aleppo, health, medical, food, care, water, small, conditions, treatment, patients 4: country, aleppo, east, across, group, region, middle 5: two, aleppo, took, another, place, taking, leaders 6: aleppo, russia, iran, barack, obama, moscow, washington, putin 7: aleppo, political, court, part, accused, opposition, called, saying, parliament, democratic 8: government, aleppo, minister, foreign, states, united, prime, UN, law, nations 9: aleppo, city, area, near, air, northern, least, town, eastern, injured 10: aleppo, people, children, human, rights, women, social, school, society, lives

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

64

slide-65
SLIDE 65

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

aleppo 1: aleppo, killed, police, security, attack, state, arrested, authorities

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

2: aleppo, forces, syria, military, war, army, civilians, iraq, militants 3: aleppo, health, medical, food, care, water, small, conditions, treatment, patients 4: country, aleppo, east, across, group, region, middle 5: two, aleppo, took, another, place, taking, leaders 6: aleppo, russia, iran, barack, obama, moscow, washington, putin 7: aleppo, political, court, part, accused, opposition, called, saying, parliament, democratic 8: government, aleppo, minister, foreign, states, united, prime, UN, law, nations 9: aleppo, city, area, near, air, northern, least, town, eastern, injured 10: aleppo, people, children, human, rights, women, social, school, society, lives

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

65

slide-66
SLIDE 66

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

aleppo 1: aleppo, killed, police, security, attack, state, arrested, authorities

Work by Brendan Kennedy and Greg Ver Steeg, Information Sciences Institute

2: aleppo, forces, syria, military, war, army, civilians, iraq, militants 3: aleppo, health, medical, food, care, water, small, conditions, treatment, patients 4: country, aleppo, east, across, group, region, middle 5: two, aleppo, took, another, place, taking, leaders 6: aleppo, russia, iran, barack, obama, moscow, washington, putin 7: aleppo, political, court, part, accused, opposition, called, saying, parliament, democratic 8: government, aleppo, minister, foreign, states, united, prime, UN, law, nations 9: aleppo, city, area, near, air, northern, least, town, eastern, injured 10: aleppo, people, children, human, rights, women, social, school, society, lives

Data: ~1 million English newswire articles since June 2015 from countries in the Middle East

66

slide-67
SLIDE 67

NAACL 2018, New Orleans, LA @ryanjgallag

Shape of the CorEx Topic Model to Come

CorEx Topic Model

By defining topics in terms of information content, the CorEx topic model takes a new perspective on topic modeling Anchoring through the information bottleneck provides a flexible mechanism to retrieve topics of interest and inject expert domain knowledge

Code: github.com/gregversteeg/corex_topic

CorEx is competitive with unsupervised and semi-supervised variants of LDA while making far fewer assumptions

Future Work

Incorporate count data into the CorEx topic model while preserving the benefits of the sparsity optimization Extend CorEx to efficiently learn multi-membership topics (in progress)

67

slide-68
SLIDE 68

NAACL 2018, New Orleans, LA @ryanjgallag

Shape of the CorEx Topic Model to Come

CorEx Topic Model

By defining topics in terms of information content, the CorEx topic model takes a new perspective on topic modeling CorEx is competitive with unsupervised and semi-supervised variants of LDA while making far fewer assumptions Anchoring through the information bottleneck provides a flexible mechanism to retrieve topics of interest and inject expert domain knowledge

Code: github.com/gregversteeg/corex_topic

Future Work

Incorporate count data into the CorEx topic model while preserving the benefits of the sparsity optimization Extend CorEx to efficiently learn multi-membership topics (in progress)

68

slide-69
SLIDE 69

NAACL 2018, New Orleans, LA @ryanjgallag

Shape of the CorEx Topic Model to Come

CorEx Topic Model

By defining topics in terms of information content, the CorEx topic model takes a new perspective on topic modeling Anchoring through the information bottleneck provides a flexible mechanism to retrieve topics of interest and inject expert domain knowledge

Code: github.com/gregversteeg/corex_topic

CorEx is competitive with unsupervised and semi-supervised variants of LDA while making far fewer assumptions

Future Work

Extend CorEx to efficiently learn multi-membership topics (in progress) Incorporate count data into the CorEx topic model while preserving the benefits of the sparsity optimization

69

slide-70
SLIDE 70

NAACL 2018, New Orleans, LA @ryanjgallag

Shape of the CorEx Topic Model to Come

CorEx Topic Model

Code: github.com/gregversteeg/corex_topic

By defining topics in terms of information content, the CorEx topic model takes a new perspective on topic modeling Anchoring through the information bottleneck provides a flexible mechanism to retrieve topics of interest and inject expert domain knowledge

Future Work

Extend CorEx to efficiently learn multi-membership topics (in progress) Incorporate count data into the CorEx topic model while preserving the benefits of the sparsity optimization CorEx is competitive with unsupervised and semi-supervised variants of LDA while making far fewer assumptions

70

slide-71
SLIDE 71

NAACL 2018, New Orleans, LA @ryanjgallag

Shape of the CorEx Topic Model to Come

CorEx Topic Model

Code: github.com/gregversteeg/corex_topic

By defining topics in terms of information content, the CorEx topic model takes a new perspective on topic modeling Anchoring through the information bottleneck provides a flexible mechanism to retrieve topics of interest and inject expert domain knowledge

Future Work

Extend CorEx to efficiently learn multi-membership topics (in progress) Incorporate count data into the CorEx topic model while preserving the benefits of the sparsity optimization CorEx is competitive with unsupervised and semi-supervised variants of LDA while making far fewer assumptions

71

slide-72
SLIDE 72

NAACL 2018, New Orleans, LA @ryanjgallag

Collaborators

Greg Ver Steeg Research Professor Information Sciences Institute David Kale CS PhD Candidate Information Sciences Institute Kyle Reing CS PhD Student Information Sciences Institute The anchored Clinton and Trump election article topics come from work by Abigail Ross and the Computational Story Lab at the University of Vermont’s Complex Systems Center

72

slide-73
SLIDE 73

@ryanjgallag ryanjgallag@gmail.com

github.com/gregversteeg/corex_topic

Thank you for your time!

slide-74
SLIDE 74

NAACL 2018, New Orleans, LA @ryanjgallag

Marginals in terms of the optimization parameter

CorEx Implementation

Update Equations Sparsity Optimization

Probabilistic labels for each latent factor given sample

Substituting above turns the sum into a matrix multiplication between a matrix of size (# docs) x (# types) and a matrix of size (# types) x (# topics)

slide-75
SLIDE 75

NAACL 2018, New Orleans, LA @ryanjgallag

Sparsity Optimization Speed Comparison

slide-76
SLIDE 76

NAACL 2018, New Orleans, LA @ryanjgallag

CorEx Example Topics

Work by Abigail Ross and the Computational Story Lab, University of Vermont

Clinton Article Topics Trump Article Topics

1: server, department, classified, information, private, investigation, fib, email, emails, secretary 1: primary, party, win, cruz, delegates, voters, ted, nomination, republicans, vote 3: sanders, bernie, primary, vermont, win, voters, race, nomination, vote, polls 4: $, tax, money, million, jobs, economic, companies, billion, pay, taxes 9: federal, its, officials, law, including, committee, staff, statement, director, group 7: percent, poll, percentage, points, polls, survey, 10, polling, margin, according 8: percent, poll, points, percentage, margin, survey, according, 10, polling, university 13: islamic, foreign, military, terrorism, war, syria, iraq, isis, u, terrorist 14: trump, donald, trump’s, republican, nominee, party, convention, top, election, him 12: crowd, rally, night, event, speech, stage, audience, spoke, wife, took 14: rubio, marco, jeb, bush, carson, florida, ben, candidates, iowa, gov 25: clinton, hillary, bernie, sanders, democratic, clinton’s, her, she, vermont, secretary

Data: news articles about Clinton and Trump, train one CorEx topic model for each corpus

slide-77
SLIDE 77

NAACL 2018, New Orleans, LA @ryanjgallag

Comparisons to Semi-Supervised LDA

slide-78
SLIDE 78

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring Experiment

Data: HA/DR news articles and clinical health notes For each document label:

Determine anchor words by measuring the words with the highest mutual information with the label Anchor one topic of CorEx topic model with the label anchor words Run an unsupervised CorEx topic model with the same random seed Compute the difference in the metric as a matched pair Analyze the distribution of the metric across models Repeat 30 times

slide-79
SLIDE 79

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring Experiment: Effect of Parameter

slide-80
SLIDE 80

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring Experiment: Heterogeneity of Effects

slide-81
SLIDE 81

NAACL 2018, New Orleans, LA @ryanjgallag

Anchoring for Topic Aspects

Data: ~870,000 unique tweets containing #Ferguson from Aug. 9th-Nov. 30th, 2014

protest protests riot riots

“protest” Topics “riot” Topics

1: protest, protests, peaceful, violent, continue, night, island, photos, staten, nights 6: riot, riots, unheard, language, inciting, accidentally, jokingly, watts, waving, dies 2: protest, protests, #hiphopmoves, #cole, hiphop, nationwide, moves, fo, anheuser, boeing 7: riot, black, riots, white, #tcot, blacks, men, whites, race, #pjnet 4: protest, protests, paddy, covering, beverly, walmart, wagon, hills, passionately, including 8: riot, riots, looks, like, sounds, acting, act, animals, looked, treated 3: protest, protests, st, louis, guard, national, county, patrol, highway, city 5: protest, protests, solidarity, march, square, rally, #oakland, downtown, nyc, #nyc 9: riot, riots, store, looting, businesses, burning, fire, looted, stores, business 10: gas, riot, tear, riots, gear, rubber, bullets, military, molotov, armored