TopicView: Visually Comparing Topic Models of Text Collections - - PowerPoint PPT Presentation

topicview visually comparing topic models of text
SMART_READER_LITE
LIVE PREVIEW

TopicView: Visually Comparing Topic Models of Text Collections - - PowerPoint PPT Presentation

TopicView: Visually Comparing Topic Models of Text Collections November 7, 2011 Patricia Crossno, Andrew Wilson, Timothy Shead, Daniel Dunlavy Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and


slide-1
SLIDE 1

TopicView: Visually Comparing Topic Models of Text Collections

November 7, 2011 Patricia Crossno, Andrew Wilson, Timothy Shead, Daniel Dunlavy Sandia National Laboratories

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

slide-2
SLIDE 2

Modeling Text Data

  • Latent Semantic Analysis (LSA) vs Latent Dirichlet

Allocation (LDA)

  • Similarities

– Bag-of-words modeling – Transform text to term-document frequency matrices – User-defined # of dimensions – Produce weighted term lists for each concept/topic – Produce topic weights for each documents – Results used to compute document relationship measures

  • Differences

– LSA: truncated singular value decomposition (SVD) -> correlations (-1 to 1) – LDA: Bayesian model -> probabilities (0 to 1) – Output quantities have different ranges and meanings

  • Direct numeric comparison not meaningful
slide-3
SLIDE 3

Comparing LSA and LDA

  • Focus on how models used in applications
  • Conceptual content

– Topic models – Labels

  • Document relationships

– Scatter plots – Graphs – Landscapes

  • TopicView application

– Visually compare and interactively explore models – Tabbed panels (Conceptual Content & Document Relationships) – Linked views – Built using Titan Informatics Toolkit

slide-4
SLIDE 4

Term Topic Table

Detailed Conceptual Similarity

slide-5
SLIDE 5

Bipartite Graph

High-level Conceptual Similarity

slide-6
SLIDE 6

LSA Concepts LDA Topics

slide-7
SLIDE 7

Linked Selection

Selected Concepts Green = Selected Edges Selected Topics

slide-8
SLIDE 8

Edge Display Controls

All Edges High Weight Edges

slide-9
SLIDE 9

Document Relationship Graphs

LSA Document Similarity Graph LDA Document Similarity Graph

slide-10
SLIDE 10

Document Topic Table

Document- Topic Weights

slide-11
SLIDE 11

Document Full Text Reader

slide-12
SLIDE 12

Alphabet Data Case Study

Synthetic Data for verification

– 26 clusters (one per letter), 10 documents each – Each document contains only words starting with a single letter

  • absorbent autonomic appeals anthology aristocrats …
  • bacquire bairbags baiming babomination battorney bafter …
  • cadvisory cassumption cappears camount canthropology

– Each algorithm given concept/topic count of 26

slide-13
SLIDE 13

Alphabet Topic Similarity

slide-14
SLIDE 14

Term/Topic Comparison

L F L? F?

slide-15
SLIDE 15

Document-Topic Weights

L F L? F? L F L? F?

slide-16
SLIDE 16

Clustering Evaluation

slide-17
SLIDE 17

DUC Data Case Study

Document Understanding Conference (DUC) Data (real world)

– 30 clusters, ~10 documents each – Human categorized around particular topic/event – Associated Press articles – New York Times articles – Each algorithm given concept/topic count of 30

slide-18
SLIDE 18

DUC Topic Similarity

slide-19
SLIDE 19
slide-20
SLIDE 20

LSA Combines Topics

Pinochet Arrest Dance Hall Fire Timor Unrest Pinochet Arrest Pinochet Arrest Doc 121 connects Chile, Spanish, Fire Doc 87 connects Pinochet & Timor

slide-21
SLIDE 21

LDA Combines Topics

Bosnian Tribunal Iranian Elections Iranian Elections Bosnian Tribunal

slide-22
SLIDE 22

DUC Document Relationships

slide-23
SLIDE 23

LDA Unexpected Connections

Pinochet’s Arrest (44) Bosnian War Crimes Tribunal (36) Cold Weather Deaths (37) Palestinian Airport Closing (34)

slide-24
SLIDE 24

Documents more strongly connected to Topic 30 than conceptual topics

slide-25
SLIDE 25

Topic 30 - AP wire source

slide-26
SLIDE 26

Bridging documents: conceptual content outweighed by source content

slide-27
SLIDE 27

LDA rerun without header tags

slide-28
SLIDE 28

Conclusions

  • LSA concepts provide good summarizations over broad

document groups

  • LDA topics are focused on smaller groups
  • LDA’s limited groups and probabilistic mechanism

provides better labeling

  • LSA’s document relationships do not include

extraneous connections between disparate topics

  • Better graphs
  • Better labels