SLIDE 1 TopicView: Visually Comparing Topic Models of Text Collections
November 7, 2011 Patricia Crossno, Andrew Wilson, Timothy Shead, Daniel Dunlavy Sandia National Laboratories
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
SLIDE 2 Modeling Text Data
- Latent Semantic Analysis (LSA) vs Latent Dirichlet
Allocation (LDA)
– Bag-of-words modeling – Transform text to term-document frequency matrices – User-defined # of dimensions – Produce weighted term lists for each concept/topic – Produce topic weights for each documents – Results used to compute document relationship measures
– LSA: truncated singular value decomposition (SVD) -> correlations (-1 to 1) – LDA: Bayesian model -> probabilities (0 to 1) – Output quantities have different ranges and meanings
- Direct numeric comparison not meaningful
SLIDE 3 Comparing LSA and LDA
- Focus on how models used in applications
- Conceptual content
– Topic models – Labels
– Scatter plots – Graphs – Landscapes
– Visually compare and interactively explore models – Tabbed panels (Conceptual Content & Document Relationships) – Linked views – Built using Titan Informatics Toolkit
SLIDE 4
Term Topic Table
Detailed Conceptual Similarity
SLIDE 5
Bipartite Graph
High-level Conceptual Similarity
SLIDE 6
LSA Concepts LDA Topics
SLIDE 7
Linked Selection
Selected Concepts Green = Selected Edges Selected Topics
SLIDE 8
Edge Display Controls
All Edges High Weight Edges
SLIDE 9
Document Relationship Graphs
LSA Document Similarity Graph LDA Document Similarity Graph
SLIDE 10
Document Topic Table
Document- Topic Weights
SLIDE 11
Document Full Text Reader
SLIDE 12 Alphabet Data Case Study
Synthetic Data for verification
– 26 clusters (one per letter), 10 documents each – Each document contains only words starting with a single letter
- absorbent autonomic appeals anthology aristocrats …
- bacquire bairbags baiming babomination battorney bafter …
- cadvisory cassumption cappears camount canthropology
- …
– Each algorithm given concept/topic count of 26
SLIDE 13
Alphabet Topic Similarity
SLIDE 14
Term/Topic Comparison
L F L? F?
SLIDE 15
Document-Topic Weights
L F L? F? L F L? F?
SLIDE 16
Clustering Evaluation
SLIDE 17
DUC Data Case Study
Document Understanding Conference (DUC) Data (real world)
– 30 clusters, ~10 documents each – Human categorized around particular topic/event – Associated Press articles – New York Times articles – Each algorithm given concept/topic count of 30
SLIDE 18
DUC Topic Similarity
SLIDE 19
SLIDE 20 LSA Combines Topics
Pinochet Arrest Dance Hall Fire Timor Unrest Pinochet Arrest Pinochet Arrest Doc 121 connects Chile, Spanish, Fire Doc 87 connects Pinochet & Timor
SLIDE 21 LDA Combines Topics
Bosnian Tribunal Iranian Elections Iranian Elections Bosnian Tribunal
SLIDE 22
DUC Document Relationships
SLIDE 23 LDA Unexpected Connections
Pinochet’s Arrest (44) Bosnian War Crimes Tribunal (36) Cold Weather Deaths (37) Palestinian Airport Closing (34)
SLIDE 24
Documents more strongly connected to Topic 30 than conceptual topics
SLIDE 25
Topic 30 - AP wire source
SLIDE 26
Bridging documents: conceptual content outweighed by source content
SLIDE 27
LDA rerun without header tags
SLIDE 28 Conclusions
- LSA concepts provide good summarizations over broad
document groups
- LDA topics are focused on smaller groups
- LDA’s limited groups and probabilistic mechanism
provides better labeling
- LSA’s document relationships do not include
extraneous connections between disparate topics
- Better graphs
- Better labels