1
Text Visualization
Ma Maneesh Agrawala
CS 448B: Visualization Winter 2020
1
Announcements
2
Text Visualization Ma Maneesh Agrawala CS 448B: Visualization - - PDF document
Text Visualization Ma Maneesh Agrawala CS 448B: Visualization Winter 2020 1 Announcements 2 1 Final project New visualization research or data analysis project I Research : Pose problem, Implement creative solution I Data analysis :
1
2
New visualization research or data analysis project
I Research: Pose problem, Implement creative solution I Data analysis: Analyze dataset in depth & make a visual explainer
Deliverables
I Research: Implementation of solution I Data analysis/explainer: Article with multiple interactive
visualizations
I 6-8 page paper
Schedule
I Project proposal: Wed 2/19 I Design review and feedback: 3/9 and 3/11 I Final presentation: 3/16 (7-9pm) Location: TBD I Final code and writeup: 3/18 11:59pm
Grading
I Groups of up to 3 people, graded individually I Clearly report responsibilities of each member
3
https://docs.google.com/spreadsheets/d/1BtXmbQHrC3-chPT6kKS51Q-2p9XhbiM3Qct0N847yPM/edit?usp=sharing
I M 3/9 4-6pm I T 3/10 7-8pm (SCPD only) I W 3/11 4-6pm
4
I Short presentation (5 min, mostly demo) I Make sure there is time for questions
5
6
Articles, books and novels Computer programs E-mails, web pages, blogs Tags, comments
Messages (e-mail, blogs, tags, comments) Social networks (personal profiles) Academic collaborations (publications) 7
8
evolution of collection over time
data, e.g., correlate with social network
9
Initiatives by President Clinton Overhaul by President Obama
News articles Speech transcriptions Legal documents
10
11
President Obama’s Health Care Speech to Congress
economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93
12
Barack Obama 2009 Bill Clinton 1993
economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93
13
14
15
How well does it convey the properties of the model?
How does the model enable us to reason about the text?
16
Where possible use text to represent text… … which terms are the most descriptive?
Provide relevant context to aid understanding Show (or provide access to) the source text
Determine your analysis task Understand abstraction of your language models Match analysis task with appropriate tools and models
17
19
20
I Correlations: Ho Hong Kong, San Francisco, Bay Area I Order: Ap April, February, January, June, March, May I Membership: Te Tennis, Running, Swimming, Hiking, Piano I Hierarchy, antonyms & synonyms, entities, …
21
Segment text into terms. Remove stop words? a, an, the, of, to, be Numbers and symbols? #cardinal, @Staanford, OMG!!!!!!!! Entities? Palo Alto, O’Connor, U.S.A.
22
Segment text into terms. Remove stop words? a, an, the, of, to, be Numbers and symbols? #cardinal, @Stanford, OMG!!!!!!!! Entities? Palo Alto, O’Connor, U.S.A.
Group together different forms of a word. Porter stemmer? visualization(s), visualize(s), visually -> visual Lemmatization? goes, went, gone -> go
23
Segment text into terms. Remove stop words? a, an, the, of, to, be Numbers and symbols? #cardinal, @Stanford, OMG!!!!!!!! Entities? Palo Alto, O’Connor, U.S.A.
Group together different forms of a word. Porter stemmer? visualization(s), visualize(s), visually -> visual Lemmatization? goes, went, gone -> go
24
Each term corresponds to a dimension (10,000+) Each value represents the relevance
I For example, simple term counts
Document vector space model
25
Each document is a vector of term weights Simplest weighting is to just count occurrences
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony
157 73 Brutus 4 157 1 Caesar 232 227 2 1 1 Calpurnia 10 Cleopatra 57 mercy 2 3 5 5 1 worser 2 1 1 1
26
http://wordcount.org
27
https://books.google.com/ngrams/
28
https://books.google.com/ngrams/
29
30
Can help with gisting and initial query formation
Sub-optimal visual encoding (size vs. position) Inaccurate size encoding (long words are bigger) May not facilitate comparison (unstable layout) Term frequency may not be meaningful Does not show the structure of the text
31
32
tftd = count(t) in d Can take log frequency: log(1 + tftd) Can normalize to show proportion: tftd / St tftd
33
34
tftd = count(t) in d
tf.idftd = log(1 + tftd) ´ log(N/dft) dft = # docs containing t; N = # of docs
35
Not clear that these provide best description
Grammar / part-of-speech Position within document Recognizable entities 41
[Chuang 2012]
42
Bigrams (phrases of 2 words) are the most common 43 Keyphrase length declines with more docs & more diversity 44
45 Selected descriptive terms have medium commonness People avoid both rare and common words 46
Commonness increases with more docs & more diversity 47
48
49
50
51
52
Understand the limitations of your language model
Bag of words:
Easy to compute Single words Loss of word ordering
Select appropriate model and visualization
Generate longer, more meaningful phrases Adjective-noun word pairs for reviews Show keyphrases within source text 53
54
55
TileBars [Hearst]
56
57 58
61 63
64
65
66 67
Lexical: <A> at <B> Syntactic: <Noun> <Verb> <Object>
68
Look for specific linking patterns in the text: ‘A and B’, ‘A at B’, ‘A of B’, etc Could be output of regexp or parser Visualize extracted patterns in a node-link view Occurrences à Node size Pattern position à Edge direction Darker color à higher ratio of out-edges to in-edges
69
Portrait of the Artist as a Young Man X and Y
70
The Bible X begat Y
72
Pride & Prejudice X at Y
73
18th & 19th Century Novels X’s Y
76
Old Testament X of Y
77
New Testament X of Y
78
89
Who (senders, receivers) What (the content of communication) When (temporal patterns)
What x When à Topic “Zeitgeist” Who x Who à Social network Who x Who x What x When à Information flow 90
91
Show correspondence patterns in text forums Initiate vs. reply; size and duration of discussion
94
95 96
Conversation by person over time (who x when)
97
One person over time, TF.IDF weighted terms
98
99 100
Washington Lobbyist
101
102
John Smith à PERSON Soviet Union à COUNTRY 353 Serra St à ADDRESS (555) 721-4312 à PHONE NUMBER
Simple approach: do they co-occur in small window of text?
105 106
107
Assume documents are a mixture of topics Topics are (roughly) a set of co-occurring terms Latent Semantic Analysis (LSA): reduce term matrix Latent Dirichlet Allocation (LDA): statistical model
111
112
Show salient (vs. frequent) terms. Seriate rows & columns. 120
’12] Show salient (vs. frequent) terms. Seriate rows &
121
with Jason Chuang, Dan Ramage & Christopher Manning 122
124 Oh, the humanities! (Insufficient number of topics) 125
Where possible use text to represent text… … which terms are the most descriptive?
Understand abstraction of your language models. Match analysis task with appropriate tools & models.
132