Deconstructing Data Science
David Bamman, UC Berkeley Info 290 Lecture 11: Topic models Feb 29, 2016
Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation
Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models Feb 29, 2016 Topic models Latent variables A latent variable is one thats unobserved, either because: we are predicting it (but have
David Bamman, UC Berkeley Info 290 Lecture 11: Topic models Feb 29, 2016
because:
variable for other data points)
latent variables email text, date, sender topic novels text, author, pub date genre, topic social network nodes, friendship structure communities fitbit data accelerometer output steps, sleep patterns legislators voting behavior, speeches political preference netflix users watching behavior, ratings genre preference
probability:
P(x, y) = P(y) P(x | y)
y x
“topics” or “themes” (groups of terms that tend to
How to tokens cluster into topics?
documents, number of clusters to learn.
document
each word in doc
… The messenger, however, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet
Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills
stabs herself with his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the two "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends with the Prince's elegy for the lovers: "For never was a story of more woe / Than this of Juliet and her Romeo."
topic models cluster tokens into “topics”
… The messenger, however, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet
Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills
stabs herself with his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the two "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends with the Prince's elegy for the lovers: "For never was a story of more woe / Than this of Juliet and her Romeo."
“Death” topic models cluster tokens into “topics”
… The messenger, however, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet
Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills
stabs herself with his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the two "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends with the Prince's elegy for the lovers: "For never was a story of more woe / Than this of Juliet and her Romeo."
topic models cluster tokens into “topics” “Love”
… The messenger, however, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet
Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills
stabs herself with his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the two "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends with the Prince's elegy for the lovers: "For never was a story of more woe / Than this of Juliet and her Romeo."
topic models cluster tokens into “topics” “Family”
… The messenger, however, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet
Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills
stabs herself with his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the two "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends with the Prince's elegy for the lovers: "For never was a story of more woe / Than this of Juliet and her Romeo."
topic models cluster tokens into “topics” “Etc.”
tokens, not types
… The messenger, however, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet
Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills
stabs herself with his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the two "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends with the Prince's elegy for the lovers: "For never was a story of more woe / Than this of Juliet and her Romeo."
“People”
A different Paris token might belong to a “Place” or “French” topic
http://www.rci.rutgers.edu/~ag978/quiet/
Feature Value
follow clinton follow trump “republican” in profile “democrat” in profile “benghazi" 1
topic 1 0.55 topic 2 0.32 topic 3 0.13
x = feature vector
15 Feature β
follow clinton
follow trump 6.8 “republican” in profile 7.9 “democrat” in profile
“benghazi"
topic 1 0.3 topic 2
topic 3 5.7
β = coefficients
http://mallet.cs.umass.edu/
https://radimrehurek.com/ gensim/
https://github.com/uwdata/ termite-visualizations
W
document distribution over topics
topic indicators for words words
topic distribution
D
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4
z w θ φ α γ
W D
death die kill dead love like adore care mother father child son the
do 0.00 0.10 0.20
z w θ φ α γ
W D
z w θ φ α γ
W D
K=20
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4? ? ? ?
P(topic | topic distribution)
z w θ φ α γ
W D
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4war ? ? ?
P(topic | topic distribution)
z w θ φ α γ
W D
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4war aliens ? ?
P(topic | topic distribution)
z w θ φ α γ
W D
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4war aliens war ?
P(topic | topic distribution)
z w θ φ α γ
W D
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4war aliens war love
P(topic | topic distribution)
z w θ φ α γ
W D
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4war aliens war love ? ? ? ?
z w θ φ α γ
W D
death die kill dead love like adore care mother father child son thez w θ φ α γ
W D
K=20
death die kill dead love like adore care mother father child son thewar love chases boats aliens family
0.0 0.1 0.2 0.3 0.4war aliens war love “fights” “alien” “kills” “marries”
z w θ φ α γ
W D
death die kill dead love like adore care mother father child son the? ? ? ?
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4P(topic | topic distribution)
z w θ φ α γ
W D
aliens ? ? ?
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4P(topic | topic distribution)
z w θ φ α γ
W D
aliens family ? ?
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4P(topic | topic distribution)
z w θ φ α γ
W D
aliens family aliens ?
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4P(topic | topic distribution)
z w θ φ α γ
W D
aliens family aliens love
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4P(topic | topic distribution)
z w θ φ α γ
W D
aliens family aliens love
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4? ? ? ?
z w θ φ α γ
W D
death die kill dead love like adore care mother father child son thealiens family aliens love
war love chases boats aliens family
0.0 0.1 0.2 0.3 0.4“ET” “mom” “space” “friend”
z w θ φ α γ
W D
death die kill dead love like adore care mother father child son theeach document?
each word in a document?
each topic?
z w θ φ α γ
W D
Find the parameters that maximize the likelihood of the data!
Metropolis Hastings, etc.)
et al. 2013)
method for approximating the joint distribution of a set
Metropolis et al. 1953; Hastings et al. 1970)
Josiah Gibbs
the variables
conditioned on all of the other variables around it (using Bayes’ theorem)
z w θ φ α γ
W D
P(θ|X) = P(θ)P(X|θ)
W D
W D
P(θd | α, zd)
∝ Dir(θ | α)
Cat(zi | θ) ∝ P(θd | α)
P(zi | θd)
W D
P(z | θd, w, φ) ∝ P(z | θd)P(w | z, φ) ∝ Cat(z | θd)Cat(w | z, φ) ∝ θz
d × φw z
P(z|𝜄) P(w|z) P(z|𝜄) P(w|z) norm z=1 0.100 0.010 0.001 0.019 z=2 0.200 0.030 0.006 0.112 z=3 0.070 0.020 0.001 0.026 z=4 0.130 0.080 0.010 0.193 z=5 0.500 0.070 0.035 0.651
Probability mass function (PMF) P(z = x) exactly
1 2 3 4 5 x P(z = x) 0.0 0.1 0.2 0.3 0.4 0.5 0.6
1 2 3 4 5 x P(z <= x) 0.0 0.2 0.4 0.6 0.8 1.0
Cumulative density function (CDF) P(z ≤ x)
1 2 3 4 5 x P(z <= x) 0.0 0.2 0.4 0.6 0.8 1.0
Sample p uniformly in [0,1] Find the point CDF-1(p) p=.78
1 2 3 4 5 x P(z <= x) 0.0 0.2 0.4 0.6 0.8 1.0
Sample p uniformly in [0,1] Find the point CDF-1(p) p=.06
1 2 3 4 5 x P(z <= x) 0.0 0.2 0.4 0.6 0.8 1.0
≤0.008 ≤0.059 ≤0.071 ≤0.703 ≤1.000
Sample p uniformly in [0,1] Find the point CDF-1(p)
words are independent of each other given the set of topics for a document)
(Dirichlet prior)
(Dirichlet prior)
the identities of words and how they are divided into documents.
W
D
What if you want to encode other assumptions or reason over other observations?
W D
W D
W D
(Wang and McCallum 2006)
Time is drawn from a Beta distribution [0,1]
𝜄 z w 𝛽 φ
W D
t 𝛽t βt
∝ P(z | θd)P(w | z, φ)P(t | z, α, β) ∝ Cat(z | θd)Cat(w | z, φ)Beta(t | αt, βt) ∝ θz
d × φw z × tαt−1(1 − t)βt−1
B(αt, βt) P(z | θ, w, t, φ, αt, βt)
W D
Time is drawn from a Normal distribution [-∞, ∞]
φ 𝜄 z w 𝛽
W D
t μ σ
P(z | θ, w, t, φ, µ, σ) ∝ P(z | θd)P(w | z, φ), P(t | z, µz, σz) ∝ Cat(z | θd)Cat(w | z, φ)Norm(t | µz, σz) ∝ θz
d × φw z ×
1 σz √ 2π exp
2σ2
z
W D
Time is drawn from a Multinomial distribution [1, … , K]
𝜄 z w 𝛽 φ
W D
t ψ
P(z | θ, w, φ, t, ψ) ∝ P(z | θd)P(w | z, φ)P(t | z, ψ) ∝ Cat(z | θd)Cat(w | w, φ)Cat(t | z, ψ) ∝ θz
d × φw z × ψt z
Goldstone and Underwood (2014), The Quiet Transformations of Literary Studies
Grimmer (2010), A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases