Toward Reliable Bayesian Nonparametric Learning
Erik Sudderth
Brown University Department of Computer Science
Joint work with
Donglai Wei & Michael Bryant (HDP topics) Michael Hughes & Emily Fox (BP-HMM)
Toward Reliable Bayesian Nonparametric Learning Erik Sudderth Brown - - PowerPoint PPT Presentation
Toward Reliable Bayesian Nonparametric Learning Erik Sudderth Brown University Department of Computer Science Donglai Wei & Michael Bryant (HDP topics) Joint work with Michael Hughes & Emily Fox (BP-HMM) Documents & Topic Models
Erik Sudderth
Brown University Department of Computer Science
Joint work with
Donglai Wei & Michael Bryant (HDP topics) Michael Hughes & Emily Fox (BP-HMM)
model neural stochastic recognition nonparametric gradient dynamical Bayesian …
Framework for unsupervised discovery of low-dimensional latent structure from bag of word representations
Algorithms Neuroscience Statistics Vision … ! pLSA: Probabilistic Latent Semantic Analysis (Hofmann 2001) ! LDA: Latent Dirichlet Allocation (Blei, Ng, & Jordan 2003) ! HDP: Hierarchical Dirichlet Processes (Teh, Jordan, Beal, & Blei 2006)
To organize large time series collections, an essential task is to
Identify segments whose visual content arises from same physical cause
Stir Brownie Mix Open Fridge Grate Cheese Set Oven Temp.
GOAL: Set of temporal behaviors
growth of model complexity
across domains
Can local updates uncover global structure?
! MCMC: Local Gibbs and Metropolis-Hastings proposals ! Variational: Local coordinate ascent optimization ! Do these algorithms live up to our complex models???
Non-traditional modeling and inferential goals
! Nonparametric: Model structure grows and adapts to new data, no need to specify number of topics/objects/etc. ! Reliable: Our primary goal is often not prediction, but correct recovery of latent cluster/feature structure ! Simple: Often want just a single “good” model, not samples
Bayesian Nonparametrics
! Dirichlet process (DP) mixture models ! Variational methods and the ME algorithm
Reliable Nonparametric Learning
! Hierarchical DP topic models ! ME search in a collapsed representation ! Non-local online variational inference
Nonparametric Temporal Models
! Beta Process Hidden Markov Models (BP-HMM) ! Effective split-merge MCMC methods
1 concentration parameter
Sethuraman, 1994
Indicates which cluster generated each observation N data points
Closed form probability for any hypothesized partition of N observations into K clusters:
log p(x, z) = log Γ(α) Γ(N + α) +
K
X
k=1
8 < :log α + log Γ(Nk) + log Z
Θ
Y
i|zi=k
f(xi | θk) dH(θk) 9 = ;
Γ(Nk) = (Nk − 1)!
Monte Carlo Methods
! Stick-breaking representation: Truncated or slice sampler ! CRP representation: Collapsed Gibbs sampler ! Split-merge samplers, retrospective samplers, …
Variational Methods
! Valid for any hypothesized distribution ! Mean field variational methods optimize in tractable family ! Truncated stick-breaking representation: Blei & Jordan, 2006 ! Collapsed CRP representation: Kurihara, Teh, & Welling 2007
log p(x | α, λ) ≥ H(q) + Eq[log p(x, z, θ | α, λ)]
EM Algorithm
! E-step: Marginalize latent variables (approximate) ! M-step: Maximize likelihood bound given model parameters
ME Algorithm
! M-step: Maximize likelihood given latent assignments ! E-step: Marginalize random parameters (exact)
Kurihara & Welling, 2009
Why Maximization-Expectation?
! Parameter marginalization allows Bayesian “model selection” ! Hard assignments allow efficient algorithms, data structures ! Hard assignments consistent with clustering objectives ! No need for finite truncation of nonparametric models
200 samples from a mixture of 4 two- dimensional Gaussians ! Stick-breaking variational: Truncate to K=20 components ! CRP collapsed variational: Truncate to K=20 components ! ME local search: No finite truncation required
log p(x, z) = log Γ(α) Γ(N + α) +
K
X
k=1
8 < :log α + log Γ(Nk) + log Z
Θ
Y
i|zi=k
f(xi | θk) dH(θk) 9 = ;
! Dynamics of inference algorithm often matter more in practice than choice of model representation/approximation ! True for MCMC as well as variational methods ! Easier to design complex algorithms for simple objectives Every run, from hundreds of initializations, produces the same (optimal) partition
Bayesian Nonparametrics
! Dirichlet process (DP) mixture models ! Variational methods and the ME algorithm
Reliable Nonparametric Learning
! Hierarchical DP topic models ! ME search in a collapsed representation ! Non-local online variational inference
Nonparametric Temporal Models
! Beta Process Hidden Markov Models (BP-HMM) ! Effective split-merge MCMC methods
Ferguson, 1973 Antoniak, 1974
Global discrete measure: For each of J groups: For each of Nj data:
Hierarchical Dirichlet Process (Teh, Jordan, Beal, & Blei 2004)
Atom locations define topics, atom masses their frequencies. Each document has its
Bag of word tokens.
! Latent Dirichlet Allocation (LDA, Blei et al. 2003) is a parametric topic model (a finite Dirichlet approximation to the HDP) ! Griffiths & Steyvers (2004) introduced a collapsed Gibbs sampler, and demonstrated it on a toy “bars” dataset:
10 topic distributions on 25 vocabulary words, and example documents
Global discrete measure: For each of J groups: For each of Nj data:
Can we marginalize both global and document-specific topic frequencies?
log Γ(γ) Γ(m.. + γ) +
K
X
k=1
( log γ + log Γ(m.k) + log Γ(Wλ) Γ(n..k + Wλ) +
W
X
w=1
log Γ(λ + nw
..k)
Γ(λ) )
jtk
Number of tokens in document j, assigned to table t and topic k Number of tokens of type (word) w in document j, assigned to table t and topic k Number of tables in document j assigned to topic k
log p(x, z, m | α, γ, λ) =
+
J
X
j=1
( log Γ(α) Γ(nj.. + α) + mj. log α +
K
X
k=1
log nj.k mjk )
Sufficient statistics: Global topic assignments and counts of tables assigned to each topic Number of permutations of items with disjoint cycles (unsigned Stirling numbers of the first kind, Antoniak 1974)
nj.k
mjk
nj.k mjk
jtk
Number of tokens in document j, assigned to table t and topic k Number of tokens of type (word) w in document j, assigned to table t and topic k Number of tables in document j assigned to topic k
! When a word is repeated multiple times within a document, those instances (tokens) have identical likelihood statistics ! We sum all possible ways of allocating repeating tokens to produce a given set of counts
nw
j.k
log Γ(γ) Γ(m.. + γ) +
K
X
k=1
( log γ + log Γ(m.k) + log Γ(Wλ) Γ(n..k + Wλ) +
W
X
w=1
log Γ(λ + nw
..k)
Γ(λ) )
+
J
X
j=1
( log Γ(α) Γ(nj.. + α) + mj. log α +
K
X
k=1
log nj.k mjk
W
X
w=1
log Γ(nw
j.. + 1)
QK
k=1 Γ(nw j.k + 1)
)
log p(x, n, m | α, γ, λ) =
W words J docs K topics Inferred Topic Distributions Input Data Search Space log Γ(γ) Γ(m.. + γ) +
K
X
k=1
( log γ + log Γ(m.k) + log Γ(Wλ) Γ(n..k + Wλ) +
W
X
w=1
log Γ(λ + nw
..k)
Γ(λ) )
+
J
X
j=1
( log Γ(α) Γ(nj.. + α) + mj. log α +
K
X
k=1
log nj.k mjk
W
X
w=1
log Γ(nw
j.. + 1)
QK
k=1 Γ(nw j.k + 1)
)
log p(x, n, m | α, γ, λ) =
W words J docs K topics Inferred Topic Distributions Input Data Search Space
! Assign one word token to the optimal (possibly new) table ! Assign one table to the optimal (possibly new) topic ! Merge two tables, assign to the optimal (possibly new) topic In some random order:
W words J docs K topics
! Remove all existing assignments, and sequentially assign tokens to topics via a conditional CRP sampler ! Refine configuration with local search (only this document) ! Reject if new configuration has lower likelihood For some document, fixing configurations of all others:
W words J docs K topics
! Remove all existing assignments topic by topic, sequentially assign tokens to topics via a conditional CRP sampler ! Refine configuration with local search (only this word type) ! Reject if new configuration has lower likelihood For some vocabulary word, fixing configurations of all others:
W words J docs K topics
! Merge with another topic ! Refine or reject topic: Apply reconfigure-document and reconfigure-word moves to this topic’s documents/words ! Reject if any new configuration has lower likelihood For some topic, fixing configurations of all others:
! Latent Dirichlet Allocation (LDA, Blei et al. 2003) is a parametric topic model (a finite Dirichlet approximation to the HDP) ! Griffiths & Steyvers (2004) introduced a collapsed Gibbs sampler, and demonstrated it on a toy “bars” dataset:
10 topic distributions on 25 vocabulary words, and example documents
! Frequency: Assign geometrically decreasing rates of
! Noise: Generate portion of document words uniformly at random, rather than from the primary topics ! Burstiness: Increase the frequencies of a few randomly chosen words from the most likely topics
100 200 300 400 500 600
5.5 6 6.5
7
7.5 8 8.5 9 9.5
10
CleanBar NoisyBar BurstyBar NIPS
Simulated Corpus
Sorted Word Index
Log Count of words
20 40 60 80 100 120 140 20 40 60 80 100 120 140 160
Average Simulated Document Average count of words
Sorted Word Index
CleanBar NoisyBar BurstyBar NIPS
! Frequency: Well, via explicit parameters in base measure ! Noise: Weakly, by using many extraneous topics with small probability mass ! Burstiness: Completely unmodeled; topics are fixed multinomials with no document-specific variation
100 200 300 400 500 600
5.5 6 6.5
7
7.5 8 8.5 9 9.5
10
CleanBar NoisyBar BurstyBar NIPS
Simulated Corpus
Sorted Word Index
Log Count of words
20 40 60 80 100 120 140 20 40 60 80 100 120 140 160
Average Simulated Document Average count of words
Sorted Word Index
CleanBar NoisyBar BurstyBar NIPS
8,000 12,000 16,000 20,000 −4.41 −4.405 −4.4 −4.395 −4.39 −4.385
# Iterations log P(xtrain)/# token Initial GS +GS +ME−n+GS
Extended run of Gibbs sampler One ME iteration, continue with Gibbs Dataset Scale: 1566 documents 625 words 50 true topics
1 10 30 50 100 −4.42 −4.41 −4.4 −4.39 −4.38 −4.37 −4.36 −4.35 −4.34
ME−n ME−z GS
Initialize by short run of MCMC, run ME search, output solution Extended run of Gibbs sampler
1 10 30 50 100 50 51 52 53 54 55 56 57
ME−n ME−z GS
1 10 30 50 100 0.2 0.4 0.6 0.8 1
ME−n ME−z GS
Gibbs Sampler ME Search
1 10 30 50 100 −4.5 −4.4 −4.3 −4.2 −4.1 −4
K0 log P(xtest)/# token NoisyBar
ME−n GS
1 10 30 50 100 −4.5 −4.4 −4.3 −4.2 −4.1 −4
K0 log P(xtest)/# token BurstyBar
ME−n GS
! Predictive likelihood and topic coherence are negatively correlated ! There is some work on modeling burstiness with parametric topic models (Doyle & Elkan, 2009)
40 80 120 50 100 150 200 250 300 350 400 450
K0
0.1 1 3 5 50 100 150 200 250 300 350 400 450
λ
0.1 1 3 5 −7.65 −7.6 −7.55 −7.5 −7.45 −7.4 −7.35 −7.3
λ
Robustness: Learned # Topics
Compactness: Learned # Topics
Prediction: Test log-like
(λ = 1) (K0 = 40) (K0 = 40)
1,740 documents, ~1.7 million word tokens
0.1 1 3 5 −7.4 −7.35 −7.3 −7.25 −7.2 −7.15 −7.1
λ
40 80 120 50 100 150 200
K0
0.1 1 3 5 20 40 60 80 100 120
λ
Robustness: Learned # Topics
Compactness: Learned # Topics
Prediction: Test log-like
(λ = 1) (K0 = 40) (K0 = 40)
4,709 documents, ~435,000 word tokens
2.5 5 7.5 10 12.5 15 x 10
5
−8.1 −8 −7.9 −7.8 −7.7 −7.6 −7.5 −7.4
documents seen
NIPS
per−word log likelihood
bHDP, K=300 bHDP, K=1000
CGS, K=300
100 200 300 400 500 600 −8.1 −8 −7.9 −7.8 −7.7 −7.6 −7.5 −7.4
# topics used per−word log likelihood
(1740 documents)
[Bryant(NIPS(2012](
150 200 250 300 350 400 450 500 550 −7.8 −7.78 −7.76 −7.74 −7.72 −7.7 −7.68 −7.66 −7.64 −7.62 −7.6 −7.58 −7.56
# topics used per−word log likelihood
1 2 3 4 5 x 10
6
−7.8 −7.78 −7.76 −7.74 −7.72 −7.7 −7.68 −7.66 −7.64 −7.62 −7.6 −7.58 −7.56
documents seen per−word log likelihood
New York Times
(1.8 million documents)
[Bryant(NIPS(2012](
Original topic 40,000 80,000 120,000 160,000 200,000 240,000 patterns patterns patterns patterns patterns patterns pattern pattern pattern pattern pattern pattern cortex cortex cortex cortex cortex cortex neurons neurons neurons neurons neurons responses neuronal neuronal responses responses responses types patterns responses responses neuronal type type type pattern single single single behavioral behavioral behavioral cortex inputs temporal type types types form neurons temporal inputs number neuronal form neurons neuronal activation type temporal single neuronal areas single patterns neuronal neuronal neuronal neuronal neuronal responses neuronal patterns neurons dendritic dendritic dendritic inputs pattern pattern activation peak fire postsynaptic type neurons neurons cortex activation peak fire activation cortex cortex dendrite cortex activation cortex inputs activation preferred pyramidal msec activation activation dendrite patterns msec pyramidal peak type inputs peak fire cortex msec preferred peak pyramidal dendrites postsynaptic pyramidal peak preferred inputs inputs inputs inputs [Bryant(NIPS(2012](
Bayesian Nonparametrics
! Dirichlet process (DP) mixture models ! Variational methods and the ME algorithm
Reliable Nonparametric Learning
! Hierarchical DP topic models ! ME search in a collapsed representation ! Non-local online variational inference
Nonparametric Temporal Models
! Beta Process Hidden Markov Models (BP-HMM) ! Effective split-merge MCMC methods
[Fox(NIPS(2009](
.8# .2# .1# .9# .8# .15# .05# .05# .9# .05# .1# .1# .8# .9# .1# .1# .9#
Transition(Matrix( Behavior(Library(!"HMM"emission"parameters"θ" Video#1# Video#2# Video#3# Video#4#
.8# .15# .05# .05# .8# .15# .05# .05# .9#
Behavior(Sequence(
F"
Behavior(Matrix( behaviors" videos" 1 # 2 # 3 # 4
[Ghahramani(2006]( [Thibaux((2007]###
=#[#0###1###0###1###0###0###0###...(]# Each#video#i#has#sparse(binary(vector(indicating(available#behaviors### Alternative#representation:#Indian#Buffet#Process## Beta(Process((BP):(prior#on#sparse#binary#matrix#
num.#behaviors/video#
#Poisson(#γ#)#
# total#num.#behaviors#
"O("γ"log#N")" Encourage(behavior(sharing,(but(allow(sparsity(and(rare(behaviors(
N=#
500 1000 1500 2000 2500 3000 3500 −4 −2 2 4x 10
4
cpu time (sec) log joint prob.
SM DD Prior
−1 1 −1.5 −1 −0.5 0.5 1 1.5 −1 1 −1.5 −1 −0.5 0.5 1 1.5
−1 1 −1.5 −1 −0.5 0.5 1 1.5
Worst Split-Merge Worst Data-Driven Best Prior
Prior proposals (Fox 2009) fairly sophisticated:
dynamic programming in feature proposals
resampling to sample new transition and emission parameters
100 sequences, 8-dim. Gaussian emissions
Reversible Jump MCMC: Add or delete unique features
k∗ ∼ p(θ)
Propose from prior [Fox et al. NIPS 2009] Data-driven proposal [Hughes et al. NIPS 2012]
k∗ ∼ 1 2p(θ) + 1 2p(θ|xit : t ∈ W)
birth death
k∗
Using mixture ensures good death move acceptance rate
l*
w#W#of#sequence#
60 80
− −
− − − − − − − − − − − −
Select features
merge split
km ka kb i j i j i j i j
z z*
time
MH Acceptance Ratio
= p(x, z∗, F ∗) p(x, z, F ) qmerge(F , z | x, F ∗, z∗, ka, kb) qsplit(F ∗, z∗ | x, F , z, km) qk(ka, kb | x, F ∗, z∗, i, j) qk(km, km | x, F , z, i, j)
SPLIT proposal
F ∗, z∗ ∼ qsplit(·|ki)
MERGE proposal
F ∗, z∗ ∼ qmerge(·|ki, kj)
ki = kj
?
YES NO
Select anchors
i, j∼ Unif(sequences)
Must satisfy Joint probability Proposal construction HMM params collapsed away
ki, kj ∼ qk(·|fi, fj)
Features active in fn can be at any time zn Sequences
set unchanged [Hughes(NIPS(2012]( Sequential allocation [Dahl 2005] efficiently gives self-consistent split proposals
Brownie
Fraction of Time Elapsed
0.25 0.5 0.75 1
Pizza Sandwich Salad Eggs 65 Others Light Switch Open Fridge Stir Bowl 1 Stir Bowl 2 Pour Bowl Grate Cheese Slice/Chop Flip Omelette
1 2 3 4 5 6 7 8 x 10
4
−7.2 −7.15 −7.1 −7.05 x 10
6
cpu time (sec) log joint prob.
SM+DD DD Prior
preparation (CMU Kitchen Database)
unusably poor
reasonable, but not complete, robustness to initialization
0.5 1 1.5 2 2.5 3 3.5 x 10
4
0.2 0.3 0.4 0.5 0.6 0.7
cpu time (sec) Hamming dist.
SM+DD from one SM+DD from unique5 Prior from unique5 0.5 1 1.5 2 2.5 3 3.5 x 10
4
−6 −5.8 −5.6 −5.4 x 10
4
cpu time (sec) joint log prob.
SM+DD from one SM+DD from unique5 Prior from unique5
−15 −10 −5 5 10 2 4 6 5 10 15 20 25 x z y −15 −10 −5 5 10 2 4 6 8 5 10 15 20 25 x z y −10 −5 5 10 15 −2 0 2 4 5 10 15 20 25 x z y −6 −4 −2 0 2 −2 0 2 4 6 5 10 15 20 25 x z y −4 −2 0 2 4 −4−2 0 2 4 5 10 15 20 25 30 x z y −6 −4 −2 0 2 −4−2 0 2 4 5 10 15 20 25 30 x z y 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 200 205 210 215 220 225 230 235 240 5 5 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 105 110 115 120 125 130 135 140 5 5 5 5 5 5 5 5 5 1 5 5 5 5 5 5 185 190 195 200 205 210 215 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 42 47 52 57 62 67 72 77 82 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 165 170 175 180 185 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 145 150 155 160 165 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 200 205 210 215 220 225 230 235 240 5 5 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 105 110 115 120 125 130 135 140 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 185 190 195 200 205 210 215 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 42 47 52 57 62 67 72 77 82 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 165 170 175 180 185 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 145 150 155 160 165 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 200 205 210 215 220 225 230 235 240 20 20 20 20 20 20 20 20 20 20 20 5 5 5 5 5 5 5 105 110 115 120 125 130 135 140 5 5 5 5 5 5 5 5 28 28 5 5 5 5 5 28 185 190 195 200 205 210 215 14 14 14 14 14 14 14 14 2 2 2 2 2 2 2 2 2 2 2 2 2 42 47 52 57 62 67 72 77 82 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 165 170 175 180 185 25 2 2 26 25 25 25 25 25 2 2 2 25 25 25 1 1 1 1 1 1 145 150 155 160 165SM+DD SM+DD Prior
Ballet Walk Squat Sword Lambada Dribble Basketball Box Climb Indian Dance Tai Chi
Figure 4:
Analyzing all “Physical Activities & Sports” from CMU Mocap, here are 10 of 33 recovered behaviors:
reduce proposal weight in MH acceptance ratio
reversibility is too strong for effective mixing of split-merge MCMC (widespread problem) Many trials on 6 sequences
[Ghosh(CVPR(2012](
Mean Field Variational EP Stochastic Search Best Worst Best Worst
Toward Reliable Bayesian Nonparametric Learning
! Basic samplers, and conventional variational methods, are not as reliable as you’ve heard ! Maximization-Expectation search: Not the ultimate solution, but proof there’s a problem ! Feasible: Split-Merge MCMC moves inspired by ME But local reversibility can still cause slow mixing… ! New “default” learning algorithms, robust to initialization ! Automatic learning for more complex hierarchies, and rich temporal and spatial models of the world
Key Challenges