Toward Reliable Bayesian Nonparametric Learning Erik Sudderth Brown - - PowerPoint PPT Presentation

toward reliable bayesian
SMART_READER_LITE
LIVE PREVIEW

Toward Reliable Bayesian Nonparametric Learning Erik Sudderth Brown - - PowerPoint PPT Presentation

Toward Reliable Bayesian Nonparametric Learning Erik Sudderth Brown University Department of Computer Science Donglai Wei & Michael Bryant (HDP topics) Joint work with Michael Hughes & Emily Fox (BP-HMM) Documents & Topic Models


slide-1
SLIDE 1

Toward Reliable Bayesian Nonparametric Learning

Erik Sudderth

Brown University Department of Computer Science

Joint work with

Donglai Wei & Michael Bryant (HDP topics) Michael Hughes & Emily Fox (BP-HMM)

slide-2
SLIDE 2

Documents & Topic Models

model neural stochastic recognition nonparametric gradient dynamical Bayesian …

Framework for unsupervised discovery of low-dimensional latent structure from bag of word representations

Algorithms Neuroscience Statistics Vision … ! pLSA: Probabilistic Latent Semantic Analysis (Hofmann 2001) ! LDA: Latent Dirichlet Allocation (Blei, Ng, & Jordan 2003) ! HDP: Hierarchical Dirichlet Processes (Teh, Jordan, Beal, & Blei 2006)

slide-3
SLIDE 3

To organize large time series collections, an essential task is to

Identify segments whose visual content arises from same physical cause

Stir Brownie Mix Open Fridge Grate Cheese Set Oven Temp.

GOAL: Set of temporal behaviors

  • Detailed segmentations
  • Sparse behavior sharing
  • Nonparametric recovery &

growth of model complexity

  • Reliable general-purpose tool

across domains

Temporal Activity Understanding

slide-4
SLIDE 4

Learning Challenges

Can local updates uncover global structure?

! MCMC: Local Gibbs and Metropolis-Hastings proposals ! Variational: Local coordinate ascent optimization ! Do these algorithms live up to our complex models???

Non-traditional modeling and inferential goals

! Nonparametric: Model structure grows and adapts to new data, no need to specify number of topics/objects/etc. ! Reliable: Our primary goal is often not prediction, but correct recovery of latent cluster/feature structure ! Simple: Often want just a single “good” model, not samples

  • r a full representation of posterior uncertainty
slide-5
SLIDE 5

Outline

Bayesian Nonparametrics

! Dirichlet process (DP) mixture models ! Variational methods and the ME algorithm

Reliable Nonparametric Learning

! Hierarchical DP topic models ! ME search in a collapsed representation ! Non-local online variational inference

Nonparametric Temporal Models

! Beta Process Hidden Markov Models (BP-HMM) ! Effective split-merge MCMC methods

slide-6
SLIDE 6

Stick-Breaking and DP Mixtures

Dirichlet process implies a prior distribution on the weights of a countably infinite mixture:

1 concentration parameter

Sethuraman, 1994

slide-7
SLIDE 7

Clustering and DP Mixtures

  • Conjugate priors allow marginalization of cluster parameters
  • Marginalized cluster sizes induce Chinese restaurant process

Indicates which cluster generated each observation N data points

  • bserved
slide-8
SLIDE 8

Chinese Restaurant Process

slide-9
SLIDE 9

DP Mixture Marginal Likelihood

Closed form probability for any hypothesized partition of N observations into K clusters:

log p(x, z) = log Γ(α) Γ(N + α) +

K

X

k=1

8 < :log α + log Γ(Nk) + log Z

Θ

Y

i|zi=k

f(xi | θk) dH(θk) 9 = ;

Γ(Nk) = (Nk − 1)!

slide-10
SLIDE 10

DP Mixture Inference

Monte Carlo Methods

! Stick-breaking representation: Truncated or slice sampler ! CRP representation: Collapsed Gibbs sampler ! Split-merge samplers, retrospective samplers, …

Variational Methods

! Valid for any hypothesized distribution ! Mean field variational methods optimize in tractable family ! Truncated stick-breaking representation: Blei & Jordan, 2006 ! Collapsed CRP representation: Kurihara, Teh, & Welling 2007

log p(x | α, λ) ≥ H(q) + Eq[log p(x, z, θ | α, λ)]

q(z, θ)

slide-11
SLIDE 11

Maximization Expectation

EM Algorithm

! E-step: Marginalize latent variables (approximate) ! M-step: Maximize likelihood bound given model parameters

ME Algorithm

! M-step: Maximize likelihood given latent assignments ! E-step: Marginalize random parameters (exact)

Kurihara & Welling, 2009

Why Maximization-Expectation?

! Parameter marginalization allows Bayesian “model selection” ! Hard assignments allow efficient algorithms, data structures ! Hard assignments consistent with clustering objectives ! No need for finite truncation of nonparametric models

slide-12
SLIDE 12

A Motivating Example

200 samples from a mixture of 4 two- dimensional Gaussians ! Stick-breaking variational: Truncate to K=20 components ! CRP collapsed variational: Truncate to K=20 components ! ME local search: No finite truncation required

log p(x, z) = log Γ(α) Γ(N + α) +

K

X

k=1

8 < :log α + log Γ(Nk) + log Z

Θ

Y

i|zi=k

f(xi | θk) dH(θk) 9 = ;

slide-13
SLIDE 13

Stick-Breaking Variational

slide-14
SLIDE 14

Collapsed Variational

slide-15
SLIDE 15

ME Local Search with Merge

! Dynamics of inference algorithm often matter more in practice than choice of model representation/approximation ! True for MCMC as well as variational methods ! Easier to design complex algorithms for simple objectives Every run, from hundreds of initializations, produces the same (optimal) partition

slide-16
SLIDE 16

Outline

Bayesian Nonparametrics

! Dirichlet process (DP) mixture models ! Variational methods and the ME algorithm

Reliable Nonparametric Learning

! Hierarchical DP topic models ! ME search in a collapsed representation ! Non-local online variational inference

Nonparametric Temporal Models

! Beta Process Hidden Markov Models (BP-HMM) ! Effective split-merge MCMC methods

slide-17
SLIDE 17

Distributions and DP Mixtures

Ferguson, 1973 Antoniak, 1974

slide-18
SLIDE 18

Distributions and HDP Mixtures

Global discrete measure: For each of J groups: For each of Nj data:

Hierarchical Dirichlet Process (Teh, Jordan, Beal, & Blei 2004)

  • Instance of a dependent Dirichlet process (MacEachern 1999)
  • Closely related to Analysis of Densities (Tomlinson & Escobar 1999)

Atom locations define topics, atom masses their frequencies. Each document has its

  • wn topic frequencies.

Bag of word tokens.

slide-19
SLIDE 19

Chinese Restaurant Franchise

slide-20
SLIDE 20

The Toy Bars Dataset

! Latent Dirichlet Allocation (LDA, Blei et al. 2003) is a parametric topic model (a finite Dirichlet approximation to the HDP) ! Griffiths & Steyvers (2004) introduced a collapsed Gibbs sampler, and demonstrated it on a toy “bars” dataset:

10 topic distributions on 25 vocabulary words, and example documents

slide-21
SLIDE 21

The Perfect Sampler?

slide-22
SLIDE 22

Direct Cluster Assignments

Global discrete measure: For each of J groups: For each of Nj data:

zji ∼ πj

xji ∼ F(θzji)

Can we marginalize both global and document-specific topic frequencies?

slide-23
SLIDE 23

Direct Assignment Likelihood

log Γ(γ) Γ(m.. + γ) +

K

X

k=1

( log γ + log Γ(m.k) + log Γ(Wλ) Γ(n..k + Wλ) +

W

X

w=1

log Γ(λ + nw

..k)

Γ(λ) )

njtk nw

jtk

mjk

Number of tokens in document j, assigned to table t and topic k Number of tokens of type (word) w in document j, assigned to table t and topic k Number of tables in document j assigned to topic k

log p(x, z, m | α, γ, λ) =

+

J

X

j=1

( log Γ(α) Γ(nj.. + α) + mj. log α +

K

X

k=1

log  nj.k mjk )

Sufficient statistics: Global topic assignments and counts of tables assigned to each topic Number of permutations of items with disjoint cycles (unsigned Stirling numbers of the first kind, Antoniak 1974)

nj.k

mjk

 nj.k mjk

  • =
slide-24
SLIDE 24

Permuting Identical Observations

njtk nw

jtk

mjk

Number of tokens in document j, assigned to table t and topic k Number of tokens of type (word) w in document j, assigned to table t and topic k Number of tables in document j assigned to topic k

! When a word is repeated multiple times within a document, those instances (tokens) have identical likelihood statistics ! We sum all possible ways of allocating repeating tokens to produce a given set of counts

nw

j.k

log Γ(γ) Γ(m.. + γ) +

K

X

k=1

( log γ + log Γ(m.k) + log Γ(Wλ) Γ(n..k + Wλ) +

W

X

w=1

log Γ(λ + nw

..k)

Γ(λ) )

+

J

X

j=1

( log Γ(α) Γ(nj.. + α) + mj. log α +

K

X

k=1

log  nj.k mjk

  • +

W

X

w=1

log Γ(nw

j.. + 1)

QK

k=1 Γ(nw j.k + 1)

)

log p(x, n, m | α, γ, λ) =

slide-25
SLIDE 25

HDP Optimization

W words J docs K topics Inferred Topic Distributions Input Data Search Space log Γ(γ) Γ(m.. + γ) +

K

X

k=1

( log γ + log Γ(m.k) + log Γ(Wλ) Γ(n..k + Wλ) +

W

X

w=1

log Γ(λ + nw

..k)

Γ(λ) )

+

J

X

j=1

( log Γ(α) Γ(nj.. + α) + mj. log α +

K

X

k=1

log  nj.k mjk

  • +

W

X

w=1

log Γ(nw

j.. + 1)

QK

k=1 Γ(nw j.k + 1)

)

log p(x, n, m | α, γ, λ) =

slide-26
SLIDE 26

ME Search: Local Moves

W words J docs K topics Inferred Topic Distributions Input Data Search Space

! Assign one word token to the optimal (possibly new) table ! Assign one table to the optimal (possibly new) topic ! Merge two tables, assign to the optimal (possibly new) topic In some random order:

slide-27
SLIDE 27

ME Search: Reconfigure Document

W words J docs K topics

! Remove all existing assignments, and sequentially assign tokens to topics via a conditional CRP sampler ! Refine configuration with local search (only this document) ! Reject if new configuration has lower likelihood For some document, fixing configurations of all others:

slide-28
SLIDE 28

ME Search: Reconfigure Word

W words J docs K topics

! Remove all existing assignments topic by topic, sequentially assign tokens to topics via a conditional CRP sampler ! Refine configuration with local search (only this word type) ! Reject if new configuration has lower likelihood For some vocabulary word, fixing configurations of all others:

slide-29
SLIDE 29

ME Search: Reconfigure Topic

W words J docs K topics

! Merge with another topic ! Refine or reject topic: Apply reconfigure-document and reconfigure-word moves to this topic’s documents/words ! Reject if any new configuration has lower likelihood For some topic, fixing configurations of all others:

slide-30
SLIDE 30

The Toy Bars Dataset

! Latent Dirichlet Allocation (LDA, Blei et al. 2003) is a parametric topic model (a finite Dirichlet approximation to the HDP) ! Griffiths & Steyvers (2004) introduced a collapsed Gibbs sampler, and demonstrated it on a toy “bars” dataset:

10 topic distributions on 25 vocabulary words, and example documents

slide-31
SLIDE 31

Making More Realistic Bars

! Frequency: Assign geometrically decreasing rates of

  • ccurrence to topics, rather than making equally likely

! Noise: Generate portion of document words uniformly at random, rather than from the primary topics ! Burstiness: Increase the frequencies of a few randomly chosen words from the most likely topics

100 200 300 400 500 600

5.5 6 6.5

7

7.5 8 8.5 9 9.5

10

CleanBar NoisyBar BurstyBar NIPS

Simulated Corpus

Sorted Word Index

Log Count of words

20 40 60 80 100 120 140 20 40 60 80 100 120 140 160

Average Simulated Document Average count of words

Sorted Word Index

CleanBar NoisyBar BurstyBar NIPS

slide-32
SLIDE 32

Bars with Unequal Frequencies

slide-33
SLIDE 33

Unequal Bars with Noise

slide-34
SLIDE 34

Bursty, Noisy, Unequal Bars

slide-35
SLIDE 35

What Should the HDP Capture?

! Frequency: Well, via explicit parameters in base measure ! Noise: Weakly, by using many extraneous topics with small probability mass ! Burstiness: Completely unmodeled; topics are fixed multinomials with no document-specific variation

100 200 300 400 500 600

5.5 6 6.5

7

7.5 8 8.5 9 9.5

10

CleanBar NoisyBar BurstyBar NIPS

Simulated Corpus

Sorted Word Index

Log Count of words

20 40 60 80 100 120 140 20 40 60 80 100 120 140 160

Average Simulated Document Average count of words

Sorted Word Index

CleanBar NoisyBar BurstyBar NIPS

slide-36
SLIDE 36

Unequal Bars: Mixing Rates

8,000 12,000 16,000 20,000 −4.41 −4.405 −4.4 −4.395 −4.39 −4.385

# Iterations log P(xtrain)/# token Initial GS +GS +ME−n+GS

Extended run of Gibbs sampler One ME iteration, continue with Gibbs Dataset Scale: 1566 documents 625 words 50 true topics

slide-37
SLIDE 37

1 10 30 50 100 −4.42 −4.41 −4.4 −4.39 −4.38 −4.37 −4.36 −4.35 −4.34

K0 log P(xtest)/# token

ME−n ME−z GS

Unequal Bars: Test Likelihoods

Initialize by short run of MCMC, run ME search, output solution Extended run of Gibbs sampler

slide-38
SLIDE 38

Unequal Bars: Number of Topics

1 10 30 50 100 50 51 52 53 54 55 56 57

K0 # Topics

ME−n ME−z GS

slide-39
SLIDE 39

Unequal Bars: KL Divergence

1 10 30 50 100 0.2 0.4 0.6 0.8 1

K0 Mean DKL

ME−n ME−z GS

slide-40
SLIDE 40

Unequal Bars: Gibbs Topics

slide-41
SLIDE 41

Unequal Bars: ME Topics

slide-42
SLIDE 42

Noisy Bars: Topics

Gibbs Sampler ME Search

slide-43
SLIDE 43

Noisy & Bursty Test Likelihoods

1 10 30 50 100 −4.5 −4.4 −4.3 −4.2 −4.1 −4

K0 log P(xtest)/# token NoisyBar

ME−n GS

1 10 30 50 100 −4.5 −4.4 −4.3 −4.2 −4.1 −4

K0 log P(xtest)/# token BurstyBar

ME−n GS

slide-44
SLIDE 44

Bursty Bars: Topics

λ = 4 λ = 0.1

! Predictive likelihood and topic coherence are negatively correlated ! There is some work on modeling burstiness with parametric topic models (Doyle & Elkan, 2009)

slide-45
SLIDE 45

NIPS Dataset

40 80 120 50 100 150 200 250 300 350 400 450

K0

0.1 1 3 5 50 100 150 200 250 300 350 400 450

λ

0.1 1 3 5 −7.65 −7.6 −7.55 −7.5 −7.45 −7.4 −7.35 −7.3

λ

  • Red: ME-n Search (5 iterations initialized by brief Gibbs sampling)
  • Blue: Gibbs Sampling (20,000 iterations)
  • All results averaged over runs from 5 random initializations
  • Predictive likelihoods computed via Chib-style MCMC estimator

Robustness: Learned # Topics

  • vs. Initial # Topics

Compactness: Learned # Topics

  • vs. Dirichlet prior

Prediction: Test log-like

  • vs. Dirichlet prior

(λ = 1) (K0 = 40) (K0 = 40)

1,740 documents, ~1.7 million word tokens

slide-46
SLIDE 46

0.1 1 3 5 −7.4 −7.35 −7.3 −7.25 −7.2 −7.15 −7.1

λ

40 80 120 50 100 150 200

K0

0.1 1 3 5 20 40 60 80 100 120

λ

20 Newsgroups Dataset

  • Red: ME-n Search (5 iterations initialized by brief Gibbs sampling)
  • Blue: Gibbs Sampling (20,000 iterations)
  • All results averaged over runs from 5 random initializations
  • Predictive likelihoods computed via Chib-style MCMC estimator

Robustness: Learned # Topics

  • vs. Initial # Topics

Compactness: Learned # Topics

  • vs. Dirichlet prior

Prediction: Test log-like

  • vs. Dirichlet prior

(λ = 1) (K0 = 40) (K0 = 40)

4,709 documents, ~435,000 word tokens

slide-47
SLIDE 47

Online Variational Learning

2.5 5 7.5 10 12.5 15 x 10

5

−8.1 −8 −7.9 −7.8 −7.7 −7.6 −7.5 −7.4

documents seen

NIPS

per−word log likelihood

  • HDP−SM, K=100
  • HDP−SM, K=300
  • HDP, K=300
  • HDP, K=1000

bHDP, K=300 bHDP, K=1000

  • HDP−CRF, K=300

CGS, K=300

100 200 300 400 500 600 −8.1 −8 −7.9 −7.8 −7.7 −7.6 −7.5 −7.4

# topics used per−word log likelihood

  • Online variational inference with split-merge proposals
  • Batch variational inference (direct assignment, local optimization)
  • Online variational inference (direct assignment, local optimization)
  • Collapsed Gibbs sampling (direct assignment)
  • Online variational inference (Chinese restaurant franchise, Wang et al.)

(1740 documents)

[Bryant(NIPS(2012](

slide-48
SLIDE 48

150 200 250 300 350 400 450 500 550 −7.8 −7.78 −7.76 −7.74 −7.72 −7.7 −7.68 −7.66 −7.64 −7.62 −7.6 −7.58 −7.56

# topics used per−word log likelihood

1 2 3 4 5 x 10

6

−7.8 −7.78 −7.76 −7.74 −7.72 −7.7 −7.68 −7.66 −7.64 −7.62 −7.6 −7.58 −7.56

documents seen per−word log likelihood

New York Times

  • HDP−SM, K=200
  • HDP−SM, K=300
  • HDP, K=300
  • HDP, K=500
  • HDP, K=500, S=10000
  • HDP−SM, K=200, S=10000

Online Variational Learning

  • Online variational inference with split-merge proposals (large mini-batches)
  • Online variational inference with split-merge proposals
  • Online variational inference (direct assignment, local optimization)

(1.8 million documents)

[Bryant(NIPS(2012](

slide-49
SLIDE 49

Split Topic Evolution

Original topic 40,000 80,000 120,000 160,000 200,000 240,000 patterns patterns patterns patterns patterns patterns pattern pattern pattern pattern pattern pattern cortex cortex cortex cortex cortex cortex neurons neurons neurons neurons neurons responses neuronal neuronal responses responses responses types patterns responses responses neuronal type type type pattern single single single behavioral behavioral behavioral cortex inputs temporal type types types form neurons temporal inputs number neuronal form neurons neuronal activation type temporal single neuronal areas single patterns neuronal neuronal neuronal neuronal neuronal responses neuronal patterns neurons dendritic dendritic dendritic inputs pattern pattern activation peak fire postsynaptic type neurons neurons cortex activation peak fire activation cortex cortex dendrite cortex activation cortex inputs activation preferred pyramidal msec activation activation dendrite patterns msec pyramidal peak type inputs peak fire cortex msec preferred peak pyramidal dendrites postsynaptic pyramidal peak preferred inputs inputs inputs inputs [Bryant(NIPS(2012](

slide-50
SLIDE 50

Outline

Bayesian Nonparametrics

! Dirichlet process (DP) mixture models ! Variational methods and the ME algorithm

Reliable Nonparametric Learning

! Hierarchical DP topic models ! ME search in a collapsed representation ! Non-local online variational inference

Nonparametric Temporal Models

! Beta Process Hidden Markov Models (BP-HMM) ! Effective split-merge MCMC methods

slide-51
SLIDE 51

[Fox(NIPS(2009](

.8# .2# .1# .9# .8# .15# .05# .05# .9# .05# .1# .1# .8# .9# .1# .1# .9#

Transition(Matrix( Behavior(Library(!"HMM"emission"parameters"θ" Video#1# Video#2# Video#3# Video#4#

.8# .15# .05# .05# .8# .15# .05# .05# .9#

π"

Behavior(Sequence(

z"

F"

Behavior(Matrix( behaviors" videos" 1 # 2 # 3 # 4

BP Hidden Markov Model

slide-52
SLIDE 52

[Ghahramani(2006]( [Thibaux((2007]###

=#[#0###1###0###1###0###0###0###...(]# Each#video#i#has#sparse(binary(vector(indicating(available#behaviors### Alternative#representation:#Indian#Buffet#Process## Beta(Process((BP):(prior#on#sparse#binary#matrix#

num.#behaviors/video#

#Poisson(#γ#)#

# total#num.#behaviors#

"O("γ"log#N")" Encourage(behavior(sharing,(but(allow(sparsity(and(rare(behaviors(

N=#

  • ften#called#“features”#in#machine#learning#literature#

Beta Process HMM

slide-53
SLIDE 53

BP-HMM Graphical Model

slide-54
SLIDE 54

Baseline MCMC Learning

500 1000 1500 2000 2500 3000 3500 −4 −2 2 4x 10

4

cpu time (sec) log joint prob.

SM DD Prior

−1 1 −1.5 −1 −0.5 0.5 1 1.5 −1 1 −1.5 −1 −0.5 0.5 1 1.5

−1 1 −1.5 −1 −0.5 0.5 1 1.5

Worst Split-Merge Worst Data-Driven Best Prior

Prior proposals (Fox 2009) fairly sophisticated:

  • Collapsed: Marginalize state sequences via

dynamic programming in feature proposals

  • Auxiliary variables: Blocked state sequence

resampling to sample new transition and emission parameters

  • But performance nevertheless very poor…

100 sequences, 8-dim. Gaussian emissions

slide-55
SLIDE 55

Data-Driven Birth/Death

Reversible Jump MCMC: Add or delete unique features

θ∗

k∗ ∼ p(θ)

Propose from prior [Fox et al. NIPS 2009] Data-driven proposal [Hughes et al. NIPS 2012]

  • select random window W of sequence
  • proposal: mixture of prior and posterior over W

θ∗

k∗ ∼ 1 2p(θ) + 1 2p(θ|xit : t ∈ W)

birth death

θ θ∗

k∗

F F ∗

Using mixture ensures good death move acceptance rate

  • [Fox#et#al.#NIPS#2009]!

l*

w#W#of#sequence#

60 80

###########[######]#

− −

− − − − − − − − − − − −

slide-56
SLIDE 56

Sequential Split/Merge

Select features

merge split

F F ∗

km ka kb i j i j i j i j

z z*

time

MH Acceptance Ratio

= p(x, z∗, F ∗) p(x, z, F ) qmerge(F , z | x, F ∗, z∗, ka, kb) qsplit(F ∗, z∗ | x, F , z, km) qk(ka, kb | x, F ∗, z∗, i, j) qk(km, km | x, F , z, i, j)

SPLIT proposal

F ∗, z∗ ∼ qsplit(·|ki)

MERGE proposal

F ∗, z∗ ∼ qmerge(·|ki, kj)

ki = kj

?

YES NO

Select anchors

i, j∼ Unif(sequences)

Must satisfy Joint probability Proposal construction HMM params collapsed away

ki, kj ∼ qk(·|fi, fj)

Features active in fn can be at any time zn Sequences

  • utside active

set unchanged [Hughes(NIPS(2012]( Sequential allocation [Dahl 2005] efficiently gives self-consistent split proposals

slide-57
SLIDE 57

Video Activity Understanding

Brownie

Fraction of Time Elapsed

0.25 0.5 0.75 1

Pizza Sandwich Salad Eggs 65 Others Light Switch Open Fridge Stir Bowl 1 Stir Bowl 2 Pour Bowl Grate Cheese Slice/Chop Flip Omelette

1 2 3 4 5 6 7 8 x 10

4

−7.2 −7.15 −7.1 −7.05 x 10

6

cpu time (sec) log joint prob.

SM+DD DD Prior

  • 126 videos of recipe

preparation (CMU Kitchen Database)

  • Prior proposals are

unusably poor

  • Split-merge provides

reasonable, but not complete, robustness to initialization

slide-58
SLIDE 58

Mocap Analysis: 6 Sequences

0.5 1 1.5 2 2.5 3 3.5 x 10

4

0.2 0.3 0.4 0.5 0.6 0.7

cpu time (sec) Hamming dist.

SM+DD from one SM+DD from unique5 Prior from unique5 0.5 1 1.5 2 2.5 3 3.5 x 10

4

−6 −5.8 −5.6 −5.4 x 10

4

cpu time (sec) joint log prob.

SM+DD from one SM+DD from unique5 Prior from unique5

−15 −10 −5 5 10 2 4 6 5 10 15 20 25 x z y −15 −10 −5 5 10 2 4 6 8 5 10 15 20 25 x z y −10 −5 5 10 15 −2 0 2 4 5 10 15 20 25 x z y −6 −4 −2 0 2 −2 0 2 4 6 5 10 15 20 25 x z y −4 −2 0 2 4 −4−2 0 2 4 5 10 15 20 25 30 x z y −6 −4 −2 0 2 −4−2 0 2 4 5 10 15 20 25 30 x z y 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 200 205 210 215 220 225 230 235 240 5 5 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 105 110 115 120 125 130 135 140 5 5 5 5 5 5 5 5 5 1 5 5 5 5 5 5 185 190 195 200 205 210 215 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 42 47 52 57 62 67 72 77 82 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 165 170 175 180 185 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 145 150 155 160 165 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 200 205 210 215 220 225 230 235 240 5 5 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 105 110 115 120 125 130 135 140 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 185 190 195 200 205 210 215 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 42 47 52 57 62 67 72 77 82 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 165 170 175 180 185 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 145 150 155 160 165 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 200 205 210 215 220 225 230 235 240 20 20 20 20 20 20 20 20 20 20 20 5 5 5 5 5 5 5 105 110 115 120 125 130 135 140 5 5 5 5 5 5 5 5 28 28 5 5 5 5 5 28 185 190 195 200 205 210 215 14 14 14 14 14 14 14 14 2 2 2 2 2 2 2 2 2 2 2 2 2 42 47 52 57 62 67 72 77 82 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 165 170 175 180 185 25 2 2 26 25 25 25 25 25 2 2 2 25 25 25 1 1 1 1 1 1 145 150 155 160 165

SM+DD SM+DD Prior

  • 6 motion capture sequences (CMU Mocap Database)
  • Human annotation of 12 partially shared exercises (ground truth validation)
  • Huge difference in quality of “typical” chains for different algorithms
slide-59
SLIDE 59

Mocap Analysis: 124 Sequences

Ballet Walk Squat Sword Lambada Dribble Basketball Box Climb Indian Dance Tai Chi

Figure 4:

Analyzing all “Physical Activities & Sports” from CMU Mocap, here are 10 of 33 recovered behaviors:

  • Non-standard annealing:

reduce proposal weight in MH acceptance ratio

  • Hypothesis: Local

reversibility is too strong for effective mixing of split-merge MCMC (widespread problem) Many trials on 6 sequences

slide-60
SLIDE 60

Spatial Image Segmentation

[Ghosh(CVPR(2012](

Mean Field Variational EP Stochastic Search Best Worst Best Worst

slide-61
SLIDE 61

Summary and Outlook

Toward Reliable Bayesian Nonparametric Learning

! Basic samplers, and conventional variational methods, are not as reliable as you’ve heard ! Maximization-Expectation search: Not the ultimate solution, but proof there’s a problem ! Feasible: Split-Merge MCMC moves inspired by ME But local reversibility can still cause slow mixing… ! New “default” learning algorithms, robust to initialization ! Automatic learning for more complex hierarchies, and rich temporal and spatial models of the world

Key Challenges