Reliable Variational Learning for Hierarchical Dirichlet Processes - - PowerPoint PPT Presentation

reliable variational learning for hierarchical dirichlet
SMART_READER_LITE
LIVE PREVIEW

Reliable Variational Learning for Hierarchical Dirichlet Processes - - PowerPoint PPT Presentation

Reliable Variational Learning for Hierarchical Dirichlet Processes Erik Sudderth Brown University Computer Science Collaborators: Michael Hughes & Dae Il Kim, Brown University Prem Gopalan & David Blei, Princeton University Learning


slide-1
SLIDE 1

Reliable Variational Learning for Hierarchical Dirichlet Processes

Erik Sudderth

Brown University Computer Science Collaborators:

  • Michael Hughes & Dae Il Kim, Brown University
  • Prem Gopalan & David Blei, Princeton University
slide-2
SLIDE 2

Learning Structured BNP Models

  • Nonparametric: Data-driven discovery of model

structure: topics, behaviors, objects, communities…

  • Reliable: Structure driven by data and modeling

assumptions, not heuristic algorithm initializations

  • Parsimonious: Want a single model structure with

good predictive power, not full posterior uncertainty

γ

λ0

φk α

xdn zdn

Nd

πd

D

β

Hierarchical
 Dirichlet Process
 (Teh et al., JASA 2006)

There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in

  • ur climate. To protect them, our

politicians must pass environmental legislation that can protect our future species from becoming extinct…

Genetics, Climate Change, Politics, …

slide-3
SLIDE 3

Memoized Variational Inference for Dirichlet Process Mixture Models

Michael Hughes & E. Sudderth 2013 Conference on Neural Information Processing Systems

slide-4
SLIDE 4

zn φ3 φ1 φ2

v1, v2, v3 . . .

0.2 0.3 0.5

π1 π2 π3

xn

Dirichlet Process Stick-Breaking

GOAL: Partition data into an a priori unknown number of discrete clusters.

π1 = v1

π2 = v2(1 − v1)

π3 = v3(1 − v2)(1 − v1)

1 − PK

k=1 πk = QK k=1(1 − vk)

π ∼ Stick(α)

1 Stick-Breaking


(Sethuraman 1994)

Each cluster k = 1, 2, …

  • Cluster shape:
  • Stick proportion:
  • Cluster frequency:

φk ∼ H(λ0)

vk ∼ Beta(1, α)

πk = vk Qk−1

`=1 (1 − v`)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

α = 1

α = 5

α = 20

slide-5
SLIDE 5

zn φ3 φ1 φ2

v1, v2, v3 . . .

0.2 0.3 0.5

π1 π2 π3

xn

Dirichlet Process Mixtures

GOAL: Partition data into an a priori unknown number of discrete clusters.

Each cluster k = 1, 2, …

φk ∼ H(λ0)

π ∼ Stick(α) vk ∼ Beta(1, α)

  • Cluster shape:
  • Stick proportion:
  • Cluster frequency: πk

Each observation n = 1, 2, …, N:

  • Cluster assignment:
  • Observed value:

zn ∼ Cat(π) xn ∼ F(φzn)

h(φk | λ0) = exp

  • λT

0 ¯

t(φk) − ¯ a(λ0)

  • ,

¯ t(φk) = [φk, −a(φk)]

f(xn | φk) = exp

  • φT

k t(xn) − a(φk)

  • Assume exponential

family likelihoods with conjugate priors

slide-6
SLIDE 6

zn φ3 φ1 φ2

v1, v2, v3 . . .

0.2 0.3 0.5

π1 π2 π3

xn

Dirichlet Process Mixtures

GOAL: Partition data into an a priori unknown number of discrete clusters.

Each cluster k = 1, 2, …

φk ∼ H(λ0)

π ∼ Stick(α) vk ∼ Beta(1, α)

  • Cluster shape:
  • Stick proportion:
  • Cluster frequency: πk

Each observation n = 1, 2, …, N:

  • Cluster assignment:
  • Observed value:

zn ∼ Cat(π) xn ∼ F(φzn)

N

π zn xn

λ0

φk α

Hyperparameters

Visually summarize model structure via directed graphical model

f(xn | zn = k, φ) = exp

  • φT

k t(xn) − a(φk)

slide-7
SLIDE 7

zn φ3 φ1 φ2

v1, v2, v3 . . .

0.2 0.3 0.5

π1 π2 π3

xn

MCMC for DP Mixtures

Can we sample from the posterior distribution over data clusterings?

Given any fixed partition z:

π ∼ Stick(α)

  • Marginalize stick-breaking weights

via Chinese Restaurant Process, assigning positive probability to all partitions of data (large support) N

π zn xn

λ0

α

  • Via conjugacy of base measure to

exponential family likelihood, marginalize cluster shape parameters

φk

Gibbs Sampler: (Neal 1992, MacEachern 1994)

Iteratively resample cluster assignment for one observation, fixing all others.

slide-8
SLIDE 8

Mixing for DP Mixture Samplers

MNIST: 60,000 digits projected to 50 dimensions via PCA.

Log-probability Number of clusters

  • Five random initializations from K=1, K=50, K=300 clusters
  • Reversible jump MCMC? Proposals slow, acceptance low.
slide-9
SLIDE 9

= log X

z

ZZ q(z, v, φ)p(x, z, v, φ | α, λ0) q(z, v, φ) dvdφ

Variational Bounds

What is the marginal likelihood of our observed data?

log p(x | α, λ0) = log X

z

ZZ p(x, z, v, φ | α, λ0) dvdφ = log Eq p(x, z, v, φ | α, λ0) q(z, v, φ)

  • Expectation with respect to some

variational distribution q(z, v, φ)

≥ Eq[log p(x, z, v, φ | α, λ0)] − Eq[log q(z, v, φ)] = L(q)

Jensen’s Inequality Expected log-likelihood
 (negative of “average energy”) Variational
 entropy

N

π zn xn

λ0

φk α

  • Maximizing this bound recovers true posterior:

L(q) = log p(x | α, λ0) −KL(q(z, v, φ) || p(z, v, φ | x, α, λ0))

  • The simplest mean field variational methods create

tractable algorithms via assumed independence:

q(z, v, φ) = q(z)q(v, φ)

slide-10
SLIDE 10

Approximating Infinite Models

q(z, v, φ) = q(z)q(v, φ) = "

N

Y

n=1

q(zn) # · " ∞ Y

k=1

q(vk)q(φk) #

Beta
 Distribution Exponential Family 
 from Conjugate Prior

q(zn = k) = rnk

Categorical distribution with unbounded support, and infinitely many potential clusters!

1 2 3 4 5 6 7 8 9 101112131415 0.05 0.1 0.15 0.2

α = 4, K = 10

Top-Down Model Truncation

Blei & Jordan, 2006; Ishwaran & James, 2001

q(v, φ) = " K Y

k=1

q(φk) # · " K−1 Y

k=1

q(vk) # , vK =

K−1

Y

k=1

(1 − vk).

q(zn) = Cat(zn | rn1, rn2, . . . , rnK)

1 2 3 4 5 6 7 8 9 101112131415 0.05 0.1 0.15 0.2

Bottom-Up Assignment Truncation

Bryant & Sudderth, 2012; Teh, Kurihara, & Welling, 2008

q(v, φ) =

Y

k=1

q(vk)q(φk)

For any k>K, optimal variational distributions equal prior & need not 
 be explicitly represented

q(zn) = Cat(zn | rn1, rn2, . . . , rnK, 0, 0, 0, . . .)

slide-11
SLIDE 11

Batch Variational Updates

A Bayesian nonparametric analog of Expectation-Maximization (EM)

q(zn) = Cat(zn | rn1, rn2, . . . , rnK, 0, 0, 0, . . .)

for some K>0

q(z, v, φ) = "

N

Y

n=1

q(zn | rn) # · " ∞ Y

k=1

Beta(vk | αk1, αk0)h(φk | λk) #

Eq[log πk(v)] = Eq[log(vk)] + Pk−1

`=1 Eq[log(1 − v`)]

ψ(αk1) − ψ(αk1 + αk0) ψ(αk0) − ψ(αk1 + αk0)

Update Assignments (The Expectation Step): For all N data, rnk ∝ exp(Eq[log πk(v)] + Eq[log p(xn | φk)])

for k ≤ K

Update Cluster Parameters (The Other Expectation Step):

N 0

k= PN n=1 rnk

s0

k ← PN n=1 rnkt(xn)

λk ← λ0 + s0

k

Expected counts and sufficient statistics are only non-zero for first K clusters

αk1 ← 1 + N 0

k

αk0 ← α + P∞

`=k+1 N 0 ` = α + PK `=k+1 N 0 `

Eq[vk] = αk1 αk1 + αk0

slide-12
SLIDE 12

Likelihood Bounds & Convergence

L(q) = Eq[log p(x, z, v, φ | α, λ0)] − Eq[log q(z, v, φ)]

Match
 Expected Sufficient Statistics For cluster k = 1, 2, … K: For data item n = 1, 2, … N, and K candidate clusters:

s0

k ← PN n=1 rnkt(xn)

λk ← λ0 + s0

k

N

π zn xn

λ0

φk α

αk1 ← 1 + N 0

k

αk0 ← α + N 0

>k

q(zn = k) = rnk ∝ eEq[log πk(v)+log p(xn|φk)]

  • Immediately after global parameter update, bound simplifies:

L(q) = H[r] +

K

X

k=1

[¯ a(λk) − ¯ a(λ0) + log B(αk1, αk0) − log B(1, α)]

log-normalizers for cluster shape and beta stick-breaking priors

H[r] = − PN

n=1

P∞

k=1 rnk log rnk = − PN n=1

PK

k=1 rnk log rnk

slide-13
SLIDE 13

Likelihood Bounds & Convergence

L(q) = Eq[log p(x, z, v, φ | α, λ0)] − Eq[log q(z, v, φ)] N

π zn xn

λ0

φk α

  • Immediately after global parameter update, bound simplifies:

L(q) = H[r] +

K

X

k=1

[¯ a(λk) − ¯ a(λ0) + log B(αk1, αk0) − log B(1, α)]

log-normalizers for cluster shape and beta stick-breaking priors

H[r] = − PN

n=1

P∞

k=1 rnk log rnk = − PN n=1

PK

k=1 rnk log rnk

  • Properties of variational optimization algorithm:

+ Likelihood bound monotonically increasing, guaranteed convergence to posterior mode + Unlike classical EM for MAP estimation, allows Bayesian comparison of hypotheses with varying complexity K, crucial for BNP models

  • Truncation level K is assumed fixed
  • Sensitive to initialization (many modes)
  • Each iteration must examine all data (SLOW)
slide-14
SLIDE 14

Stochastic Variational Inference

Hoffman, Blei, Paisley, & Wang, JMLR 2013 Data

x(BB) x(B2) x(B1)

Stochastically partition large dataset into B smaller batches:

x(Bb)

. . . . . .

λb

k ← λ0 + N |Bb|sb k

λk ← ρtλb

k + (1−ρt)λk

For cluster k = 1, 2, … K:

r(Bb) ← Estep(x(Bb), α, λ)

Apply similar updates to stick weights.

Update: For each batch b

sb

k ← P n∈Bb rnkt(xn)

batch stats give noisy estimate of (natural) gradient

10 10

1

10

2

10

3

10

4

0.2 0.4

  • num. iterations t

a b c

Learning Rate

P

t ρ2 t < ∞

κ ∈ (.5, 1]

ρt , (ρ0 + t)−κ

Robbins-Monro 
 convergence condition:

P

t ρt → ∞

Properties of stochastic inference:

+ Per-iteration cost is low + Initial iterations often very effective

  • Objective is highly non-convex, so


convergence guarantee is weak

  • Batch size and learning rate significantly impact efficiency & accuracy
slide-15
SLIDE 15

Memoized Variational Inference

Hughes & Sudderth, NIPS 2013; Neal & Hinton 1999 Data

x(BB) x(B2) x(B1)

Memoization: Storage (caching) of results of previous computations

x(Bb)

. . . . . .

Properties of memoized inference:

+ Per-iteration cost is low + Initial iterations often very effective + Insensitive to chosen B, no learning rate + Foundation for inferring number of clusters K

  • Requires storage proportional to number of

batches (NOT number of observations)

For cluster k = 1, 2, … K:

r(Bb) ← Estep(x(Bb), α, λ)

Apply similar updates to stick weights.

Update: For each batch b

batch stats allow exact estimation from partial E-steps

s0

k ← s0 k − sb k

s0

k ← s0 k + sb k

sb

k ← P n∈Bb rnkt(xn)

λk ← λ0 + s0

k

Batch Summaries Global Summary

s1

1 s1 2 · · · s1 K

s2

1 s2 2 · · · s2 K

sB

1

sB

2

· · · sB

K

s0

1 s0 2 · · · s0 K

s0

k = s1 k + s2 k + . . . sB k

. . . . . . . . .

H0

k = H1 k + H2 k + . . . HB k

Entropy for L(q)

Hb

k = − P n∈Bb rnk log rnk

slide-16
SLIDE 16

Memoized Cluster Births

Subsample data explained by 1

1) Create new components 3) Merge to remove redundancy 1 2

Before

1 2 3 4 5 6 7

Add fresh components to expand original model Learn fresh DP-GMM on subsample via VB

After

Memoized summary

1 2 3 4 5 6 7

expected count of each component

current position batches not-yet updated

  • n this pass do not use

any new components

2) Adopt in one pass thru data

Batch 1 Batch b Batch b+1 Batch B

… …

N b

k, sb k

Principles guiding memoized births:

  • BNP models support rare clusters, so random sampling ineffective
  • Target data grouped by some current cluster (likelihood-independent)
  • Memoized updates allow efficient marginal likelihood verification
slide-17
SLIDE 17

Memoized Cluster Merges

rnkm←rnka + rnkb

s0

km←s0 ka + s0 kb

  • New cluster takes over all responsibility for data assigned to old clusters:

Merge two clusters into one for parsimony, accuracy, efficiency.

φka

φkb

φkm

  • Accept or reject via exact full-dataset likelihood bound: L(qmerge) > L(q)?

Requires memoized entropy sums for candidate pairs of clusters;
 more efficient alternatives under development.

N 0

km←N 0 ka + N 0 kb,

  • No batch processing required, efficiently evaluate via memoized statistics:

L(q) = H[r] +

K

X

k=1

⇥ ¯ a(s0

k + λ0) − ¯

a(λ0) + log B(1 + N 0

k, α + N 0 >k) − log B(1, α)

slide-18
SLIDE 18

Example: Finite Gaussian Mixture

0.13 0.13 0.12 0.12 0.13 0.13 0.13 0.12

worst MO-BM run worst MO run best SO run worst Batch run

0.00 0.00 0.13 0.12 0.13 0.25 0.25 0.13

Not found Not found

0.25 0.13 0.12 0.12 0.13 0.13 0.13 0.00

Not found

0.00 0.00 0.25 0.13 0.13 0.25 0.25 0.00

Not found Not found Not found

  • N=100,000 samples from mixture of 8 Gaussians
  • 25-dim. covariance motivated by 5x5 image patches
  • DP mixture variational approximations allow K=25 clusters

3 6 9 12 15 18 21 24 27 30 0.99 1 1.01 1.02 1.03 1.04

  • num. passes thru data (N=100000)

log evidence x106 SOa K=25 SOb K=25 SOc K=25

Batch, memoized, & memoized birth-merge Stochastic variational: Rate a, Rate b, Rate c Greedy: Merge based on single batches

3 6 9 12 15 18 21 24 27 30 0.99 1 1.01 1.02 1.03 1.04

  • num. passes thru data (N=100000)

log evidence x106 Full K=25 MO K=25 GreedyMerge MO−BM K=1

L(q)

From 10 random initializations:

  • Memoized birth-merge from K=1

finds true cluster every time

  • Greedy decisions to merge based
  • n single batches collapse model
  • Stochastic sensitive to learning

rate and initialization

slide-19
SLIDE 19

Clustering Handwritten Digits

MNIST: 60,000 digits projected to 50 dimensions via PCA.

SOa SOb SOc Full MO MO−BM Kuri −3.1 −3.05 −3 −2.95 −2.9 −2.85 log evidence x106 20 batches 100 batches

Likelihood bound, K-means++ initialization

SOa SOb SOc Full MO MO−BM Kuri −4.5 −4 −3.5 −3 log evidence x106 20 batches 100 batches

Likelihood Bound, random initialization

Learning rate schedules

Batch, memoized, & memoized birth-merge Stochastic variational: Rate a, Rate b, Rate c Kurihara: Accelerated variational, NIPS 2006

slide-20
SLIDE 20

Clustering Handwritten Digits

MNIST: 60,000 digits projected to 50 dimensions via PCA.

40 50 60 70 80 90 100 110 0.7 0.72 0.74 0.76 0.78 0.8 0.82 Effective num. components K Alignment accuracy

Many-to-one alignment

  • f clusters to digit labels

SOa SOb SOc Full MO MO−BM Kuri −3.1 −3.05 −3 −2.95 −2.9 −2.85 log evidence x106 20 batches 100 batches

Likelihood bound, K-means++ initialization

SOa SOb SOc Full MO MO−BM Kuri −4.5 −4 −3.5 −3 log evidence x106 20 batches 100 batches

Likelihood Bound, random initialization

Learning rate schedules

  • Memoized birth-merge from K=1 has highest

accuracy while using fewer clusters

slide-21
SLIDE 21

MNIST: Variational versus Gibbs

Gibbs: Log-probability Gibbs: Number of clusters

  • Five random initializations from K=1, K=50, K=300 clusters
  • Diagonal-covariance Gaussians (change from previous slides)

Memoized birth-merge: 
 Log-likelihood bound Memoized birth-merge: 
 Number of clusters Gap: Tiny clusters

slide-22
SLIDE 22

Clustering Image Patches

SUN Database of Natural Scene Categories: N=108,754

5 10 15 20 25 30 35 40 45 50 −1.62 −1.61 −1.6 −1.59 −1.58 −1.57 −1.56 −1.55

  • num. passes thru data (N=108754)

log evidence x107 SOa K=100 SOb K=100 Full K=100 MO K=100 MO−BM K=1

25 batches

  • Memoized birth-merge learns a more accurate model with only K=28 clusters

Likelihood bound

8x8 Image Patches (Berkeley Segmentation): N=1.88 million

10 20 30 40 50 60 70 80 90 100 50 100 150 200 250 300

  • num. passes thru data (N=1880200)
  • num. components K

MO−BM K=1 MO K=100 SOa K=100 10 20 30 40 50 60 70 80 90 100 4.25 4.3 4.35 4.4 4.45

  • num. passes thru data (N=1880200)

log evidence x108 MO−BM K=1 MO K=100 SOa K=100

Likelihood bound Number of clusters

  • Memoized birth-merge allows growth in model complexity
  • Effective performance as density model for image denoising
slide-23
SLIDE 23

Memoized Variational Inference for Hierarchial DP Topic Models

Michael Hughes, Dae Il Kim, & E. Sudderth

slide-24
SLIDE 24

What are Topic Models?

GOAL: Summarize semantic content of a large document corpus.

There are reasons to believe that the genetics of an organism are likely to shift due to the extreme changes in our

  • climate. To protect them, our politicians

must pass environmental legislation that can protect our future species from becoming extinct…

Documents are represented as mixtures

  • f “topics” used with varying frequencies.

0.5 Document 1

Genetics Climate Politics

“Politics” Topic “Climate Change” Topic “Genetics” Topic

Topics are categorical distributions on a (typically large) discrete vocabulary:

slide-25
SLIDE 25

Hierarchical DP Topic Model

γ

Generalization of Latent Dirichlet Allocation (LDA, Blei 2003) by Teh et al. JMLR 2006.
 Dependent Dirichlet process (DDP , MacEachern 1999) with group-specific weights.

λ0

φk α

xdn zdn

Nd

πd

D

β

βk = uk Qk−1

`=1 (1 − u`)

  • Global topic frequencies and parameters:

uk ∼ Beta(1, γ) φk ∼ Dirichlet(λ0)

(sparse)

  • For each of Nd words in document d:
  • Topic assignment:
  • Observed value:

zdn ∼ Cat(πd) xdn ∼ Cat(φzdn)

  • For each of D documents (groups):
  • Topic frequencies:

πd ∼ DP(αβ) vdk ∼ Beta(αkuk, αk(1 − uk))

αk = α Qk−1

`=1 (1 − vd`)

πdk = vdk Qk−1

`=1 (1 − vd`)

Generalized Dirichlet, Connor & Mosimann 1969

slide-26
SLIDE 26

Variational Learning of HDP Topics

γ

λ0

φk α

xdn zdn

Nd

πd

D

β

for some K>0

q(zdn) = Cat(zdn | rdn1, rdn2, . . . , rdnK, 0, 0, 0, . . .) Update Document Distributions:

rdnk ∝ exp(Eq[log πdk(vd)] + Eq[log p(xdn | φk)])

Eq[log πdk(vd)] = Eq[log(vdk)] + Pk−1

`=1 Eq[log(1 − vd`)]

For k ≤ K,

  • Closed form update for beta stick-breaking weights
  • Local iteration between assignments and weights

Update Global Parameters:

s0

k ← PD d=1

PNd

n=1 rdnkt(xdn)

q(φk) = Dir(φk | λ0 + s0

k)

  • Closed form for topic-specific word distributions
  • Beta normalization constants have non-conjugate

dependence on topic frequencies, requires additional bound and numerical optimization

Iterate: Batch, Stochastic, or Memoized

slide-27
SLIDE 27

Analysis of Document Corpora

  • Variational: Memoized versus stochastic rate A, stochastic rate C
  • Baseline: Stochastic variational on “expanded” HDP (Wang et al. 2011)

NIPS Conference
 (D=1392) Huffington Post
 (D=3271)

Training Log-likelihood Bound Training Log-likelihood Bound Test Log-likelihood
 (MCMC Estimator) Test Log-likelihood
 (MCMC Estimator)

slide-28
SLIDE 28

Stochastic Variational Inference for Hierarchial DP Relational Models

  • D. Kim, P

. Gopalan, D. Blei, & E. Sudderth 2013 Conference on Neural Information Processing Systems

slide-29
SLIDE 29

What are Relational Models?

GOAL: Unsupervised community discovery from observed relationships.

.8 .6 .2 .1 .9 .3 .1 .2 .8

Edge Creation 
 Parameter Matrix (K=3)

unobserved link Receiver Source

Stochastic Block Model: (Wang et al., JASA 1987)

  • Assign each node to one latent block/community
  • Predict edge presence or absence from block 


assignments of source and receiver nodes

slide-30
SLIDE 30

Mixed Membership Blockmodels

Parametric mixed membership stochastic blockmodel, Airoldi et al. JMLR 2008

.8 .1 .5 .2 .7 .7 .5 .3 .9 .2 .7 .8 .1 .1 .9 .7

r

ij

sij

π 2 π1

s12 r

12

r

21

s21

y12

Edge is not present

y21

Edge is present

  • Source Community Assignment
  • Receiver Community Assignment
  • Community Link Probability
  • Binary Edge Indicator

Edge Creation 
 Parameter Matrix

sij ∼ Cat(πi) rij ∼ Cat(πj) φk` ∼ Beta(τa, τb) yij ∼ Bern(sijφrT

ij)

slide-31
SLIDE 31

HDP Relational Models

γ

λ0

φk α β

N

πi

yij

rij

sij

E

.8 .1 .5 .2 .7 .7 .5 .3 .9 .2 .7 .8 .1 .1 .9 .7

r

ij

sij

π 2 π1

s12 r

12

r

21

s21

y12

Edge is not present

y21

Edge is present

Edge Creation 
 Parameter Matrix

sij ∼ Cat(πi) rij ∼ Cat(πj) φk` ∼ Beta(τa, τb) yij ∼ Bern(sijφrT

ij)

slide-32
SLIDE 32

Variational Learning of Relations

γ

λ0

φk α β

N

πi

yij

rij

sij

E

Assortative Likelihoods: p(yij = 1 | sij 6= rij) = ✏ p(yij = 1 | sij = rij = k) = φk

O(K)

storage & computation for
 distribution on K2 community pairs

Mini-Batch #1 Mini-Batch #2 Mini-Batch #3

Stochastic Variational: Variational Pruning:

Θk < (log K)/N

Θk = PN

i=1 Eq[πik]

slide-33
SLIDE 33

Analysis of Collaboration Networks

2 4 6 8 10 12 14 x 10

7

5 10 15 20 25 30 Perplexity Hep N=11204 Number of Observed Edges Perplexity aMMSB−K250 aMMSB−K300 aHDPR−Naive−K500 aHDPR−K500 aHDPR−Pruning 0.5 1 1.5 2 2.5 x 10

8

10 20 30 40 50 60 70 Perplexity Condensed Matter N=21363 Number of Observed Edges Perplexity aMMSB−K400 aMMSB−K450 aHDPR−Naive−K500 aHDPR−K500 aHDPR−Pruning 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 AUC Hep N=11204 AUC Quantiles aMMSB K250 aMMSB K300 aHDPR Naive−K500 aHDPR K500 aHDPR Pruning 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 AUC Condensed Matter N=21363 AUC Quantiles aMMSB K400 aMMSB K450 aHDPR Naive−K500 aHDPR K500 aHDPR Pruning

AUC: Area Under ROC
 (prediction of held-out links) Perplexity: Normalized
 negative log-probability

High Energy 
 Physics (HEP)
 N=11,204

Stochastic inference & model variants:

  • Parametric MMSB
  • Parametric MMSB
  • HDP naïve
  • HDP blocked
  • HDP pruned

Condensed 
 Matter Physics
 N=21,363

slide-34
SLIDE 34

LittleSis Network: Raw Data

Top 200 degree nodes Full network has N=18,831

slide-35
SLIDE 35

LittleSis Network Communities

Colors correspond to highest community memberships Community Distance: Top 200 degree nodes Full network has N=18,831

Dij = 1

2

P

k |πik − πjk|

slide-36
SLIDE 36

Reliable Variational Learning for Hierarchical Dirichlet Processes

  • Scalable: Large-scale learning via

stochastic or memoized updates

  • Reliable: Birth-merge recovers

structure informed by model & data, not inference algorithm limitations

  • Flexible: Designed to be broadly

applicable: space, time, scale, … BNPy: Bayesian Nonparametric Learning in Python
 Erik Sudderth @ Brown CS: http://cs.brown.edu/~sudderth/