Scalable Machine Learning 10. Distributed Inference and Applications - - PowerPoint PPT Presentation

scalable machine learning
SMART_READER_LITE
LIVE PREVIEW

Scalable Machine Learning 10. Distributed Inference and Applications - - PowerPoint PPT Presentation

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Outline Latent Dirichlet Allocation Basic model Sampling and


slide-1
SLIDE 1

Scalable Machine Learning

  • 10. Distributed Inference and Applications

Alex Smola Yahoo! Research and ANU

http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

slide-2
SLIDE 2

Outline

  • Latent Dirichlet Allocation
  • Basic model
  • Sampling and efficient implementation
  • Parallel Inference
  • Problem templates
  • Solution templates
  • Applications
slide-3
SLIDE 3

Latent Dirichlet Allocation

slide-4
SLIDE 4

Grouping objects

slide-5
SLIDE 5

Grouping objects

Singapore

slide-6
SLIDE 6

Grouping objects

slide-7
SLIDE 7

Grouping objects

slide-8
SLIDE 8

Grouping objects

airline restaurant university

slide-9
SLIDE 9

Grouping objects

Australia Singapore USA

slide-10
SLIDE 10

Topic Models

USA airline

Singapore airline

Singapore food

USA food

Singapore university

Australia university

slide-11
SLIDE 11

Clustering & Topic Models

Clustering Topics

?

group objects by prototypes decompose objects into prototypes

slide-12
SLIDE 12

Clustering & Topic Models

Clustering Topics

?

group objects by prototypes decompose objects into prototypes

slide-13
SLIDE 13

Clustering & Topic Models

x y θ prior

cluster probability cluster label instance

x y θ prior

topic probability topic label instance

clustering Latent Dirichlet Allocation

α α

slide-14
SLIDE 14

Clustering & Topic Models

Documents membership

Cluster/ topic distributions

x =

clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices

slide-15
SLIDE 15

Topics in text

Latent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003

slide-16
SLIDE 16

Mathematical Details

  • Dirichlet prior
  • Topic label
  • Word probability
  • Word model via

Dirichlet prior

x y θ prior

topic probability topic label instance

α

word model

p(θ|α) = Γ (P

i αi)

Q

i Γ(αi)

Y

i

θαi−1

i

p(y|θ) = θy

Φ

p(x|φ, y) = φx,y

β

slide-17
SLIDE 17

Collapsed Inference

  • Dirichlet prior
  • Topic label
  • Word probability
  • Word model via

Dirichlet prior

x y θ prior

integrate

  • ut

topic label instance

α

p(θ|α) = Γ (P

i αi)

Q

i Γ(αi)

Y

i

θαi−1

i

p(y|θ) = θy

Φ

p(x|φ, y) = φx,y

integrate

  • ut

β

slide-18
SLIDE 18

Collapsed Inference

  • Collapsed Dirichlet

for topics

  • Collapsed Dirichlet

for words (need to partition between topics)

x y prior

topic label instance

α β

exchangeable exchangeable

p(yd1, . . . ydm|α)

k

Y

j=1

p(W|yid = j|β)

slide-19
SLIDE 19

Recall - collapsing exponentials

  • Conjugate priors

Hence we know how to compute normalization

  • Prediction

p(θ) ∝ p(Xfake|θ) p(x|X) = Z p(x|θ)p(θ|X)dθ ∝ Z p(x|θ)p(X|θ)p(Xfake|θ)dθ = Z p({x} ∪ X ∪ Xfake|θ)dθ

look up closed form expansions

(Beta, binomial) (Dirichlet, multinomial) (Gamma, Poisson) (Wishart, Gauss)

http://en.wikipedia.org/wiki/Exponential_family

slide-20
SLIDE 20

Collapsing exponentials

  • Conjugate prior
  • Posterior
  • Computing the normalization yields

p(θ) = exp (m0 hµ0, θi m0g(θ) h(m0µ0, m0))

p(θ|X) ∝

m

Y

i=1

p(xi|θ)p(θ) = exp * m0µ0 +

m

X

i=1

φ(xi), θ + − (m0 + m)g(θ) − h(m0µ0, m0) !

p(X|µ0, m0) = exp (h(m0µ0 + mµ[X], m0 + m) − h(m0µ0, m0))

slide-21
SLIDE 21

Applyication to Dirichlet prior

  • Normalization
  • In mean (not natural) parameters ...
  • Change in normalization is Laplace smoother

p(X|µ0, m0) = exp (h(m0µ0 + mµ[X], m0 + m) − h(m0µ0, m0))

p(θ|α) = Γ (P

i αi)

Q

i Γ(αi)

Y

i

θαi−1

i

h(α) = X

i

log Γ(αi) − log Γ X

i

αi ! exp h(α ∪ X) − h(α) = αi + ni P

i αi + ni

slide-22
SLIDE 22

Collapsed Inference

  • Collapsed Dirichlet

for topics

  • Collapsed Dirichlet for

words

  • Unnormalized product

(n are counters for y)

x y prior

topic label instance

α β

exchangeable exchangeable

n−ij(t, w) + βt n−i(t) + P

t βt

n−ij(t, d) + αt n−i(d) + P

t αt

slide-23
SLIDE 23

Acceleration

  • For most words only

few topics relevant

  • Normalization

requires sum over all topics regardless

  • Exploit sparsity in n

x y prior

topic label instance

α β

exchangeable exchangeable

n−ij(t, w) + βt n−i(t) + P

t βt

n−ij(t, d) + αt n−i(d) + P

t αt

slide-24
SLIDE 24

Gibbs Sampler (Griffiths & Steyvers)

  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table
slide-25
SLIDE 25

Gibbs Sampler (Griffiths & Steyvers)

this kills parallelism

  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table
slide-26
SLIDE 26
  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table

State of the art UMass Mallet, UC Irvine, Google

p(t|wij) ∝ βw αt n(t) + ¯ β + βw n(t, d = i) n(t) + ¯ β + n(t, w = wij) [n(t, d = i) + αt] n(t) + ¯ β

slide-27
SLIDE 27
  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table

State of the art UMass Mallet, UC Irvine, Google

p(t|wij) ∝ βw αt n(t) + ¯ β + βw n(t, d = i) n(t) + ¯ β + n(t, w = wij) [n(t, d = i) + αt] n(t) + ¯ β

slow

slide-28
SLIDE 28
  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table

State of the art UMass Mallet, UC Irvine, Google

p(t|wij) ∝ βw αt n(t) + ¯ β + βw n(t, d = i) n(t) + ¯ β + n(t, w = wij) [n(t, d = i) + αt] n(t) + ¯ β

slow changes rapidly

slide-29
SLIDE 29
  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table

State of the art UMass Mallet, UC Irvine, Google

p(t|wij) ∝ βw αt n(t) + ¯ β + βw n(t, d = i) n(t) + ¯ β + n(t, w = wij) [n(t, d = i) + αt] n(t) + ¯ β

slow changes rapidly moderately fast

slide-30
SLIDE 30
  • For 1000 iterations do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Update CPU local (word, topic) table
  • Update global (word, topic) table

State of the art UMass Mallet, UC Irvine, Google

p(t|wij) ∝ βw αt n(t) + ¯ β + βw n(t, d = i) n(t) + ¯ β + n(t, w = wij) [n(t, d = i) + αt] n(t) + ¯ β

slow changes rapidly moderately fast

table out

  • f sync

blocking network bound memory inefficient

slide-31
SLIDE 31

Fully asynchronous sampler

  • For 1000 iterations do (independently per computer)
  • For each thread/core do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Generate computer local (word, topic) message
  • In parallel update local (word, topic) table
  • In parallel update global (word, topic) table
slide-32
SLIDE 32

Fully asynchronous sampler

network bound

concurrent cpu hdd net

  • For 1000 iterations do (independently per computer)
  • For each thread/core do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Generate computer local (word, topic) message
  • In parallel update local (word, topic) table
  • In parallel update global (word, topic) table
slide-33
SLIDE 33

Fully asynchronous sampler

network bound memory inefficient

concurrent cpu hdd net

minimal view

  • For 1000 iterations do (independently per computer)
  • For each thread/core do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Generate computer local (word, topic) message
  • In parallel update local (word, topic) table
  • In parallel update global (word, topic) table
slide-34
SLIDE 34

Fully asynchronous sampler

table out

  • f sync

network bound memory inefficient continuous sync

concurrent cpu hdd net

minimal view

  • For 1000 iterations do (independently per computer)
  • For each thread/core do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Generate computer local (word, topic) message
  • In parallel update local (word, topic) table
  • In parallel update global (word, topic) table
slide-35
SLIDE 35

Fully asynchronous sampler

table out

  • f sync

blocking network bound memory inefficient continuous sync barrier free

concurrent cpu hdd net

minimal view

  • For 1000 iterations do (independently per computer)
  • For each thread/core do
  • For each document do
  • For each word in the document do
  • Resample topic for the word
  • Update local (document, topic) table
  • Generate computer local (word, topic) message
  • In parallel update local (word, topic) table
  • In parallel update global (word, topic) table
slide-36
SLIDE 36

Multicore Architecture

  • Decouple multithreaded sampling and updating

(almost) avoids stalling for locks in the sampler

  • Joint state table
  • much less memory required
  • samplers syncronized (10 docs vs. millions delay)
  • Hyperparameter update via stochastic gradient descent
  • No need to keep documents in memory (streaming)

tokens topics file combiner count updater diagnostics &

  • ptimization
  • utput to

file topics sampler sampler sampler sampler sampler

Intel Threading Building Blocks

joint state table

slide-37
SLIDE 37

Cluster Architecture

  • Distributed (key,value) storage via memcached
  • Background asynchronous synchronization
  • single word at a time to avoid deadlocks
  • no need to have joint dictionary
  • uses disk, network, cpu simultaneously

sampler sampler sampler sampler

ice ice ice ice

slide-38
SLIDE 38

Cluster Architecture

  • Distributed (key,value) storage via ICE
  • Background asynchronous synchronization
  • single word at a time to avoid deadlocks
  • no need to have joint dictionary
  • uses disk, network, cpu simultaneously

sampler sampler sampler sampler

ice ice ice ice

slide-39
SLIDE 39

Making it work

  • Startup
  • Randomly initialize topics on each node

(read from disk if already assigned - hotstart)

  • Sequential Monte Carlo for startup much faster
  • Aggregate changes on the fly
  • Failover
  • State constantly being written to disk

(worst case we lose 1 iteration out of 1000)

  • Restart via standard startup routine
  • Achilles heel: need to restart from checkpoint if even

a single machine dies.

slide-40
SLIDE 40

Easily extensible

  • Better language model (topical n-grams)

can process millions of users (vs 1000s)

  • Conditioning on side information (upstream)

estimate topic based on authorship, source, joint user model ...

  • Conditioning on dictionaries (downstream)

integrate topics between different languages

  • Time dependent sampler for user model

approximate inference per episode

slide-41
SLIDE 41

Alternatives

slide-42
SLIDE 42

V1 - Brute force maximization

  • Integrate out latent

parameters θ and ѱ

  • Discrete maximization

problem in Y

  • Hard to implement
  • Overfits a lot (mode is

not a typical sample)

  • Parallelization infeasible

x y θ α ѱ β x y α β

p(X, Y |α, β)

Hal Daume; Joey Gonzalez

slide-43
SLIDE 43

V2 - Brute force maximization

  • Integrate out latent parameters y
  • Continuous nonconvex optimization

problem in θ and ѱ

  • Solve by stochastic gradient descent
  • ver documents
  • Easy to implement
  • Does not overfit much
  • Great for small datasets
  • Parallelization difficult/impossible
  • Memory storage/access is O(T W)

(this breaks for large models)

  • 1M words, 1000 topics = 4GB
  • Per document 1MFlops/iteration

x y θ α ѱ β

Hoffmann, Blei, Bach (in VW)

x θ α ѱ β

p(X, ψ, θ|α, β)

slide-44
SLIDE 44
  • Approximate intractable joint

distribution by tractable factors

  • Alternating convex optimization problem
  • Dominant cost is matrix matrix multiply
  • Easy to implement
  • Great for small topics/vocabulary
  • Parallelization easy (aggregate

statistics)

  • Memory storage is O(T W)

(this breaks for large models)

  • Model not quite as good as sampling

V3 - Variational approximation

x y θ α ѱ β

Blei, Ng, Jordan

y θ α ѱ β

log p(x) log p(x) D(q(y)kp(y|x)) = Z dq(y) [log p(x) + log p(y|x) q(y)] = Z dq(y) log p(x, y) + H[q]

slide-45
SLIDE 45

V4 - Uncollapsed Sampling

  • Sample yij|rest

Can be done in parallel

  • Sample θ|rest and ѱ|rest

Can be done in parallel

  • Compatible with MapReduce

(only aggregate statistics)

  • Easy to implement
  • Children can be conditionally

independent*

  • Memory storage is O(T W)

(this breaks for large models)

  • Mixes slowly

x y θ α ѱ β

2 2 1

*for the right model

slide-46
SLIDE 46
  • Integrate out latent

parameters θ and ѱ

  • Sample one topic assignment

yij|X,Y-ij at a time from

  • Fast mixing
  • Easy to implement
  • Memory efficient
  • Parallelization infeasible

(variables lock each other)

V5 - Collapsed Sampling

x y θ α ѱ β x y α β

p(X, Y |α, β)

Griffiths & Steyvers 2005

n−ij(t, w) + βt n−i(t) + P

t βt

n−ij(t, d) + αt n−i(d) + P

t αt

slide-47
SLIDE 47
  • Integrate out latent

parameters θ and ѱ

  • Sample one topic assignment

yij|X,Y-ij at a time from

  • Fast mixing
  • Easy to implement
  • Memory efficient
  • Parallelization infeasible

(variables lock each other)

V5 - Collapsed Sampling

x y θ α ѱ β x y α β

p(X, Y |α, β)

Griffiths & Steyvers 2005

n−ij(t, w) + βt n−i(t) + P

t βt

n−ij(t, d) + αt n−i(d) + P

t αt

slide-48
SLIDE 48
  • Collapsed sampler per

machine

  • Defer synchronization

between machines

  • no problem for n(t)
  • big problem for n(t,w)
  • Easy to implement
  • Can be memory efficient
  • Easy parallelization
  • Mixes slowly/worse likelihood

V6 - Approximating the Distribution

x y α β

Asuncion, Smyth, Welling, ... UCI Mimno, McCallum, ... UMass

n−ij(t, w) + βt n−i(t) + P

t βt

n−ij(t, d) + αt n−i(d) + P

t αt

slide-49
SLIDE 49
  • Collapsed sampler
  • Make local copies of state
  • Implicit for multicore

(delayed updates from samplers)

  • Explicit copies for multi-machine
  • Not a hierarchical model

(Welling, Asuncion, et al. 2008)

  • Memory efficient (only need to view

its own sufficient statistics)

  • Multicore / Multi-machine
  • Convergence speed depends on

synchronizer quality

V7 - Better Approximations

  • f the Distribution

x y α β

  • S. and Narayanamurthy, 2009

Ahmed, Gonzalez, et al., 2012

n−ij(t, w) + βt n−i(t) + P

t βt

n−ij(t, d) + αt n−i(d) + P

t αt

slide-50
SLIDE 50
  • Integrate out latent θ and ѱ
  • Chain conditional probabilities
  • For each particle sample
  • Reweight particle by next step

data likelihood

  • Resample particles if weight

distribution is too uneven

V8 - Sequential Monte Carlo

x y α β

p(X, Y |α, β)

Canini, Shi, Griffiths, 2009 Ahmed et al., 2011

x y x y

...

p(X, Y |α, β) =

m

Y

i=1

p(xi, yi|x1, y1, . . . xi−1, yi−1, α, β)

yi ∼ p(yi|xi, x1, y1, . . . xi−1, yi−1, α, β)

p(xi+1|x1, y1, . . . xi, yi, α, β)

slide-51
SLIDE 51
  • Integrate out latent θ and ѱ
  • Chain conditional probabilities
  • For each particle sample
  • Reweight particle by next step

data likelihood

  • Resample particles if weight

distribution is too uneven

V8 - Sequential Monte Carlo

p(X, Y |α, β)

Canini, Shi, Griffiths, 2009 Ahmed et al., 2011

p(X, Y |α, β) =

m

Y

i=1

p(xi, yi|x1, y1, . . . xi−1, yi−1, α, β)

yi ∼ p(yi|xi, x1, y1, . . . xi−1, yi−1, α, β)

p(xi+1|x1, y1, . . . xi, yi, α, β)

  • One pass through data
  • Data sequential

parallelization is open problem

  • Nontrivial to implement
  • Sampler is easy
  • Inheritance tree through particles

is messy

  • Need to estimate data likelihood

(integration over y), e.g. as part of sampler

  • This is multiplicative update

algorithm with log loss ...

slide-52
SLIDE 52

Uncollapsed Variational approximation Collapsed natural parameters Collapsed topic assignments Optimization

  • verfits

too costly easy parallelization big memory footprint

  • verfits

too costly easy to optimize big memory footprint difficult parallelization Sampling slow mixing conditionally independent n.a. fast mixing difficult parallelization approximate inference by delayed updates particle filtering sequential sampling difficult

slide-53
SLIDE 53

P a r a l l e l I n f e r e n c e

slide-54
SLIDE 54

3 Problems

mean variance cluster weight data cluster ID

slide-55
SLIDE 55

3 Problems

global state data

local state

slide-56
SLIDE 56

3 Problems

too big for single machine

huge

  • nly local
slide-57
SLIDE 57

3 Problems

data local state global state

Vanilla LDA User profiling

global state

slide-58
SLIDE 58

3 Problems

data local state global state

Vanilla LDA User profiling

global state

slide-59
SLIDE 59

3 Problems

global state is too large does not fit into memory network load & barriers does not fit into memory local state is too large

slide-60
SLIDE 60

3 Problems

global state is too large does not fit into memory network load & barriers does not fit into memory local state is too large

stream local data from disk

slide-61
SLIDE 61

3 Problems

global state is too large does not fit into memory network load & barriers does not fit into memory local state is too large

stream local data from disk asynchronous synchronization

slide-62
SLIDE 62

3 Problems

global state is too large does not fit into memory network load & barriers does not fit into memory local state is too large

stream local data from disk asynchronous synchronization partial view

slide-63
SLIDE 63

Distribution

global replica

rack cluster

slide-64
SLIDE 64

Distribution

global replica

rack cluster

slide-65
SLIDE 65

Synchronization

  • Child updates local state
  • Start with common state
  • Child stores old and new state
  • Parent keeps global state
  • Transmit differences asynchronously
  • Inverse element for difference
  • Abelian group for commutativity (sum, log-sum, cyclic group, exponential families)

local to global global to local

x ← x + (xglobal − xold) xold ← xglobal δ ← x − xold xold ← x xglobal ← xglobal + δ

slide-66
SLIDE 66

Synchronization

  • Naive approach (dumb master)
  • Global is only (key,value) storage
  • Local node needs to lock/read/write/unlock master
  • Needs a 4 TCP/IP roundtrips - latency bound
  • Better solution (smart master)
  • Client sends message to master / in queue / master incorporates it
  • Master sends message to client / in queue / client incorporates it
  • Bandwidth bound (>10x speedup in practice)

local to global global to local

x ← x + (xglobal − xold) xold ← xglobal δ ← x − xold xold ← x xglobal ← xglobal + δ

slide-67
SLIDE 67

Distribution

  • Dedicated server for variables
  • Insufficient bandwidth (hotspots)
  • Insufficient memory
  • Select server e.g. via consistent hashing

m(x) = argmin

m∈M

h(x, m)

slide-68
SLIDE 68

Distribution & fault tolerance

  • Storage is O(1/k) per machine
  • Communication is O(1) per machine
  • Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)
  • O(k) open connections per machine
  • O(1/k) throughput per machine

m(x) = argmin

m∈M

h(x, m)

slide-69
SLIDE 69

Distribution & fault tolerance

  • Storage is O(1/k) per machine
  • Communication is O(1) per machine
  • Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)
  • O(k) open connections per machine
  • O(1/k) throughput per machine

m(x) = argmin

m∈M

h(x, m)

slide-70
SLIDE 70

Distribution & fault tolerance

  • Storage is O(1/k) per machine
  • Communication is O(1) per machine
  • Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)
  • O(k) open connections per machine
  • O(1/k) throughput per machine

m(x) = argmin

m∈M

h(x, m)

slide-71
SLIDE 71

Synchronization

  • Data rate between machines is O(1/k)
  • Machines operate asynchronously (barrier free)
  • Solution
  • Schedule message pairs
  • Communicate with r random machines simultaneously

local global r=1

slide-72
SLIDE 72

Synchronization

  • Data rate between machines is O(1/k)
  • Machines operate asynchronously (barrier free)
  • Solution
  • Schedule message pairs
  • Communicate with r random machines simultaneously

local global r=1

slide-73
SLIDE 73

Synchronization

  • Data rate between machines is O(1/k)
  • Machines operate asynchronously (barrier free)
  • Solution
  • Schedule message pairs
  • Communicate with r random machines simultaneously

local global r=1

slide-74
SLIDE 74

Synchronization

  • Data rate between machines is O(1/k)
  • Machines operate asynchronously (barrier free)
  • Solution
  • Schedule message pairs
  • Communicate with r random machines simultaneously

local global r=2 2 3 3 3 1 1 3 2

0.78 < eff. < 0.89

slide-75
SLIDE 75

Synchronization

  • Data rate between machines is O(1/k)
  • Machines operate asynchronously (barrier free)
  • Solution
  • Schedule message pairs
  • Communicate with r random machines simultaneously
  • Use Luby-Rackoff PRPG for load balancing
  • Efficiency guarantee

4 simultaneous connections are sufficient

slide-76
SLIDE 76

Scalability

slide-77
SLIDE 77

Summary Variable Replication

  • Global shared variable
  • Make local copy
  • Distributed (key,value) storage table for global copy
  • Do all bookkeeping locally (store old versions)
  • Sync local copies asynchronously using message passing

(no global locks are needed)

  • This is an approximation!

x y z x y

y’

z

local copy synchronize computer

slide-78
SLIDE 78

Summary Asymmetric Message Passing

  • Large global shared state space

(essentially as large as the memory in computer)

  • Distribute global copy over several machines

(distributed key,value storage)

  • ld copy

current copy global state

slide-79
SLIDE 79

Summary Out of core storage

  • Very large state space
  • Gibbs sampling requires us to traverse the data sequentially many

times (think 1000x)

  • Stream local data from disk and update coupling variable each

time local data is accessed

  • This is exact

x

y

z

tokens topics file combiner count updater diagnostics &

  • ptimization
  • utput to

file topics sampler sampler sampler sampler sampler

slide-80
SLIDE 80

Advanced Modeling

slide-81
SLIDE 81

Advances in Representation

slide-82
SLIDE 82
  • Prior over document topic vector
  • Usually as Dirichlet distribution
  • Use correlation between topics (CTM)
  • Hierarchical structure over topics
  • Document structure
  • Bag of words
  • n-grams (Li & McCallum)
  • Simplical Mixture (Girolami & Kaban)
  • Side information
  • Upstream conditioning (Mimno & McCallum)
  • Downstream conditioning (Petterson et al.)
  • Supervised LDA (Blei and McAulliffe 2007; Lacoste,

Sha and Jordan 2008; Zhu, Ahmed and Xing 2009)

x y θ α

Extensions to topic models

slide-83
SLIDE 83
  • Dirichlet distribution
  • Can only model which topics are hot
  • Does not model relationships between topics
  • Key idea
  • We expect to see documents about sports and

health but not about sports and politics

  • Uses a logistic normal distribution as a prior
  • Conjugacy is no longer maintained
  • Inference is harder than in LDA

Blei & Lafferty 2005; Ahmed & Xing 2007

Correlated topic models

slide-84
SLIDE 84
  • Dirichlet distribution
  • Can only model which topics are hot
  • Does not model relationships between topics
  • Key idea
  • We expect to see documents about sports and

health but not about sports and politics

  • Uses a logistic normal distribution as a prior
  • Conjugacy is no longer maintained
  • Inference is harder than in LDA

Blei & Lafferty 2005; Ahmed & Xing 2007

Correlated topic models

slide-85
SLIDE 85

Dirichlet prior on topics

slide-86
SLIDE 86

Log-normal prior on topics

θ = eη−g(η) with η ∼ N(µ, Σ)

with

slide-87
SLIDE 87

Blei and Lafferty 2005

Correlated topics

slide-88
SLIDE 88

Correlated topics

slide-89
SLIDE 89
  • Model the prior as a Directed Acyclic Graph
  • Each document is modeled as multiple paths
  • To sample a word, first select a path and then

sample a word from the final topic

  • The topics reside on the leaves of the tree

Li and McCallum 2006

Pachinko Allocation

slide-90
SLIDE 90

Li and McCallum 2006

Pachinko Allocation

slide-91
SLIDE 91
  • Topics can appear anywhere in the tree
  • Each document is modeled as
  • Single path over the tree (Blei et al., 2004)
  • Multiple paths over the tree (Mimno et al.,2007)

Topic Hierarchies

slide-92
SLIDE 92

Blei et al. 2004

Topic Hierarchies

slide-93
SLIDE 93
  • Documents as bag of words
  • Exploit sequential structure
  • N-gram models
  • Capture longer phrases
  • Switch variables to

determine segments

  • Dynamic programming

needed

x y θ α

Topical n-grams

y x x

Girolami & Kaban, 2003; Wallach, 2006; Wang & McCallum, 2007

slide-94
SLIDE 94

Topic n-grams

slide-95
SLIDE 95

Side information

  • Upstream conditioning (Mimno et al., 2008)
  • Document features are informative for topics
  • Estimate topic distribution e.g. based on authors, links,

timestamp

  • Downstream conditioning (Petterson et al., 2010)
  • Word features are informative on topics
  • Estimate topic distribution for words e.g. based on dictionary,

lexical similarity, distributional similarity

  • Class labels (Blei and McAulliffe 2007; Lacoste, Sha and Jordan

2008; Zhu, Ahmed and Xing 2009)

  • Joint model of unlabeled data and labels
  • Joint likelihood - semisupervised learning done right!
slide-96
SLIDE 96

Downstream conditioning

Europarl corpus without alignment

slide-97
SLIDE 97

Recommender Systems

Agarwal & Chen, 2010

slide-98
SLIDE 98

Chinese Restaurant Process

φ2 φ1 φ3

slide-99
SLIDE 99

Problem

  • How many clusters should we pick?
  • How about a prior for infinitely many clusters?
  • Finite model
  • Infinite model

Assume that the total smoother weight is constant

p(y|Y, α) = n(y) + αy n + P

y0 αy0

p(y|Y, α) = n(y) n + P

y0 αy0 and p(new|Y, α) =

α n + α

and

new

slide-100
SLIDE 100

Chinese Restaurant Metaphor

  • ­‑For ¡data ¡point ¡xi ¡
  • ­‑ ¡Choose ¡table ¡j ¡∝ ¡mj ¡ ¡ ¡ ¡and ¡ ¡Sample ¡xi ¡~ ¡f(φj)
  • ­‑ ¡Choose ¡a ¡new ¡table ¡ ¡K+1 ¡∝ ¡α ¡
  • ­‑ ¡Sample ¡φK+1 ¡~ ¡G0 ¡ ¡ ¡and ¡Sample ¡xi ¡~ ¡f(φK+1)

Genera=ve ¡Process

φ2 φ1 φ3

the rich get richer

Pitman; Antoniak; Ishwaran; Jordan et al.; Teh et al.;

slide-101
SLIDE 101

Evolutionary Clustering

  • Time series of objects, e.g. news stories
  • Stories appear / disappear
  • Want to keep track of clusters automatically
slide-102
SLIDE 102

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2 m'1,1=2 m'2,1=3 m'3,1=1

slide-103
SLIDE 103

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2 m'1,1=2 m'2,1=3 m'3,1=1

slide-104
SLIDE 104

φ2,1 φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2 m'1,1=2 m'2,1=3 m'3,1=1

slide-105
SLIDE 105

φ2,1 φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2 m'1,1=2 m'2,1=3 m'3,1=1

slide-106
SLIDE 106

φ2,1 φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2 m'1,1=2 m'2,1=3 m'3,1=1 Sample ¡φ1,2 ¡~ ¡P(.| φ1,1) ¡

slide-107
SLIDE 107

φ2,1 φ1,1 φ3,1

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2 m'1,1=2 m'2,1=3 m'3,1=1

slide-108
SLIDE 108

Recurrent Chinese Restaurant Process

φ2,1 φ1,1 φ3,1

T=1 T=2

φ4,2

m'1,1=2 m'2,1=3 m'3,1=1

φ2,2 φ1,2 φ3,1

dead cluster new cluster

slide-109
SLIDE 109

Longer History

φ2,1 φ1,1 φ3,1

T=1 T=2

φ2,2 φ1,2 φ3,1

m'1,1=2 m'2,1=3 m'3,1=1

φ4,2

T=3

φ2,2 φ1,2 φ4,2

m'2,3

slide-110
SLIDE 110

TDPM Generative Power

W= ∞ λ = ∞

DPM

W=4 λ = .4

TDPM

W= 0 λ = ? (any)

Independent DPMs

37

power law

slide-111
SLIDE 111

User modeling

10 20 30 40 0.1 0.2 0.3 Propotion Day

Baseball Finance Jobs Dating

slide-112
SLIDE 112

Buying a camera

time

slide-113
SLIDE 113

Buying a camera

time

slide-114
SLIDE 114

Buying a camera

show ads now

time

slide-115
SLIDE 115

Buying a camera

show ads now too late

time

slide-116
SLIDE 116

Car Deals van job Hiring diet Hiring Salary Diet calories Auto Price Used inspec=on Flight London Hotel weather Diet Calories Recipe chocolate Movies Theatre Art gallery School Supplies Loan college

slide-117
SLIDE 117

Car Deals van job Hiring diet Hiring Salary Diet calories Auto Price Used inspec=on Flight London Hotel weather Diet Calories Recipe chocolate Movies Theatre Art gallery School Supplies Loan college

slide-118
SLIDE 118

Car Deals van job Hiring diet Hiring Salary Diet calories Auto Price Used inspec=on Flight London Hotel weather Diet Calories Recipe chocolate Movies Theatre Art gallery School Supplies Loan college CARS Art Diet Jobs Travel College finance

slide-119
SLIDE 119

Flight London Hotel weather School Supplies Loan college Travel College finance Input

  • Queries ¡issued ¡by ¡the ¡user ¡or ¡Tags ¡of ¡watched ¡content
  • Snippet ¡of ¡page ¡examined ¡by ¡user
  • Time ¡stamp ¡of ¡each ¡acEon ¡(day ¡resoluEon)

Output

  • ¡ ¡Users’ ¡daily ¡distribuEon ¡over ¡intents
  • ¡ ¡Dynamic ¡intent ¡representaEon

User modeling

slide-120
SLIDE 120

Time dependent models

  • LDA for topical model of users where
  • User interest distribution changes over time
  • Topics change over time
  • This is like a Kalman filter except that
  • Don’t know what to track (a priori)
  • Can’t afford a Rauch-Tung-Striebel smoother
  • Much more messy than plain LDA
slide-121
SLIDE 121

Graphical Model

wij zij θi α φk αt β θt

i

zij wij φt

k

βt αt−1 αt+1 θt−1

i

θt+1

i

φt−1

k

βt−1 φt+1

k

βt+1

plain LDA

time dependent user interest user actions actions per topic

slide-122
SLIDE 122

job ¡ Career Business Assistant Hiring Part-­‑=me Recep=onist Car Blue Book Kelley Prices Small Speed large Bank Online Credit Card debt ¡ porZolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

Time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t+1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Food Chicken pizza recipe job hiring Part-­‑=me Opening salary food chicken Pizza millage Kelly recipe cuisine

Diet Cars Job Finance Prior ¡for ¡user ¡ ac=ons ¡at ¡=me ¡t

slide-123
SLIDE 123

All

job ¡ Career Business Assistant Hiring Part-­‑=me Recep=onist Car Blue Book Kelley Prices Small Speed large Bank Online Credit Card debt ¡ porZolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

Time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t+1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Food Chicken pizza recipe job hiring Part-­‑=me Opening salary food chicken Pizza millage Kelly recipe cuisine

Diet Cars Job Finance Prior ¡for ¡user ¡ ac=ons ¡at ¡=me ¡t

Long-­‑term

slide-124
SLIDE 124

All

job ¡ Career Business Assistant Hiring Part-­‑=me Recep=onist Car Blue Book Kelley Prices Small Speed large Bank Online Credit Card debt ¡ porZolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

month Time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t+1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Food Chicken pizza recipe job hiring Part-­‑=me Opening salary food chicken Pizza millage Kelly recipe cuisine

Diet Cars Job Finance Prior ¡for ¡user ¡ ac=ons ¡at ¡=me ¡t

Long-­‑term

slide-125
SLIDE 125

All week

job ¡ Career Business Assistant Hiring Part-­‑=me Recep=onist Car Blue Book Kelley Prices Small Speed large Bank Online Credit Card debt ¡ porZolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

month Time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t+1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Food Chicken pizza recipe job hiring Part-­‑=me Opening salary food chicken Pizza millage Kelly recipe cuisine

Diet Cars Job Finance Prior ¡for ¡user ¡ ac=ons ¡at ¡=me ¡t

Long-­‑term short-­‑term

slide-126
SLIDE 126

All week

job ¡ Career Business Assistant Hiring Part-­‑=me Recep=onist Car Blue Book Kelley Prices Small Speed large Bank Online Credit Card debt ¡ porZolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

month Time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t+1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Food Chicken pizza recipe job hiring Part-­‑=me Opening salary food chicken Pizza millage Kelly recipe cuisine

Diet Cars Job Finance Prior ¡for ¡user ¡ ac=ons ¡at ¡=me ¡t

Long-­‑term short-­‑term

slide-127
SLIDE 127

All week

job ¡ Career Business Assistant Hiring Part-­‑=me Recep=onist Car Blue Book Kelley Prices Small Speed large Bank Online Credit Card debt ¡ porZolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

month Time ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡t+1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Food Chicken pizza recipe job hiring Part-­‑=me Opening salary food chicken Pizza millage Kelly recipe cuisine

Diet Cars Job Finance Prior ¡for ¡user ¡ ac=ons ¡at ¡=me ¡t μ μ2 μ3

Long-­‑term short-­‑term

slide-128
SLIDE 128

Food ¡Chicken

Pizza ¡ ¡mileage

Car ¡speed ¡offer Camry ¡accord ¡career

At ¡)me ¡t At ¡)me ¡t+1

job ¡ Career Business Assistant Hiring Part-­‑Eme RecepEoni st Car AlEma Accord Blue Book Kelley Prices Small Speed Bank Online Credit Card debt ¡ porYolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

slide-129
SLIDE 129

Food ¡Chicken

Pizza ¡ ¡mileage

Car ¡speed ¡offer Camry ¡accord ¡career

At ¡)me ¡t At ¡)me ¡t+1

job ¡ Career Business Assistant Hiring Part-­‑Eme RecepEoni st Car AlEma Accord Blue Book Kelley Prices Small Speed Bank Online Credit Card debt ¡ porYolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

slide-130
SLIDE 130

Food ¡Chicken

Pizza ¡ ¡mileage

Car ¡speed ¡offer Camry ¡accord ¡career

At ¡)me ¡t At ¡)me ¡t+1

short-­‑term priors

job ¡ Career Business Assistant Hiring Part-­‑Eme RecepEoni st Car AlEma Accord Blue Book Kelley Prices Small Speed Bank Online Credit Card debt ¡ porYolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

slide-131
SLIDE 131

Food ¡Chicken

Pizza ¡ ¡mileage

Car ¡speed ¡offer Camry ¡accord ¡career

At ¡)me ¡t At ¡)me ¡t+1

  • ¡For ¡each ¡user ¡interac=on
  • ¡Choose ¡an ¡intent ¡from ¡local ¡distribu=on
  • Sample ¡word ¡from ¡the ¡topic’s ¡word-­‑distribu=on ¡
  • Choose ¡a ¡new ¡intent ¡ ¡∝ ¡α ¡
  • Sample ¡a ¡new ¡intent ¡from ¡the ¡global ¡distribu=on
  • ¡Sample ¡word ¡from ¡the ¡new ¡topic ¡word-­‑distribu=on ¡

Genera=ve ¡Process

short-­‑term priors

job ¡ Career Business Assistant Hiring Part-­‑Eme RecepEoni st Car AlEma Accord Blue Book Kelley Prices Small Speed Bank Online Credit Card debt ¡ porYolio Finance Chase Recipe Chocolate Pizza Food Chicken Milk Bu\er Powder

slide-132
SLIDE 132

At ¡)me ¡t At ¡)me ¡t+1 At ¡)me ¡t+2 At ¡)me ¡t+3 User ¡1 process User ¡2 process User ¡3 process Global process m m' n n'

slide-133
SLIDE 133

Sample users

10 20 30 40 0.1 0.2 0.3 Propotion Day

Baseball Finance Jobs Dating

10 20 30 40 0.1 0.2 0.3 0.4 0.5 Propotion Day

Baseball Dating Celebrity Health Snooki Tom Cruise Katie Holmes Pinkett Kudrow Hollywood League baseball basketball, doublehead Bergesen Griffey bullpen Greinke skin body fingers cells toes wrinkle layers women men dating singles personals seeking match

Dating Baseball Celebrity Health

job career business assistant hiring part-time receptionist financial Thomson chart real Stock Trading currency

Jobs Finance

slide-134
SLIDE 134

Sample users

10 20 30 40 0.1 0.2 0.3 Propotion Day

Baseball Finance Jobs Dating

10 20 30 40 0.1 0.2 0.3 0.4 0.5 Propotion Day

Baseball Dating Celebrity Health Snooki Tom Cruise Katie Holmes Pinkett Kudrow Hollywood League baseball basketball, doublehead Bergesen Griffey bullpen Greinke skin body fingers cells toes wrinkle layers women men dating singles personals seeking match

Dating Baseball Celebrity Health

job career business assistant hiring part-time receptionist financial Thomson chart real Stock Trading currency

Jobs Finance

slide-135
SLIDE 135

Datasets

Data

slide-136
SLIDE 136

ROC score improvement

slide-137
SLIDE 137

ROC score improvement

50 52 54 56 58 60 62

Dataset−2

>1000 [1000,600] [600,400] [400,200] [200,100] [100,60] [60,40] [40,20] <20 baseline TLDA TLDA+Baseline

slide-138
SLIDE 138

LDA for user profiling

Sample Z For users Sample Z For users Sample Z For users Sample Z For users

Barrier

Write counts to memcached Write counts to memcached Write counts to memcached Write counts to memcached Collect counts and sample Do nothing Do nothing Do nothing

Barrier

Read from memcached Read from memcached Read from memcached Read from memcached

slide-139
SLIDE 139

News

slide-140
SLIDE 140

News Stream

slide-141
SLIDE 141

News Stream

slide-142
SLIDE 142

News Stream

  • Over 1 high quality news article per second
  • Multiple sources (Reuters, AP, CNN, ...)
  • Same story from multiple sources
  • Stories are related
  • Goals
  • Aggregate articles into a storyline
  • Analyze the storyline (topics, entities)
slide-143
SLIDE 143

Clustering / RCRP

  • Assume active story

distribution at time t

  • Draw story indicator
  • Draw words from story

distribution

  • Down-weight story counts for

next day Ahmed & Xing, 2008

slide-144
SLIDE 144

Clustering / RCRP

  • Pro
  • Nonparametric model of story generation

(no need to model frequency of stories)

  • No fixed number of stories
  • Efficient inference via collapsed sampler
  • Con
  • We learn nothing!
  • No content analysis
slide-145
SLIDE 145

Latent Dirichlet Allocation

  • Generate topic distribution

per article

  • Draw topics per word from

topic distribution

  • Draw words from topic specific

word distribution Blei, Ng, Jordan, 2003

slide-146
SLIDE 146

Latent Dirichlet Allocation

  • Pro
  • Topical analysis of stories
  • Topical analysis of words (meaning, saliency)
  • More documents improve estimates
  • Con
  • No clustering
slide-147
SLIDE 147
  • Named entities are special, topics less

(e.g. Tiger Woods and his mistresses)

  • Some stories are strange

(topical mixture is not enough - dirty models)

  • Articles deviate from general story

(Hierarchical DP)

More Issues

slide-148
SLIDE 148
  • Named entities are special, topics less

(e.g. Tiger Woods and his mistresses)

  • Some stories are strange

(topical mixture is not enough - dirty models)

  • Articles deviate from general story

(Hierarchical DP)

More Issues

slide-149
SLIDE 149

Storylines

slide-150
SLIDE 150

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-151
SLIDE 151

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-152
SLIDE 152

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-153
SLIDE 153

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-154
SLIDE 154

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-155
SLIDE 155

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-156
SLIDE 156

Storylines Model

  • Topic model
  • Topics per cluster
  • RCRP for cluster
  • Hierarchical DP for

article

  • Separate model for

named entities

  • Story specific

correction

slide-157
SLIDE 157

Dynamic Cluster-Topic Hybrid

UEFA-soccer Champions* Goal* Coach* Striker* Midfield* penalty* Juventus** AC*Milan** Lazio* Ronaldo* Lyon**** Tax-Bill Tax* Billion* Cut* Plan* Budget* Economy* Bush* Senate* Fleischer* White*House* Republican* Bor

  • rder

der-T

  • Tens

ension ion

Nuclear* Border* Dialogue* DiplomaJc* militant* Insurgency* missile* Pakistan* India* Kashmir* New*Delhi* Islamabad* Musharraf* Vajpayee*

Sports games Won Team Final Season League held Poli)cs Government Minister AuthoriEes OpposiEon Officials Leaders group Accidents Police A\ack run man group arrested move

γ!

slide-158
SLIDE 158

Dynamic Cluster-Topic Hybrid

UEFA-soccer Champions* Goal* Coach* Striker* Midfield* penalty* Juventus** AC*Milan** Lazio* Ronaldo* Lyon**** Tax-Bill Tax* Billion* Cut* Plan* Budget* Economy* Bush* Senate* Fleischer* White*House* Republican* Bor

  • rder

der-T

  • Tens

ension ion

Nuclear* Border* Dialogue* DiplomaJc* militant* Insurgency* missile* Pakistan* India* Kashmir* New*Delhi* Islamabad* Musharraf* Vajpayee*

Sports games Won Team Final Season League held Poli)cs Government Minister AuthoriEes OpposiEon Officials Leaders group Accidents Police A\ack run man group arrested move

γ!

slide-159
SLIDE 159

Dynamic Cluster-Topic Hybrid

UEFA-soccer Champions* Goal* Coach* Striker* Midfield* penalty* Juventus** AC*Milan** Lazio* Ronaldo* Lyon**** Tax-Bill Tax* Billion* Cut* Plan* Budget* Economy* Bush* Senate* Fleischer* White*House* Republican* Bor

  • rder

der-T

  • Tens

ension ion

Nuclear* Border* Dialogue* DiplomaJc* militant* Insurgency* missile* Pakistan* India* Kashmir* New*Delhi* Islamabad* Musharraf* Vajpayee*

Sports games Won Team Final Season League held Poli)cs Government Minister AuthoriEes OpposiEon Officials Leaders group Accidents Police A\ack run man group arrested move

γ!

slide-160
SLIDE 160

Inference

  • We receive articles as a stream

Want topics & stories now

  • Variational inference infeasible

(RCRP, sparse to dense, vocabulary size)

  • We have a ‘tracking problem’
  • Sequential Monte Carlo
  • Use sampled variables of surviving particle
  • Use ideas from Cannini et al. 2009
slide-161
SLIDE 161
  • Proposal distribution - draw stories s, topics z

using Gibbs Sampling for each particle

  • Reweight particle via
  • Resample particles if l2 norm too large

(resample some assignments for diversity, too)

  • Compare to multiplicative updates algorithm

In our case predictive likelihood yields weights

Particle Filter

p(xt+1|x1...t, s1...t, z1...t)

p(st+1, zt+1|x1...t+1, s1...t, z1...t)

past state new data

slide-162
SLIDE 162

Particle Filter

  • s ¡and ¡z ¡are ¡=ghtly ¡coupled
  • Alterna=ve ¡to ¡MCMC
  • Sample ¡s ¡then ¡sample ¡z ¡(high ¡variance)
  • Sample ¡z ¡then ¡sample ¡s ¡(doesn’t ¡make ¡sense)
  • Idea ¡(following ¡a ¡similar ¡trick ¡by ¡Jain ¡and ¡Neal)
  • Run ¡a ¡few ¡itera=ons ¡of ¡MCMC ¡over ¡s ¡and ¡z
  • Take ¡last ¡sample ¡as ¡the ¡proposed ¡value
slide-163
SLIDE 163

Particle Filter

slide-164
SLIDE 164

Particle Filter

slide-165
SLIDE 165

Particle Filter

slide-166
SLIDE 166

Inheritance Tree

Filter'threads'update'par-cles'

Root$

1

games:$1$

  • fficials:$3$

league:$4$

2 3

(empty)$ league:$5$ minister:$1$ games:$0$ season:$2$

Ini-al'tree' (ready'for'threads)'

Root$

1

games:$1$

  • fficials:$3$

league:$4$

2 3

(empty)$ league:$5$ games:$3$ minister:$7$ games:$0$ season:$2$

0 = get(1,’games’) set(2,’games’,3) set(3,’minister’,7)

Resampling'copies'par-cles'

Root$

games:$1$

  • fficials:$3$

league:$4$

2,1$ 3

(empty)$ league:$5$ games:$3$ minister:$7$ games:$0$ season:$2$

copy(2,1)

Prune.unused'branches'

Root$

games:$1$

  • fficials:$3$

league:$4$

2,1$ 3

(empty)$ league:$5$ games:$3$ minister:$7$ games:$0$ season:$2$

Collapse.long'branches'

Root$

games:$1$

  • fficials:$3$

league:$4$

2,1$ 3

league:$5$ games:$3$ minister:$7$

2,1$

games:$3$ season:$2$ league:$5$

maintain_prune() maintain_collapse()

Create'new.leaves'

Root$

games:$1$

  • fficials:$3$

league:$4$

3

minister:$7$ games:$3$ season:$2$ league:$5$

branch(1) branch(2)

1 2

(empty)$ (empty)$

New'ini-al'tree' (ready'for'threads)'

Root$

games:$1$

  • fficials:$3$

league:$4$

3

minister:$7$ games:$3$ season:$2$ league:$5$

1 2

(empty)$ (empty)$

slide-167
SLIDE 167

Inheritance Tree

Filter'threads'update'par-cles'

Root$

1

games:$1$

  • fficials:$3$

league:$4$

2 3

(empty)$ league:$5$ minister:$1$ games:$0$ season:$2$

Ini-al'tree' (ready'for'threads)'

Root$

1

games:$1$

  • fficials:$3$

league:$4$

2 3

(empty)$ league:$5$ games:$3$ minister:$7$ games:$0$ season:$2$

0 = get(1,’games’) set(2,’games’,3) set(3,’minister’,7)

Resampling'copies'par-cles'

Root$

games:$1$

  • fficials:$3$

league:$4$

2,1$ 3

(empty)$ league:$5$ games:$3$ minister:$7$ games:$0$ season:$2$

copy(2,1)

Prune.unused'branches'

Root$

games:$1$

  • fficials:$3$

league:$4$

2,1$ 3

(empty)$ league:$5$ games:$3$ minister:$7$ games:$0$ season:$2$

Collapse.long'branches'

Root$

games:$1$

  • fficials:$3$

league:$4$

2,1$ 3

league:$5$ games:$3$ minister:$7$

2,1$

games:$3$ season:$2$ league:$5$

maintain_prune() maintain_collapse()

Create'new.leaves'

Root$

games:$1$

  • fficials:$3$

league:$4$

3

minister:$7$ games:$3$ season:$2$ league:$5$

branch(1) branch(2)

1 2

(empty)$ (empty)$

New'ini-al'tree' (ready'for'threads)'

Root$

games:$1$

  • fficials:$3$

league:$4$

3

minister:$7$ games:$3$ season:$2$ league:$5$

1 2

(empty)$ (empty)$

n

  • t

t h r e a d s a f e

slide-168
SLIDE 168

Extended Inheritance Tree

Root

1

India: [(I-P tension,3),(Tax bills,1)] Pakistan: [(I-P tension,2),(Tax bills,1)] Congress: [(I-P tension,1),(Tax bills,1)]

2 3

(empty) Congress: [(I-P tension,0),(Tax bills,2)] Bush: [(I-P tension,1),(Tax bills,2)] India: [(Tax bills,0)] India: [(I-P tension,2)] US: [(I-P tension,1),[Tax bills,1)]

Extended Inheritance Tree

[(I-P tension,2),(Tax bills,1)] = get_list(1,’India’) set_entry(3,’India’,’Tax ¡bills’,0)

Note: ¡“I-P ¡tension” ¡is ¡short ¡for ¡“India-Pakistan ¡tension”

write only in the leaves (per thread)

slide-169
SLIDE 169

Results

slide-170
SLIDE 170

Ablation studies

  • TDT5 (Topic Detection and Tracking)

macro-averaged minimum detection cost: 0.714

  • Removing features

time entities topics story words 0.84 0.90 0.86 0.75

slide-171
SLIDE 171

Comparison

Hashing & correlation clustering

slide-172
SLIDE 172

Time-Accuracy trade off

slide-173
SLIDE 173

Stories

Sports

games won team final season league held

Politics

government minister authorities

  • pposition
  • fficials

leaders group

Unrest

police attack run man group arrested move India-Pakistan tension nuclear border dialogue diplomatic militant insurgency missile Pakistan India Kashmir New Delhi Islamabad Musharraf Vajpayee UEFA-soccer champions goal leg coach striker midfield penalty Juventus AC Milan Real Madrid Milan Lazio Ronaldo Lyon Tax bills tax billion cut plan budget economy lawmakers Bush Senate US Congress Fleischer White House Republican

T O P I C S S T O R Y L I N E S

slide-174
SLIDE 174

India-Pakistan tension nuclear border dialogue diplomatic militant insurgency missile Pakistan India Kashmir New Delhi Islamabad Musharraf Vajpayee Middle-east conflict Peace Roadmap Suicide Violence Settlements bombing Israel Palestinian West bank Sharon Hamas Arafat North Korea nuclear nuclear summit warning policy missile program North Korea South Korea U.S Bush Pyongyang

“Show similar stories by topic” “Show similar stories, require the word nuclear”

Related Stories

slide-175
SLIDE 175

Detecting Ideologies

Ahmed and Xing, 2010

slide-176
SLIDE 176

Problem ¡Statement

Build ¡a ¡model ¡to ¡describe ¡both ¡ collec=ons ¡of ¡data

VisualizaEon

  • ¡How ¡does ¡each ¡ideology ¡view ¡mainstream ¡events?
  • ¡On ¡which ¡topics ¡do ¡they ¡differ?
  • ¡On ¡which ¡topics ¡do ¡they ¡agree?

Ideologies

slide-177
SLIDE 177

Problem ¡Statement

Build ¡a ¡model ¡to ¡describe ¡both ¡ collec=ons ¡of ¡data

VisualizaEon

Ideologies

ClassificaEon

  • Given ¡a ¡new ¡news ¡arEcle ¡ ¡or ¡a ¡blog ¡post, ¡the ¡system ¡should ¡infer
  • ¡From ¡which ¡side ¡it ¡was ¡wri\en
  • ¡ ¡JusEfy ¡its ¡answer ¡on ¡a ¡topical ¡level ¡(view ¡on ¡abor=on, ¡taxes, ¡health ¡care)
slide-178
SLIDE 178

Problem ¡Statement

Build ¡a ¡model ¡to ¡describe ¡both ¡ collec=ons ¡of ¡data

VisualizaEon

Ideologies

ClassificaEon Structured ¡browsing

  • Given ¡a ¡new ¡news ¡arEcle ¡ ¡or ¡a ¡blog ¡post, ¡the ¡user ¡can ¡ask ¡for ¡:
  • Examples ¡of ¡other ¡arEcles ¡from ¡the ¡same ¡ideology ¡about ¡the ¡same ¡topic
  • Documents ¡that ¡could ¡exemplify ¡alterna)ve ¡views ¡from ¡other ¡ideologies
slide-179
SLIDE 179

Ω1 Ω2 β1 β1 βk-­‑1 βk φ1,1 φ1,2 φ1,k φ2,1 φ2,2 φ2,k Ideology ¡1 Views Ideology ¡2 Views Topics

Building a factored model

slide-180
SLIDE 180

Building a factored model

slide-181
SLIDE 181

Ω1 Ω2 β1 β2 βk-­‑1 βk φ1,1 φ1,2 φ1,k φ2,1 φ2,2 φ2,k Ideology ¡1 Views Ideology ¡2 Views Topics

Building a factored model

slide-182
SLIDE 182

Ω1 Ω2 β1 β2 βk-­‑1 βk φ1,1 φ1,2 φ1,k φ2,1 φ2,2 φ2,k Ideology ¡1 Views Ideology ¡2 Views Topics

λ 1−λ λ 1−λ

Building a factored model

slide-183
SLIDE 183

Datasets

  • BiCerlemons: ¡
  • Middle-­‑east ¡conflict, ¡document ¡wri\en ¡by ¡Israeli ¡and ¡PalesEnian ¡authors.
  • ¡~300 ¡documents ¡form ¡each ¡view ¡with ¡average ¡length ¡740
  • ¡MulE ¡author ¡collecEon
  • ¡80-­‑20 ¡split ¡for ¡test ¡and ¡train
  • Poli)cal ¡Blog-­‑1:
  • ¡American ¡poliEcal ¡blogs ¡(Democrat ¡and ¡Republican)
  • ¡2040 ¡posts ¡with ¡average ¡post ¡length ¡= ¡100 ¡words
  • ¡Follow ¡test ¡and ¡train ¡split ¡as ¡in ¡(Yano ¡et ¡al., ¡2009)
  • Poli)cal ¡Blog-­‑2 ¡ ¡(test ¡generalizaEon ¡to ¡a ¡new ¡wriEng ¡style)
  • ¡Same ¡as ¡1 ¡but ¡6 ¡blogs, ¡3 ¡from ¡each ¡side
  • ¡ ¡~14k ¡posts ¡with ¡~200 ¡words ¡per ¡post
  • ¡4 ¡blogs ¡for ¡training ¡and ¡2 ¡blogs ¡for ¡test

Data

slide-184
SLIDE 184

Example: ¡Bi\erlemons ¡corpus

palestinian israeli peace year political process state end right

government

need conflict way security palestinian israeli Peace political

  • ccupation

process end security conflict way government people time year force negotiation bush US president american sharon administration prime pressure policy washington powell minister colin visit internal policy statement express pro previous package work transfer european arafat state leader roadmap election month iraq yasir senior involvement clinton terrorism

US ¡ ¡role

PalesEnian View Israeli View

roadmap phase security ceasefire state plan international step authority end settlement implementation obligation stop expansion commitment fulfill unit illegal present previous assassination meet forward process force terrorism unit provide confidence element interim discussion union succee point build positive recognize present timetable

Roadmap ¡process

syria syrian negotiate lebanon deal conference concession asad agreement regional

  • ctober initiative relationship

track negotiation official leadership position withdrawal time victory present second stand circumstance represent sense talk strategy issue participant parti negotiator peace strategic plo hizballah islamic neighbor territorial radical iran relation think

  • bviou countri mandate

greater conventional intifada affect jihad time

Arab ¡Involvement

Bitterlemons dataset

slide-185
SLIDE 185

ClassificaEon

Classification accuracy

slide-186
SLIDE 186

GeneralizaEon ¡to ¡New ¡Blogs

Generalization to new blogs

slide-187
SLIDE 187

Geong ¡AlternaEve ¡View

  • ­‑ Given ¡a ¡document ¡wri\en ¡in ¡one ¡ideology, ¡retrieve ¡the ¡equivalent
  • ­‑ Baseline: ¡SVM ¡+ ¡cosine ¡similarity

144

Finding alternate views

slide-188
SLIDE 188

Can ¡We ¡use ¡Unlabeled ¡data?

Unlabeled data

slide-189
SLIDE 189

Can ¡We ¡use ¡Unlabeled ¡data?

  • ¡In ¡theory ¡this ¡is ¡simple
  • Add ¡a ¡step ¡that ¡samples ¡the ¡document ¡view ¡(v)
  • Doesn’t ¡mix ¡in ¡pracEce ¡because ¡Eght ¡coupling ¡between ¡v ¡and ¡(x1,x2,z)

Unlabeled data

slide-190
SLIDE 190

Can ¡We ¡use ¡Unlabeled ¡data?

  • ¡In ¡theory ¡this ¡is ¡simple
  • Add ¡a ¡step ¡that ¡samples ¡the ¡document ¡view ¡(v)
  • Doesn’t ¡mix ¡in ¡pracEce ¡because ¡Eght ¡coupling ¡between ¡v ¡and ¡(x1,x2,z)
  • SoluEon

Unlabeled data

slide-191
SLIDE 191

Can ¡We ¡use ¡Unlabeled ¡data?

  • ¡In ¡theory ¡this ¡is ¡simple
  • Add ¡a ¡step ¡that ¡samples ¡the ¡document ¡view ¡(v)
  • Doesn’t ¡mix ¡in ¡pracEce ¡because ¡Eght ¡coupling ¡between ¡v ¡and ¡(x1,x2,z)
  • SoluEon
  • Sample ¡ ¡v ¡and ¡(x1,x2,z) ¡ ¡as ¡a ¡block ¡ ¡using ¡a ¡Metropolis-­‑HasEng ¡step

Unlabeled data

slide-192
SLIDE 192

Can ¡We ¡use ¡Unlabeled ¡data?

  • ¡In ¡theory ¡this ¡is ¡simple
  • Add ¡a ¡step ¡that ¡samples ¡the ¡document ¡view ¡(v)
  • Doesn’t ¡mix ¡in ¡pracEce ¡because ¡Eght ¡coupling ¡between ¡v ¡and ¡(x1,x2,z)
  • SoluEon
  • Sample ¡ ¡v ¡and ¡(x1,x2,z) ¡ ¡as ¡a ¡block ¡ ¡using ¡a ¡Metropolis-­‑HasEng ¡step
  • ¡This ¡is ¡a ¡huge ¡proposal!

Unlabeled data