 
              Exploiting compositionality to explore a large space of model structures Roger Grosse Dept. of Computer Science, University of Toronto
Introduction How has the life of a machine learning engineer changed in the past decade? Many tasks that previously required human experts are starting to be automated Stan ? probabilistic programming feature algorithm probabilistic model selection engineering configuration inference
The probabilistic modeling pipeline Can we identify good models automatically? Design a Fit the Evaluate the model model model Two challenges: Automating each stage of this pipeline Identifying a promising set of candidate models
The probabilistic modeling pipeline Design a Fit the Evaluate the model model model
Matrix decompositions Example: Senate votes, 2009-2010 Votes Senators all of one Senator’s votes record of votes on one motion or bill
Matrix decompositions Clustering the Senators Cluster Cluster Within-cluster Observations assignments centers variability = + Which groups of Senators vote for a Which cluster a particular bill/motion Senator belongs to
Matrix decompositions Clustering the Senators Cluster Cluster Within-cluster Observations assignments centers variability = +
Matrix decompositions Clustering the votes Cluster Cluster Within-cluster Observations centers assignments variability = + which cluster a what sorts of vote belongs to bills/motions one which Senators tend Senator tends to to vote for one sort of vote for bill/motion
Matrix decompositions Clustering the votes Cluster Cluster Within-cluster Observations centers assignments variability = +
Matrix decompositions Dimensionality reduction Residuals Observations = + Representation of a vote Representation of a Senator
Matrix decompositions Dimensionality reduction Residuals Observations = +
Matrix decompositions Co-clustering Senators and Votes + +
Matrix decompositions Co-clustering Senators and Votes + +
Matrix decompositions No structure Cluster columns Cluster rows … Dimensionality Co-clustering reduction
The probabilistic modeling pipeline Design a Fit the Evaluate the model model model
Building models compositionally We build models by composing simpler motifs + - - + x x x x x + + + - x x x x x x x xx + - - - x x x x x x x x x x - + + - x x x x xx x x x x x x x x x x Dimensionality Clustering Binary attributes reduction x x x x x x x x x x x x x x x Heavy-tailed Smoothness Periodicity distributions
Building models compositionally (Ghahramani, 1999 NIPS tutorial)
Generative models Posterior Generation Inference Latent variables h Infer a good explanation of Tell a story of how how a particular datasets get dataset was generated generated This gives a joint Find likely values probability of the latent distribution over variables observations and conditioned on the latent variables observations Observations v p ( h , v ) = p ( h ) p ( v | h ) p ( h | v )
Space of models: building blocks λ i ∼ Gamma( a, b ) π ∼ Dirichlet( α ) ν j ∼ Gamma( a, b ) u i ∼ Multinomial( π ) u ij ∼ Normal(0 , λ − 1 i ν − 1 ) j Gaussian Multinomial (G) (M) � 1 if i ≥ j p j ∼ Beta( α , β ) u ij = 0 otherwise u ij ∼ Bernoulli( p j ) Bernoulli Integration (B) (C) Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Space of models: generative process We represent models as algebraic expressions. M G + G 1. Sample all leaf matrices independently from their corresponding prior distributions M T + G 2. Evaluate the resulting expression Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Space of models: grammar Starting symbol: G Gaussian Multinomial Production rules: (G) (M) G � MG + G | GM T + G clustering � + low rank G � GG + G � nary features + G � BG + G | GB T + G binary features M � B + G � CG + G | GC T + G linear dynamics | � sparsity G � exp( G ) � G sparsity G � exp( G ) � G Bernoulli Integration (B) (C) Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Example: co-clustering M G + G G M T + G G → MG + G G → GM � + G M T G + G Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Examples from the literature dependent gaussian scale mixture (e.g. Karklin and Lewicki, 2005) Bayesian clustered tensor factorization (Sutskever et al., 2009) ... ... binary matrix factorization sparse coding (Meeds et al., 2006) (e.g. Olshausen and Field, 1996) co-clustering linear dynamical system (e.g. Kemp et al., 2006) binary features ... low-rank approximation (Salakhutdinov and (Griffiths and ... Mnih, 2008) Ghahramani, 2005) random walk clustering no structure Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
The probabilistic modeling pipeline Design a Fit a Evaluate the model model model Posterior Inference
Algorithms: posterior inference fit a clustering Recursive initialization model G → MG + G implement one algorithm per production rule share computation between models Choose the model dimension using Bayesian nonparametrics Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Posterior inference algorithms Can make use of model-specific algorithmic tricks carefully designed for individual production rules : ( A + UCV ) − 1 = A − 1 − A − 1 U ( C − 1 + V A − 1 U ) − 1 V A − 1 Eliminating variables Linear algebra analytically identities x x x x x x x x x x x x x x xx x xx xx x xx x x x x x x x x tractable High-level substructures transition operators
The probabilistic modeling pipeline Design a Fit a Evaluate the model model model We evaluate models on the probability they assign to held-out subsets of the observation matrix.
The probabilistic modeling pipeline Design a Fit a Evaluate the model model model Want to search over the large, open-ended space of models Key problem: the search space is very large! over 1000 models reachable in 3 productions how to choose a promising set of models to evaluate?
Algorithms: structure search A brief history of models of natural images… Olshausen and Karklin and Lewicki, Sanger, 1988 Field, 1994 2005, 2008 Model the dependencies Model the heavy-tailed Model patches as linear between scales of distributions of coefficients combinations of uncorrelated coefficients basis functions high-level texture oriented edges representation similar Fourier representation similar to simple cells to complex cells
Algorithms: structure search Refining models = applying productions Based on this intuition, we apply a greedy search procedure ... M ( GM T + G ) + G ... MG + G G
Experiments: simulated data Tested on simulated data where we know the correct structure Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Experiments: simulated data Tested on simulated data where we know the correct structure Usually chooses the correct structure in low-noise conditions Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Experiments: simulated data Tested on simulated data where we know the correct structure Usually chooses the correct structure in low-noise conditions Gracefully falls back to simpler models under heavy noise Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Experiments: real-world data ( MG + G ) M T + G Senate votes 09-10 GM T + G — Cluster votes. 22 clusters Cluster Senators. largest: party line Democrat, party line 11 clusters Republican, all yea no cross-party clusters others are series of No third level model votes on single issues improves by more than 1 nat Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Experiments: real-world data ( MG + G ) M T + G Senate votes 09-10 GM T + G — Motion capture C ( GG + G ) + G CG + G — Model 1: Data: motion capture of Independent a person walking. Each Markov chains row gives a person’s Model 2: displacement and joint Correlations in angles in one frame. joint angles Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Experiments: real-world data ( MG + G ) M T + G Senate votes 09-10 GM T + G — Motion capture C ( GG + G ) + G CG + G — Image patches (exp( GG + G ) � G ) G + G (exp( G ) � G ) G + G GG + G Model 1: Low- Data: 1,000 12x12 rank approximation patches from 10 blurred (PCA). and whitened images. Model 2: Sparsify coefficients to get sparse coding Model 3: Model dependencies between scale variables ... Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Experiments: real-world data ( MG + G ) M T + G Senate votes 09-10 GM T + G — Motion capture C ( GG + G ) + G CG + G — Image patches (exp( GG + G ) � G ) G + G (exp( G ) � G ) G + G GG + G Concepts MG + G M ( GG + G ) + G — Data: Mechanical Turk Model 1: Model 2: users’ judgments to 218 Cluster entities. Low-rank representation of questions about 1000 cluster centers. entities 39 clusters 8 dimensions Dimension 1: living vs. nonliving Dimension 2: large vs. small Grosse , Salakhutdinov, Freeman, and Tenenbaum, UAI 2012
Recommend
More recommend