A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie - - PowerPoint PPT Presentation

a tutorial on bayesian nonparametrics
SMART_READER_LITE
LIVE PREVIEW

A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie - - PowerPoint PPT Presentation

A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie Mellon University fraisi@cs.cmu.edu October 25, 2016 Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45 Introdution 1


slide-1
SLIDE 1

A Tutorial on Bayesian Nonparametrics

Fatima Al-Raisi

Carnegie Mellon University fraisi@cs.cmu.edu

October 25, 2016

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45

slide-2
SLIDE 2

1

Introdution

2

Baseyan Non-Parametrics Motivation Intuitions and Assumptions Theoretical Motivation Practical Motivation

3

Dirichlet Process

4

Chinese Restaurant Process Pitman-Yor Process

5

Discussion and Concluding Remarks

6

List of Tutorials

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 2 / 45

slide-3
SLIDE 3

Development of Interest in Topic Over Time

An interesting “interest over time” pattern!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 3 / 45

slide-4
SLIDE 4

Interest Over Time: Deep Learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 4 / 45

slide-5
SLIDE 5

Interest Over Time: Reinforcement Learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 5 / 45

slide-6
SLIDE 6

Interest Over Time: Nonparametric Statistics!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 6 / 45

slide-7
SLIDE 7

Interest Over Time: Bayesian Inference!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 7 / 45

slide-8
SLIDE 8

Terminology

What does “Bayesian Nonparametrics” mean?

Bayesian inference: data and parameters, priors and posterios P(parameters|data) ∝ P(parameters)P(data|parameters) Bayesian inference vs. Bayes rule (Bayesian inference does not mean using Bayes rule!) Non-parametric⋆ (misnomer): large/unbounded number of parameters, growing number of parameters, infinite parameter space “the number of parameters grow with the amount of training data” No (strong) assumption about underlying distribution of the data Terminology note: non-parametric vs. noneparametric

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 8 / 45

slide-9
SLIDE 9

Terminology

Formal Definition

A statistical model is a collection of distributions: {Pθ : θ ∈ Θ} indexed by a parameter θ Parametric Model: indexing parameter is a finite-dimensional vector: Θ ⊂ Rk Nonparametric Model: Θ ⊂ F for some possibly infinite-dimensional space F Semiparametric Model: parameter has both a finite-dimensional component and an infinite-dimensional component: Θ ⊂ Rk × F where F is an infinite-dimensional space

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 9 / 45

slide-10
SLIDE 10

Review

Probabilistic Modeling

Data: x1, x2, . . . , xn Latent variables: z1, z2, . . . , zn Parameter: θ A probabilistic model is a parametrized joint distribution over variables P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ) Typically interpreted as a generative model of data Inference of latent variables given observed data: P(z1, z2, . . . , zn|x1, x2, . . . , xn, θ) = P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ) P(x1, x2, . . . , xn|θ)

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 10 / 45

slide-11
SLIDE 11

Review

Probabilistic Modeling

Learning, (e.g., by maximum likelihood): θ = argmax

θ

P(x1, x2, . . . , xn|θ) Prediction: P(xn+1|x1, x2, . . . , xn, θ) Classification: argmax

c

P(xn+1|θc) Standard algorithms: EM, VI, MCMC, etc.

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 11 / 45

slide-12
SLIDE 12

Review

Bayesian Modeling

Prior distribution: P(θ) Posterior distribution: P(z1, . . . , zn, θ|x1, . . . , xn) = P(x1, . . . , xn, z1, . . . , zn|θ)P(θ) P(x1, . . . , xn) The above is doing both inference and learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 12 / 45

slide-13
SLIDE 13

Clustering

Parametric Approach

Think of data as generated from a number of sources Model each cluster using a parametric model A data item i is drawn as follows: zi|π ∼ Discrete(π) xi|zi, θ⋆

k ∼ F(θ⋆ zi) where F is a parametric model (e.g., Guassian

with parameter vector θ = (µ, σ)) Mixing proportions: π = (π1, . . . , πk)|α ∼ Dirichlet( α

k , . . . , α k )

More on the Dirichlet distribution later

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 13 / 45

slide-14
SLIDE 14

Motivation

Question: What is the number of sources?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 14 / 45

slide-15
SLIDE 15

Motivation

Question: What is the number of sources? Is it 5?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 15 / 45

slide-16
SLIDE 16

Motivation

Question: What is the number of sources? Or maybe 3?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 16 / 45

slide-17
SLIDE 17

Motivation

Question: What is the number of sources? In practice an ad-hoc approach is followed to decide k. For example, guess the number of clusters, then run EM for Gaussian Mixture Model, look at results and goodness of fit, and then if needed try again with a different k

  • r run hierarchical agglomerative clustering, and cut the tree at a

“reasonable looking” level

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 17 / 45

slide-18
SLIDE 18

Motivation

Question: What is the number of sources? In practice an ad-hoc approach is followed to decide on k. But we want a principled approach for discovering k. After all, it is an essential part of the problem to be solved!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 18 / 45

slide-19
SLIDE 19

Motivation

Intuitive and Theoretical Motivation

Natural Phenomena: Topics:

◮ (Wikipedia) dynamic traversal ◮ Clustering

Species discovery Annotation and labeling Knowledge-base entity types . . . For any fixed k, as we see more data, there is a positive probability that we will encounter a data point that does not fit in the current scheme; i.e., k grows with data

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 19 / 45

slide-20
SLIDE 20

Motivation

Theoretical Motivation: De Finetti’s Theorem

Infinite Exchangeability

A data sequence is infinitely exchangeable if the distribution of any N data points does not change under permutation: p(X1, . . . , Xn) = p(Xσ(1), . . . , Xσ(n)) Theoretical Motivation: De Finetti’s Theorem

Theorem (De Finetti’s Theorem)

A sequence X1, . . . , Xn is infinitely exchangeable if and only if, for all N and some distribution P: p(X1, . . . , Xn) =

  • θ

N

  • n=1

p(Xn|θ)P(dθ)

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 20 / 45

slide-21
SLIDE 21

Motivation

Theoretical Motivation

De Finetti’s Theorem General proof: Hewitt, Savage 1955; Aldous 1983

Theorem (De Finetti’s Theorem)

A sequence X1, . . . , Xn is infinitely exchangeable if and only if, for all N and some distribution P: p(X1, . . . , Xn) =

  • θ

N

  • n=1

p(Xn|θ)P(dθ) Motivates: Parameters Likelihood Priors Non-parametric Bayesian priors

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 21 / 45

slide-22
SLIDE 22

Motivation

Theoretical Motivation

What happens under the parametric regime?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 22 / 45

slide-23
SLIDE 23

Motivation

Theoretical Motivation

What happens under the parametric regime? Let’s take the example of regression

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 23 / 45

slide-24
SLIDE 24

Motivation

Theoretical Motivation

What happens under the parametric regime? When fitting/optimizing, we’re finding the best fit within the chosen (parametric) family of functions; i.e., we’re optimizing to get the closest approximation to the true taget function.

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 24 / 45

slide-25
SLIDE 25

Motivation

Theoretical Motivation

What happens under the parametric regime? When fitting, we’re finding the best fit within the chosen (parametric) family of functions; i.e., we’re

  • ptimizing to get the closest approximation to the true taget function.

But this may not be good enough

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 25 / 45

slide-26
SLIDE 26

Motivation

Theoretical Motivation: Non-parametric Bayesin Approach

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 26 / 45

slide-27
SLIDE 27

Motivation

Practical Problem-solving Motivation

Human intuitions about high-dimentional problems are often misleading! Example: recent result from Random Matrix Theory: proving the proliferation of saddle points in comparison to local minina in high-dimentional problems [Dauphin et. al 2015] Assumptions often made when attempting to solve different problems are naturally part of the problem to be solved, e.g.,

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 27 / 45

slide-28
SLIDE 28

Motivation

Practical Problem-solving Motivation

Assumptions often made when attempting to solve different problems with data, are naturally part of the problem to be solved, e.g., number of clusters in clustering “type” or class of function in regression number of factors in factor analysis . . . The Bayesian non-parametric approach: no unreasonable assumptions about the data (i.e., true model for complex phenomenon goverened by a small number of parameters) model that can adopt its complexity to the data Let the data determine model complexity naturally no fitting or model selection → no underfitting or overfitting → no regularization required

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 28 / 45

slide-29
SLIDE 29

Motivation

Practical Problem-solving Motivation

Learning structures Bayesian prior over combinatorial structures Lack of intuitive parametric prior over these complex structures Nonparametric priors sometimes end up simpler than parametric priors

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 29 / 45

slide-30
SLIDE 30

Motivation

Practical Problem-solving Motivation: Structure Learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 30 / 45

slide-31
SLIDE 31

Motivation

Desirable Properties of Non-parametric Models

Exchangeability Naturally captures power laws Flexible ways of building complex models (e.g., heirarchical models) When conjugate priors are used, problems often become computationally tractable

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 31 / 45

slide-32
SLIDE 32

Dirichlet Process

Fundamental concept in Bayesian nonparametrics Formally defined by [Ferguson 1973] as a distribution over measures Can be derived in different ways, and as special cases of different processes:

◮ Infinite limit of a Gibbs sampler for finite mixture models ◮ Chinese restaurant process ◮ Stick-breaking construction Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 32 / 45

slide-33
SLIDE 33

Chinese Restaurant Process

A partition ̺ of a set S is: A disjoint family of non-empty subsets of S whose union is S. Denote the set of all partitions of S as PS Random partitions are random variables taking values in PS We will consider partitions of S

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 33 / 45

slide-34
SLIDE 34

Chinese Restaurant Process

Each customer comes into restaurant and sits at a table: Customers correspond to elements of S, and tables to clusters in ̺ Rich-gets-richer: large clusters more likely to attract more items Multiplying conditional probabilities together, the overall probability of ̺ , called the exchangeable partition probability function (EPPF), is:

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 34 / 45

slide-35
SLIDE 35

Chinese Restaurant Process

Number of clusters

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 35 / 45

slide-36
SLIDE 36

Nonparametric approach to clustering

Partitions are natural latent objects in clustering Given a dataset S, partition it into clusters of similar items Cluster c ∈ ̺ described by a model F(θ⋆

c) parameterized by θ⋆ c

Bayesian approach: introduce prior over ̺ and θc Compute posterior over both CRP mixture model: Use CRP prior over ̺, and an iid prior H over cluster parameters Computation becomes efficient when H is the conjugate prior for F

◮ One of the reasons why Guassians are popular in modeling is their nice

mathematical properties including conjugacy

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 36 / 45

slide-37
SLIDE 37

Nonparametric approach to clustering

Generative model of data: ̺ ∼ CRP(α) θ⋆

c|̺ ∼ H

xi|θ⋆, ̺ ∼ F(θ⋆

c)

The CRP prior is a prior over partitions of the data with the number

  • f partitions/clusters unknown a priori and is part of inference

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 37 / 45

slide-38
SLIDE 38

Nonparametric approach to clustering

Consider a finite mixture model with K sources. How can we describe parition of data into clusters? What is the distribution over paritions ̺ where [x]a

b = x(x + b) . . . (x + (a − 1)b)

Taking the limit k → ∞, we obtain a distribution over partitions without a limit on the number of sources (K disappears in the limit): Note where the excheangibility comes from!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 38 / 45

slide-39
SLIDE 39

Pitman-Yor Process

The Pitman-Yor Process is a generalization of the Dirichlet Process Recall the CRP probabilities: Here the difference is a discount parameter d: Effect is as d increases, the model tends to create more tables and more tables with fewer customers

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 39 / 45

slide-40
SLIDE 40

Pitman-Yor Process

The Pitman-Yor Process is better at capturing natural power-law phenomena, specially around the tales and peak of the distribution. Example: English word frequencies and ranks [Wood et. al 2011]

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 40 / 45

slide-41
SLIDE 41

Discussion

Note on the Statistical Properties of Nonparametric Models:

◮ Consistency ◮ Efficiency (i.e., statistical efficiency) ◮ Coverage (Bayesian analoug of Confidence Intervals)

Computationally expensive (also related to decoupling of models and algorithms) How to compare against non-parametric counterpart:

◮ Accuracy alone is not a good metric for comparison. It is a function of

the model and a specific dataset

◮ Asymptotic performance as the amount of data increases is better for

comparison

Nonparametric models are extremely popular in settings where the data follows a power-law Should be considered when we suspect a continous increase in possible configurations as we see more data Should not be used when we know that the distribution of the data is likely to follow a parametric form or is generated using a finite number of sources (no coverage guarantees in this case)

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 41 / 45

slide-42
SLIDE 42

Tutorials: A webpage with a list Tutorials on Bayesian Nonparametrics: http://stat.columbia.edu/~porbanz/npb-tutorial.html A tutorial on Bayesian nonparametric models by Gershman Blei http: //gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf Dirichlet Process. Yee Whye Teh. http: //www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf A Tutorial on Guassian Processes. Mark Ebden. http: //www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf Video tutorial Bayesian Nonparametrics - Yee Whye Teh - MLSS 2013 T¨ ubingen (Max Planck Institute for Intelligent Systems T¨ ubingen) Bayesian Nonparametrics - Tamara Broderic - MLSS 2015 T¨ ubingen Bayesian Nonparametrics Lectures - Larry Wasserman

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 42 / 45

slide-43
SLIDE 43

Tutorials and Further References

Courses on Bayesian Nonparametrics

Nonparametric modeling. UIC. http://georgek.people.uic.edu/Nonparametric.htm Bayesian Nonparametric Statistics. UNITO.it http://www.master-sds.unito.it/do/corsi.pl/Show?_id=meln Bayesian Nonparametrics - Foundations and Applications. FSU. http://stat.fsu.edu/~sethu/st718outline.pdf A Course in Bayesian Statistics. Stanford University. http://statweb.stanford.edu/~sabatti/Stat370/

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 43 / 45

slide-44
SLIDE 44

Tutorials and Further References

Bayesian Nonparametrics Textbooks

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 44 / 45

slide-45
SLIDE 45

Thank you!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 45 / 45