Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 - PowerPoint PPT Presentation

Graham Neubig – Non-parametric Bayesian Statistics Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1

Graham Neubig – Non-parametric Bayesian Statistics Overview ● About Bayesian Non-parametrics ● Basic theory ● Inference using sampling ● Learning an HMM with sampling ● From the finite HMM to the infinite HMM ● Recent developments (in sampling and modeling) ● Applications to speech and language processing ● Focus on unsupervised learning for discrete distributions 2

Graham Neubig – Non-parametric Bayesian Statistics Non-parametric Bayes The number of parameters Put a prior on the is not decided in advance parameters and consider (i.e. infinite) their distribution 3

Graham Neubig – Non-parametric Bayesian Statistics Types of Statistical Models Prior on # of Discrete Continuous Parameters Parameters Distribution Distribution (Classes) Maximum No Finite Multinomial Gaussian Likelihood Bayesian Yes Finite Multinomial+ Gaussian+ Parametric Dirichlet Gaussian Prior Prior Bayesian Yes Infinite Multinomial+ Gaussian Non- Dirichlet Process parametric Process Covered Here 4

Graham Neubig – Non-parametric Bayesian Statistics Bayesian Basics 5

Graham Neubig – Non-parametric Bayesian Statistics Maximum Likelihood (ML) ● We have an observed sample X = 1 2 4 5 2 1 4 4 1 4 ● Gather counts C ={ c 1, c 2, c 3, c 4, c 5 }={ 3,2,0,4,1 } ● Divide counts to get probabilities c i P  x = i = ∑  i c  i multinomial P  x = ={ 0.3,0.2,0 , 0.4,0.1 } 6

Graham Neubig – Non-parametric Bayesian Statistics Bayesian Inference ● ML is weak against sparse data ● Don't actually know parameters we could have  ={ 0.3,0.2,0 , 0.4,0.1 } c  x ={ 3,2,0,4,1 } if or we could have  ={ 0.35,0.05,0.05,0.35,0.2 } ● Bayesian statistics don't pick one probability ● Use the expectation instead P  x = i = ∫  i P  ∣ X  d   7

Graham Neubig – Non-parametric Bayesian Statistics Calculating Parameter Distributions ● Decompose with Bayes' law likelihood prior P  X ∣ P  P ∣ X = ∫ P  X ∣ P  d  regularization coefficient ● likelihood easily calculated according to the model ● prior chosen according belief about probable values ● regularization requires difficult integration... ● … but conjugate priors make things easier 8

Graham Neubig – Non-parametric Bayesian Statistics Conjugate Priors ● Definition: Product of likelihood and prior takes the same form as the prior Multinomial Likelihood * Dirichlet Prior = Dirichlet Posterior Gaussian Likelihood * Gaussian Prior = Gaussian Posterior Same ● Because the form is known, no need to take the integral to regularize 9

Graham Neubig – Non-parametric Bayesian Statistics Dirichlet Distribution/Process ● Assigns probabilities to multinomial distributions P { 0.3,0.2,0.01,0.4,0.09 }= 0.000512 e.g. P { 0.35,0.05,0.05,0.35,0.2 }= 0.0000963 ● Defined over the space of proper probability { 1 ,  ,  n } distributions n ∑ i = 1  i = 1 ∀  i 0  i  1 ● Dirichlet process is a generalization of distribution ● Can assign probabilities to infinite spaces 10

Graham Neubig – Non-parametric Bayesian Statistics Dirichlet Process (DP)  ;  ,P base = 1 n Z ∏ i = 1 P  ● Eq.  P base  x = i − 1  i ● α is the “concentration parameter,” larger value means more data needed to diverge from prior ● P base is the “base measure,” expectation of θ Way of writing in Way of writing in  i = P base  x = i  Dirichlet distribution Dirichlet process n Z = ∏ i = 1  P base  x = i  ● Regularization coefficient: n  ∑ i = 1  P base  x = i  (Γ=gamma function) 11

Graham Neubig – Non-parametric Bayesian Statistics Examples of Probability Densities α = 10 α = 15 P base = P base = {0.6,0.2,0.2} {0.2,0.47,0.33} α = 9 P base = α = 14 P base = {0.22,0.33,0.44} {0.43,0.14,0.43} 12 From Wikipedia

Graham Neubig – Non-parametric Bayesian Statistics Why is the Dirichlet Conjugate? ● Likelihood is product of multinomial probabilities x 1 = 1, x 2 = 5, x 3 = 2, x 4 = 5 Data: P  X ∣= p  x = 1 ∣ p  x = 5 ∣ p  x = 2 ∣ p  x = 5 ∣= 1  5  2  5 ● Combine multiple instances into a single count c  x = i ={ 1,1,0,0, 2 } n 2 = ∏ i = 1 c  x = i  P  X ∣= 1  2  5  i ● Take product of likelihood and prior 1 1 n n n ∏ i = 1 Z prior ∏ i = 1 Z post ∏ i = 1  i − 1 → c  x = i  i − 1 c  x = i  ∗  i  i  i 13

Graham Neubig – Non-parametric Bayesian Statistics Expectation of θ in the DP ● When N=2 = 1  1  1 − 1   2 / 2 ] 0 1 − Z [− 1 1 1 E [ 1 ]= ∫  1 − 1  2  2 − 1 d  1  1 Z  1 1 1 Z ∫  2 / 2 ∗ 1  1  1 − 1 d  1 0 − 1 − 1  0 = 1 1 Z ∫  1  1 − 1   2 − 1 d  1  1 = 0  1 0 1 1 Z ∫ 0  1 − 1  1 − 1   2 d  1  1  2 Integration by Parts = 1 E [ 2 ]= 1  1  1 − 1 d  1 u = 1 du = 1  1  1 − E [ 1 ]  2  2  2 − 1 d  1 dv = 1 − 1   2 / 2 v =− 1 − 1   1 E [ 1 ]= ∫ u dv = uv − ∫ v du  1  2 14

Graham Neubig – Non-parametric Bayesian Statistics Multi-Dimensional Expectation  i =  P base  x = i  E [ i ] = = P base  x = i   n ∑ i = 1  i ● Posterior distribution for multinomial with DP prior: 1 1 n P  x = i = ∫ Z post ∏ i = 1 c  x = i  i − 1  i  i 0 Base Measure = c  x = i ∗ P base  x = i  Observed Counts c  ・  Concentration Parameter ● Same as additive smoothing 15

Graham Neubig – Non-parametric Bayesian Statistics Marginal Probability ● Calculate prob. of observed data using the chain rule P  x i = c  x i ∗ P base  x i  X = 1 2 1 3 1 α=1 P base (x=1,2,3,4) = .25 c  ・  c = { 0, 0, 0, 0 } c = { 2, 1, 0, 0 } P  x 1 = 1 = 0  1 ∗ .25 P  x 4 = 3 ∣ x 1,2,3 = 0  1 ∗ .25 = .25 = .063 0  1 3  1 c = { 1, 0, 0, 0 } c = { 2, 1, 1, 0 } P  x 2 = 2 ∣ x 1 = 0  1 ∗ .25 P  x 5 = 1 ∣ x 1,2,3,4 = 2  1 ∗ .25 = .125 = .45 1  1 4  1 c = { 1, 1, 0, 0 } Marginal Probability P  x 3 = 1 ∣ x 1,2 = 1  1 ∗ .25 = .417 P(X) = .25*.125*.417*.063*.45 2  1 16

Graham Neubig – Non-parametric Bayesian Statistics Chinese Restaurant Process ● Way of expressing DP and other stochastic processes ● Chinese restaurant with infinite number of tables ● Each customer enters restaurant and takes action: P  sits at table i ∝ c  i  P  sits at a new table ∝ ● When the first customer sits at a table, choose the food served there according to P base X = 1 2 1 3 1 α=1 N=4 … 1 2 1 3 17

Graham Neubig – Non-parametric Bayesian Statistics Sampling Basics 18

Graham Neubig – Non-parametric Bayesian Statistics Sampling Basics ● Generate a sample from probability distribution: Distribution: P(Noun)=0.5 P(Verb)=0.3 P(Preposition)=0.2 Sample: Verb Verb Prep. Noun Noun Prep. Noun Verb Verb Noun … ● Count the samples and calculate probabilities P(Noun)= 4/10 = 0.4, P(Verb)= 4/10 = 0.4, P(Preposition) = 2/10 = 0.2 ● More samples = better approximation 1 0.8 Noun Probability 0.6 Verb 0.4 Prep. 0.2 0 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 19 Samples

Graham Neubig – Non-parametric Bayesian Statistics Actual Algorithm SampleOne (probs[]) Calculate sum of probs z = sum (probs) Generate number from remaining = rand(z) uniform distribution over [0,z) for each i in 1:probs.size Iterate over all probabilities remaining -= probs[i] Subtract current prob. value if remaining <= 0 If smaller than zero, return current index as answer return i Bug check, beware of overflow! 20

Graham Neubig – Non-parametric Bayesian Statistics Gibbs Sampling ● Want to sample a 2-variable distribution P(A,B) ● … but cannot sample directly from P(A,B) ● … but can sample from P(A|B) and P(B|A) ● Gibbs sampling samples variables one-by-one to recover true distribution ● Each iteration: Leave A fixed, sample B from P(B|A) Leave B fixed, sample A from P(A|B) 21

Graham Neubig – Non-parametric Bayesian Statistics Example of Gibbs Sampling ● Parent A and child B are shopping, what sex? P(Mother|Daughter) = 5/6 = 0.833 P(Mother|Son) = 5/8 = 0.625 P(Daughter|Mother) = 2/3 = 0.667 P(Daughter|Father) = 2/5 = 0.4 ● Original state: Mother/Daughter Sample P(Mother|Daughter)=0.833, chose Mother Sample P(Daughter|Mother)=0.667, chose Son c(Mother, Son)++ Sample P(Mother|Son)=0.625, chose Mother Sample P(Daughter|Mother)=0.667, chose Daughter c(Mother, Daughter)++ … 22

Graham Neubig – Non-parametric Bayesian Statistics Try it Out: 1 0.8 y t 0.6 i Moth/Daugh l i b Moth/Son a 0.4 b Fath/Daugh o 0.2 r P Fath/Son 0 1E+00 1E+02 1E+04 1E+06 Number of Samples ● In this case, we can confirm this result by hand 23

Graham Neubig – Non-parametric Bayesian Statistics Learning a Hidden Markov Model Part-of-Speech Tagger with Sampling 24

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 - PowerPoint PPT Presentation

Graham Neubig Non-parametric Bayesian Statistics Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric Bayesian Statistics Overview About Bayesian Non-parametrics Basic theory Inference

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Semi-parametric and response setup non-parametric approaches to Parametric models

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Parametric and non-parametric multivariate test statistics for high-dimensional fMRI data Daniela

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Non parametric prediction and mapping of standing Non-parametric prediction and mapping of

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

TCTL model checking lower/upper-bound Introduction parametric timed automata without Parametric

1 An Filtering System that Monitors Document Search Engines Can Help, But Not Enough!

Discrete probability distributions Beginning Bayes in R Course overview Two schools of

Convergence of the Ensemble Kalman Filter Jan Mandel Center for Computational Mathematics

1 - However, I would like to go beyond the specific problem of direct illumination and really

CSC2515 Lecture 6: Probabilistic Models Marzyeh Ghassemi Material and slides developed by Roger

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 10 Uncertainty 10.1

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

THIRD QUARTER 2019 CONFERENCE CALL 1 phillips66.com | NYSE: PSX CAUTIONARY STATEMENT