Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID - PDF document

Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID dataset: In the class, we discussed the case of Bayesian estimation for a univariate Gaussian from dataset D that consisted of IID (independent and identically distributed) observations. • Let Pr( X ) ∼ N ( µ, σ 2 ) and let the data D = x 1 ...x m be IID. Let σ 2 be known. m m � � • µ MLE = 1 x i and σ MLE = 1 ( x i − µ ) 2 m m i =1 i =1 • The conjugate prior is Pr( µ ) = N ( µ 0 , σ 2 0 ), And the posterior is: Pr( µ | x 1 ...x m ) = N ( µ m , σ 2 m ) such that σ 2 mσ 2 µ ML ) and 1 = 1 + m 0 • µ m = ( 0 + σ 2 µ 0 ) + ( 0 + σ 2 ˆ mσ 2 mσ 2 σ 2 σ 2 σ 2 0 m Prove the above Answer: We have already done this in the class: https://www.cse.iitb.ac.in/ ~cs725/notes/lecture-slides/lecture-06-unannotated.pdf ). Now suppose, the examples x 1 ...x m in the dataset D were not necessarily independent and whose possible dependence was expressed by known covariance matrix Ω but with a common unknown (to be estimated) mean µ ∈ � . Let u = [1 , 1 , . . . 1] a m − dimensional vector of 1’s and x = [ x 1 ...x m ] and 1 2 ( x − µ u ) T Ω − 1 ( x − µ u ) 2 e − 1 Pr ( x 1 ...x m ; µ, Ω) = m 1 2 | Ω | (2 π ) Assume that Ω ∈ � m × m is positive-definite. Now answer the following questions 1. What would be the maximum likelihood estimate for µ ? Answer: This would correspond to MLE estimate for a multivariate Gaussian but with a single data point. Additionally, the restriction is that the mean vector is of the form µ u : We have already seen that maximizing a monotonically increasing transformation of the objective should yield the same point of optimality (and proved the same in this tutorial). So taking logs of the likelihood gives us the log likelihood above: 1

− 1 2( x − µ u ) T Ω − 1 ( x − µ u ) µ MLE = argmax µ Setting the derivative with respect to µ to 0: d � − 1 � 2( x T Ω − 1 x − 2 µ x T Ω − 1 u + µ 2 u T Ω − 1 u ) x T Ω − 1 u − µ u T Ω − 1 u � � = = 0 dµ ⇒ µ MLE = x T Ω − 1 u u T Ω − 1 u 2. How would you go about doing Bayesian estimation for µ ? 3. What will be an appropriate conjugate prior? 4. What will the posterior be? And what will be the MAP and Bayes estimates? As hinted in the class, we will expect the conjugate prior of Answers to 2, 3 and 4: mean µ of the (product of) Gaussian to be Gaussian. Let µ ∼ N ( µ 0 , σ 2 0 ) with a fixed and known σ 2 0 . � m ( µ − µ m ) 2 � − 1 N ( µ m , σ 2 m ) = exp = Pr( µ |D ) ∝ Pr( D| µ ) Pr( µ ) = 2 σ 2 2( x − µ u ) T Ω − 1 ( x − µ u ) − ( µ − µ 0 ) 2 � − 1 � � − 1 2( x − µ u ) T Ω − 1 ( x − µ u ) − ( 1 √ 1 ∝ exp 0 exp m 1 2 σ 2 2 πσ 2 2 | Ω | (2 π ) 2 0 Our reference equality is: � − 1 2( x T Ω − 1 x − 2 µ x T Ω − 1 u + µ 2 u T Ω − 1 u ) − ( µ − µ 0 ) 2 � � � − 1 ( µ − µ m ) 2 exp = exp 2 σ 2 2 σ 2 0 m Matching coefficients of µ 2 , we get 0 ⇒ 1 = 1 − µ 2 2 µ 2 u T Ω − 1 u + − µ 2 m = − 1 + u T Ω − 1 u 2 σ 2 2 σ 2 σ 2 σ 2 m 0 Matching coefficients of µ , we get � � � � 2 µµ m x T Ω − 1 u + 2 µ 0 x T Ω − 1 u + µ 0 1 ⇒ µ m = σ 2 � σ 2 0 x T Ω − 1 u + µ 0 � m = µ ⇒ 0 u T Ω − 1 u 2 σ 2 2 σ 2 m σ 2 1+ σ 2 0 0 µ m will be the MAP estimate of µ . HOMEWORK: What about the special cases of Ω being diagonal matrices with the same or different values along the diagonal? Problem 2. We discussed atleast two settings where maximizing a monotonically increasing function of the objective is somewhat more intuitive than maximizing the original objective. Recall the two settings. Now prove that maximizing the monotonically increasing transformation of the objective gives the same optimality point as does maximizing the original objective. 2

Answer: We will prove by contradiction. Let O ( θ ) be the objective function being maximized. Let θ ∗ = argmax O ( θ ). Let f ( β ) be a monotonically increasing function. Let θ θ � = θ ∗ and f ( O (ˆ ˆ f ( O ( θ )) such that ˆ θ )) > f ( O ( θ ∗ )). Since f is a monotonically θ = argmax θ increasing function of its arguments, it must be that O (ˆ θ ) > O ( θ ∗ ). Which is a contradiction, since we had θ ∗ = argmax θ = θ ∗ OR f ( O (ˆ O ( θ ). Thus either, it must be that ˆ θ )) = f ( O ( θ ∗ )). θ 3

Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID - PDF document

Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID dataset: In the class, we discussed the case of Bayesian estimation for a univariate Gaussian from dataset D that consisted of IID (independent and identically distributed)

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

XDP hands-on tutorial Jesper Dangaard Brouer Toke Hiland-Jrgensen Bornhack Gelsted, August

Prose tutorial Edit New Page Sumit Gulwani edited this page 9 minutes ago 60 revisions

Tutorial on using the Google Cloud Platform (GCP) Tutorial on using the Google Cloud Platform

CS 525M Mobile and Ubiquitous Computing Tutorial 1: Introduction by Bucky Roberts (thenewboston)

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

Related to Bayesian Statistics by Atsuhide Mori (Osaka Dental University, Japan) Geometric

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Automatic code rewriting in probabilistic programming Internship supervised by Hongseok Yang at

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Multiple co-clustering and its application Tomoki Tokuda, Okinawa Institute of Science and

CS 559: Machine Learning Fundamentals and Applications 4 th Set of Notes Instructor: Philippos

Graphs and their representations After this lesson, you should be able to define the