assignment 3
play

Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - PDF document

Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020 1 [2 points] Mixture of Bernoullis A mixture of Bernoullis model is like the Gaussian mixture model which weve discussed in this course. Each of the mixture


  1. Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020 1 [2 points] Mixture of Bernoullis A mixture of Bernoullis model is like the Gaussian mixture model which we’ve discussed in this course. Each of the mixture components consists of a collection of independent Bernoulli random variables. In general, a mixture model assumes the data are generated by the following process: first we sample z , and then we sample the observables x from a distribution which depends on z , i.e. p ( x , z ) = p ( x | z ) p ( z ) (1) In mixture models, p ( z ) is always a multinomial distribution with parameter π = { π 1 , ..., π K } which are mixture weights satisfying K � π k = 1 , π k ≥ 0 (2) k =1 Consider a set of N binary random variable in a D -dimensional space x j , where j = 1 , ..., N , each of which is governed by a Bernoulli distribution with param- eter θ jk D � θ x ij jk (1 − θ jk ) (1 − x ij ) p ( x i | z i = k, θ ) = (3) j =1 We can write the generative model of a mixture model as K � π z k p ( z | π ) ∼ Multinoimal( π ) = k k =1 (4) K � k (1 − θ k ) (1 − x ) ] z k [ θ x p ( x | z, θ ) ∼ Bernoulli( θ ) = k =1 The second distribution is the mixture proportion and π k is the weight of k -th proportion. So the Bernoulli mixture model is given as K N D � � � θ x ij jk (1 − θ jk ) (1 − x ij ) p ( x ) = π k (5) k =1 i =1 j =1 1

  2. • Show the associated directed graphical model and write down the incomplete- data log likelihood. The complete data log likelihood for this model can be written as N � K D � z ik � � � � � ln p ( x , z | π , θ ) = ln π k p ( x ij | θ jk ) (6) i =1 k =1 j =1 In order to drive the EM algorithm, we take the expectation of the com- plete data log-likelihood with respect to the posterior distribution of latent variable z . Write down Q ( ξ ; ξ (old) ), where ξ = { θ, π } and the posterior distribution of the latent variable z . • Derive the update for π and θ in the M-step for ML estimation in terms of E [ z ik ] and write down E [ z ik ]. • Consider a mixture distribution p ( x ) and show that K � E [ x ] = π k θ k k =1 (7) K � π k { Σ k + θ k θ T k } − E [ x ] E [ x ] T cov[ x ] = k =1 where Σ k = diag[ θ ki (1 − θ ki )]. Hint - Solve the second equation in a general case by adding and subtract- ing a term which is a function of E [ x | k ] = θ k . • We now consider a Bayesian model in which we impose priors on the parameters. We impose the natural conjugate priors, i.e., a Beta prior for each θ jk and a Dirichlet prior for π . p ( π | α ) ∼ Dir( α ) (8) p ( θ jk | a, b ) ∼ Beta( a, b ) show that the M-step for MAP estimation of a mixture Bernoullis is given by θ kj = ( � i E [ z ik ] x ij + a − 1) ( � i E [ z ik ]) + a + b − 2 (9) π k = ( � i E [ z ik ]) + α k − 1 N + � k α k − K Hint - For the maximization w.r.t. π k in the M-step, you need to use Lagrange multiplier to enforce the constraint about π . 2

  3. 2 [2 points] Variational Lower Bound for the mixture of Bernoullis In the mixture of Bernoullis, the multinomial distribution chooses the mixtures. One can assume the conditional probability of each observed component follows a Bernoulli distribution given as Equation 3. We have priors over π and θ as K p ( π | α ) = Γ( � k α k ) � π α k − 1 k � k Γ( α k ) (10) k =1 p ( θ jk | a, b ) = Γ( a + b ) Γ( a )Γ( b ) θ a − 1 jk (1 − θ jk ) b − 1 • If you consider a variational distribution which factorizes between the latent variables and the parameters then you must show how the lower bound has the following form L = E [ln P ( x | z , θ )] + E [ln P ( z | π )] + E [ln P ( π | α )] + E [ln P ( θ | a, b )] (11) − E [ln q ( z )] − E [ln q ( π )] − E [ln q ( θ )] • Let’s assume that the approximate distribution of parameters of the model has the following form q ( θ | η, ν ) ∼ Beta( η, ν ) q ( π | ρ ) ∼ Dir( ρ ) (12) q ( z k | τ k ) ∼ Cat( τ k ) Here ρ , τ , η and ν are variational parameters. Derive the variational update equations for the three variational distributions using the mean field approximation which should yield to N � ρ k = α + τ ik i =1 N � η jk = a + τ ik x ij i =1 N � ν jk = b + τ ik (1 − x ij ) (13) i =1 K D � � � τ ik ∝ exp ψ ( ρ k ) − ψ ( ρ k ′ ) + x ij [ ψ ( η jk ) − ψ ( η jk + ν jk )] k ′ =1 j =1 D � � + (1 − x ij )[ ψ ( ν jk ) − ψ ( η jk + ν jk )] j =1 3

  4. Hint-Use following properties E q ( θ ) [ln θ jk ] = ψ ( η jk ) − ψ ( η jk + ν jk ) E q ( θ ) [ln(1 − θ jk )] = ψ ( ν jk ) − ψ ( η jk + ν jk ) K (14) � E q ( π ) [ln π k ] = ψ ( ρ k ) − ψ ( ρ k ′ ) k ′ =1 E q ( z k ) [ z k ] = τ k where ψ ( . ) is the digamma function. 3 [2 points] Kernel methods 1. The k-nearest neighbors classifier assigns a point x to the majority class of its k nearest neighbors in the training set. Assume that we use squared Euclidean distance to measure the distance to some point x n in the train- ing set, � x − x n � 2 . Reformulate this classifier for a nonlinear kernel k using the kernel trick. 2. The file circles.csv contains a toy dataset. Each example has two fea- tures that represent its coordinates ( x 1 , x 2 ) in 2D space. Points belong to one of 5 classes which correspond to different circles centered at the ori- gin. We would like to perform classification with an additional feature for the squared Euclidean distance to the origin. Write out the appropriate feature map φ (( x 1 , x 2 )) and kernel function k ( x , x ′ ). 3. Perform k-nearest neighbors classification with k = 15 using the kernel from (2) and the standard linear kernel. Compare accuracies over 10-fold cross validation. Which version gives better results? 4

  5. 4 [2 points] Support Vector Machine 1. Recall the formulation of soft-margin (linear) SVM: n 1 2 � w � 2 + C � argmin w,b,ξ ξ i i =1 (15) w T x ( i ) + b s.t. y ( i ) � � ≥ 1 − ξ i , i = 1 , . . . , n ξ i ≥ 0 , i = 1 , . . . , n During lecture, support vector machine is introduced geometrically as find- ing the Max-Margin Classifier . While this geometric interpretation pro- vides useful intuition about how SVM works, it is hard to relate to other machine learning algorithms such as Logistic Regression. In this exercise, we show that soft-margin SVM is equivalent to minimizing a loss function (to be specific, the hinge loss ) with L2-regularization. And thus connect it to logistic regression and the goal of binary classification. The hinge loss is defined as V ( y, f ( x )) = (1 − yf ( x )) + where ( s ) + = max( s, 0). Show that � n � 1 � V ( y i , f ( x i )) + λ � w � 2 argmin (16) 2 n w,b i =1 is equivalent to formulation (15) for some C , where f ( x ) = w T x + b ; what is the corresponding C (in terms of n and λ )? 2. In the previous question, we chose V ( y, f ( x )) = (1 − yf ( x )) + (the hinge loss ) as our loss function; however, there are other reasonable loss func- tions that we can choose. For example, we can choose V ( y, f ( x )) = 1 � 1 + e − yf ( x ) � log(2) log which is usually called the logistic loss ; and � 0 , yf ( x ) ≥ 0 V ( y, f ( x )) = which is called the 0-1 loss . 1 , yf ( x ) < 0 Please plot the above three loss functions in one figure, with yf ( x ) as the horizontal axis and V ( y, f ( x )) as the vertical axis. Explain your observa- tion. 3. [Bonus] Answer the following questions as precisely as you can. What is 1 � 1 + e − yf ( x ) � (16) if we choose V ( y, f ( x )) = log(2) log ? What is (16) if we � 0 , yf ( x ) ≥ 0 choose V ( y, f ( x )) = ? (Long answers receive no score) 1 , yf ( x ) < 0 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend