outline
play

Outline Latent Variable Generative Models Cooperative Vector - PDF document

2/4/2020 CS 3750 Advanced Machine Learning Latent Variable Generative Models II Ahmad Diab AHD23@cs.pitt.edu Feb 4, 2020 Based on slides of Professor Milos Hauskrecht Outline Latent Variable Generative Models Cooperative Vector


  1. 2/4/2020 CS 3750 Advanced Machine Learning Latent Variable Generative Models II Ahmad Diab AHD23@cs.pitt.edu Feb 4, 2020 Based on slides of Professor Milos Hauskrecht Outline • Latent Variable Generative Models • Cooperative Vector Quantizer Model • Model Formulation • Expectation Maximization (EM) • Variational Approximation • Noisy-OR Component Analyzer • Model Formulation • Variational EM for NOCA • References 1

  2. 2/4/2020 Latent Variable Generative Models • Generative Models: Unsupervised learning models that study the underlying structure (e.g. interesting patterns) and causal structures of data to generate data like it. • Latent (hidden) variables are random variables that are hard to observe. (ex. Length is measured, but intelligence is not), and is assumed to affect the response variable. • The idea: introduce an unobserved latent variable, S, and use it to generate a traceable, less complex distribution. p(x, s) = p(x | s) p(s) p(x) Complex Distribution Simpler Distribution Latent Variable Generative Models • Assumption: Observable variables are independent given latent variables. S 1 S 2 S q . . . . . . x d-1 x d x 1 x 2 2

  3. 2/4/2020 Cooperative Vector Quantizer (CVQ) • Latent variables (s): Binary vars with Dimensionality k • Observed variables (x): real valued vars Dimensionality d S 1 S 2 S k . . . . . . x d-1 x 1 x 2 x d CVQ – Model Description S: k binary vars • Model … 𝐿 • 𝑦 = σ 𝑙=1 𝑡 𝑙 𝑥 𝑙 + 𝜗 • Latent variables 𝑡 𝑗 • ~ Bernoulli distribution parameter: 𝜌 𝑗 • 𝑄(𝑡 𝑗 | 𝜌 𝑗 ) = 𝜌 𝑗𝑡 𝑗 (1 − 𝜌 𝑗 ) 1−𝑡 𝑗 X: d real valued vars • 𝑥 𝑙 is the weight output by source 𝑡 𝑙   w w .. w • Observable variables 𝑦   11 12 1 k   w • ~ Normal distributions parameters: W, Σ  21 W   .. • 𝑄(𝑦 | 𝑡 ) = N(Ws, Σ ),      .. ..  • we assume Σ = 𝜏𝐽 w w 1 d dk • Joint for one instance of s and x 2𝜏 2 (𝑦 − 𝑋𝑡) 𝑈 (𝑦 − 𝑋𝑡) ς 𝑗=1 1 • 𝑞 𝑦, 𝑡 Θ) = 2 −𝑒/2 𝜏 −𝑒/2 exp{ − 𝑙 𝜌 𝑗𝑡 𝑗 (1 − 𝜌 𝑗 ) 1−𝑡 𝑗 6 3

  4. 2/4/2020 CVQ – Model Description • Objective: to learn parameters of the model: W, π, σ • If both x and s are observable, • Use loglikelihood: 𝑂 𝑚𝑝𝑕𝑄(𝑦 𝑜 , 𝑡 𝑜 |Θ) = ෍ 𝑜=1 (𝑜) 𝑚𝑝𝑕 𝜌 𝑗 2𝜏 2 (𝑦 𝑜 − 𝑋𝑡 𝑜 ) 𝑈 (𝑦 𝑜 − 𝑋𝑡 𝑜 ) + σ 𝑗=1 1 𝑂 𝑙 σ 𝑜=1 −𝑒 𝑚𝑝𝑕 𝜏 − 𝑡 𝑗 (𝑜) )log(1 − 𝜌 𝑗 ) + c (1 − 𝑡 𝑗 • Solution is nice and easy 7 CVQ – Model Description • Objective: to learn parameters of the model: W, π, σ • If only x are observable • Log likelihood of data: 𝑚𝑝𝑕𝑄(𝑦 𝑜 |Θ) = σ 𝑜=1 𝑚𝑝𝑕 σ {𝑡 𝑜 } 𝑄 (𝑦 𝑜 , 𝑡 𝑜 |Θ) 𝑂 𝑂 𝑚𝑝𝑕𝑄 𝐸 Θ = σ 𝑜=1 • Solution is hard, we can no longer benefit from the decomposition. • Use Expectation Maximization (EM). 8 4

  5. 2/4/2020 Expectation Maximization (EM) • Let H be a set of all variables with hidden or missing values • 𝑄(𝐼, 𝐸|Θ, 𝜊) = 𝑄(𝐼 |𝐸, Θ, 𝜊)𝑄(𝐸|Θ, 𝜊) • log 𝑄 ( 𝐼 , 𝐸 | Θ , 𝜊 ) = log 𝑄 ( 𝐼 | 𝐸 , Θ , 𝜊 ) + log 𝑄 ( 𝐸 | Θ , 𝜊 ) • log 𝑄 ( 𝐸 | Θ , 𝜊 ) = log 𝑄 ( 𝐼 , 𝐸 | Θ , 𝜊 ) − log 𝑄 ( 𝐼 | 𝐸 , Θ , 𝜊 ) • Average both sides with 𝑄(𝐼 |𝐸, Θ′, 𝜊) for Θ′ • 𝐹 𝐼|𝐸,Θ′ 𝑚𝑝𝑕𝑄(𝐸|Θ, 𝜊) = 𝐹 𝐼|𝐸,Θ′ 𝑚𝑝𝑕𝑄(𝐼, 𝐸|Θ, 𝜊) − 𝐹 𝐼|𝐸,Θ′ 𝑚𝑝𝑕𝑄(𝐼|𝐸, Θ, 𝜊) • log 𝑄(𝐸 | Θ, 𝜊 ) = 𝐺 (Θ | Θ′) = 𝐹(Θ | Θ′) + 𝐼 (Θ | Θ′) Log-likelihood of data • EM uses the true posterior. ) 𝑄(𝐼|𝐸, 𝛪′, 𝜊 9 Expectation Maximization (EM) • General EM Algorithm: • Initialize parameters Θ • Set Θ '= Θ • Expectation step • 𝐹(Θ|Θ′) = 𝑚𝑝𝑕𝑄(𝐼, 𝐸|Θ, 𝜊) 𝑄(𝐼|𝐸,Θ′) • Maximization step • Θ = argmax 𝐹(Θ|Θ′) • Repeat until no or small improvement in Θ ( (Θ = Θ') • Problem • 𝑄 𝐼 𝐸, Θ ′ = ς 𝑜=1 𝑄(𝑦 𝑜 , 𝑡 𝑜 |Θ′) 𝑂 • Each data point requires us to calculate 2 𝑙 probabilities • If k is large, then this is a bottleneck 10 5

  6. 2/4/2020 Variational Approximation • An alternative method to approximate inference based on stochastic sampling. • Let H be a set of all variables with hidden or missing values • log 𝑄 ( 𝐸 | Θ , 𝜊 ) = log 𝑄 ( 𝐼 , 𝐸 | Θ , 𝜊 ) − log 𝑄 ( 𝐼 | 𝐸 , Θ , 𝜊 ) • Average both sides using a distribution 𝑅(𝐼 | 𝜇) [ surrogate posterior ] 𝐹 𝐼|𝜇 𝑚𝑝𝑕𝑄 𝐸 Θ, 𝜊 = 𝐹 𝐼|𝜇 𝑚𝑝𝑕𝑄(𝐼, 𝐸|Θ, 𝜊) − 𝐹 𝐼|𝜇 𝑚𝑝𝑕𝑅(𝐼 |𝜇) +𝐹 𝐼|𝜇 𝑚𝑝𝑕𝑅(𝐼 |𝜇) − 𝐹 𝐼|𝜇 𝑚𝑝𝑕𝑄(𝐼|Θ, 𝜊) log𝑄(𝐸|𝛪, 𝜊) = 𝐺(𝑅, 𝛪) + 𝐿𝑀(𝑅, 𝑄 ) 𝐺(𝑅, Θ) = Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑕𝑄(𝐼, 𝐸|Θ, 𝜊) − Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑕𝑅(𝐼 |𝜇) 𝐿𝑀(𝑅, 𝑄) = Σ {𝐼} 𝑅(𝐼 |𝜇)[𝑚𝑝𝑕𝑅(𝐼 |𝜇) − 𝑚𝑝𝑕𝑄(𝐼 |𝐸, Θ)] 11 Variational Approximation ) log𝑄(𝐸|𝛪, 𝜊) = 𝐺(𝑅, 𝛪) + 𝐿𝑀(𝑅, 𝑄 𝐺(𝑅, Θ) = Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑕𝑄(𝐼, 𝐸|Θ, 𝜊) − Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑕𝑅(𝐼 |𝜇) 𝐿𝑀(𝑅, 𝑄) = Σ {𝐼} 𝑅(𝐼 |𝜇)[𝑚𝑝𝑕𝑅(𝐼 |𝜇) − 𝑚𝑝𝑕𝑄(𝐼 |𝐸, Θ)] • Approximation: maximize 𝐺(𝑅, Θ) • Parameters: Θ, 𝜇 • Maximization of F pushes up the lower bound on the log-likelihood log 𝑄 𝐸 Θ, 𝜊 ≥ 𝐺 𝑅, Θ . 6

  7. 2/4/2020 Kullback-Leibler (KL) divergence • A method to measure the difference between two probability distributions over the same variable x • 𝐿𝑀(𝑄 || 𝑅) • Where the “||” operator indicates “ divergence ” or P’s divergence from Q • Entropy: the average amount of information for a probability distribution 𝑜 • 𝐼 𝑄 = 𝐹 𝑄 𝐽 𝑄 𝑌 = − σ 𝑗=1 𝑄 𝑗 log(𝑄 𝑗 ) 𝑄(𝑗) 𝑜 𝑜 𝑜 • 𝐿𝑀 𝑄 ||𝑅 = 𝐼 𝑄, 𝑅 − 𝐼 𝑄 = − σ 𝑗=1 𝑄 𝑗 log 𝑅 𝑗 + σ 𝑗=1 𝑄 𝑗 log(𝑄 𝑗 ) = σ 𝑗=1 𝑄 𝑗 log( 𝑅(𝑗) ) • If we have some theoretic minimal distribution P, we want to try to find an approximation Q that tries to get as close as possible by minimizing the KL divergence 13 Variational EM • To use Variational EM, we hope if we choose 𝑅(𝐼 | 𝜇) well, the optimization of both 𝜇 and Θ will become easy. • A well-behaved choice for 𝑅(𝐼 | 𝜇) is the mean field approximation . • Let H – be a set of all variables with hidden or missing values: • E-step: Compute expectation over hidden variables • Optimize: 𝐺(𝑅, Θ) with respect to 𝜇 while keeping Θ fixed. • M-step: Maximize expected loglikelihood • Optimize: 𝐺 (𝑅 , Θ) with respect to Θ while keeping 𝜇𝑡 fixed. 14 7

  8. 2/4/2020 Mean Field Approximation • To find the distribution Q, we use Mean Field Approximation • Assumption: • 𝑅(𝐼|𝜇) is the mean field approximation • Variables in the 𝑅(𝐼) distribution are independent variables 𝐼 𝑗 • Q is completely factorized 𝑅(𝐼|𝜇) = ς𝑅 𝑗 (𝐼 𝑗 |𝜇 𝑗 ) • For our CVQ model • Hidden variables are binary sources 𝑅(𝑡 𝑜 |𝜇 𝑜 ) 𝑅(𝐼|𝜇) = ෑ 𝑜=1…𝑂 𝑜 |𝜇 𝑗 𝑅(𝑡 𝑜 |𝜇 𝑜 ) = ς 𝑗=1…𝑙 𝑅(𝑡 𝑗 (𝑜) ) 𝑜 𝑡 𝑗 𝑜 𝑜 𝜇 𝑗 𝑜 ) 1 −𝑡 𝑗 𝑜 𝑜 𝑅 𝑡 𝑗 = 𝜇 𝑗 (1 − 𝜇 𝑗 15 Mean Field Approximation • Functional F for the mean field: ) ) 𝐺(𝑅, 𝛪) = ෍ 𝑅(𝐼|𝜇 log𝑄(𝐼, 𝐸|𝛪, 𝜊) − ෍ 𝑅(𝐼|𝜇 log𝑅(𝐼|𝜇) 𝐼 𝐼 • Assume just one data point x and corresponding s : 𝑂 𝑚𝑝𝑕𝑄((𝑦 𝑜 , 𝑡 𝑜 |Θ) 𝑅(𝑡 𝑜 |𝜇 𝑜 ) − 𝑚𝑝𝑕𝑅 𝑡 𝑜 𝜇 𝑜 𝐺 𝑅, Θ = ෍ 𝑅 𝑡 𝑜 𝜇 𝑜 𝑜=1 1 2𝜏 2 𝐲 − 𝐗𝐭 𝑈 (𝐲 − 𝐗𝐭 = −𝑒log𝜏 − ቁ (1) ) 𝑅(𝑡|𝜇 𝑙 + ෌ 𝑗=1 𝑡 𝑗 log𝜌 𝑗 + (1 − 𝑡 𝑗 )log(1 − 𝜌 𝑗 ) (2) ) 𝑅(𝑡|𝜇 𝑙 ) − ෌ 𝑗=1 𝑡 𝑗 log𝜇 𝑗 + (1 − 𝑡 𝑗 )log(1 − 𝜇 𝑗 (3) 𝑅(𝑡|𝜇 ) 16 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend