Outline Latent Variable Generative Models Cooperative Vector - - PDF document

โ–ถ
outline
SMART_READER_LITE
LIVE PREVIEW

Outline Latent Variable Generative Models Cooperative Vector - - PDF document

2/4/2020 CS 3750 Advanced Machine Learning Latent Variable Generative Models II Ahmad Diab AHD23@cs.pitt.edu Feb 4, 2020 Based on slides of Professor Milos Hauskrecht Outline Latent Variable Generative Models Cooperative Vector


slide-1
SLIDE 1

2/4/2020 1

CS 3750 Advanced Machine Learning

Latent Variable Generative Models II

Ahmad Diab AHD23@cs.pitt.edu Feb 4, 2020

Based on slides of Professor Milos Hauskrecht

Outline

  • Latent Variable Generative Models
  • Cooperative Vector Quantizer Model
  • Model Formulation
  • Expectation Maximization (EM)
  • Variational Approximation
  • Noisy-OR Component Analyzer
  • Model Formulation
  • Variational EM for NOCA
  • References
slide-2
SLIDE 2

2/4/2020 2

Latent Variable Generative Models

  • Generative Models: Unsupervised learning models that study the underlying

structure (e.g. interesting patterns) and causal structures of data to generate data like it.

  • Latent (hidden) variables are random variables that are hard to observe. (ex.

Length is measured, but intelligence is not), and is assumed to affect the response variable.

  • The idea: introduce an unobserved latent variable, S, and use it to generate a

traceable, less complex distribution. p(x)

Complex Distribution

p(x, s) = p(x | s) p(s)

Simpler Distribution

Latent Variable Generative Models

  • Assumption: Observable variables are independent given latent

variables.

. . . . . .

S1 S2 Sq

x1 x2 xd-1 xd

slide-3
SLIDE 3

2/4/2020 3

Cooperative Vector Quantizer (CVQ)

  • Latent variables (s): Binary vars with Dimensionality k
  • Observed variables (x): real valued vars Dimensionality d

. . . . . .

S1 S2 Sk

x1 x2 xd-1 xd

CVQ โ€“ Model Description

  • Model
  • ๐‘ฆ = ฯƒ๐‘™=1

๐ฟ

๐‘ก๐‘™๐‘ฅ๐‘™ + ๐œ—

  • Latent variables ๐‘ก๐‘—
  • ~ Bernoulli distribution parameter: ๐œŒ๐‘—
  • ๐‘„(๐‘ก๐‘— | ๐œŒ๐‘—) = ๐œŒ๐‘—๐‘ก๐‘— (1 โˆ’ ๐œŒ๐‘—)1โˆ’๐‘ก๐‘—
  • ๐‘ฅ๐‘™ is the weight output by source ๐‘ก๐‘™
  • Observable variables ๐‘ฆ
  • ~ Normal distributions parameters: W, ฮฃ
  • ๐‘„(๐‘ฆ| ๐‘ก) = N(Ws, ฮฃ),
  • we assume ฮฃ = ๐œ๐ฝ
  • Joint for one instance of s and x
  • ๐‘ž ๐‘ฆ, ๐‘ก ฮ˜) = 2โˆ’๐‘’/2๐œโˆ’๐‘’/2exp{โˆ’

1 2๐œ2 (๐‘ฆ โˆ’ ๐‘‹๐‘ก)๐‘ˆ (๐‘ฆ โˆ’ ๐‘‹๐‘ก) ฯ‚๐‘—=1 ๐‘™

๐œŒ๐‘—๐‘ก๐‘— (1 โˆ’ ๐œŒ๐‘—)1โˆ’๐‘ก๐‘—

6

X: d real valued vars

โ€ฆ

S: k binary vars

๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒธ ๏ƒถ ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒจ ๏ƒฆ ๏€ฝ

dk d k

w w w w w w .. .. .. ..

1 21 1 12 11

W

slide-4
SLIDE 4

2/4/2020 4

CVQ โ€“ Model Description

  • Objective: to learn parameters of the model: W, ฯ€, ฯƒ
  • If both x and s are observable,
  • Use loglikelihood:

เท

๐‘œ=1 ๐‘‚

๐‘š๐‘๐‘•๐‘„(๐‘ฆ ๐‘œ , ๐‘ก ๐‘œ |ฮ˜) = ฯƒ๐‘œ=1

๐‘‚

โˆ’๐‘’ ๐‘š๐‘๐‘• ๐œ โˆ’

1 2๐œ2 (๐‘ฆ ๐‘œ โˆ’ ๐‘‹๐‘ก ๐‘œ )๐‘ˆ(๐‘ฆ ๐‘œ โˆ’ ๐‘‹๐‘ก ๐‘œ ) + ฯƒ๐‘—=1 ๐‘™

๐‘ก๐‘—

(๐‘œ) ๐‘š๐‘๐‘• ๐œŒ๐‘—

(1 โˆ’ ๐‘ก๐‘—

(๐‘œ))log(1 โˆ’ ๐œŒ๐‘—) + c

  • Solution is nice and easy

7

CVQ โ€“ Model Description

  • Objective: to learn parameters of the model: W, ฯ€, ฯƒ
  • If only x are observable
  • Log likelihood of data:

๐‘š๐‘๐‘•๐‘„ ๐ธ ฮ˜ = ฯƒ๐‘œ=1

๐‘‚

๐‘š๐‘๐‘•๐‘„(๐‘ฆ ๐‘œ |ฮ˜) = ฯƒ๐‘œ=1

๐‘‚

๐‘š๐‘๐‘• ฯƒ{๐‘ก๐‘œ} ๐‘„ (๐‘ฆ ๐‘œ , ๐‘ก ๐‘œ |ฮ˜)

  • Solution is hard, we can no longer benefit from the decomposition.
  • Use Expectation Maximization (EM).

8

slide-5
SLIDE 5

2/4/2020 5

Expectation Maximization (EM)

  • Let H be a set of all variables with hidden or missing values
  • ๐‘„(๐ผ, ๐ธ|ฮ˜, ๐œŠ) = ๐‘„(๐ผ |๐ธ, ฮ˜, ๐œŠ)๐‘„(๐ธ|ฮ˜, ๐œŠ)
  • log ๐‘„ ( ๐ผ , ๐ธ | ฮ˜ , ๐œŠ ) = log ๐‘„ ( ๐ผ | ๐ธ , ฮ˜ , ๐œŠ ) + log ๐‘„ ( ๐ธ | ฮ˜ , ๐œŠ )
  • log ๐‘„ ( ๐ธ | ฮ˜ , ๐œŠ ) = log ๐‘„ ( ๐ผ , ๐ธ | ฮ˜ , ๐œŠ ) โˆ’ log ๐‘„ ( ๐ผ | ๐ธ , ฮ˜ , ๐œŠ )
  • Average both sides with ๐‘„(๐ผ |๐ธ, ฮ˜โ€ฒ, ๐œŠ) for ฮ˜โ€ฒ
  • ๐น๐ผ|๐ธ,ฮ˜โ€ฒ ๐‘š๐‘๐‘•๐‘„(๐ธ|ฮ˜, ๐œŠ) = ๐น๐ผ|๐ธ,ฮ˜โ€ฒ๐‘š๐‘๐‘•๐‘„(๐ผ, ๐ธ|ฮ˜, ๐œŠ) โˆ’ ๐น๐ผ|๐ธ,ฮ˜โ€ฒ๐‘š๐‘๐‘•๐‘„(๐ผ|๐ธ, ฮ˜, ๐œŠ)
  • log ๐‘„(๐ธ | ฮ˜, ๐œŠ ) = ๐บ (ฮ˜ | ฮ˜โ€ฒ) = ๐น(ฮ˜ | ฮ˜โ€ฒ) + ๐ผ (ฮ˜ | ฮ˜โ€ฒ)
  • EM uses the true posterior.

) ๐‘„(๐ผ|๐ธ, ๐›ชโ€ฒ, ๐œŠ

9

Log-likelihood

  • f data

Expectation Maximization (EM)

  • General EM Algorithm:
  • Initialize parameters ฮ˜
  • Set ฮ˜'=ฮ˜
  • Expectation step
  • ๐น(ฮ˜|ฮ˜โ€ฒ) =

๐‘š๐‘๐‘•๐‘„(๐ผ, ๐ธ|ฮ˜, ๐œŠ) ๐‘„(๐ผ|๐ธ,ฮ˜โ€ฒ)

  • Maximization step
  • ฮ˜ = argmax ๐น(ฮ˜|ฮ˜โ€ฒ)
  • Repeat until no or small improvement in ฮ˜ ((ฮ˜ = ฮ˜')
  • Problem
  • ๐‘„ ๐ผ ๐ธ, ฮ˜โ€ฒ = ฯ‚๐‘œ=1

๐‘‚

๐‘„(๐‘ฆ ๐‘œ , ๐‘ก ๐‘œ |ฮ˜โ€ฒ)

  • Each data point requires us to calculate 2๐‘™ probabilities
  • If k is large, then this is a bottleneck

10

slide-6
SLIDE 6

2/4/2020 6

Variational Approximation

  • An alternative method to approximate inference based on stochastic sampling.
  • Let H be a set of all variables with hidden or missing values
  • log ๐‘„ ( ๐ธ | ฮ˜ , ๐œŠ ) = log ๐‘„ ( ๐ผ , ๐ธ | ฮ˜ , ๐œŠ ) โˆ’ log ๐‘„ ( ๐ผ | ๐ธ , ฮ˜ , ๐œŠ )
  • Average both sides using a distribution ๐‘…(๐ผ | ๐œ‡) [surrogate posterior]

๐น๐ผ|๐œ‡๐‘š๐‘๐‘•๐‘„ ๐ธ ฮ˜, ๐œŠ = ๐น๐ผ|๐œ‡๐‘š๐‘๐‘•๐‘„(๐ผ, ๐ธ|ฮ˜, ๐œŠ) โˆ’ ๐น๐ผ|๐œ‡๐‘š๐‘๐‘•๐‘…(๐ผ |๐œ‡) +๐น๐ผ|๐œ‡๐‘š๐‘๐‘•๐‘…(๐ผ |๐œ‡) โˆ’ ๐น๐ผ|๐œ‡๐‘š๐‘๐‘•๐‘„(๐ผ|ฮ˜, ๐œŠ) ) log๐‘„(๐ธ|๐›ช, ๐œŠ) = ๐บ(๐‘…, ๐›ช) + ๐ฟ๐‘€(๐‘…, ๐‘„

๐บ(๐‘…, ฮ˜) = ฮฃ{๐ผ}๐‘…(๐ผ |๐œ‡)๐‘š๐‘๐‘•๐‘„(๐ผ, ๐ธ|ฮ˜, ๐œŠ) โˆ’ ฮฃ{๐ผ}๐‘…(๐ผ |๐œ‡)๐‘š๐‘๐‘•๐‘…(๐ผ |๐œ‡) ๐ฟ๐‘€(๐‘…, ๐‘„) = ฮฃ{๐ผ}๐‘…(๐ผ |๐œ‡)[๐‘š๐‘๐‘•๐‘…(๐ผ |๐œ‡) โˆ’ ๐‘š๐‘๐‘•๐‘„(๐ผ |๐ธ, ฮ˜)]

11

Variational Approximation

) log๐‘„(๐ธ|๐›ช, ๐œŠ) = ๐บ(๐‘…, ๐›ช) + ๐ฟ๐‘€(๐‘…, ๐‘„

๐บ(๐‘…, ฮ˜) = ฮฃ{๐ผ}๐‘…(๐ผ |๐œ‡)๐‘š๐‘๐‘•๐‘„(๐ผ, ๐ธ|ฮ˜, ๐œŠ) โˆ’ ฮฃ{๐ผ}๐‘…(๐ผ |๐œ‡)๐‘š๐‘๐‘•๐‘…(๐ผ |๐œ‡) ๐ฟ๐‘€(๐‘…, ๐‘„) = ฮฃ{๐ผ}๐‘…(๐ผ |๐œ‡)[๐‘š๐‘๐‘•๐‘…(๐ผ |๐œ‡) โˆ’ ๐‘š๐‘๐‘•๐‘„(๐ผ |๐ธ, ฮ˜)]

  • Approximation: maximize ๐บ(๐‘…, ฮ˜)
  • Parameters: ฮ˜, ๐œ‡
  • Maximization of F pushes up the lower bound on the log-likelihood

log ๐‘„ ๐ธ ฮ˜, ๐œŠ โ‰ฅ ๐บ ๐‘…, ฮ˜ .

slide-7
SLIDE 7

2/4/2020 7

Kullback-Leibler (KL) divergence

  • A method to measure the difference between two probability distributions
  • ver the same variable x
  • ๐ฟ๐‘€(๐‘„ || ๐‘…)
  • Where the โ€œ||โ€ operator indicates โ€œdivergenceโ€ or Pโ€™s divergence from Q
  • Entropy: the average amount of information for a probability distribution
  • ๐ผ ๐‘„ = ๐น๐‘„ ๐ฝ๐‘„ ๐‘Œ

= โˆ’ ฯƒ๐‘—=1

๐‘œ

๐‘„ ๐‘— log(๐‘„ ๐‘— )

  • ๐ฟ๐‘€ ๐‘„ ||๐‘… = ๐ผ ๐‘„, ๐‘… โˆ’ ๐ผ ๐‘„ = โˆ’ ฯƒ๐‘—=1

๐‘œ

๐‘„ ๐‘— log ๐‘… ๐‘— + ฯƒ๐‘—=1

๐‘œ

๐‘„ ๐‘— log(๐‘„ ๐‘— ) = ฯƒ๐‘—=1

๐‘œ

๐‘„ ๐‘— log(

๐‘„(๐‘—) ๐‘…(๐‘—))

  • If we have some theoretic minimal distribution P, we want to try to find an

approximation Q that tries to get as close as possible by minimizing the KL divergence

13

Variational EM

  • To use Variational EM, we hope if we choose ๐‘…(๐ผ | ๐œ‡) well, the optimization of

both ๐œ‡ and ฮ˜ will become easy.

  • A well-behaved choice for ๐‘…(๐ผ | ๐œ‡) is the mean field approximation.
  • Let H โ€“ be a set of all variables with hidden or missing values:
  • E-step: Compute expectation over hidden variables
  • Optimize: ๐บ(๐‘…, ฮ˜) with respect to ๐œ‡ while keeping ฮ˜ fixed.
  • M-step: Maximize expected loglikelihood
  • Optimize: ๐บ (๐‘… , ฮ˜) with respect to ฮ˜ while keeping ๐œ‡๐‘ก fixed.

14

slide-8
SLIDE 8

2/4/2020 8

Mean Field Approximation

  • To find the distribution Q, we use Mean Field Approximation
  • Assumption:
  • ๐‘…(๐ผ|๐œ‡) is the mean field approximation
  • Variables in the ๐‘…(๐ผ) distribution are independent variables ๐ผ๐‘—
  • Q is completely factorized

๐‘…(๐ผ|๐œ‡) = ฯ‚๐‘…๐‘—(๐ผ๐‘—|๐œ‡๐‘—)

  • For our CVQ model
  • Hidden variables are binary sources

๐‘…(๐ผ|๐œ‡) = เท‘

๐‘œ=1โ€ฆ๐‘‚

๐‘…(๐‘ก ๐‘œ |๐œ‡๐‘œ) ๐‘…(๐‘ก ๐‘œ |๐œ‡๐‘œ) = ฯ‚๐‘—=1โ€ฆ๐‘™ ๐‘…(๐‘ก๐‘—

๐‘œ |๐œ‡๐‘— (๐‘œ))

๐‘… ๐‘ก๐‘—

๐‘œ ๐œ‡๐‘— ๐‘œ

= ๐œ‡๐‘—

๐‘œ ๐‘ก๐‘— ๐‘œ

(1 โˆ’ ๐œ‡๐‘—

๐‘œ )1 โˆ’๐‘ก๐‘— ๐‘œ

15

Mean Field Approximation

  • Functional F for the mean field:

๐บ(๐‘…, ๐›ช) = เท

๐ผ

) ๐‘…(๐ผ|๐œ‡ log๐‘„(๐ผ, ๐ธ|๐›ช, ๐œŠ) โˆ’ เท

๐ผ

) ๐‘…(๐ผ|๐œ‡ log๐‘…(๐ผ|๐œ‡)

  • Assume just one data point x and corresponding s :

๐บ ๐‘…, ฮ˜ = เท

๐‘œ=1 ๐‘‚

๐‘š๐‘๐‘•๐‘„((๐‘ฆ ๐‘œ , ๐‘ก ๐‘œ |ฮ˜) ๐‘…(๐‘ก ๐‘œ |๐œ‡ ๐‘œ ) โˆ’ ๐‘š๐‘๐‘•๐‘… ๐‘ก ๐‘œ ๐œ‡ ๐‘œ

๐‘… ๐‘ก ๐‘œ ๐œ‡ ๐‘œ

= แ‰ โˆ’๐‘’log๐œ โˆ’

1 2๐œ2 ๐ฒ โˆ’ ๐—๐ญ ๐‘ˆ(๐ฒ โˆ’ ๐—๐ญ ) ๐‘…(๐‘ก|๐œ‡

+ เทŒ๐‘—=1

๐‘™

) ๐‘ก๐‘—log๐œŒ๐‘— + (1 โˆ’ ๐‘ก๐‘—)log(1 โˆ’ ๐œŒ๐‘—

) ๐‘…(๐‘ก|๐œ‡

โˆ’ เทŒ๐‘—=1

๐‘™

) ๐‘ก๐‘—log๐œ‡๐‘— + (1 โˆ’ ๐‘ก๐‘—)log(1 โˆ’ ๐œ‡๐‘—

) ๐‘…(๐‘ก|๐œ‡

16

(1) (2) (3)

slide-9
SLIDE 9

2/4/2020 9

Mean Field Approximation

  • Functional F. Part (1)

แ‰‡ โˆ’๐‘’log๐œ โˆ’ 1 2๐œ2 ๐ฒ โˆ’ ๐—๐ญ ๐‘ˆ(๐ฒ โˆ’ ๐—๐ญ

) ๐‘…(๐‘ก|๐œ‡

= แ‰ โˆ’๐‘’log๐œ โˆ’

1 2๐œ2 ๐ฒ โˆ’ ฯƒ๐‘—=1 ๐‘™

๐‘ก๐‘— ๐ฑ๐‘—

๐‘ˆ(๐ฒ โˆ’ ฯƒ๐‘—=1 ๐‘™

๐‘ก๐‘— ๐ฑ๐‘—

) ๐‘…(๐‘ก|๐œ‡

= โˆ’๐‘’log๐œ โˆ’

1 2๐œ2 ๐ฒ๐‘ˆ๐ฒ โˆ’ 2 เทŒ๐‘—=1 ๐‘™

เตซ๐‘ก๐‘— ๐ฑ๐‘—)๐ฒ + เท

๐‘—=1 ๐‘™

ฯƒ๐‘˜=1

๐‘™

๐‘ก๐‘— ๐‘ก

๐‘˜๐ฑ๐‘—๐‘ˆ๐ฑ ๐‘˜ ) ๐‘…(๐‘ก|๐œ‡

= โˆ’๐‘’log๐œ โˆ’ 1 2๐œ2 ๐ฒ๐‘ˆ๐ฒ โˆ’ 2 เท

๐‘—=1 ๐‘™

๐‘ก๐‘—

) ๐‘…(๐‘ก๐‘—|๐œ‡๐‘— ๐ฑ๐‘—)๐ฒ + เท ๐‘—=1 ๐‘™

เท

๐‘˜=1 ๐‘™

๐‘ก๐‘—๐‘ก

๐‘˜ ) ๐‘…(๐‘ก|๐œ‡ ๐ฑ๐‘—๐‘ˆ๐ฑ ๐‘˜

๐‘ก๐‘—

) ๐‘…(๐‘ก๐‘—|๐œ‡๐‘— = ๐œ‡๐‘—

เตฏ ๐‘ก๐‘—๐‘ก

๐‘˜ ) ๐‘…(๐‘ก|๐œ‡ = ๐œ‡๐‘—๐œ‡๐‘˜ + ๐œ€๐‘—๐‘˜(๐œ‡๐‘— โˆ’ ๐œ‡๐‘— 2

Mean Field Approximation

  • Functional F. Part (2)
  • Functional F. part (3)

เท

๐‘—=1 ๐‘™

) ๐‘ก๐‘—log๐œŒ๐‘— + (1 โˆ’ ๐‘ก๐‘—)log(1 โˆ’ ๐œŒ๐‘—

) ๐‘…(๐‘ก|๐œ‡ = เท ๐‘—=1 ๐‘™

เตฏ ๐‘ก๐‘—

) ๐‘…(๐‘ก๐‘—|๐œ‡๐‘— log๐œŒ๐‘— + (1 โˆ’ ๐‘ก๐‘— ) ๐‘…(๐‘ก๐‘—|๐œ‡๐‘— )log(1 โˆ’ ๐œŒ๐‘—

= เท

๐‘—=1 ๐‘™

) ๐œ‡๐‘—log๐œŒ๐‘— + (1 โˆ’ ๐œ‡๐‘—)log(1 โˆ’ ๐œŒ๐‘— เท

๐‘—=1 ๐‘™

) ๐‘ก๐‘—log๐œ‡๐‘— + (1 โˆ’ ๐‘ก๐‘—)log(1 โˆ’ ๐œ‡๐‘—

) ๐‘…(๐‘ก|๐œ‡ = เท ๐‘—=1 ๐‘™

) ๐œ‡๐‘—log๐œ‡๐‘— + (1 โˆ’ ๐œ‡๐‘—)log(1 โˆ’ ๐œ‡๐‘—

slide-10
SLIDE 10

2/4/2020 10

Mean Field Approximation

Functional F:

= โˆ’๐‘’log๐œ โˆ’ 1 2๐œ2 ๐ฒ๐‘ˆ๐ฒ โˆ’ 2 เท

๐‘—=1 ๐‘™

๐‘ก๐‘—

) ๐‘…(๐‘ก๐‘—|๐œ‡๐‘— ๐ฑ๐‘—)๐ฒ + เท ๐‘—=1 ๐‘™

เท

๐‘˜=1 ๐‘™

๐‘ก๐‘—๐‘ก๐‘˜

) ๐‘…(๐‘ก|๐œ‡ ๐ฑ๐‘—๐‘ˆ๐ฑ ๐‘˜

Parameters: ๐‘‹, ๐œŒ, ๐œ Mean field parameters: ๐œ‡

+ เท

๐‘—=1 ๐‘™

) ๐œ‡๐‘—log๐œŒ๐‘— + (1 โˆ’ ๐œ‡๐‘—)log(1 โˆ’ ๐œŒ๐‘— + เท

๐‘—=1 ๐‘™

) ๐œ‡๐‘—log๐œ‡๐‘— + (1 โˆ’ ๐œ‡๐‘—)log(1 โˆ’ ๐œ‡๐‘—

Mean Field Approximation

Functional F (for all data points):

๐บ(๐‘…, ๐›ช) = เท

๐‘œ=1 ๐‘‚

เตฏ log๐‘„(๐ฒ ๐‘œ , ๐ญ ๐‘œ |๐›ช

เตฏ ๐‘…(๐‘ก ๐‘œ |๐œ‡ ๐‘œ

โˆ’ เตฏ log๐‘…(๐ญ ๐‘œ |๐œ‡ ๐‘œ

เตฏ ๐‘…(๐‘ก ๐‘œ |๐œ‡ ๐‘œ

= โˆ’๐‘’log๐œ โˆ’ 1 2๐œ2 ๐ฒ ๐‘œ ๐‘ˆ๐ฒ ๐‘œ โˆ’ 2 เท

๐‘—=1 ๐‘™

๐œ‡๐‘—

๐‘œ ๐ฑ๐‘—)๐ฒ ๐‘œ + เท ๐‘—=1 ๐‘™

เท

๐‘˜=1 ๐‘™

แ‰ ๐œ‡๐‘—

๐‘œ ๐œ‡๐‘˜ ๐‘œ + ๐œ€๐‘—๐‘˜(๐œ‡๐‘— ๐‘œ โˆ’ ๐œ‡๐‘— ๐‘œ 2

๐ฑ๐‘—๐‘ˆ๐ฑ

๐‘˜

+ เท

๐‘—=1 ๐‘™

เตฏ ๐œ‡๐‘—

๐‘œ log๐œŒ๐‘— + (1 โˆ’ ๐œ‡๐‘— ๐‘œ )log(1 โˆ’ ๐œŒ๐‘—

+ เท

๐‘—=1 ๐‘™

เตฏ ๐œ‡๐‘—

๐‘œ log๐œ‡๐‘— ๐‘œ + (1 โˆ’ ๐œ‡๐‘— ๐‘œ )log(1 โˆ’ ๐œ‡๐‘— ๐‘œ

Parameters: ๐‘‹, ๐œŒ, ๐œ Mean field parameters: ๐œ‡ = ๐œ‡1, ๐œ‡2, โ€ฆ , ๐œ‡๐‘œ

slide-11
SLIDE 11

2/4/2020 11

Variational EM

  • E-step
  • Optimize ๐บ(๐‘…, ฮ˜) with respect to ๐œ‡ while keeping ฮ˜ fixed

๐œ– ๐œ–๐œ‡๐‘ฃ ๐บ = 1 ๐œ2 (๐ฒ โˆ’ เท

๐‘˜โ‰ ๐‘ฃ

๐œ‡๐‘˜๐ฑ

๐‘˜)๐‘ˆ ๐ฑ๐‘ฃ โˆ’ 1

2๐œ2 ๐ฑ๐‘ฃ๐‘ˆ๐ฑ๐‘ฃ + log ๐œŒ๐‘ฃ 1 โˆ’ ๐œŒ๐‘ฃ โˆ’ log ๐œ‡๐‘ฃ 1 โˆ’ ๐œ‡๐‘ฃ Set

๐œ– ๐œ–๐œ‡๐‘ฃ ๐บ = 0

๐œ‡๐‘ฃ = g(

1 ๐œ2 (๐‘ฆ โˆ’ ฯƒ๐‘˜โ‰ ๐‘ฃ ๐œ‡๐‘˜๐‘ฅ ๐‘˜)๐‘ˆ๐‘ฅ๐‘ฃ โˆ’ 1 2๐œ2 ๐‘ฅ๐‘ฃ๐‘ˆ๐‘ฅ๐‘ฃ + log ฯ€๐‘ฃ 1 โˆ’ฯ€๐‘ฃ) , g(x) = 1 1+ ๐‘“โˆ’๐‘ฆ

21

Variational EM

  • M-step
  • Optimize ๐บ (๐‘… , ฮ˜ ) with respect to ฮ˜ while keeping ๐œ‡๐‘ก

Start with ๏ฐ: For N data points ๐œ– ๐œ–๐œŒ๐‘ฃ ๐บ = เทŽ

๐‘œ=1 ๐‘‚

๐œ‡๐‘ฃ

๐‘œlog 1

๐œŒ๐‘ฃ โˆ’ (1 โˆ’ ๐œ‡๐‘ฃ

๐‘œ)log

1 1 โˆ’ ๐œŒ๐‘ฃ Set

๐œ– ๐œ–ฯ€๐‘ฃ ๐บ = 0,

๐œŒ๐‘ฃ =

เท

๐‘œ=1 ๐‘‚

๐œ‡๐‘ฃ

๐‘œ

๐‘‚

(closed form solution)

22

slide-12
SLIDE 12

2/4/2020 12

Variational EM

And for parameter w:

๐œ– ๐œ–๐‘ฅ๐‘ฃ๐‘ค ๐บ = เท

๐‘œ=1 ๐‘‚

โˆ’ 1 2๐œ2 ๐œ‡๐‘ค

๐‘œ ๐‘ฆ๐‘ฃ ๐‘œ + 2 เท ๐‘˜โ‰ ๐‘ค

๐œ‡๐‘ค

๐‘œ ๐œ‡๐‘˜ ๐‘œ ๐‘ฅ๐‘ฃ๐‘˜ + 2๐œ‡๐‘ค ๐‘œ ๐‘ฅ๐‘ฃ๐‘ค = 0

  • For each variable v:

The equations define a set of k linear equations that can be solved

23

๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒท ๏ƒธ ๏ƒถ ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒง ๏ƒจ ๏ƒฆ ๏€ฝ

dk d k

w w w w w w .. .. .. ..

1 21 1 12 11

W ) w w (w W

k 2 1 ๏‹

๏€ฝ

Image Separation Experiment

Source images associated with latent variables

slide-13
SLIDE 13

2/4/2020 13

Mixed images

  • Images generated by the model.
  • Some of the images are noise.
  • Generating enough samples; the

model can retrieve the original source.

Recovered sources

slide-14
SLIDE 14

2/4/2020 14

Modeling High-Dimensional Data

  • Definition: Number of dimensions are high that makes calculations extremely

difficult (number of features exceed number of observations).

  • Examples of domains with High-Dimensional Data:
  • Sensor Networks
  • Document Repositories.
  • Typically, variables are dependent.
  • How to model dependencies?
  • Full model (intractable, overfitting)
  • All-independent (unrealistic)
  • Middle-of-the-road approaches
  • Captures dependencies in an efficient way (representation, reasoning, learning).

Noisy-OR Component Analyzer

  • Objective: Capture dependencies via latent factors and combinations.
  • The dependencies between observables are represented using a smaller number of

hidden binary factors.

  • NOCA model has binary nodes:
  • k parameters for each observed node, p1j, โ€ฆ., pkj
  • pij is interpreted as โ€œstrength of influenceโ€ of Si on
  • bservable variable xj

๐‘„(๐‘Œ

๐‘˜ = 0|s) = เท‘ ๐‘—=1 ๐ฟ

1 โˆ’ ๐‘ž๐‘—๐‘˜

๐‘ก๐‘—

๐‘„(๐‘Œ

๐‘˜ = 1|๐ญ) = 1 โˆ’ ๐‘„(๐‘Œ ๐‘˜ = 0|๐ญ) = 1 โˆ’ เท‘ ๐‘—=1 ๐ฟ

1 โˆ’ ๐‘ž๐‘—๐‘˜

๐‘ก๐‘—

โ€ฆ โ€ฆ ๐‘ก1 ๐‘ก2 ๐‘ก๐‘™ ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ๐‘’โˆ’1 ๐‘ฆ๐‘’

ฯ€k ฯ€2 ฯ€1

โ€ฆ p

slide-15
SLIDE 15

2/4/2020 15

Noisy-OR Component Analyzer

Assumptions:

  • All possible causes ๐‘‰๐‘— for an event ๐‘Œ are modeled using nodes (random

variables) and their values, with T (or 1) reflecting the presence of the cause , and F (or 0) its absence

  • If one needs to represent unknown causes one can add a leak node
  • Parameters: For each cause ๐‘‰๐‘— define an (independent) probability qi that

represents the probability with which the cause does not lead to X = T (or 1), or in other words, it represents the probability that the positive value of variable ๐‘Œ is inhibited when ๐‘‰๐‘— is present Note: The negated causes ยฌ๐‘‰๐‘— (reflecting the absence of the cause) do not have any influence on ๐‘Œ.Why?

29

  • A generalization of the logical OR

p | ๐‘ฆ = 1 ๐‘‰1, โ€ฆ , ๐‘‰j, ยฌ๐‘‰

๐‘˜+1, โ€ฆ , ยฌ๐‘‰๐‘™ = 1 โˆ’ ฯ‚๐‘—=1 ๐‘˜

๐‘Ÿ๐‘— ๐‘ž | ๐‘ฆ = 0 ๐‘‰1, โ€ฆ , ๐‘‰j, ยฌ๐‘‰

๐‘˜+1, โ€ฆ , ยฌ๐‘‰๐‘™ = เท‘ ๐‘—=1 ๐‘˜

๐‘Ÿ๐‘—

Noisy-OR Example

๐œˆ | ๐‘ฆ = 1 ๐‘‰1, โ€ฆ , ๐‘‰j, ยฌ๐‘‰

๐‘˜+1, โ€ฆ , ยฌ๐‘‰๐‘™ = 1 โˆ’ เท‘ ๐‘—=1 ๐‘˜

๐‘Ÿ๐‘— ๐œˆ | ๐‘ฆ = 0 ๐‘‰1, โ€ฆ , ๐‘‰j, ยฌ๐‘‰

๐‘˜+1, โ€ฆ , ยฌ๐‘‰๐‘™ = เท‘ ๐‘—=1 ๐‘˜

๐‘Ÿ๐‘—

30

Cold Flu Malaria ๐œˆ(Fever ) ๐œˆ ยฌ๐บ๐‘“๐‘ค๐‘“๐‘  F F F 1 F F T 0.9 0.1 F T F 0.8 0.2 F T T 0.98 0.02 = 0.2 ร— 0.1 T F F 0.4 0.6 T F T 0.94 0.06 = 0.6 ร— 0.1 T T F 0.88 0.12 = 0.6 ร— 0.2 T T T 0.988 0.012 = 0.6 ร— 0.2 ร— 0.1 Flue Cold Malaria Fever q_mal=0.1 q_fl=0.2 q_cold=0.6

slide-16
SLIDE 16

2/4/2020 16

Noisy-OR parameter reduction

31

Flue Cold Mala

  • ria

Fever q_mal=0.1 q_fl=0.2 q_cold=0.6

  • Please note that in general the number of entries defining the

CPT (conditional probability table) grows exponentially with the number of parents;

  • for q binary parents the number is : 2๐‘Ÿ
  • For the noisy-or CPT the number of parameters is q + 1

Example: CPT: 8 different combination

  • f values for 3 binary parents

Noisy-or: 4 parameters

Noisy-OR Component Analyzer (NOCA)

32

โ€ฆ โ€ฆ Observed variables ๐ฒ : ๐’†-dimensions ๐‘ฆ โˆˆ 0,1 ๐‘’

๐‘„ ๐‘ฆ = เท

๐‘ก

เท‘

๐‘˜=1 ๐‘’

๐‘„ เธซ ๐‘ฆ๐‘˜ ๐‘ก เท‘

๐‘—=1 ๐‘Ÿ

๐‘„ ๐‘ก๐‘—

Latent variables ๐’•: (๐’“ + ๐Ÿ)-dimensions s โˆˆ 0,1 ๐‘Ÿ, ๐‘„ | ๐‘ก๐‘— ๐œŒ๐‘— = ๐œŒ๐‘—

๐‘ก๐‘— 1 โˆ’ ๐œŒ๐‘— 1โˆ’๐‘ก๐‘—

Loading Matrix: ๐’’ = {๐‘ž๐‘—๐‘˜}๐‘˜=1,โ€ฆ,๐‘’

๐‘—=1,โ€ฆ,๐‘Ÿ

๐‘Ÿ < ๐‘’ ๐‘ก1 ๐‘ก2 ๐‘ก๐‘Ÿ ๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ๐‘’โˆ’1 ๐‘ฆ๐‘’

slide-17
SLIDE 17

2/4/2020 17

Why EM wonโ€™t work?

  • Take N iid samples (D-dimensional binary vectors)
  • We will need:
  • The joint distribution

๐‘„(๐ฒ, ๐ญ) = ๐‘„(๐ฒ|๐ญ)๐‘„(๐ญ) = ๐‘„(๐ญ) เท“

๐‘˜

1 โˆ’ เท‘

๐‘—=1 ๐ฟ

1 โˆ’ ๐‘ž๐‘—๐‘˜

๐‘ก๐‘— ๐‘ฆ๐‘˜

เท‘

๐‘—=1 ๐ฟ

1 โˆ’ ๐‘ž๐‘—๐‘˜

๐‘ก๐‘— 1โˆ’๐‘ฆ๐‘˜

  • Joint over observables

๐‘„(๐ฒ) = เท

๐ญ

) ๐‘„(๐ฒ, ๐ญ = เท

๐ญ

เตฑ เท‘

๐‘˜

เตฏ ๐‘„(๐‘ฆ๐‘˜|๐ญ ๐‘„(๐ญ Problem1: not a product Problem 2: summation over 2K terms

Variational EM for NOCA

  • Similar to what we did for CVQ, we simplify the distribution with a

decomposable Q(s)

log ๐‘„ | ๐‘ฆ ๐œ„ = log เท‘

๐‘œ=1 ๐‘‚

๐‘„ | ๐‘ฆ๐‘œ ๐œ„ = เท

๐‘œ=1 ๐‘‚

log เท

๐‘ก

๐‘„ ๐‘ฆ๐‘œ, | ๐‘ก๐‘œ ๐œ„ = เท

๐‘œ=1 ๐‘‚

log เทŽ

๐‘ก

๐‘„ ๐‘ฆ๐‘œ, | ๐‘ก๐‘œ ๐œ„, ๐‘Ÿ๐‘œ ๐‘… ๐‘ก๐‘œ ๐‘… ๐‘ก๐‘œ โ‰ฅ เท

๐‘œ=1 ๐‘‚

เท

๐‘ก๐‘œ

๐น๐‘ก๐‘œ log ๐‘„ ๐‘ฆ๐‘œ, | ๐‘‡๐‘œ ๐œ„ โˆ’ ๐น๐‘ก๐‘œ log ๐‘… ๐‘‡๐‘œ

  • log(๐‘„ ๐‘ฆ๐‘œ,

| ๐‘ก๐‘œ ๐œ„, ๐‘Ÿ๐‘œ still can not be solved easily

  • Noisy-Or is not in exponential family

34

slide-18
SLIDE 18

2/4/2020 18

Variational EM for NOCA

A further lower bound is required

  • Jensenโ€™s inequality: ๐’ˆ(๐’ƒ + เท

๐’Œ

๐’“๐’Œ๐’š๐’Œ) โ‰ฅ เท

๐’Œ ๐’“๐’Œ๐’ˆ ๐’ƒ + ๐ฒ๐’Œ

๐‘„ เธซ ๐‘ฆ๐‘˜ ๐‘ก = 1 โˆ’ 1 โˆ’ ๐‘ž0๐‘˜ เท‘

๐‘—=1 ๐‘Ÿ

1 โˆ’ ๐‘ž๐‘—๐‘˜

๐‘ก๐‘— ๐‘ฆ๐‘˜

1 โˆ’ ๐‘ž0๐‘˜ เท‘

๐‘—=1 ๐‘Ÿ

1 โˆ’ ๐‘ž๐‘—๐‘˜

๐‘ก๐‘— 1โˆ’๐‘ฆ๐‘˜

๐“๐Ÿ๐ฎ ๐œพ๐’‹๐’Œ = โˆ’ ๐ฆ๐ฉ๐ก ๐Ÿ โˆ’ ๐’’๐’‹๐’Œ ๐‘„ เธซ ๐‘ฆ๐‘˜ ๐‘ก = exp ๐‘ฆ๐‘˜ log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ โˆ’ เท

๐‘—=1 ๐‘Ÿ

๐œ„๐‘—๐‘˜๐‘ก๐‘— + 1 โˆ’ ๐‘ฆ๐‘˜ โˆ’๐œ„0๐‘˜ โˆ’ เท

๐‘—=1 ๐‘Ÿ

๐œ„๐‘—๐‘˜๐‘ก๐‘—

๐‘ธ เธซ ๐’š๐’Œ ๐’• does not factorize for ๐’š๐’Œ = ๐Ÿ

๐‘„ เธซ ๐‘ฆ๐‘˜ = 1 ๐‘ก = exp[log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ โˆ’ เท

๐‘—=1 ๐‘Ÿ

๐œ„๐‘—๐‘˜๐‘ก๐‘— ] = exp log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ โˆ’ เทŽ

๐‘—=1 ๐‘Ÿ

๐œ„๐‘—๐‘˜๐‘ก๐‘— ๐‘Ÿ๐‘˜ ๐‘— ๐‘Ÿ๐‘˜ ๐‘— โ‰ฅ exp[เท

๐‘—=1 ๐‘Ÿ

๐‘Ÿ๐‘˜(๐‘—) log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ โˆ’ ๐œ„๐‘—๐‘˜๐‘ก๐‘— ๐‘Ÿ๐‘˜(๐‘—) ] = exp[เท

๐‘—=1 ๐‘Ÿ

๐‘Ÿ๐‘˜ ๐‘— [๐‘ก๐‘—log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ โˆ’ ๐œ„๐‘—๐‘˜ ๐‘Ÿ๐‘˜ ๐‘— + (1 โˆ’ ๐‘ก๐‘—)log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ ]] = เท‘

๐‘—=1 ๐‘Ÿ

exp[๐‘Ÿ๐‘˜ ๐‘— ๐‘ก๐‘—[log 1 โˆ’ exp โˆ’๐œ„0๐‘˜ โˆ’ ๐œ„๐‘—๐‘˜ ๐‘Ÿ๐‘˜ ๐‘— โˆ’ log( 1 โˆ’ exp โˆ’๐œ„0๐‘˜ )] + ๐‘Ÿ๐‘˜ ๐‘— log( 1 โˆ’ exp โˆ’๐œ„0๐‘˜ )]] 35

Variational EM for NOCA

A further lower bound is required

log ๐‘„ | ๐‘ฆ ๐œ„

โ‰ฅ เทŽ

๐‘œ=1 ๐‘‚

เท

๐‘ก๐‘œ

๐น๐‘ก๐‘œ log ๐‘„ ๐‘ฆ๐‘œ, | ๐‘ก๐‘œ ๐œ„ โˆ’ ๐น๐‘ก๐‘œ log ๐‘… ๐‘ก๐‘œ โ‰ฅ เทŽ

๐‘œ=1 ๐‘‚

เท

๐‘ก๐‘œ

๐น๐‘ก๐‘œ log เทจ ๐‘„ ๐‘ฆ๐‘œ, | ๐‘ก๐‘œ ๐œ„, ๐‘Ÿ๐‘œ โˆ’ ๐น๐‘ก๐‘œ log ๐‘… ๐‘ก๐‘œ

= เทŽ

๐‘œ=1 ๐‘‚

เท

๐‘ก๐‘œ

๐น๐‘ก๐‘œ log เทจ ๐‘„ ๐‘ฆ๐‘œ|๐‘ก๐‘œ, ๐œ„, ๐‘Ÿ๐‘œ ๐‘„

| ๐‘ก๐‘œ ๐œ„

โˆ’ ๐น๐‘ก๐‘œ log ๐‘… ๐‘ก๐‘œ

= เทŒ๐‘œ=1

๐‘‚

โ„ฑ

๐‘œ ๐‘ฆ๐‘œ, ๐‘… ๐‘ก๐‘œ

= โ„ฑ(๐‘ฆ, ๐‘…(๐‘ก))

36

slide-19
SLIDE 19

2/4/2020 19

Variational EM for NOCA

37

Parameters: ๐’“๐’, ๐œพ๐’‹๐’Œ, ๐œพ๐Ÿ๐’Œ

  • E-step: update ๐’“๐’ to optimize ๐‘ฎ๐’
  • ๐‘Ÿ๐‘œ๐‘˜ โ…ˆ โ‡ ๐‘ก๐‘œ๐‘— ๐‘… ๐‘‡๐‘œ

๐‘Ÿ๐‘œ๐‘˜ โ…ˆ log 1โˆ’โ…‡โˆ’๐œ„0๐‘˜ เตคlog 1 โˆ’ ๐ต๐‘œ โ…ˆ, ๐‘˜

โˆ’

๐œ„๐‘—๐‘˜ ๐‘Ÿ๐‘œ๐‘˜ โ…ˆ ๐ต๐‘œ โ…ˆ,๐‘˜ 1โˆ’๐ต๐‘œ โ…ˆ,๐‘˜ โˆ’

Structure Recovery Experiment

a) Image patterns associated with hidden sources. b) Example images generated by the NOCA model c) Images recovered from source input.

(a) (b) (c)

slide-20
SLIDE 20

2/4/2020 20

References

  • X. Lu, M. Hauskrecht, R.S. Day. Variational Bayesian learning of the cooperative vector quantizer

(CVQ) model. Part I: The Theory, Technical Report, Computer Science Department, University of Pittsburgh, 2002

  • Singliar, Hauskrecht. Noisy-or Component Analysis and its Application to Link Analysis. Journal
  • f Machine Learning Research 2006
  • https://people.cs.pitt.edu/~milos/courses/cs3750-Fall2007/lectures/class20.pdf
  • http://people.cs.pitt.edu/~milos/courses/cs3750/lectures/class8.pdf
  • http://www.blutner.de/Intension/Noisy%20OR.pdf
  • Jaakkola, Tommi S., and Michael I. Jordan. "Variational probabilistic inference and the QMR-DT

network." Journal of artificial intelligence research 10 (1999): 291-322.

  • http://bjlkeng.github.io/posts/variational-bayes-and-the-mean-field-approximation/
  • https://machinelearningmastery.com/divergence-between-probability-distributions/
  • https://towardsdatascience.com/deep-latent-variable-models-unravel-hidden-structures-

a5df0fd32ae2