Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations Siamak Ravanbakhsh Fall 2019

Learning objectives Learning objectives different types of missing data learning with missing data and hidden vars: directed models undirected models develop an intuition for expectation maximization variational interpretation

Two settings for partial observations Two settings for partial observations missing data each instance in is missing some values D

Two settings for partial observations Two settings for partial observations missing data why model hidden variables? each instance in is missing some values D hidden variables original causes variables that are never observed mediating cause effect image credit: Murphy's book

Two settings for partial observations Two settings for partial observations missing data why model hidden variables? each instance in is missing some values D hidden variables original causes variables that are never observed mediating cause latent variable models effect observations have common cause widely used in machine learning image credit: Murphy's book

Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe

Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o

Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o missing completely at random (MCAR) P ( X , O ) = P ( X ) P ( O ) X X θ ) 1− x throw to generate p ( x ) = θ (1 − x

Missing data Missing data observation mechanism: generate the data point X = [ X , … , X ] 1 D decide the values to observe = [1, 0, … , 0, 1] O X hide observe observe while is missing ( X = [ X ; X ]) X X o h h o missing completely at random (MCAR) P ( X , O ) = P ( X ) P ( O ) X X θ ) 1− x throw to generate p ( x ) = θ (1 − x throw to decide show/hide ψ ) 1− o p ( o ) = ψ (1 − o

Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o

Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o objective: learn a model for X, from the data D = { x (1) ( M ) , … , x } o o each may include values for a different subset of vars. x o

Learning with MCAR Learning with MCAR P ( X , O ) = P ( X ) P ( O ) missing completely at random (MCAR) θ ) 1− x throw to generate p ( x ) = θ (1 − x θ ) 1− o throw to decide show/hide p ( o ) = ψ (1 − o objective: learn a model for X, from the data D = { x (1) ( M ) , … , x } o o each may include values for a different subset of vars. x o since , we can ignore the obs. patterns P ( X , O ) = P ( X ) P ( O ) optimize: ℓ( D , θ ) = log p ( x , x ) ∑ x ∑ x ∈ D o h o h

A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o

A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o throw the thumb-tack twice X = [ X , X ] 1 2 missing at random example if hide = 1 X X 2 1 missing completely at random otherwise show X 1

A more general criteria A more general criteria ⊥ ∣ X missing at random (MAR) O X X h o if there is information about the obs. pattern in X O X h then it is also in X o throw the thumb-tack twice X = [ X , X ] 1 2 missing at random example if hide = 1 X X 2 1 missing completely at random otherwise show X 1 no "extra" information in the obs. pattern > ignore it optimize: ℓ( D , θ ) = ∑ x log ∑ x p ( x , x ) o h ∈ D o h

Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave

Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose not convex anymore

Likelihood function Likelihood function marginal for partial observations fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose likelihood for a single assignment to the latent vars. not convex anymore ℓ( D , θ ) = log p ( x , x ) ∑ x ∑ x o h ∈ D o h

Likelihood function: Likelihood function: example example marginal for a directed model fully observed case decomposes: x ℓ( D , θ ) = log p ( x , y , z ) ∑ x , y , z ∈ D y z = log p ( x ) + log p ( y ∣ x ) + log p ( z ∣ x ) ∑ x ∑ x , y ∑ x , z

Likelihood function: Likelihood function: example example marginal for a directed model fully observed case decomposes: x ℓ( D , θ ) = log p ( x , y , z ) ∑ x , y , z ∈ D y z = log p ( x ) + log p ( y ∣ x ) + log p ( z ∣ x ) ∑ x ∑ x , y ∑ x , z x is always missing (e.g., in a latent variable model ) ℓ( D , θ ) = ∑ y , z ∈ D log ∑ x p ( x ) p ( y ∣ x ) p ( z ∣ x ) cannot decompose it!

Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation

Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation undirected models: obtain the gradient of marginal likelihood EM is not a good option here

Parameter learning Parameter learning with missing data with missing data Directed models: option 1: obtain the gradient of marginal likelihood option 2: expectation maximization (EM) variational interpretation all of these options need inference for each step of undirected models: learning obtain the gradient of marginal likelihood EM is not a good option here

Gradient of the marginal Gradient of the marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden

Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden take the derivative: ∂ 1 ′ ′ ℓ( D ) = p ( d , c ∣ a , d ) ∑ ( a , d )∈ D ∂ p ( d ∣ c ) ′ ′ p ( d ∣ c ) ′ ′ need inference for this

Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) example log marginal likelihood: ℓ( D ) = log p ( a ) p ( b ) p ( c ∣ a , b ) p ( d ∣ c ) ∑ ( a , d )∈ D ∑ b , c hidden take the derivative: ∂ 1 ′ ′ ℓ( D ) = p ( d , c ∣ a , d ) ∑ ( a , d )∈ D ∂ p ( d ∣ c ) ′ ′ p ( d ∣ c ) ′ ′ need inference for this what happens to this expression if every variable is observed?

Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) for a Bayesian Network with CPT ∂ 1 ∣ x ℓ( D ) = p ( x ∣ pa ) ∑ x i x o ∈ D ∂ p ( x ∣ pa ) p ( x ∣ pa ) i o i x i x i i some specific assignment run inference for each observation

Gradient of the Gradient of the marginal marginal likelihood likelihood (directed models) for a Bayesian Network with CPT ∂ 1 ∣ x ℓ( D ) = p ( x ∣ pa ) ∑ x i x o ∈ D ∂ p ( x ∣ pa ) p ( x ∣ pa ) i o i x i x i i some specific assignment run inference for each observation a technical issue: gradient is always non-negative no constraint of the form p ( x ∣ pa ) = 1 ∑ x x reparametrize (e.g., using softmax) or use Lagrange multipliers

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different types of missing data learning with missing data and hidden vars:

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

CS 4803 / 7643: Deep Learning Topics: Variational Auto-Encoders (VAEs)

CS 104 Computer Organization and Design Exceptions and Interrupts CS104: Exceptions and

UMBC A B M A L T F O U M B C I M Y O R T 1 (9/7/05) I E S R C E O V U

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala Introduction HPC and embedded

Amortized Analysis

Level-Rebuilt B-Trees Gerth Stlting Brodal BRICS University of Aarhus Pankaj K. Agarwal Lars

Heaps Carola Wenk 9/8/17 1 CMPS 2200 Introduction to Algorithms Priority Queue A priority

Applied Algorithm Design Lecture 6 Pietro Michiardi Institut Eurcom Pietro Michiardi (EUR)