Probabilistic Graphical Models Probabilistic Graphical Models
introduction to learning
Siamak Ravanbakhsh Fall 2019
Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation
Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup
Siamak Ravanbakhsh Fall 2019
image: http://blog.londolozi.com/
no need to estimate the partition function
image: http://blog.londolozi.com/
no need to estimate the partition function
image: http://blog.londolozi.com/
D = {X , … , X } X ∼
(1) (M) (m)
P ∗
empirical distribution:
P
(x) =D
I(x ∈∣D∣ 1
D)
D = {X , … , X } X ∼
(1) (M) (m)
P ∗
∈ P ^ P
P KL ∗ empirical distribution:
P
(x) =D
I(x ∈∣D∣ 1
D)
D = {X , … , X } X ∼
(1) (M) (m)
P ∗
∈ P ^ P
P KL ∗
P ∗ ∗
P ∗ empirical distribution:
P
(x) =D
I(x ∈∣D∣ 1
D)
D = {X , … , X } X ∼
(1) (M) (m)
P ∗
∈ P ^ P
P KL ∗
P ∗ ∗
P ∗
negative Entropy of P* (does not depend on P)
empirical distribution:
P
(x) =D
I(x ∈∣D∣ 1
D)
D = {X , … , X } X ∼
(1) (M) (m)
P ∗
∈ P ^ P
P KL ∗
P ∗ ∗
P ∗
negative Entropy of P* (does not depend on P)
empirical distribution:
P
(x) =D
I(x ∈∣D∣ 1
D)
D
P ∑x∈D log-likelihood
how to compare two log-likelihood values?
its negative is called the log loss
, Y )}
(m) (m)
the output in our prediction is structured
(X ∣ P ^ Y )
e.g. in image segmentation
(Y ) = X ^ arg max (x ∣
x P
^ Y )
, Y )}
(m) (m)
the output in our prediction is structured
(X ∣ P ^ Y )
e.g. in image segmentation
0/1 loss (unforgiving): Hamming loss: conditional log-likelihood:
takes prediction uncertainty into account
(Y ) = X ^ arg max (x ∣
x P
^ Y )
E
I(X =(X,Y )∼P ∗
(Y )) X ^ E
I(X =(X,Y )∼P ∗ ∑i i
(Y )
)X ^
i
E
log(X ∣
(X,Y )∼P ∗
P ^ Y )
)}
(m)
finding conditional independencies or causal relationships
E.g. in gene regulatory network
image credit: Chen et al., 2014
)}
(m)
finding conditional independencies or causal relationships same undirected skeleton same immoralities
two DAGs are I-equivalent if
I(G) = I(G )
′ E.g. in gene regulatory network
image credit: Chen et al., 2014
Recall
image: http://ipython-books.github.io
E
[loss(X)]X∼P ∗
E
[loss(x)]x∈D
low empirical risk does not translate to low risk
different choices of produce very different models
image: http://ipython-books.github.io
E
[loss(X)]X∼P ∗
E
[loss(x)]x∈D
D D ∼ P ∗ P ^
high variance as measured by a validation set
low empirical risk does not translate to low risk
different choices of produce very different models
image: http://ipython-books.github.io
E
[loss(X)]X∼P ∗
E
[loss(x)]x∈D
the model has a bias, and even large dataset cannot help D D ∼ P ∗ P ^
high variance
D
high bias as measured by a validation set
low empirical risk does not translate to low risk
different choices of produce very different models
image: http://ipython-books.github.io
E
[loss(X)]X∼P ∗
E
[loss(x)]x∈D
the model has a bias, and even large dataset cannot help D D ∼ P ∗ P ^
high variance
D
high bias as measured by a validation set
a solution: penalize model complexity
regularization
(X ∣ P ^ Y ) (X, Y ) P ^ (X ∣ P ^ Y )
training
(X ∣ P ^ Y ) (X, Y ) P ^ (X ∣ P ^ Y )
training
X Y
Naive Bayes P(X ∣ Y ) ∝ P(X)P(Y ∣ X)
trained generatively (log-likelihood) works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data
(X ∣ P ^ Y ) (X, Y ) P ^ (X ∣ P ^ Y )
training
X X Y Y
Naive Bayes logistic regression P(X = 1∣Y ) = σ(W Y +
T
b) P(X ∣ Y ) ∝ P(X)P(Y ∣ X)
trained generatively (log-likelihood) works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data trained discriminatively (cond. log-likelihood) works better on large datasets no assumptions about cond. ind. in Y
training
naive Bayes vs logistic regression on UCI dataset
naive Bayes logistic regression from: Ng & Jordan 2001
calculating P(x) sampling from P (generative modeling)
discriminative and generative modeling
bias-variance trade-off regularize the model