Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

probabilistic graphical models probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup


slide-1
SLIDE 1

Probabilistic Graphical Models Probabilistic Graphical Models

introduction to learning

Siamak Ravanbakhsh Fall 2019

slide-2
SLIDE 2

Learning objectives Learning objectives

different goals of learning a graphical model effect of goals on the learning setup

slide-3
SLIDE 3

Where does a graphical model come from? Where does a graphical model come from?

image: http://blog.londolozi.com/

slide-4
SLIDE 4

Where does a graphical model come from? Where does a graphical model come from?

designed by domain experts: more suitable for directed models

  • cond. probabilities are more intuitive than unnormalized factors

no need to estimate the partition function

image: http://blog.londolozi.com/

slide-5
SLIDE 5

Where does a graphical model come from? Where does a graphical model come from?

designed by domain experts: more suitable for directed models

  • cond. probabilities are more intuitive than unnormalized factors

no need to estimate the partition function

image: http://blog.londolozi.com/

learning from data: fixed structure: easy for directed models unknown structure fully or partially observed data, hidden variables

slide-6
SLIDE 6

Goals of learning: Goals of learning: density estimation density estimation

assumption: data is IID sample from a P ∗

D = {X , … , X } X ∼

(1) (M) (m)

P ∗

empirical distribution:

P

(x) =

D

I(x ∈

∣D∣ 1

D)

slide-7
SLIDE 7

Goals of learning: Goals of learning: density estimation density estimation

assumption: data is IID sample from a P ∗

D = {X , … , X } X ∼

(1) (M) (m)

P ∗

  • bjective: learn a close to P ∗

∈ P ^ P

= P ^ arg min

D (P ∥P)

P KL ∗ empirical distribution:

P

(x) =

D

I(x ∈

∣D∣ 1

D)

slide-8
SLIDE 8

Goals of learning: Goals of learning: density estimation density estimation

assumption: data is IID sample from a P ∗

D = {X , … , X } X ∼

(1) (M) (m)

P ∗

  • bjective: learn a close to P ∗

∈ P ^ P

= P ^ arg min

D (P ∥P)

P KL ∗

= E

[log P ] −

P ∗ ∗

E

[log P]

P ∗ empirical distribution:

P

(x) =

D

I(x ∈

∣D∣ 1

D)

slide-9
SLIDE 9

Goals of learning: Goals of learning: density estimation density estimation

assumption: data is IID sample from a P ∗

D = {X , … , X } X ∼

(1) (M) (m)

P ∗

  • bjective: learn a close to P ∗

∈ P ^ P

= P ^ arg min

D (P ∥P)

P KL ∗

= E

[log P ] −

P ∗ ∗

E

[log P]

P ∗

negative Entropy of P* (does not depend on P)

empirical distribution:

P

(x) =

D

I(x ∈

∣D∣ 1

D)

slide-10
SLIDE 10

Goals of learning: Goals of learning: density estimation density estimation

assumption: data is IID sample from a P ∗

D = {X , … , X } X ∼

(1) (M) (m)

P ∗

  • bjective: learn a close to P ∗

∈ P ^ P

= P ^ arg min

D (P ∥P)

P KL ∗

= E

[log P ] −

P ∗ ∗

E

[log P]

P ∗

negative Entropy of P* (does not depend on P)

substitute with : P ∗

empirical distribution:

P

(x) =

D

I(x ∈

∣D∣ 1

D)

P

D

= P ^ arg max

log P(x)

P ∑x∈D log-likelihood

how to compare two log-likelihood values?

its negative is called the log loss

slide-11
SLIDE 11

Goals of learning: Goals of learning: prediction prediction

given D = {(X

, Y )}

(m) (m)

the output in our prediction is structured

interested in learning

(X ∣ P ^ Y )

e.g. in image segmentation

making prediction:

(Y ) = X ^ arg max (x ∣

x P

^ Y )

slide-12
SLIDE 12

Goals of learning: Goals of learning: prediction prediction

given D = {(X

, Y )}

(m) (m)

the output in our prediction is structured

interested in learning

(X ∣ P ^ Y )

e.g. in image segmentation

error measures:

0/1 loss (unforgiving): Hamming loss: conditional log-likelihood:

takes prediction uncertainty into account

making prediction:

(Y ) = X ^ arg max (x ∣

x P

^ Y )

E

I(X =

(X,Y )∼P ∗

(Y )) X ^ E

I(X =

(X,Y )∼P ∗ ∑i i

(Y )

)

X ^

i

E

log

(X ∣

(X,Y )∼P ∗

P ^ Y )

slide-13
SLIDE 13

Goals of learning: Goals of learning: knowledge discovery knowledge discovery

given D = {(X

)}

(m)

finding conditional independencies or causal relationships

interested in learning G or H

E.g. in gene regulatory network

image credit: Chen et al., 2014

slide-14
SLIDE 14

Goals of learning: Goals of learning: knowledge discovery knowledge discovery

given D = {(X

)}

(m)

finding conditional independencies or causal relationships same undirected skeleton same immoralities

interested in learning G or H not always uniquely identifiable

two DAGs are I-equivalent if

I(G) = I(G )

′ E.g. in gene regulatory network

image credit: Chen et al., 2014

Recall

slide-15
SLIDE 15

bias-variance trade-off bias-variance trade-off

learning ideally minimizes some risk (expected loss) in reality we use empirical risk

image: http://ipython-books.github.io

E

[loss(X)]

X∼P ∗

E

[loss(x)]

x∈D

slide-16
SLIDE 16

if our model is expressive we can overfit

low empirical risk does not translate to low risk

  • ur model does not generalize to samples outside

different choices of produce very different models

bias-variance trade-off bias-variance trade-off

learning ideally minimizes some risk (expected loss) in reality we use empirical risk

image: http://ipython-books.github.io

E

[loss(X)]

X∼P ∗

E

[loss(x)]

x∈D

  • verfitting in density estimation

D D ∼ P ∗ P ^

high variance as measured by a validation set

slide-17
SLIDE 17

if our model is expressive we can overfit

low empirical risk does not translate to low risk

  • ur model does not generalize to samples outside

different choices of produce very different models

bias-variance trade-off bias-variance trade-off

learning ideally minimizes some risk (expected loss) in reality we use empirical risk

image: http://ipython-books.github.io

E

[loss(X)]

X∼P ∗

E

[loss(x)]

x∈D

  • verfitting in density estimation

simple models cannot fit the data

the model has a bias, and even large dataset cannot help D D ∼ P ∗ P ^

high variance

D

high bias as measured by a validation set

slide-18
SLIDE 18

if our model is expressive we can overfit

low empirical risk does not translate to low risk

  • ur model does not generalize to samples outside

different choices of produce very different models

bias-variance trade-off bias-variance trade-off

learning ideally minimizes some risk (expected loss) in reality we use empirical risk

image: http://ipython-books.github.io

E

[loss(X)]

X∼P ∗

E

[loss(x)]

x∈D

  • verfitting in density estimation

simple models cannot fit the data

the model has a bias, and even large dataset cannot help D D ∼ P ∗ P ^

high variance

D

high bias as measured by a validation set

a solution: penalize model complexity

regularization

slide-19
SLIDE 19

Discreminative vs generative Discreminative vs generative

if the goal is prediction: Generative: learn and condition on Y (e.g., MRF) Discriminative: directly learn (e.g., CRF)

(X ∣ P ^ Y ) (X, Y ) P ^ (X ∣ P ^ Y )

training

slide-20
SLIDE 20

Discreminative vs generative Discreminative vs generative

if the goal is prediction: Generative: learn and condition on Y (e.g., MRF) Discriminative: directly learn (e.g., CRF)

(X ∣ P ^ Y ) (X, Y ) P ^ (X ∣ P ^ Y )

Example

training

X Y

naive Bayes vs logistic regression

Naive Bayes P(X ∣ Y ) ∝ P(X)P(Y ∣ X)

trained generatively (log-likelihood) works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data

slide-21
SLIDE 21

Discreminative vs generative Discreminative vs generative

if the goal is prediction: Generative: learn and condition on Y (e.g., MRF) Discriminative: directly learn (e.g., CRF)

(X ∣ P ^ Y ) (X, Y ) P ^ (X ∣ P ^ Y )

Example

training

X X Y Y

naive Bayes vs logistic regression

Naive Bayes logistic regression P(X = 1∣Y ) = σ(W Y +

T

b) P(X ∣ Y ) ∝ P(X)P(Y ∣ X)

trained generatively (log-likelihood) works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data trained discriminatively (cond. log-likelihood) works better on large datasets no assumptions about cond. ind. in Y

slide-22
SLIDE 22

Discreminative vs generative Discreminative vs generative

Example

training

naive Bayes vs logistic regression on UCI dataset

naive Bayes logistic regression from: Ng & Jordan 2001

slide-23
SLIDE 23

summary summary

learning can have different objectives: density estimation

calculating P(x) sampling from P (generative modeling)

prediction (conditional density estimation)

discriminative and generative modeling

knowledge discovery expressed as empirical risk minimization

bias-variance trade-off regularize the model