Conditional Probability Estimation Marco Cattaneo School of - - PowerPoint PPT Presentation

conditional probability estimation
SMART_READER_LITE
LIVE PREVIEW

Conditional Probability Estimation Marco Cattaneo School of - - PowerPoint PPT Presentation

Conditional Probability Estimation Marco Cattaneo School of Mathematics and Physical Sciences University of Hull PGM 2016, Lugano, Switzerland 7 September 2016 MLE of conditional probability given: a probabilistic model P with unknown


slide-1
SLIDE 1

Conditional Probability Estimation

Marco Cattaneo

School of Mathematics and Physical Sciences University of Hull PGM 2016, Lugano, Switzerland 7 September 2016

slide-2
SLIDE 2

MLE of conditional probability

◮ given: a probabilistic model Pθ with unknown θ, past data D, and events

E, Q concerning some new (independent) data

Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7

slide-3
SLIDE 3

MLE of conditional probability

◮ given: a probabilistic model Pθ with unknown θ, past data D, and events

E, Q concerning some new (independent) data

◮ MLE of Pθ(Q | E) = Pθ(Q | D ∩ E):

θD(Q | E)

with ˆ θD = arg max

θ

Pθ(D) (wrong)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7

slide-4
SLIDE 4

MLE of conditional probability

◮ given: a probabilistic model Pθ with unknown θ, past data D, and events

E, Q concerning some new (independent) data

◮ MLE of Pθ(Q | E) = Pθ(Q | D ∩ E):

θD(Q | E)

with ˆ θD = arg max

θ

Pθ(D) (wrong) Pˆ

θD∩E (Q | E)

with ˆ θD∩E = arg max

θ

Pθ(D ∩ E) (right)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7

slide-5
SLIDE 5

MLE of conditional probability

◮ given: a probabilistic model Pθ with unknown θ, past data D, and events

E, Q concerning some new (independent) data

◮ MLE of Pθ(Q | E) = Pθ(Q | D ∩ E):

θD(Q | E)

with ˆ θD = arg max

θ

Pθ(D) (wrong) Pˆ

θD∩E (Q | E)

with ˆ θD∩E = arg max

θ

Pθ(D ∩ E) (right)

◮ when Pθ is a (generalized) regression model, and E, Q describe predictors and

response, respectively, then there is no difference between (right) and (wrong)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7

slide-6
SLIDE 6

MLE of conditional probability

◮ given: a probabilistic model Pθ with unknown θ, past data D, and events

E, Q concerning some new (independent) data

◮ MLE of Pθ(Q | E) = Pθ(Q | D ∩ E):

θD(Q | E)

with ˆ θD = arg max

θ

Pθ(D) (wrong) Pˆ

θD∩E (Q | E)

with ˆ θD∩E = arg max

θ

Pθ(D ∩ E) (right)

◮ when Pθ is a (generalized) regression model, and E, Q describe predictors and

response, respectively, then there is no difference between (right) and (wrong)

◮ when Pθ is a Bayesian network, D is a training dataset, and E, Q concern

some new instances, then the usual MLE is (wrong), and this partially explains the unsatisfactory performance of MLE for Bayesian networks

Marco Cattaneo @ University of Hull Conditional Probability Estimation 2/7

slide-7
SLIDE 7

conditional probability estimation in Bayesian networks

◮ given: a DAG with vertices v ∈ V representing categorical variables Xv, a

complete training dataset D with counts n(·), and conjugate Dirichlet priors with parameters d(·)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 3/7

slide-8
SLIDE 8

conditional probability estimation in Bayesian networks

◮ given: a DAG with vertices v ∈ V representing categorical variables Xv, a

complete training dataset D with counts n(·), and conjugate Dirichlet priors with parameters d(·)

◮ estimates of local probability models:

ˆ pD(xv | xpa(v)) = n(xv, xpa(v)) n(xpa(v)) (ML) ˆ pD(xv | xpa(v)) = n(xv, xpa(v)) + d(xv, xpa(v)) n(xpa(v)) + d(xpa(v)) (Bayes)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 3/7

slide-9
SLIDE 9

conditional probability estimation in Bayesian networks

◮ given: a DAG with vertices v ∈ V representing categorical variables Xv, a

complete training dataset D with counts n(·), and conjugate Dirichlet priors with parameters d(·)

◮ estimates of local probability models:

ˆ pD(xv | xpa(v)) = n(xv, xpa(v)) n(xpa(v)) (ML) ˆ pD(xv | xpa(v)) = n(xv, xpa(v)) + d(xv, xpa(v)) n(xpa(v)) + d(xpa(v)) (Bayes)

◮ estimates of probabilities concerning a new instance:

ˆ pD(xQ) =

  • xV\Q
  • v∈V

ˆ pD(xv | xpa(v)) =

  • xV\Q
  • v∈V

n(xv, xpa(v)) n(xpa(v)) (ML) ˆ pD(xQ) =

  • xV\Q
  • v∈V

ˆ pD(xv | xpa(v)) =

  • xV\Q
  • v∈V

n(xv, xpa(v)) + d(xv, xpa(v)) n(xpa(v)) + d(xpa(v)) (Bayes)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 3/7

slide-10
SLIDE 10

conditional probability estimation in Bayesian networks

◮ estimates of conditional probabilities concerning a new instance:

ˆ pD,xE(xQ | xE) =

  • xV\(Q∪E)
  • v∈V ˆ

pD(xv | xpa(v))

  • xV\Q
  • v∈V ˆ

pD(xv | xpa(v)) =

  • xV\(Q∪E)
  • v∈V

n(xv,xpa(v)) n(xpa(v))

  • xV\Q
  • v∈V

n(xv,xpa(v)) n(xpa(v))

(wrong ML) ˆ pD,xE(xQ | xE) =

  • xV\(Q∪E)
  • v∈V ˆ

pD(xv | xpa(v))

  • xV\Q
  • v∈V ˆ

pD(xv | xpa(v)) =

  • xV\(Q∪E)
  • v∈V

n(xv,xpa(v))+d(xv,xpa(v)) n(xpa(v))+d(xpa(v))

  • xV\Q
  • v∈V

n(xv,xpa(v))+d(xv,xpa(v)) n(xpa(v))+d(xpa(v))

(Bayes)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 4/7

slide-11
SLIDE 11

conditional probability estimation in Bayesian networks

◮ estimates of conditional probabilities concerning a new instance:

ˆ pD,xE(xQ | xE) =

  • xV\(Q∪E)
  • v∈V ˆ

pD,xE(xv | xpa(v))

  • xV\Q
  • v∈V ˆ

pD,xE(xv | xpa(v)) =

  • xV\(Q∪E)
  • v∈V

n(xv,xpa(v))+ˆ eD,xE (xv,xpa(v)) n(xpa(v))+ˆ eD,xE (xpa(v))

  • xV\Q
  • v∈V

n(xv,xpa(v))+ˆ eD,xE (xv,xpa(v)) n(xpa(v))+ˆ eD,xE (xpa(v))

(ML) ˆ pD,xE(xQ | xE) =

  • xV\(Q∪E)
  • v∈V ˆ

pD(xv | xpa(v))

  • xV\Q
  • v∈V ˆ

pD(xv | xpa(v)) =

  • xV\(Q∪E)
  • v∈V

n(xv,xpa(v))+d(xv,xpa(v)) n(xpa(v))+d(xpa(v))

  • xV\Q
  • v∈V

n(xv,xpa(v))+d(xv,xpa(v)) n(xpa(v))+d(xpa(v))

(Bayes)

Marco Cattaneo @ University of Hull Conditional Probability Estimation 4/7

slide-12
SLIDE 12

conditional probability estimation in Bayesian networks

◮ estimates of conditional probabilities concerning a new instance:

ˆ pD,xE(xQ | xE) =

  • xV\(Q∪E)
  • v∈V ˆ

pD,xE(xv | xpa(v))

  • xV\Q
  • v∈V ˆ

pD,xE(xv | xpa(v)) =

  • xV\(Q∪E)
  • v∈V

n(xv,xpa(v))+ˆ eD,xE (xv,xpa(v)) n(xpa(v))+ˆ eD,xE (xpa(v))

  • xV\Q
  • v∈V

n(xv,xpa(v))+ˆ eD,xE (xv,xpa(v)) n(xpa(v))+ˆ eD,xE (xpa(v))

(ML) ˆ pD,xE(xQ | xE) =

  • xV\(Q∪E)
  • v∈V ˆ

pD(xv | xpa(v))

  • xV\Q
  • v∈V ˆ

pD(xv | xpa(v)) =

  • xV\(Q∪E)
  • v∈V

n(xv,xpa(v))+d(xv,xpa(v)) n(xpa(v))+d(xpa(v))

  • xV\Q
  • v∈V

n(xv,xpa(v))+d(xv,xpa(v)) n(xpa(v))+d(xpa(v))

(Bayes)

◮ ˆ

eD,xE(·) are the MLE of expected counts for the new instance, obtained from the EM algorithm

Marco Cattaneo @ University of Hull Conditional Probability Estimation 4/7

slide-13
SLIDE 13

performance comparison: √ MSE

◮ given: 3 binary variables X1, X2, Y with X1 ⊥ X2 | Y and

p(x1 | y) = p(¬x1 | ¬y) = 99%, while p(¬x2 | y) = p(¬x2 | ¬y) = 99%

Marco Cattaneo @ University of Hull Conditional Probability Estimation 5/7

slide-14
SLIDE 14

performance comparison: √ MSE

◮ given: 3 binary variables X1, X2, Y with X1 ⊥ X2 | Y and

p(x1 | y) = p(¬x1 | ¬y) = 99%, while p(¬x2 | y) = p(¬x2 | ¬y) = 99%

◮ estimate p(y | x1, x2) on the basis of a complete training dataset of size 100:

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p(y) incomplete ML (when it exists) complete ML (when incomplete ML exists) Bayes−Laplace (when incomplete ML exists) complete ML (unconditional) Bayes−Laplace (unconditional) probability that incomplete ML exists Marco Cattaneo @ University of Hull Conditional Probability Estimation 5/7

slide-15
SLIDE 15

performance comparison: √ MSE

◮ given: 3 binary variables X1, X2, Y with X1 ⊥ X2 | Y and

p(x1 | y) = p(¬x1 | ¬y) = 99%, while p(¬x2 | y) = p(x2 | ¬y) = 90%

Marco Cattaneo @ University of Hull Conditional Probability Estimation 6/7

slide-16
SLIDE 16

performance comparison: √ MSE

◮ given: 3 binary variables X1, X2, Y with X1 ⊥ X2 | Y and

p(x1 | y) = p(¬x1 | ¬y) = 99%, while p(¬x2 | y) = p(x2 | ¬y) = 90%

◮ estimate p(y | x1, x2) on the basis of a complete training dataset of size 100:

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p(y) incomplete ML (when it exists) complete ML (when incomplete ML exists) Bayes−Laplace (when incomplete ML exists) complete ML (unconditional) Bayes−Laplace (unconditional) probability that incomplete ML exists Marco Cattaneo @ University of Hull Conditional Probability Estimation 6/7

slide-17
SLIDE 17

conclusion

◮ the following way of using Bayesian networks is in agreement with Bayes

estimation, but not with ML estimation: estimate the local probability models of a Bayesian network from data, and then use the resulting global model to calculate conditional probabilities of future events

Marco Cattaneo @ University of Hull Conditional Probability Estimation 7/7

slide-18
SLIDE 18

conclusion

◮ the following way of using Bayesian networks is in agreement with Bayes

estimation, but not with ML estimation: estimate the local probability models of a Bayesian network from data, and then use the resulting global model to calculate conditional probabilities of future events

◮ correct MLE of conditional probabilities can be calculated using the EM

algorithm

Marco Cattaneo @ University of Hull Conditional Probability Estimation 7/7

slide-19
SLIDE 19

conclusion

◮ the following way of using Bayesian networks is in agreement with Bayes

estimation, but not with ML estimation: estimate the local probability models of a Bayesian network from data, and then use the resulting global model to calculate conditional probabilities of future events

◮ correct MLE of conditional probabilities can be calculated using the EM

algorithm

◮ future work includes empirical studies of the effect of using the correct MLE

  • n the performance of Bayesian network classifiers

Marco Cattaneo @ University of Hull Conditional Probability Estimation 7/7