Bayesian networks: basic parameter learning Machine Intelligence - - PowerPoint PPT Presentation

bayesian networks basic parameter learning
SMART_READER_LITE
LIVE PREVIEW

Bayesian networks: basic parameter learning Machine Intelligence - - PowerPoint PPT Presentation

Bayesian networks: basic parameter learning Machine Intelligence Thomas D. Nielsen September 2008 Basic parameter learning September 2008 1 / 24 Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated


slide-1
SLIDE 1

Bayesian networks: basic parameter learning

Machine Intelligence Thomas D. Nielsen September 2008

Basic parameter learning September 2008 1 / 24

slide-2
SLIDE 2

Estimation

Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error.

Basic parameter learning September 2008 2 / 24

slide-3
SLIDE 3

Estimation

Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error.

true mass random error

Basic parameter learning September 2008 2 / 24

slide-4
SLIDE 4

Estimation

Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error.

Basic parameter learning September 2008 2 / 24

slide-5
SLIDE 5

Estimation

Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Estimate of true mass: mean value of normal distribution that best “fits” the data.

Basic parameter learning September 2008 2 / 24

slide-6
SLIDE 6

Estimation

Example: Coin Tossing Is the Euro fair?

Basic parameter learning September 2008 3 / 24

slide-7
SLIDE 7

Estimation

Example: Coin Tossing Is the Euro fair? Toss Euro 1000 times and count number of heads and tails: . . .

Basic parameter learning September 2008 3 / 24

slide-8
SLIDE 8

Estimation

Example: Coin Tossing Is the Euro fair? Result: heads: 521, tails: 479. Probability of Euro falling heads (estimate): value that best “fits” the data: 521/1000.

Basic parameter learning September 2008 3 / 24

slide-9
SLIDE 9

Estimation: Classical

Structure of Estimation Problem Given: Data produced by some random process that is characterized by one or several numerical parameters. Wanted: Infer value of (some) parameters. (Classical) Method: Obtain estimate for parameter by a function that maps possible data sets into the parameter space.

Basic parameter learning September 2008 4 / 24

slide-10
SLIDE 10

Estimation: Classical

Parametric Family Let W be a set, Θ ⊆ Rk for some k ≥ 1. For every θ ∈ Θ let Pθ be a probability distribution on W. Then {Pθ | θ ∈ Θ} is called a parametric family (of distributions). Example 1: W = {h, t}, Θ = [0, 1]. Pθ: distribution with P(h) = θ (and P(t) = 1 − θ).

Basic parameter learning September 2008 5 / 24

slide-11
SLIDE 11

Estimation: Classical

Parametric Family Let W be a set, Θ ⊆ Rk for some k ≥ 1. For every θ ∈ Θ let Pθ be a probability distribution on W. Then {Pθ | θ ∈ Θ} is called a parametric family (of distributions). Example 1: W = {h, t}, Θ = [0, 1]. Pθ: distribution with P(h) = θ (and P(t) = 1 − θ). Example 2: W = {w1, . . . , wk}, Θ = {θ = (p1, . . . , pk) ∈ [0, 1]k | X pi = 1}. Pθ: distribution with P(wi) = pi.

Basic parameter learning September 2008 5 / 24

slide-12
SLIDE 12

Estimation: Classical

Parametric Family Let W be a set, Θ ⊆ Rk for some k ≥ 1. For every θ ∈ Θ let Pθ be a probability distribution on W. Then {Pθ | θ ∈ Θ} is called a parametric family (of distributions). Example 1: W = {h, t}, Θ = [0, 1]. Pθ: distribution with P(h) = θ (and P(t) = 1 − θ). Example 2: W = {w1, . . . , wk}, Θ = {θ = (p1, . . . , pk) ∈ [0, 1]k | X pi = 1}. Pθ: distribution with P(wi) = pi. Example 3: W = R, Θ = R × R+. For θ = (µ, σ) ∈ Θ: Pθ normal distribution with mean µ and standard deviation σ.

Basic parameter learning September 2008 5 / 24

slide-13
SLIDE 13

Estimation: Classical

Sample A family X1, . . . , XN of random variables is called independent identically distributed (iid) if the family is independent, and P(Xi) = P(Xj) for all i, j.

Basic parameter learning September 2008 6 / 24

slide-14
SLIDE 14

Estimation: Classical

Sample A family X1, . . . , XN of random variables is called independent identically distributed (iid) if the family is independent, and P(Xi) = P(Xj) for all i, j. A sample s1, . . . , sN ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P(Xi) = Pθ.

Basic parameter learning September 2008 6 / 24

slide-15
SLIDE 15

Estimation: Classical

Sample A family X1, . . . , XN of random variables is called independent identically distributed (iid) if the family is independent, and P(Xi) = P(Xj) for all i, j. A sample s1, . . . , sN ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P(Xi) = Pθ. Likelihood Function Given a parametric family {Pθ | θ ∈ Θ} of distributions on W, and a sample s = (s1, . . . , sN) ∈ W N. The function θ → Pθ(s) :=

N

Y

i=1

Pθ(si),

  • resp. θ → log Pθ(s) =

N

X

i=1

log Pθ(si), is called the likelihood function (resp. log-likelihood function) for θ given s.

Basic parameter learning September 2008 6 / 24

slide-16
SLIDE 16

Estimation: Classical

Maximum Likelihood Estimator Given: parametric family and sample s. Every θ∗ ∈ Θ with θ∗ = arg max

θ∈Θ Pθ(s)

is called a maximum likelihood estimate for θ (given s).

Basic parameter learning September 2008 7 / 24

slide-17
SLIDE 17

Estimation: Classical

Maximum Likelihood Estimator Given: parametric family and sample s. Every θ∗ ∈ Θ with θ∗ = arg max

θ∈Θ Pθ(s)

is called a maximum likelihood estimate for θ (given s). Since the logarithm is a strictly monotone function, maximum likelihood estimatates are also

  • btained by maximizing the log-likelihood:

θ∗ = arg max

θ∈Θ log Pθ(s)

Basic parameter learning September 2008 7 / 24

slide-18
SLIDE 18

Estimation: classical

Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T 0.1 0.2 0.3 M0.1 M0.2 M0.3 Model Structure Probability, P(pinup) =

Basic parameter learning September 2008 8 / 24

slide-19
SLIDE 19

Estimation: classical

Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T 0.1 0.2 0.3 M0.1 M0.2 M0.3 Model Structure Probability, P(pinup) = We can measure how well a model fits the data using: P(D|Mθ) = P(pinup, pinup, pindown, . . . , pinup|Mθ) = P(pinup|Mθ)P(pinup|Mθ)P(pindown|Mθ) · . . . · P(pinup|Mθ) This is also called the likelihood of Mθ given D.

Basic parameter learning September 2008 8 / 24

slide-20
SLIDE 20

Estimation: classical

Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T 0.1 0.2 0.3 M0.1 M0.2 M0.3 Model Structure Probability, P(pinup) = We select the parameter ˆ θ that maximizes: ˆ θ = arg max

θ

P(D|Mθ) = arg max

θ 100

Y

i=1

P(di|Mθ) = arg max

θ

µ · θ80(1 − θ)20.

Basic parameter learning September 2008 8 / 24

slide-21
SLIDE 21

Estimation: classical

Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T 0.1 0.2 0.3 M0.1 M0.2 M0.3 Model Structure Probability, P(pinup) = By setting: d dθ µ · θ80(1 − θ)20 = 0 we get the maximum likelihood estimate: ˆ θ = 0.8.

Basic parameter learning September 2008 8 / 24

slide-22
SLIDE 22

Estimation: Classical

Maximum Likelihood Estimates for Multinomial Distribution Consider the family of multinomial distribution defined as W = {w1, . . . , wk}, Θ = {θ = (p1, . . . , pk) ∈ [0, 1]k | X pi = 1}. Pθ: distribution with P(wi) = pi. For {Pθ | θ ∈ Θ} and s ∈ W N: there exists exactly one maximum likelihood estimate θ∗ = (p∗

1 , . . . , p∗ k ) given by

p∗

i = 1

N |{j ∈ {1, . . . , N} | sj = wi}| [i.e. θ∗ is just the empirical distribution defined by the data on W]

Basic parameter learning September 2008 9 / 24

slide-23
SLIDE 23

Estimation: Classical

Proof: (for W = {w1, w2}): p∗

1

= 1/N|{j ∈ {1, . . . , N} | sj = w1}| p∗

2

= 1/N|{j ∈ {1, . . . , N} | sj = w2}| (= 1 − p∗

1 )

log Pθ(s) =

N

X

j=1

log Pθ(sj) = N · (p∗

1 log(p1) + p∗ 2 log(p2))

= N · (p∗

1 log(p1) + (1 − p∗ 1 ) log(1 − p1))

Differentiated w.r.t. p1: N · (p∗

1 /p1 − (1 − p∗ 1)/(1 − p1))

Only root: p1 = p∗

1 .

Basic parameter learning September 2008 10 / 24

slide-24
SLIDE 24

Estimation: Classical

Consistency Let W = {w1, . . . , wk}, and the data s1, s2, . . . , sN be generated by the distribution Pθ with parameters θ = (p1, . . . , pk). Then for all ǫ > 0 and i = 1, . . . , k: lim

N→∞ Pθ(|p∗ i − pi| ≥ ǫ) = 0

Note: p∗ is a function of s. The probability Pθ(|p∗

i − pi| ≥ ǫ) is the probability that by sampling

from Pθ a sample s will be obtained, so that for the p∗ computed from s the inequality |p∗

i − pi| ≥ ǫ holds.

Similar consistency properties hold for many other types of maximum likelihood estimates.

Basic parameter learning September 2008 11 / 24

slide-25
SLIDE 25

Estimation: Classical

Chebyshev’s Inequality A quantitative bound: Pθ(|p∗

i − pi| ≥ ǫ) ≤

1 ǫ2N pi(1 − pi)

Basic parameter learning September 2008 12 / 24

slide-26
SLIDE 26

Estimation: Bayesian

Prior Beliefs Classical approach: toss Euro 3 times and always obtain heads infer P(heads) = 1. Inference solely based on observed data – no use of prior knowledge, beliefs, etc. Bayesian methods: start with an encoding of prior beliefs, and show how to modify them on the basis of observed data.

Basic parameter learning September 2008 13 / 24

slide-27
SLIDE 27

Estimation: Bayesian

Approach Model a-priori assumptions on parameter by probability distribution on Θ. Based on observed sample s update to a-posteriori distribution. Example 1: Apparently symmetric coin. W = {heads, tails}. θ = P(heads). Observation: sample with 3 heads and 7 tails 1 1 prior posterior

Basic parameter learning September 2008 14 / 24

slide-28
SLIDE 28

Estimation: Bayesian

Example 2: Result of first clinical tests of new type of drug. W = {pos, neg}. θ = P(pos). Observation: sample with 3 pos and 7 neg 1 prior posterior 1 0.3

Basic parameter learning September 2008 15 / 24

slide-29
SLIDE 29

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = P(pin up | θ)fprior (θ) P(pin up)

Basic parameter learning September 2008 16 / 24

slide-30
SLIDE 30

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = P(pin up | θ)fprior (θ) P(pin up) = θ P(pin up)

Basic parameter learning September 2008 16 / 24

slide-31
SLIDE 31

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = P(pin up | θ)fprior (θ) P(pin up) = θ P(pin up) P(pin up) = Z 1 θdθ = 1 2 .

Basic parameter learning September 2008 16 / 24

slide-32
SLIDE 32

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = P(pin up | θ)fprior (θ) P(pin up) = θ P(pin up) = 2θ P(pin up) = Z 1 θdθ = 1 2 .

Basic parameter learning September 2008 16 / 24

slide-33
SLIDE 33

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = 2θ Assume that we now get a pin down fpost2(θ | pin up, pin down) = P(pin down, pin up | θ)f(θ) P(pin down, pin up)

Basic parameter learning September 2008 16 / 24

slide-34
SLIDE 34

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = 2θ Assume that we now get a pin down fpost2(θ | pin up, pin down) = P(pin down, pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)P(pin up | θ)f(θ) P(pin down, pin up)

Basic parameter learning September 2008 16 / 24

slide-35
SLIDE 35

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = 2θ Assume that we now get a pin down fpost2(θ | pin up, pin down) = P(pin down, pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)P(pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)θ1 P(pin down, pin up)

Basic parameter learning September 2008 16 / 24

slide-36
SLIDE 36

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = 2θ Assume that we now get a pin down fpost2(θ | pin up, pin down) = P(pin down, pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)P(pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)θ1 P(pin down, pin up) = (1 − θ)θ P(pin down, pin up) Note: P(pin down, pin up) = R 1

0 (1 − θ)θdθ = 1 6

Basic parameter learning September 2008 16 / 24

slide-37
SLIDE 37

Estimation: Bayesian

Example 3: Consider a thumbtack with P(pin up | θ) = θ. If we have no idea about θ, then we set fprior(θ) = 1. Assume we perform one experiment and get pin up: fpost(θ | pin up) = 2θ Assume that we now get a pin down fpost2(θ | pin up, pin down) = P(pin down, pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)P(pin up | θ)f(θ) P(pin down, pin up) = P(pin down | θ)θ1 P(pin down, pin up) = (1 − θ)θ P(pin down, pin up) = 6(1 − θ)θ Note: P(pin down, pin up) = R 1

0 (1 − θ)θdθ = 1 6

Basic parameter learning September 2008 16 / 24

slide-38
SLIDE 38

Estimation: Bayesian

Updating a Prior Distribution Let {Pθ | θ ∈ Θ} be a parametric family, fprior a density on Θ. Let s1, . . . , sN ∈ W N be a sample. The posterior density fpost on Θ is defined as fpost(θ) = c · fprior(θ)Pθ(s), where c ∈ R is a normalization constant such that Z 1 fpost(θ)dθ = 1.

Basic parameter learning September 2008 17 / 24

slide-39
SLIDE 39

Estimation: Bayesian

Bayesian Estimator The Bayesian estimate of the parameter θ is the mean value of the current distribution on Θ: θ∗ := Z 1 θfcurrent(θ)dθ 1 θ∗ θmap Example In Example 3 we had fpost2(θ | pin up, pin down)) = 6(1 − θ)θ, hence θ∗ := Z 1 θ6(1 − θ)θ = 1 2 .

Basic parameter learning September 2008 18 / 24

slide-40
SLIDE 40

Estimation: Bayesian

Bayesian Estimator The Bayesian estimate of the parameter θ is the mean value of the current distribution on Θ: θ∗ := Z 1 θfcurrent(θ)dθ 1 θ∗ θmap Alternative: MAP (maximum a-posteriori) estimate: θmap := arg max

θ

fcurrent(θ)

Basic parameter learning September 2008 18 / 24

slide-41
SLIDE 41

Estimation: Bayesian

The Beta Distribution For W = {0, 1}, Θ = [0, 1], Pθ(1) = θ usually use Beta distributions defined by densities: fa,b(θ) := Γ(a + b) Γ(a) · Γ(b) θa−1(1 − θ)b−1 (a, b ∈ R) where Γ(x) := R ∞ tx−1e−tdt. For n ∈ N: Γ(n) = (n − 1)!. For positive integer parameters a, b: fa,b(θ) =

(a+b−1)! (a−1)!(b−1)!θa−1(1 − θ)b−1

= (a + b − 1) `a+b−2

a−1

´ θa−1(1 − θ)b−1 i.e. the product of the likelihood of θ given a sample of a − 1 1’s and b − 1 0’s, and a normalization factor (a + b − 1).

Basic parameter learning September 2008 19 / 24

slide-42
SLIDE 42

Estimation: Bayesian

Examples

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

beta(1,1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5

beta(5,5)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8

beta(50,50)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3

beta(4,8)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4

beta(8,12)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9

beta(53,57)

Basic parameter learning September 2008 20 / 24

slide-43
SLIDE 43

Estimation: Bayesian

Updating a Beta prior Updating a Beta prior fa,b with a sample containing k 1’s and l 0’s leads to a posterior equal to fa+k,b+l. The mean of the beta distribution is θ∗ = Z 1 θfa,b(θ)dθ = a a + b

Basic parameter learning September 2008 21 / 24

slide-44
SLIDE 44

Estimation: Bayesian

Asymptotic Indpendence from prior

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

beta(1,1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3

beta(6,6)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9

beta(51,51)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3

beta(3,7)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 3.5 4

beta(8,12)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9

beta(53,57)

Basic parameter learning September 2008 22 / 24

slide-45
SLIDE 45

Estimation: Bayesian

Dirichlet Distribution For W = {w1, . . . , wk} the Beta distribution is generalized by the Dirichlet distribution: fa1,...,ak (p1, . . . , pk) := Γ(a1 + . . . + ak) Γ(a1) · · · Γ(ak ) pa1

1 · · · pak k

(ai ∈ R)

Basic parameter learning September 2008 23 / 24

slide-46
SLIDE 46

Estimation

Literature Cowell et al.: Probabilistic Networks and Expert Systems. Chapter 9.

  • D. Heckerman: A Tutorial on Learning With Bayesian Networks. Microsoft Research Technical

Report MSR-TR-95-06. Also in: M.I. Jordan (Ed.) Learning in Graphical Models. MIT Press. E.L. Lehmann, G. Casella: Theory of Point Estimation. Springer. Any other textbook on statistics.

Basic parameter learning September 2008 24 / 24