Learning Bayesian Networks: Learning Bayesian Networks: Na Naï ïve and non ve and non-
- Na
Learning Bayesian Networks: Learning Bayesian Networks: Na ve and - - PowerPoint PPT Presentation
Learning Bayesian Networks: Learning Bayesian Networks: Na ve and non ve and non- -Na Na ve Bayes ve Bayes Na Hypothesis Space Hypothesis Space fixed size fixed size stochastic stochastic continuous
k covariance
k
y x
– – y y: P( : P(y y = = k k) ) – – x xj
j: P(
: P(x xj
j =
= v v | | y y = = k k) ) “ “class conditional probability class conditional probability” ”
– – Choose the class Choose the class k k according to P( according to P(y y = = k k) ) – – Generate each feature Generate each feature independently independently according to P(x according to P(xj
j=
=v v | | y y= =k k) ) – – The feature values are The feature values are conditionally independent conditionally independent P( P(x xi
i,
,x xj
j |
| y y) = P( ) = P(x xi
i |
| y y) ) · · P( P(x xj
j |
| y y) )
y x3 x2 x1 xn
– – Univariate Gaussian Univariate Gaussian
if if x xj
j is a continuous random variable, then we can use a normal
is a continuous random variable, then we can use a normal distribution and learn the mean distribution and learn the mean µ µ and variance and variance σ σ2
2
– – Multinomial Multinomial
if if x xj
j is a discrete random variable,
is a discrete random variable, x xj
j ∈
∈ { {v v1
1,
, … …, , v vm
m}, then we construct
}, then we construct the conditional probability table the conditional probability table
– – Discretization Discretization
convert continuous convert continuous x xj
j into a discrete variable
into a discrete variable
– – Kernel Density Estimates Kernel Density Estimates
apply a kind of nearest apply a kind of nearest-
neighbor algorithm to compute P(x xj
j |
| y y) in ) in neighborhood of query point neighborhood of query point
P(x P(xj
j=
=v vm
m |
| y = y = K) K) … … P(x P(xj
j=
=v vm
m |
| y = y = 2) 2) P(x P(xj
j=
=v vm
m |
| y = y = 1) 1) x xj
j=
=v vm
m
… … … … … … … … … … P(x P(xj
j=
=v vm
m |
| y = y = K) K) … … P(x P(xj
j=
=v v2
2 |
| y = y = 2) 2) P(x P(xj
j=
=v v2
2 |
| y = y = 1) 1) x xj
j=
=v v2
2
P(x P(xj
j=
=v vm
m |
| y = y = K) K) … … P(x P(xj
j=
=v v1
1 |
| y = y = 2) 2) P(x P(xj
j=
=v v1
1 |
| y = y = 1) 1) x xj
j=
=v v1
1
y y=K =K … … y y=2 =2 y y = 1 = 1
– – To discretize feature x To discretize feature xj
j, grow a decision tree considering only
, grow a decision tree considering only splits on splits on x xj
. Each leaf of the resulting tree will correspond to a single value of the discretized single value of the discretized x xj
j.
. – – Stopping rule (applied at each node). Stop when Stopping rule (applied at each node). Stop when – – where where S S is the training data in the parent node; is the training data in the parent node; S Sl
l and
and S Sr
r are the
are the examples in the left and right child. K, K examples in the left and right child. K, Kl
l, and K
, and Kr
r are the
are the corresponding number of classes present in these examples. corresponding number of classes present in these examples. I I is the mutual information, is the mutual information, H H is the entropy, and is the entropy, and N N is the number is the number
k is the number of training
µxj − xi,j
¶2
k on each trianing
xj P(xj|y)
P(xj|y) xj
σ=0.15 σ=0.50
h(x) = P(Y = 1|X) P (Y = K|X) = P(x1 = v1|Y = 1) P(x1 = v1|Y = K) · · · P (xn = vn|Y = 1) P(xn = vn|Y = K)· P (Y = 1) P(Y = K)
P (x|Y = y) = P(x1 = v1|Y = y)·P(x2 = v2|Y = y) · · · P(xn = vn|Y = y)
Suppose each xj is binary and define
P(y = 1|x) P (y = K|x) = P (x1 = v1|y = 1) P(x1 = v1|y = K) · · · P(xn = vn|y = 1) P (xn = vn|y = K) · P(y = 1) P(y = K) log P(y = 1|x) P (y = K|x) = log P(x1 = v1|y = 1) P(x1 = v1|y = K) + . . . log P(xn = vn|y = 1) P(xn = vn|y = K) + log P (y = 1) P(y = K) αj,0 = log P(xj = 0|y = 1) P (xj = 0|y = K) αj,1 = log P(xj = 1|y = 1) P (xj = 1|y = K)
log P(y = 1|x) P (y = K|x) =
X j
(αj,1 − αj,0)xj + αj,0 + log P(y = 1) P (y = K) log P(y = 1|x) P (y = K|x) =
X j
(αj,1 − αj,0)xj +
⎛ ⎝X j
αj,0 + log P(y = 1) P(y = K)
⎞ ⎠
j =
j
j for training examples in class
When we have very little training data, direct probability compu When we have very little training data, direct probability computation tation can give probabilities of 0 or 1. Such extreme probabilities ar can give probabilities of 0 or 1. Such extreme probabilities are e “ “too too strong strong” ” and cause problems and cause problems Suppose we are estimate a probability P(z) and we have n Suppose we are estimate a probability P(z) and we have n0 examples where z is false and n examples where z is false and n1
1 examples where z is true. Our
examples where z is true. Our direct estimate is direct estimate is Laplace Estimate. Add 1 to the numerator and 2 to the denominat Laplace Estimate. Add 1 to the numerator and 2 to the denominator
This says that in the absence of any evidence, we expect P(z This says that in the absence of any evidence, we expect P(z) = 0.5, ) = 0.5, but our belief is weak (equivalent to 1 example for each outcome but our belief is weak (equivalent to 1 example for each outcome). ). Generalized Laplace Estimate. If z has K different outcomes, th Generalized Laplace Estimate. If z has K different outcomes, then en we estimate it as we estimate it as
– – Bayes nets work best when arrows follow the direction of Bayes nets work best when arrows follow the direction of causality causality
two things with a common cause are likely to be conditionally two things with a common cause are likely to be conditionally independent given the cause; arrows in the causal direction capt independent given the cause; arrows in the causal direction capture ure this independence this independence
– – In a Na In a Naï ïve Bayes network, arrows are often ve Bayes network, arrows are often not not in the causal in the causal direction direction
diabetes does not cause pregnancies diabetes does not cause pregnancies diabetes does not cause age diabetes does not cause age
– – But some arrows are correct But some arrows are correct
diabetes does cause the level of blood insulin and blood glucose diabetes does cause the level of blood insulin and blood glucose
P (D = d|A, P, M, I, G) = P(I|D = d)P(G|I, D = d)P(D = d|A, M, P) P (I, G)
yes yes some some yes yes some some no no no no yes yes no no no no SVM SVM yes yes yes yes yes yes some some yes yes disc disc disc disc yes yes yes yes NB NB no no no no some some no no no no no no yes yes some some no no NNbr NNbr yes yes no no yes yes no no yes yes some some yes yes no no no no Nets Nets no no yes yes no no some some yes yes yes yes yes yes yes yes yes yes Trees Trees yes yes yes yes yes yes no no yes yes no no no no yes yes no no LDA LDA yes yes yes yes Accurate Accurate yes yes yes yes Interpretable Interpretable yes yes yes yes Linear combinations Linear combinations no no no no Irrelevant inputs Irrelevant inputs yes yes yes yes Scalability Scalability no no no no Monotone transformations Monotone transformations yes yes no no Outliers Outliers no no no no Missing values Missing values no no no no Mixed data Mixed data Logistic Logistic LMS LMS Criterion Criterion
and information retrieval where there are many features compared to the number of examples
as well as more sophisticated methods
– – Produces stochastic classifiers Produces stochastic classifiers
can be combined with utility functions to make optimal decisions can be combined with utility functions to make optimal decisions
– – Easy to incorporate causal knowledge Easy to incorporate causal knowledge
resulting probabilities are easy to interpret resulting probabilities are easy to interpret
– – Very simple learning algorithms Very simple learning algorithms
if all variables are observed in training data if all variables are observed in training data
– – Fixed sized hypothesis space Fixed sized hypothesis space
may underfit or overfit the data may underfit or overfit the data may not contain may not contain any any good classifiers if prior knowledge is wrong good classifiers if prior knowledge is wrong
– – Harder to handle continuous features Harder to handle continuous features