04/ 21/ 2005 CS673 1
Being Bayesian About Being Bayesian About Net work St ruct ure Net - - PowerPoint PPT Presentation
Being Bayesian About Being Bayesian About Net work St ruct ure Net - - PowerPoint PPT Presentation
Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian Approach t o St ruct ure Discovery in Bayesian Net works Nir Friedman and Daphne Koller 04/ 21/ 2005 CS673 1 Roadmap Roadmap Bayesian lear
04/ 21/ 2005 CS673 2
Roadmap Roadmap
- Bayesian lear ning of Bayesian Net wor ks
– Exact vs Approximat e Learning
- Markov Chain Mont e Carlo met hod
– MCMC over st ruct ures – MCMC over orderings
- Experiment al Result s
- Conclusions
04/ 21/ 2005 CS673 3
Bayesian Net works Bayesian Net works
- Compact represent at ion of probabilit y dist ribut ions via
condit ional independence
Qualit at ive part :
Direct ed acyclic graph-DAG
- Nodes – random variables
- Edges – direct inf luence
Toget her :
Def ine a unique dist ribut ion in a f act ored f orm
Quant it at ive part :
Set of condit ional probabilit y dist ribut ion E B R A C
0.9 0.1 0.2 0.8 0.9 0.1 0.01 0.99 e b e !b !e b !e !b
P (A| E,B) E B
P(B,E,A,C,R) =P(B)P(E)P(A| B,E)P(R| E)P(C| A)
04/ 21/ 2005 CS673 4
Why Learn Bayesian Net works? Why Learn Bayesian Net works?
- Condit ional independencies & graphical represent at ion
capt ure t he st ruct ure of many real-world dist ribut ions
- P
rovides insight s int o domain
- Graph st ruct ure allows “ knowledge discovery”
- I s t here a direct connect ion bet ween X & Y
- Does X separat e bet ween t wo “ subsyst ems”
- Does X causally af f ect Y
- Bayesian Net works can be used f or many t asks
– I nf erence, causalit y, et c.
- Examples: scient if ic dat a mining
- Disease propert ies and sympt oms
- I nt eract ions bet ween t he expression of genes
04/ 21/ 2005 CS673 5
Learning Bayesian Net works Learning Bayesian Net works
Data + Prior Information
Inducer E B R A C
0.9 0.1 0.2 0.8 0.9 0.1 0.01 0.99 e b e !b !e b !e !b
P (A| E,B) E B
- I nducer needs t he prior probabilit y dist ribut ion P(
I nducer needs t he prior probabilit y dist ribut ion P(B B) ) Using Bayesian condit ioning, updat e t he prior Using Bayesian condit ioning, updat e t he prior P( P(B B) ) P( P(B B| D) | D)
04/ 21/ 2005 CS673 6
Why St ruggle f or Accurat e St ruct ure? Why St ruggle f or Accurat e St ruct ure?
A E B S A E B S A E B S
“Tr ue” st r uct ur e Tr ue” st r uct ur e Adding an ar c Adding an ar c Missing an ar c Missing an ar c
- I ncr eases t he number of
I ncr eases t he number of par amet er s t o be f it t ed par amet er s t o be f it t ed Wr ong assumpt ions about Wr ong assumpt ions about causalit y and domain st r uct ur e causalit y and domain st r uct ur e
- Cannot be compensat ed by
Cannot be compensat ed by accur at e f it t ing of par amet er s accur at e f it t ing of par amet er s Also misses causalit y and domain Also misses causalit y and domain st r uct ur e st r uct ur e
04/ 21/ 2005 CS673 7
Score Score-
- based learning
based learning
- Def ine scoring f unct ion t hat evaluat es how well a
st ruct ure mat ches t he dat a
E B A E A B E B A E, B, A < Y,N,N> < Y,Y,Y> < N,Y,Y> . . < N,N,N>
- Search f or a st ruct ure t hat maximizes t he score
04/ 21/ 2005 CS673 8
Bayesian Score of a Model Bayesian Score of a Model
) ( ) ( ) | ( ) | ( D P G P G D P D G P =
where where
θ θ θ d G P G D P G D P ) | ( ) , | ( ) | (
∫
=
Marginal Likelihood Likelihood Prior over parameters
04/ 21/ 2005 CS673 9
Discovering St ruct ure Discovering St ruct ure – – Model Select ion Model Select ion
P(G| D) P(G| D)
E B R A C
- Current pract ice: model select ion
Current pract ice: model select ion
P ick a single high P ick a single high-
- scoring model
scoring model Use t hat model t o inf er domain st ruct ure Use t hat model t o inf er domain st ruct ure
04/ 21/ 2005 CS673 10
Discovering St ruct ure Discovering St ruct ure – – Model Averaging Model Averaging
P(G| D) P(G| D)
E B R A C E B R A C E B R A C E B R A C E B R A C
- Pr oblem
Pr oblem
Small sample size Small sample size many high scoring models many high scoring models Answer based on one model of t en useless Answer based on one model of t en useless Want f eat ures common t o many models Want f eat ures common t o many models
⇒
04/ 21/ 2005 CS673 11
Bayesian Approach Bayesian Approach
- Est imat e probabilit y of f eatures
– Edge X Y – Markov edge X -- Y – Pat h X … Y – ...
∑
=
G
D G P G f D f P ) | ( ) ( ) | (
Feature of G, e.g., X Y Indicator function for feature f Bayesian score for G
- Huge (super-exponent ial – 2T (n2)) number of net works G
- Exact learning - int ract able
04/ 21/ 2005 CS673 12
Approximat e Bayesian Learning Approximat e Bayesian Learning
- Rest rict t he search space t o G
k,
where G
k – set of graphs wit h indegree bounded by k
- space st ill super-exponent ial
- Find a set G of high scoring st ruct ures
– Est imat e
- Hill-climbing – biased sample of st ruct ures
∑ ∑
≈
G G
D G P G f D G P D f P ) | ( ) ( ) | ( ) | (
04/ 21/ 2005 CS673 13
Markov Chain Mont e Carlo over Net works Markov Chain Mont e Carlo over Net works
MCMC Sampling
– Def ine Markov Chain over BNs – Perf orm a walk t hrough t he chain t o get samples G’s whose post eriors converge t o t he post erior P(G| D) of t he t rue st ruct ure
- Possible pit f alls:
– St ill super-exponent ial number of net works – Time f or chain t o converge t o post erior is unknown – I slands of high post erior, connect ed by low bridges
04/ 21/ 2005 CS673 14
Bet t er Approach t o Approximat e Learning Bet t er Approach t o Approximat e Learning
- Furt her const raint s on t he search space
– P erf orm model averaging over t he st ruct ures consist ent wit h some know (f ixed) t ot al ordering ‹
- Ordering of variables:
– X1 ‹ X2 ‹…
‹ Xn
parent s f or X i must be in X1, X2,… , Xi-1
- I nt uit ion: Order decouples choice of parent s
– Choice of P a(X7) does not rest rict choice of P a(X12)
- Can comput e ef f icient ly in closed f orm
Can comput e ef f icient ly in closed f orm
Likelihood P (D| Likelihood P (D| ‹
‹)
) Feat ure probabilit y Feat ure probabilit y P(f | D
P(f | D, ,‹
‹)
)
04/ 21/ 2005 CS673 15
Sample Orderings Sample Orderings
We can writ e
∑
=
p
p p ) | ( ) , | ( ) | ( D P D f P D f P
Sample orderings and approximat e
∑
=
=
n i i D
f P D f P
1
) , | ( ) | ( p
MCMC Sampling
- Def ine Markov Chain over orderings
- Run chain t o get samples f rom post erior P(<
| D)
04/ 21/ 2005 CS673 16
Experiment s: Exact post erior over orders Experiment s: Exact post erior over orders versus order versus order -
- MCMC
MCMC
04/ 21/ 2005 CS673 17
Experiment s: Convergence Experiment s: Convergence
04/ 21/ 2005 CS673 18
Experiment s: st ruct ure Experiment s: st ruct ure-
- MCMC
MCMC – – post erior post erior correlat ion f or t wo dif f erent runs correlat ion f or t wo dif f erent runs
04/ 21/ 2005 CS673 19
Experiment s: order Experiment s: order -
- MCMC
MCMC – – post erior post erior correlat ion f or t wo dif f erent runs correlat ion f or t wo dif f erent runs
04/ 21/ 2005 CS673 20
Conclusion Conclusion
- Or der -MCMC bet t er t han st r uct ur e-MCMC
04/ 21/ 2005 CS673 21
Ref erences Ref erences
Being Bayesian about Net work St ruct ure: A Bayesian Approach t o St ruct ure Discovery in Bayesian Net works, N. Friedman and D. Koller. Machine Learning J ournal, 2002
NI P S 2001 Tut orial on learning Bayesian net works f rom Dat a. Nir Friedman and Daphne Koller Nir Friedman and Moises Goldzsmidt, AAAI -98 Tut orial on learning Bayesian net works f rom Dat a.
- D. Hecker man. A Tut or ial on Lear ning wit h Bayesian Net wor ks. I n Lear ning in
Gr aphical Models, M. J or dan, ed.. MI T Pr ess, Cambr idge, MA, 1999. Also appear s as Technical Repor t MSR-TR-95-06, Micr osof t Resear ch, Mar ch,
- 1995. An ear lier ver sion appear s as Bayesian Net wor ks f or Dat a Mining, Dat a
Mining and Knowledge Discover y, 1:79-119, 1997.
Christ ophe Andrieu, Nando de Freit as, Arnaud Doucet and Michael I . J ordan. An I ntroduction to MCMC f or Machine Learning. Machine Learning, 2002. Art if icial I nt elligence: A Modern Approach. St uart Russell and P et er Norvig