learning in bayes nets bayes nets
play

Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: - PowerPoint PPT Presentation

Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: infer from data, given G Learning Parameters and Parents P(W|Pa) P(~W|Pa) 1 =? 1 1 ~L,~R Structure 2 =? 1 2 ~L,R 3 =? 1 3 L,~R 4


  1. Learning in Bayes Nets Bayes Nets: 1. Parameter Learning/Estimation: infer from data, given G Θ Learning Parameters and Parents P(W|Pa) P(~W|Pa) θ 1 =? 1 − θ 1 ~L,~R Structure θ 2 =? 1 − θ 2 ~L,R θ 3 =? 1 − θ 3 L,~R θ 4 =? 1 − θ 4 L,R Machine Learning 10-701 2. Structure Learning: inferring G and from data Θ Anna Goldenberg Parents P(W|Pa) P(~W|Pa) ? ? θ 1 =? 1 − θ 1 ~L,~R ? θ 2 =? 1 − θ 2 ~L,R θ 3 =? 1 − θ 3 L,~R θ 4 =? 1 − θ 4 L,R ? ? ? Parents P(W|Pa) P(~W|Pa) θ 1 =? 1 − θ 1 ~L,~R θ 2 =? 1 − θ 2 ~L,R θ 3 =? 1 − θ 3 L,~R Parameter Learning Parameter Estimation Outline θ 4 =? 1 − θ 4 L,R � Frequentist Parameter Estimation � G is a given DAG over N variables � MLE � Goal: Estimate from iid data , D = ( x 1 , . . . , x M ) θ where M is the number of records � example of estimation with discrete data x m = { x m 1 , . . . , x m N } � Each record � MAP � estimate for discrete data � Complete Observability (no missing values) � Bayesian Parameter Estimation � How it’s different from Frequentist

  2. Maximum Likelihood Estimator Example: MLE for one variable � � � Likelihood (for iid data): p ( x m i | x m � Variable X ~ Multinomial with K values (K-sided die) p ( D | θ ) = π i , θ ) m i � Observe M rolls: 1, 4, K, 2, ... � Log likelihood � � log p ( x m i | x m l ( θ ; D ) = log p ( D | θ ) = π i , θ ) � model , (2) � p ( X = k ) = θ k θ k = 1 m i k ˆ θ ML = arg max l ( θ ; D ) � MLE N k log ( θ k ) (1) I ( x m = k ) θ k = I ( x m = k ) log( θ k ) = � � � � � l ( θ ; D ) = log θ m k m k � advantages: has nice statistical properties Maximizing (1) subject to constraint (2): � disadvantages: can overfit θ k,ML = N k ˆ the fraction of times k occurs M Discrete Bayes Nets Continuous Variables Example: Gaussian Variables <=> One variable: X ∼ N ( µ, σ ) ML estimates: � m x m µ ML = ˆ M � Assume each CPD is represented as a table � m ( x m − ˆ µ ML ) 2 ˆ σ 2 ML = M � Loglikelihood: Similarly for several Continuous Variables Another option to estimate parameters: X i ∼ f ( Pa i , θ ) � Parameter Estimator:

  3. Maximum A Posteriori estimate Example: MAP for Multinomial (MAP) θ N ijk � � � MLE is obtained by maximizing loglikelihood Multinomial likelihood: P ( D | θ ) = ijk m ijk � sensitive to small sample sizes ijk θ ( α ijk − 1) � ijk Dirichlet Prior: P ( θ | α ) = Z ( α ) � MAP comes from maximizing posterior θ N ijk + α ijk − 1 � Posterior: P ( θ | D, α ) ∝ p ( θ | D ) ∼ p ( D | θ ) p ( θ ) = likelihood × prior ijk ijk N ijk + α ijk ˆ � prior acts as a smoothing factor θ MAP MAP = ijk � j � ( N ij � k + α ij � k ) can be thought of as virtual pseudo counts α Bayesian vs Frequentist � Frequentist: � are unknown constants θ Questions on � MLE is a very common frequentist estimator Parameter Learning? � Bayesian θ � unknown are random variables � estimates differ based on a prior

  4. What if G is not given? Structural Learning � When? � Constraint Based � Scientific discovery (protein networks, data mining) � Test independencies � Add edges according to the tests � Need a good model for compression, prediction... � Search and Score j_miller m_moore j_kolojejchick u_saranli � Define a selection criterion that measures goodness of a model m_derthick j_kozar � Search in the space of all models (or orders) j_harrison r_munos m_riedmiller � Mix models (recent) j_boyan m_meila a_steinfeld j_schneider b_anderson � Test for almost all independencies t_kanade k_deng a_moore l_kramer � Search and score according to possible l_baird a_ankolekar m_nechyba v_cicirello j_kubica Constraint Based Learning Constraint Based Learning � Define Conditional Independence Test Ind(X i ;X j |S) � Cons: ( O x i ,x j | s − E x i ,x j | s ) 2 χ 2 : � e.g. , � � Independence tests are less reliable on small samples E x i ,x j | s x i ,x j � One incorrect independence test might propagate far (not robust to G 2 , conditional entropy, etc. noise) � if Ind(X i ;X j |S)<p, then independence � Pros: � Choose p with care! � More global decisions => doesn’t get stuck in local minima as much � Construct model consistent with the set of independencies � Works well on sparse nets (small markov blankets, sufficient data)

  5. Score Based Search Maximum likelihood in Outline Information Theoretic terms � Select the highest scoring model! ˆ ˆ � � log P ( D | θ G , G ) = M I ( X i | Pa X i ) − M H ( X i ) � What should the score be? � The entropy does not depend on the current model i i � Specialized structures (trees, TANs) � Thus, it’s enough to maximize mutual information! � Selection operators - how to navigate the space of models? � General case: � Same as constraint search! Theorem: maximizing Bayesian Score for d � 2 � Special case (trees): (not a tree) is NP-hard (Chickering, 2002) � have to consider only all pairs (tree => only one parent): O(N 2 ) Chow Liu tree algorithm Tree Augmented Naive Bayes � Compute empirical distribution: TAN (Friedman et al, 1997) is an extension of Chow Liu C C Naive Bayes TAN � Mutual Information: X1 X2 X3 X M X1 X2 X3 X M � Set as weight per edge between X i and X j TAN: � Find Optimal tree BN by getting the maximum spanning tree for direction: pick a random node as root ˆ ˆ Score(TAN): � � I ( X i , C ) + I ( X j , { Pa X j , C } ) direct in BFS order i j

  6. MI Problem Penalized Likelihood Score � BIC (Bayesian Information Criterion) � Doesn’t penalize complexity: I(A,B) � I(A,{B,C}) θ ML ) − d logP ( D ) ∼ logP ( D | ˆ 2 log ( N ) , where d is the number of free parameters � Adding a parent always increases the score! � AIC (Akaike Information Criterion) � Model will overfit, since the completely connected logP ( D ) ∼ logP ( D | ˆ θ ML ) − d graph would be favored � BIC penalizes complexity more than AIC Minimum Description Length What should the score be? � Consistent : for all G’ I-equivalent to the true G and � Total number of bits needed to describe data is -log 2 P(x) all G* not equivalent to G Score(G)=Score(G’) and Score(G*)<Score(G’) � Instead - send the model and then residuals: � Decomposable : can be locally computed (for efficiency) � Score ( G ; D ) = FamScore ( X i | Pa X i ; D ) -L(D,H) = - logP(H) - log P(D|H) = -log P(H|D) + const i � The best is the one with the shortest message! Example: BIC and AIC are consistent and decomposable

  7. Bayesian Scoring Bayesian Scoring Parameter Prior Parameter Prior � Parameter Prior - important for small datasets! � Bayes Dirichlet equivalent scoring (BDe) : α X i | P a Xi = MP � ( X i , Pa ( X i )) � Is consistent (and decomposable) � Dirichlet Parameters ( from a few slides before ) Theorem: If P(G) assigns the same prior to I-equivalent � For each possible family define a prior distribution structures and Parameter prior is Dirichlet then Bayesian score satisfies score equivalence, if and only if prior is of � Can encode it as a Bayes Net BDe form! � (Usually Independent - product of marginals) � Bayesian Scoring Structure search algorithms Structure Prior � Structure Prior - should satisfy prior modularity � Order in known � Order is unknown � Parameter Modularity: if X has the same set of parents in two different structures, then parameters should be � Search in the space of orderings the same. � Search in the space of DAGs � Search in the space of equivalence classes � Typically set to uniform. 1 � Can be a function of prior counts: α + 1

  8. Order is unknown Order is known Search space of orderings � Select an order according to some heuristic � Suppose the total ordering is � Then for each node X i can find an optimal set of parents in � Use K2 to learn a BN corresponding to the ordering and score it � Choice of parents for X j doesn’t depend on previous X i � Maybe do multiple restarts � Need to search among all choices (where d is the maximum number of parents) for the highest local score � Most recent research: Tessier and Koller (2005) � Greedy search with known order, aka K2 algorithm is Order is unknown Exploiting Decomposable Score Search space of DAGs � Typical search operators � If the operator for edge (X,Y) is valid, then we need only to look at the families of X and Y � Add an edge � Remove an edge � e.g. for addition operator o � Reverse an edge � At most O(n 2 ) steps to get from any graph to any graph � Moves are reversible � Simplest search is Greedy Hillclimbing � Move to proposed new graph if it satisfies constraints

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend