Structure' Learning' Daphne Koller Why Structure Learning To - PowerPoint PPT Presentation

Learning' Probabilis2c' Graphical' BN'Structure' Models' Structure' Learning' Daphne Koller

Why Structure Learning • To learn model for new queries, when domain expertise is not perfect • For structure discovery, when inferring network structure is goal in itself Daphne Koller

Importance of Accurate Structure B C A D Missing an arc Adding an arc B C B C A A D D • Incorrect independencies • Spurious dependencies • Correct distribution P* • Can correctly learn P* cannot be learned • Increases # of parameters • But could generalize better • Worse generalization Daphne Koller

Score-Based Learning Define scoring function that evaluates how well a structure matches the data A,B,C <1,0,0> <1,1,1> <0,0,1> <0,1,1> . C A C B . A <0,1,0> C B A B Search for a structure that maximizes the score Daphne Koller

Learning) Probabilis3c) Graphical) BN)Structurds) Models) Likelihood) Structure) Score) Daphne Koller

Likelihood Score • Find (G, θ ) that maximize the likelihood Daphne Koller

Example X Y X Y Daphne Koller

General Decomposition • The Likelihood score decomposes as: Daphne Koller

Limitations of Likelihood Score X Y X Y • Mutual information is always ≥ 0 • Equals 0 iff X, Y are independent – In empirical distribution • Adding edges can’t hurt, and almost always helps • Score maximized for fully connected network Daphne Koller

Avoiding Overfitting • Restricting the hypothesis space – restrict # of parents or # of parameters • Scores that penalize complexity: – Explicitly – Bayesian score averages over all possible parameter values Daphne Koller

Summary • Likelihood score computes log-likelihood of D relative to G, using MLE parameters – Parameters optimized for D • Nice information-theoretic interpretation in terms of (in)dependencies in G • Guaranteed to overfit the training data (if we don’t impose constraints) Daphne Koller

Learning$ Probabilis3c$ Graphical$ BN$Structure$ Models$ BIC$Score$and$ Asympto3c$ Consistency$ Daphne Koller

Penalizing Complexity • Tradeoff between fit to data and model complexity Daphne Koller

Asymptotic Behavior • Mutual information grows linearly with M while complexity grows logarithmically with M – As M grows, more emphasis is given to fit to data Daphne Koller

Consistency • As M  ∞ , the true structure G* (or any I- equivalent structure) maximizes the score – Asymptotically, spurious edges will not contribute to likelihood and will be penalized – Required edges will be added due to linear growth of likelihood term compared to logarithmic growth of model complexity Daphne Koller

Summary • BIC score explicitly penalizes model complexity (# of independent parameters) – Its negation often called MDL • BIC is asymptotically consistent: – If data generated by G*, networks I-equivalent to G* will have highest score as M grows to ∞ Daphne Koller

Learning( Probabilis0c( Graphical( BN(Structure( Models( Bayesian( Score( Daphne Koller

Bayesian Score Marginal likelihood Prior over structures Marginal probability of Data Daphne Koller

Marginal Likelihood of Data Given G Prior over parameters Likelihood Daphne Koller

Marginal Likelihood Intuition Daphne Koller

Marginal Likelihood: BayesNets ∞ x 1 t ( x ) t − e − dt ( x ) x ( x 1 ) Γ = ∫ Γ = ⋅ Γ − 0 Daphne Koller

Marginal Likelihood Decomposition Daphne Koller

Structure Priors • Structure prior P(G) – Uniform prior: P(G) ∝ constant – Prior penalizing # of edges: P(G) ∝ c |G| (0<c<1) – Prior penalizing # of parameters • Normalizing constant across networks is similar and can thus be ignored Daphne Koller

Parameter Priors • Parameter prior P( θ |G) is usually the BDe prior – α : equivalent sample size – B 0 : network representing prior probability of events – Set α (x i ,pa i G ) = α P(x i ,pa i G | B 0 ) • Note: pa i G are not the same as parents of X i in B 0 • A single network provides priors for all candidate networks • Unique prior with the property that I-equivalent networks have the same Bayesian score Daphne Koller

BDe and BIC • As M  ∞ , a network G with Dirichlet priors satisfies Daphne Koller

Summary • Bayesian score averages over parameters to avoid overfitting • Most often instantiated as BDe – BDe requires assessing prior network – Can naturally incorporate prior knowledge – I-equivalent networks have same score • Bayesian score – Asymptotically equivalent to BIC – Asymptotically consistent – But for small M, BIC tends to underfit Daphne Koller

Learning' Probabilis4c' Graphical' BN'Structure' Models' Structure' Learning'In' Trees' Daphne Koller

Score-Based Learning Define scoring function that evaluates how well a structure matches the data A,B,C <1,0,0> <1,1,1> <0,0,1> <0,1,1> . C A C B . A <0,1,0> C B A B Search for a structure that maximizes the score Daphne Koller

Optimization Problem Input: – Training data – Scoring function (including priors, if needed) – Set of possible structures Output: A network that maximizes the score Key Property: Decomposability Daphne Koller

Learning Trees/Forests • Forests – At most one parent per variable • Why trees? – Elegant math – Efficient optimization – Sparse parameterization Daphne Koller

Learning Forests p(i) = parent of X i , or 0 if X i has no parent • Improvement over Score of “empty” “empty” network network • Score = sum of edge scores + constant Daphne Koller

Learning Forests I • Set w(i → j) = Score(X j | X i ) - Score(X j ) • For likelihood score, w(i → j) = M I (X i ; X j ), ˆ P and all edge weights are nonnegative  Optimal structure is always a tree • For BIC or BDe, weights can be negative  Optimal structure might be a forest Daphne Koller

Learning Forests II • A score satisfies score equivalence if I- equivalent structures have the same score – Such scores include likelihood, BIC, and BDe • For such a score, we can show w(i → j) = w(j → i), and use an undirected graph Daphne Koller

Learning Forests III (for score-equivalent scores) • Define undirected graph with nodes {1,…,n} • Set w(i,j) = max[ Score(X j | X i ) - Score(X j ), 0] • Find forest with maximal weight – Standard algorithms for max-weight spanning trees (e.g., Prim’s or Kruskal’s) in O(n 2 ) time – Remove all edges of weight 0 to produce a forest Daphne Koller

Learning Forests: Example MINVOLSET MINVOLSET PULMEMBOLUS INTUBATION KINKEDTUBE KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT VENTMACH DISCONNECT Tree learned from data of Alarm network PAP SHUNT VENTLUNG PAP SHUNT VENTLUNG VENITUBE VENITUBE PRESS PRESS MINOVL MINOVL VENTALV VENTALV FIO2 FIO2 ANAPHYLAXIS PVSAT PVSAT ANAPHYLAXIS ARTCO2 ARTCO2 EXPCO2 EXPCO2 SAO2 SAO2 TPR TPR INSUFFANESTH INSUFFANESTH Correct edges HYPOVOLEMIA HYPOVOLEMIA LVFAILURE LVFAILURE CATECHOL CATECHOL Spurious edges LVEDVOLUME STROEVOLUME ERRCAUTER LVEDVOLUME STROEVOLUME ERRBLOWOUTPUT HR HR ERRCAUTER HISTORY HISTORY ERRBLOWOUTPUT CVP PCWP CO CO HREKG CVP PCWP HREKG HRSAT HRSAT HRBP HRBP BP BP • Not every edge in tree is in the original network • Inferred edges are undirected – can’t determine direction Daphne Koller

Summary • Structure learning is an optimization over the combinatorial space of graph structures • Decomposability  network score is a sum of terms for different families • Optimal tree-structured network can be found using standard MST algorithms • Computation takes quadratic time Daphne Koller

Learning' Probabilis2c' Graphical' BN'Structure' Models' General' Graphs:'Search' Daphne Koller

Optimization Problem Input: – Training data – Scoring function – Set of possible structures Output: A network that maximizes the score Daphne Koller

Beyond Trees • Problem is not obvious for general networks – Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network • Theorem: – Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1 Daphne Koller

Heuristic Search A B C A B D C D A B A B C C D D Daphne Koller

Heuristic Search • Search operators: – local steps: edge addition, deletion, reversal – global steps • Search techniques: – Greedy hill-climbing – Best first search – Simulated Annealing – ... Daphne Koller

Search: Greedy Hill Climbing • Start with a given network – empty network – best tree – a random network – prior knowledge • At each iteration – Consider score for all possible changes – Apply change that most improves the score • Stop when no modification improves score Daphne Koller

Greedy Hill Climbing Pitfalls • Greedy hill-climbing can get stuck in: – Local maxima – Plateaux • Typically because equivalent networks are often neighbors in the search space Daphne Koller

Why Edge Reversal A B A B C C Daphne Koller

Structure' Learning' Daphne Koller Why Structure Learning To - PowerPoint PPT Presentation

Learning' Probabilis2c' Graphical' BN'Structure' Models' Structure' Learning' Daphne Koller Why Structure Learning To learn model for new queries, when domain expertise is not perfect For structure discovery, when inferring network

EXPLOITING STRUCTURE FOR META-LEARNING NeurIPS Metalearning Workshop | December 8, 2018 Lise

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

STRUCTURE INTO MACHINE LEARNING TRINITY OF AI ALGORITHMS COMPUTE DATA 2 DEEP LEARNING IS

Social Structure & Society Chapter 5 Section 1 SOCIAL STRUCTURE & STATUS Social

Structure and Function of Muscle and Nervous Tissue What well talk about Structure and

Day 2: LFG approaches to information structure LFG The nature of f-structure An f-structure

Latent Event Structure Atomic Object Structure: Formal Quale (objects expressed as basic nominal

The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Granular Soil Structure Granular Soil Structure Crusted Soil Crusted Soil Soil Compaction Soil

CHAPTER 1 Structure (Asian Parliamentary Debate) Asian Parliamentary Teams 3 Debaters (per

BMI-206 Structure-Structure comparisons Sequence-Structure comparisons Marc A. Marti-Renom

1 Models for population structure Models for population structure Multi-group Spatial mixing

SENTENCE STRUCTURE ATI TEAS ENGLISH AND LANGUAGE USAGE SENTENCE STRUCTURE Sentence Structure

Welcome to Welcome to the PGM Class

On Identifying Significant Edges in Graphical Models Marco Scutari 1 and Radhakrishnan Nagarajan 2

PROPOSING WRITING A WINNING RESEARCH GRANT PROPOSAL SSA, RMI 2 FRGS B DEC 2012 Research

1 HPC in Dr HPC in Drug ug Disco Discover ery Ashutosh Tripathi, Ph.D. Bankaitis Lab

Objec&ve(ICT,2013.5.5 ! Collec've!awareness!Pla/orms!for!

Lecture #12 Acids & Bases: Graphical Solutions II Benjamin, Chapter 4 (Stumm & Morgan,

r t trs t

r r rr

Structure' Learning' Daphne Koller Why Structure Learning To - PowerPoint PPT Presentation

Learning' Probabilis2c' Graphical' BN'Structure' Models' Structure' Learning' Daphne Koller Why Structure Learning To learn model for new queries, when domain expertise is not perfect For structure discovery, when inferring network

EXPLOITING STRUCTURE FOR META-LEARNING NeurIPS Metalearning Workshop | December 8, 2018 Lise

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

STRUCTURE INTO MACHINE LEARNING TRINITY OF AI ALGORITHMS COMPUTE DATA 2 DEEP LEARNING IS

Social Structure &amp; Society Chapter 5 Section 1 SOCIAL STRUCTURE &amp; STATUS Social

Structure and Function of Muscle and Nervous Tissue What well talk about Structure and

Day 2: LFG approaches to information structure LFG The nature of f-structure An f-structure

Latent Event Structure Atomic Object Structure: Formal Quale (objects expressed as basic nominal

The Search For Structure or The Relationship Between Structure and Prediction June 2012 Larry

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Granular Soil Structure Granular Soil Structure Crusted Soil Crusted Soil Soil Compaction Soil

CHAPTER 1 Structure (Asian Parliamentary Debate) Asian Parliamentary Teams 3 Debaters (per

BMI-206 Structure-Structure comparisons Sequence-Structure comparisons Marc A. Marti-Renom

1 Models for population structure Models for population structure Multi-group Spatial mixing

SENTENCE STRUCTURE ATI TEAS ENGLISH AND LANGUAGE USAGE SENTENCE STRUCTURE Sentence Structure

Welcome to Welcome to the PGM Class

On Identifying Significant Edges in Graphical Models Marco Scutari 1 and Radhakrishnan Nagarajan 2

PROPOSING WRITING A WINNING RESEARCH GRANT PROPOSAL SSA, RMI 2 FRGS B DEC 2012 Research

1 HPC in Dr HPC in Drug ug Disco Discover ery Ashutosh Tripathi, Ph.D. Bankaitis Lab

Objec&amp;ve(ICT,2013.5.5 ! Collec've!awareness!Pla/orms!for!

Lecture #12 Acids &amp; Bases: Graphical Solutions II Benjamin, Chapter 4 (Stumm &amp; Morgan,

r t trs t

r r rr

Social Structure & Society Chapter 5 Section 1 SOCIAL STRUCTURE & STATUS Social

Objec&ve(ICT,2013.5.5 ! Collec've!awareness!Pla/orms!for!

Lecture #12 Acids & Bases: Graphical Solutions II Benjamin, Chapter 4 (Stumm & Morgan,