Structure' Learning' Daphne Koller Why Structure Learning To - - PowerPoint PPT Presentation

structure learning
SMART_READER_LITE
LIVE PREVIEW

Structure' Learning' Daphne Koller Why Structure Learning To - - PowerPoint PPT Presentation

Learning' Probabilis2c' Graphical' BN'Structure' Models' Structure' Learning' Daphne Koller Why Structure Learning To learn model for new queries, when domain expertise is not perfect For structure discovery, when inferring network


slide-1
SLIDE 1

Daphne Koller

Structure' Learning'

Probabilis2c' Graphical' Models'

BN'Structure' Learning'

slide-2
SLIDE 2

Daphne Koller

Why Structure Learning

  • To learn model for new queries, when

domain expertise is not perfect

  • For structure discovery, when inferring

network structure is goal in itself

slide-3
SLIDE 3

Daphne Koller

Importance of Accurate Structure

  • Incorrect independencies
  • Correct distribution P*

cannot be learned

  • But could generalize better
  • Spurious dependencies
  • Can correctly learn P*
  • Increases # of parameters
  • Worse generalization

A

B D C

Adding an arc Missing an arc A

B D C

A

B D C

slide-4
SLIDE 4

Daphne Koller

Score-Based Learning

A,B,C <1,0,0> <1,1,1> <0,0,1> <0,1,1> . . <0,1,0>

A B C C B A C B A

Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

slide-5
SLIDE 5

Daphne Koller

Likelihood) Structure) Score)

Probabilis3c) Graphical) Models)

BN)Structurds) Learning)

slide-6
SLIDE 6

Daphne Koller

Likelihood Score

  • Find (G,θ) that maximize the likelihood
slide-7
SLIDE 7

Daphne Koller

Example

X Y X Y

slide-8
SLIDE 8

Daphne Koller

General Decomposition

  • The Likelihood score decomposes as:
slide-9
SLIDE 9

Daphne Koller

Limitations of Likelihood Score

  • Mutual information is always ≥ 0
  • Equals 0 iff X, Y are independent

– In empirical distribution

  • Adding edges can’t hurt, and

almost always helps

  • Score maximized for fully

connected network

X Y X Y

slide-10
SLIDE 10

Daphne Koller

Avoiding Overfitting

  • Restricting the hypothesis space

– restrict # of parents or # of parameters

  • Scores that penalize complexity:

– Explicitly – Bayesian score averages over all possible parameter values

slide-11
SLIDE 11

Daphne Koller

Summary

  • Likelihood score computes log-likelihood of D

relative to G, using MLE parameters

– Parameters optimized for D

  • Nice information-theoretic interpretation in

terms of (in)dependencies in G

  • Guaranteed to overfit the training data (if we

don’t impose constraints)

slide-12
SLIDE 12

Daphne Koller

BIC$Score$and$ Asympto3c$ Consistency$

Probabilis3c$ Graphical$ Models$

BN$Structure$ Learning$

slide-13
SLIDE 13

Daphne Koller

Penalizing Complexity

  • Tradeoff between fit to data and model

complexity

slide-14
SLIDE 14

Daphne Koller

Asymptotic Behavior

  • Mutual information grows linearly with M while

complexity grows logarithmically with M

– As M grows, more emphasis is given to fit to data

slide-15
SLIDE 15

Daphne Koller

Consistency

  • As M∞, the true structure G* (or any I-

equivalent structure) maximizes the score

– Asymptotically, spurious edges will not contribute to likelihood and will be penalized – Required edges will be added due to linear growth of likelihood term compared to logarithmic growth of model complexity

slide-16
SLIDE 16

Daphne Koller

Summary

  • BIC score explicitly penalizes model

complexity (# of independent parameters)

– Its negation often called MDL

  • BIC is asymptotically consistent:

– If data generated by G*, networks I-equivalent to G* will have highest score as M grows to ∞

slide-17
SLIDE 17

Daphne Koller

Bayesian( Score(

Probabilis0c( Graphical( Models(

BN(Structure( Learning(

slide-18
SLIDE 18

Daphne Koller

Bayesian Score

Marginal likelihood Prior over structures Marginal probability of Data

slide-19
SLIDE 19

Daphne Koller

Marginal Likelihood of Data Given G

Likelihood Prior over parameters

slide-20
SLIDE 20

Daphne Koller

Marginal Likelihood Intuition

slide-21
SLIDE 21

Daphne Koller

Marginal Likelihood: BayesNets

∞ − −

= Γ

1

) ( dt e t x

t x

) 1 ( ) ( − Γ ⋅ = Γ x x x

slide-22
SLIDE 22

Daphne Koller

Marginal Likelihood Decomposition

slide-23
SLIDE 23

Daphne Koller

Structure Priors

  • Structure prior P(G)

– Uniform prior: P(G) ∝ constant – Prior penalizing # of edges: P(G) ∝ c|G| (0<c<1) – Prior penalizing # of parameters

  • Normalizing constant across networks is

similar and can thus be ignored

slide-24
SLIDE 24

Daphne Koller

Parameter Priors

  • Parameter prior P(θ|G) is usually the BDe prior

– α: equivalent sample size – B0: network representing prior probability of events – Set α(xi,pai

G) = α P(xi,pai G| B0)

  • Note: pai

G are not the same as parents of Xi in B0

  • A single network provides priors for all

candidate networks

  • Unique prior with the property that I-equivalent

networks have the same Bayesian score

slide-25
SLIDE 25

Daphne Koller

BDe and BIC

  • As M∞, a network G with Dirichlet

priors satisfies

slide-26
SLIDE 26

Daphne Koller

Summary

  • Bayesian score averages over parameters to avoid
  • verfitting
  • Most often instantiated as BDe

– BDe requires assessing prior network – Can naturally incorporate prior knowledge – I-equivalent networks have same score

  • Bayesian score

– Asymptotically equivalent to BIC – Asymptotically consistent – But for small M, BIC tends to underfit

slide-27
SLIDE 27

Daphne Koller

Structure' Learning'In' Trees'

Probabilis4c' Graphical' Models'

BN'Structure' Learning'

slide-28
SLIDE 28

Daphne Koller

Score-Based Learning

A,B,C <1,0,0> <1,1,1> <0,0,1> <0,1,1> . . <0,1,0>

A B C C B A C B A

Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

slide-29
SLIDE 29

Daphne Koller

Optimization Problem

Input:

– Training data – Scoring function (including priors, if needed) – Set of possible structures

Output: A network that maximizes the score Key Property: Decomposability

slide-30
SLIDE 30

Daphne Koller

  • Forests

– At most one parent per variable

  • Why trees?

– Elegant math – Efficient optimization – Sparse parameterization

Learning Trees/Forests

slide-31
SLIDE 31

Daphne Koller

Learning Forests

  • p(i) = parent of Xi, or 0 if Xi has no parent
  • Score = sum of edge scores + constant

Score of “empty” network Improvement over “empty” network

slide-32
SLIDE 32

Daphne Koller

  • Set w(i→j) = Score(Xj | Xi ) - Score(Xj)
  • For likelihood score, w(i→j) = M I (Xi; Xj),

and all edge weights are nonnegative

 Optimal structure is always a tree

  • For BIC or BDe, weights can be negative

 Optimal structure might be a forest

Learning Forests I

P ˆ

slide-33
SLIDE 33

Daphne Koller

  • A score satisfies score equivalence if I-

equivalent structures have the same score

– Such scores include likelihood, BIC, and BDe

  • For such a score, we can show w(i→j) =

w(j→i), and use an undirected graph

Learning Forests II

slide-34
SLIDE 34

Daphne Koller

  • Define undirected graph with nodes {1,…,n}
  • Set w(i,j) = max[Score(Xj | Xi ) - Score(Xj),0]
  • Find forest with maximal weight

– Standard algorithms for max-weight spanning trees (e.g., Prim’s or Kruskal’s) in O(n2) time – Remove all edges of weight 0 to produce a forest

Learning Forests III

(for score-equivalent scores)

slide-35
SLIDE 35

Daphne Koller

PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP HYPOVOLEMIA CVP BP

Learning Forests: Example

  • Not every edge in tree is in the original network
  • Inferred edges are undirected – can’t determine direction

Correct edges Spurious edges

Tree learned from data of Alarm network

slide-36
SLIDE 36

Daphne Koller

Summary

  • Structure learning is an optimization over

the combinatorial space of graph structures

  • Decomposability  network score is a sum
  • f terms for different families
  • Optimal tree-structured network can be

found using standard MST algorithms

  • Computation takes quadratic time
slide-37
SLIDE 37

Daphne Koller

General' Graphs:'Search'

Probabilis2c' Graphical' Models'

BN'Structure' Learning'

slide-38
SLIDE 38

Daphne Koller

Optimization Problem

Input:

– Training data – Scoring function – Set of possible structures

Output: A network that maximizes the score

slide-39
SLIDE 39

Daphne Koller

Beyond Trees

  • Problem is not obvious for general networks

– Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network

  • Theorem:

– Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1

slide-40
SLIDE 40

Daphne Koller

Heuristic Search

A B C D A B C D A B C D A B C D

slide-41
SLIDE 41

Daphne Koller

  • Search operators:

– local steps: edge addition, deletion, reversal – global steps

  • Search techniques:

– Greedy hill-climbing – Best first search – Simulated Annealing – ...

Heuristic Search

slide-42
SLIDE 42

Daphne Koller

  • Start with a given network

– empty network – best tree – a random network – prior knowledge

  • At each iteration

– Consider score for all possible changes – Apply change that most improves the score

  • Stop when no modification improves score

Search: Greedy Hill Climbing

slide-43
SLIDE 43

Daphne Koller

Greedy Hill Climbing Pitfalls

  • Greedy hill-climbing can get stuck in:

– Local maxima – Plateaux

  • Typically because equivalent networks are often

neighbors in the search space

slide-44
SLIDE 44

Daphne Koller

Why Edge Reversal

B A C B A C

slide-45
SLIDE 45

Daphne Koller

A Pretty Good, Simple Algorithm

  • Greedy hill-climbing, augmented with:
  • Random restarts:

– When we get stuck, take some number of random steps and then start climbing again

  • Tabu list:

– Keep a list of K steps most recently taken – Search cannot reverse any of these steps

slide-46
SLIDE 46

Daphne Koller

Example: ICU-Alarm

0.5 1 1.5 2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL Divergence M True Structure/BDe α = 10 Unknown Structure/BDe α = 10

slide-47
SLIDE 47

Daphne Koller

JamBayes

Horvitz, Apacible, Sarin, & Liao, UAI 2005

slide-48
SLIDE 48

Daphne Koller

Predicting Surprises

Horvitz, Apacible, Sarin, & Liao, UAI 2005

slide-49
SLIDE 49

Daphne Koller

Learned Model

Horvitz, Apacible, Sarin, & Liao, UAI 2005

slide-50
SLIDE 50

Daphne Koller

Influences in Learned Model

Horvitz, Apacible, Sarin, & Liao, UAI 2005

slide-51
SLIDE 51

Daphne Koller

Known 15/17 Supported 2/17 Reversed 1 Missed 3

From “Causal protein-signaling networks derived from multiparameter single-cell data” Sachs et al., Science 308:523, 2005. Reprinted with permission from AAAS.

Biological Network Reconstruction

Phospho-Proteins Phospho-Lipids Perturbed in data

PKC Raf Erk Mek Plcγ PKA Akt Jnk P38 PIP2 PIP3 Subsequently validated in wetlab

This figure may be used for non-commercial and classroom purposes only. Any other uses require the prior written permission from AAAS

slide-52
SLIDE 52

Daphne Koller

Summary

  • Useful for building better predictive models:

– when domain experts don’t know the structure – for knowledge discovery

  • Finding highest-scoring structure is NP-hard
  • Typically solved using simple heuristic search

– local steps: edge addition, deletion, reversal – hill-climbing with tabu lists and random restarts

  • But there are better algorithms
slide-53
SLIDE 53

Daphne Koller

General'Graphs:' Decomposability'

Probabilis5c' Graphical' Models'

BN'Structure' Learning'

slide-54
SLIDE 54

Daphne Koller

Heuristic Search

A B C D A B C D A B C D A B C D

slide-55
SLIDE 55

Daphne Koller

Naïve Computational Analysis

  • Operators per search step:
  • Cost per network evaluation:

– Components in score – Compute sufficient statistics – Acyclicity check

  • Total: O(n2 (Mn + m)) per search step
slide-56
SLIDE 56

Daphne Koller

Exploiting Decomposability

A B C D A B C D Δscore(D) = Score(D | {B,C})

  • Score(D | {C})

score = Score(A | {}) + Score(B | {}) + Score(C | {A,B}) + Score(D | {C}) score = Score(A | {}) + Score(B | {}) + Score(C | {A,B}) + Score(D | {B,C})

slide-57
SLIDE 57

Daphne Koller

Exploiting Decomposability

A B C D A B C D A B C D A B C D Δscore(D) = Score(D | {B,C})

  • Score(D | {C})

Δscore(C) = Score(C | {A})

  • Score(C | {A,B})

Δscore(C)+Δscore(B) = Score(C | {A})

  • Score(C | {A,B})

+ Score(B | {C})

  • Score(B | {})
slide-58
SLIDE 58

Daphne Koller

Exploiting Decomposability

A B C D A B C D A B C D A B C D

To recompute scores,

  • nly need to re-score families

that changed in the last move

Δscore(C) = Score(C | {A})

  • Score(C | {A,B})
slide-59
SLIDE 59

Daphne Koller

Computational Cost

  • Cost per move

– Compute O(n) delta-scores damaged by move – Each one takes O(M) time

  • Keep priority queue of operators sorted by

delta-score – O(n log n)

slide-60
SLIDE 60

Daphne Koller

More Computational Efficiency

  • Reuse and adapt previously computed

sufficient statistics

  • Restrict in advance the set of operators

considered in the search

slide-61
SLIDE 61

Daphne Koller

Summary

  • Even heuristic structure search can get

expensive for large n

  • Can exploit decomposability to get orders
  • f magnitude reduction in cost
  • Other tricks are also used for scaling