ECE 6504: Advanced Topics in Machine Learning Probabilistic - - PowerPoint PPT Presentation

ece 6504 advanced topics in machine learning
SMART_READER_LITE
LIVE PREVIEW

ECE 6504: Advanced Topics in Machine Learning Probabilistic - - PowerPoint PPT Presentation

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Bayes Nets (Finish) Structure Learning Readings: KF 18.4; Barber 9.5, 10.4 Dhruv Batra Virginia Tech Administrativia HW1


slide-1
SLIDE 1

ECE 6504: Advanced Topics in Machine Learning

Probabilistic Graphical Models and Large-Scale Learning

Dhruv Batra Virginia Tech

Topics

– Bayes Nets – (Finish) Structure Learning

Readings: KF 18.4; Barber 9.5, 10.4

slide-2
SLIDE 2

Administrativia

  • HW1

– Out – Due in 2 weeks: Feb 17, Feb 19, 11:59pm – Please please please please start early – Implementation: TAN, structure + parameter learning – Please post questions on Scholar Forum.

(C) Dhruv Batra 2

slide-3
SLIDE 3

Recap of Last Time

(C) Dhruv Batra 3

slide-4
SLIDE 4

Learning Bayes nets

Known structure Unknown structure Fully observable data Missing data

x(1) … x(m)

Data

structure parameters

CPTs – P(Xi| PaXi)

(C) Dhruv Batra 4 Slide Credit: Carlos Guestrin

Very easy Somewhat easy (EM) Hard Very very hard

slide-5
SLIDE 5

Types of Errors

  • Truth:
  • Recovered:

(C) Dhruv Batra 5

Flu Allergy Sinus Headache Nose Flu Allergy Sinus Headache Nose Flu Allergy Sinus Headache Nose

slide-6
SLIDE 6

Data

<x1

(1),…,xn (1)>

… <x1

(m),…,xn (m)>

Flu Allergy Sinus Headache Nose

Possible structures Score structure

  • 52

Learn parameters

Score-based approach

(C) Dhruv Batra 6 Slide Credit: Carlos Guestrin

Flu Allergy Sinus Headache Nose

Score structure

  • 60

Learn parameters

Flu Allergy Sinus Headache Nose

Score structure

  • 500

Learn parameters

slide-7
SLIDE 7

How many graphs?

  • N vertices.
  • How many (undirected) graphs?
  • How many (undirected) trees?

(C) Dhruv Batra 7

slide-8
SLIDE 8

What’s a good score?

  • Score(G) = log-likelihood(G : D, θMLE)

= logP(D | θMLE, G)

(C) Dhruv Batra 8

slide-9
SLIDE 9

Information-theoretic interpretation of Maximum Likelihood Score

  • Implications:

– Intuitive: higher mutual info à higher score – Decomposes over families in BN (node and it’s parents) – Same score for I-equivalent structures!

(C) Dhruv Batra 9

Flu Allergy Sinus Headache Nose

slide-10
SLIDE 10

Log-Likelihood Score Overfits

  • Adding an edge only improves score!

– Thus, MLE = complete graph

  • Two fixes:

– Restrict space of graphs

  • say only d parents allowed (d=1 à trees)

– Put priors on graphs

  • Prefer sparser graphs

(C) Dhruv Batra 10

Flu Allergy Sinus Headache Nose

slide-11
SLIDE 11

Chow-Liu tree learning algorithm 1

  • For each pair of variables Xi,Xj

– Compute empirical distribution: – Compute mutual information:

  • Define a graph

– Nodes X1,…,Xn – Edge (i,j) gets weight

(C) Dhruv Batra 11 Slide Credit: Carlos Guestrin

slide-12
SLIDE 12

Chow-Liu tree learning algorithm 2

  • Optimal tree BN

– Compute maximum weight spanning tree – Directions in BN: pick any node as root, and direct edges away from root

  • breadth-first-search defines directions

(C) Dhruv Batra 12 Slide Credit: Carlos Guestrin

slide-13
SLIDE 13

Can we extend Chow-Liu?

  • Tree augmented naïve Bayes (TAN) [Friedman et al. ’97]

– Naïve Bayes model overcounts, because correlation between features not considered – Same as Chow-Liu, but score edges with:

(C) Dhruv Batra 13 Slide Credit: Carlos Guestrin

slide-14
SLIDE 14

Plan for today

  • (Finish) BN Structure Learning

– Bayesian Score – Heuristic Search – Efficient tricks with decomposable scores

(C) Dhruv Batra 14

slide-15
SLIDE 15

Bayesian score

  • Bayesian view à Prior distributions:

– Over structures – Over parameters of a structure

  • Posterior over structures given data:

(C) Dhruv Batra 15

slide-16
SLIDE 16

Structure Prior

  • Common choices:

– Uniform: P(G) α c – Sparsity prior: P(G) α c|G| – Prior penalizing number of parameters – P(G) should decompose like the family score

(C) Dhruv Batra 16

slide-17
SLIDE 17

Parameter Prior and Integrals

  • Important Result:

– If P(θG | G) is Dirichlet, then integral has closed form! – And it factorizes according to families in G

(C) Dhruv Batra 17

∏∏

=

i pa G

i

G D P ) | (

Dirichlet marginal likelihood for multinomial P(Xi | pai)

( ) ( )∏

Γ + Γ + Γ Γ

i i i i i i i

x G i G i G i G G G

pa x pa x N pa x pa N pa pa )) , ( ( )) , ( ) , ( ( ) ( ) ( ) ( α α α α

slide-18
SLIDE 18

Parameter Prior and Integrals

  • How should we choose Dirichlet hyperparameters?

– K2 prior: fix an α, P(θXi|PaXi) = Dirichlet(α,…, α)

  • K2 is “inconsistent”

(C) Dhruv Batra 18

slide-19
SLIDE 19

BDe Prior

  • BDe Prior

– Remember that Dirichlet parameters are analogous to “fictitious samples” – Pick a fictitious sample size m’ – Pick a “prior” BN

  • Usually independent (product of marginals)

– Compute P(Xi,PaXi) under this prior BN

  • BDe prior:
  • Has consistency property

(C) Dhruv Batra 19

slide-20
SLIDE 20

Chow-Liu for Bayesian score

  • Edge weight wXjàXi is advantage of adding Xj as

parent for Xi

  • Now have a directed graph, need directed spanning

forest

– Note that adding an edge can hurt Bayesian score – choose forest not tree – Maximum spanning forest algorithm works

(C) Dhruv Batra 20

slide-21
SLIDE 21

Structure learning for general graphs

  • In a tree, a node only has one parent
  • Theorem:

– The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2

  • Most structure learning approaches use heuristics

– Exploit score decomposition – (Quickly) Describe two heuristics that exploit decomposition in different ways

(C) Dhruv Batra 21

slide-22
SLIDE 22

Structure learning using local search

(C) Dhruv Batra 22

Starting from Chow-Liu tree

Local search,

possible moves:

Only if acyclic!!!

  • Add edge
  • Delete edge
  • Invert edge

Select using favorite score

slide-23
SLIDE 23

Structure learning using local search

  • Problems:

– Local maximum – Plateau

  • Strategies

– Random restart – Tabu list

(C) Dhruv Batra 23

slide-24
SLIDE 24

Exploit score decomposition in local search

  • Add edge and delete edge:

– Only rescore one family!

  • Reverse edge

– Rescore only two families

Flu Allergy Sinus Headache Nose Flu Allergy Sinus Headache Nose Local Move Add Edge

(C) Dhruv Batra 24

slide-25
SLIDE 25

Alarm network

Example

(C) Dhruv Batra 25

slide-26
SLIDE 26

Example

  • JamBayes [Horvitz et al UAI05]

(C) Dhruv Batra 26

slide-27
SLIDE 27

Example

  • JamBayes [Horvitz et al UAI05]

(C) Dhruv Batra 27

slide-28
SLIDE 28

Example

  • JamBayes [Horvitz et al]

(C) Dhruv Batra 28

slide-29
SLIDE 29

Bayesian model averaging

  • So far, we have selected a single structure
  • But, if you are really Bayesian, must average over

structures

– Similar to averaging over parameters

slide-30
SLIDE 30

BN: Structure Learning: What you need to know

  • Score-based approach

– Log-likelihood score

  • Use θMLE
  • Information theoretic interpretation
  • Overfits! Adding edges only helps

– Bayesian Score

  • Priors over structure and priors over parameters for a structure
  • If dirichlet closed form expression for P(D|G)
  • K2 dirichlet not enough; Need BDe for consistency
  • Structure Search

– For trees

  • Chow-Liu: max-weight spanning tree
  • Can be extended to forests and TAN

– General graphs

  • Heuristic Search
  • Efficiency tricks due to decomposable score

(C) Dhruv Batra 30