Advanced Machine Learning Introduction to Probabilistic Graphical - - PowerPoint PPT Presentation

advanced machine learning
SMART_READER_LITE
LIVE PREVIEW

Advanced Machine Learning Introduction to Probabilistic Graphical - - PowerPoint PPT Presentation

Advanced Machine Learning Introduction to Probabilistic Graphical Models Amit Sethi Electrical Engineering, IIT Bombay Objectives Learn about statistical dependency of variables Understand how this dependency can be coded in graphs


slide-1
SLIDE 1

Advanced Machine Learning Introduction to Probabilistic Graphical Models

Amit Sethi Electrical Engineering, IIT Bombay

slide-2
SLIDE 2

Objectives

  • Learn about statistical dependency of

variables

  • Understand how this dependency can be

coded in graphs

  • Understand the basic intuition behind

Bayesian Networks

2

slide-3
SLIDE 3

Bayesian models with which you are familiar

  • Bayes theorem
  • p(Ck|x) = p(Ck) p(x|Ck) / p(x)
  • Posterior = prior x likelihood / evidence
  • Naïve Bayes
  • p(Ck|x) = p(Ck|x1,…,xn) α p(Ck) Пi p(xi|Ck)
  • A decision about the class can now be based on:
  • Prior, and
  • Simplified class conditional densities of x
  • Log of the posterior probability leads to a linear

discriminant for certain class conditionals from the exponential families

slide-4
SLIDE 4

Consider an inference problem

  • Trying to guess if the family is out:

– When wife leaves the house she leaves the outdoor light on (but sometimes leaves it on for a guest) – When wife leaves the house, she usually puts the dog

  • ut

– When dog has a bowel problem, she goes to the backyard – If the dog is in the backyard, I will probably hear it (but it might be the neighbor's dog)

  • If the dog is barking and the light is off, is the

family out?

4

Example source: “Bayesian Networks without Tears” by Eugene Charniak, AI Magazine, AAAI 1991

slide-5
SLIDE 5

Some observations

  • A lot of the events in the world are related
  • The relations are not deterministic but

probabilistic

– Some events are the cause and others are effects – The effect usually has a sharper conditional distribution given the cause than if the cause is unknown

5

slide-6
SLIDE 6

Bayesian Network definition

  • A Bayesian network is a directed graph in which each node

(variable) is annotated with a conditional probability distribution that encodes statistical dependency:

– Each node corresponds to a random variable – If there is an arrow (edge) from node X to node Y , X is said to be a parent of Y – Each node Xi has a conditional probability distribution P(Xi| Parents(Xi)) that quantifies the effect of the parents on the node – The graph has no directed cycles (and hence is a directed acyclic graph, or DAG

6

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-7
SLIDE 7

Back to our inference problem

  • Trying to guess if the family is out, given that the

light is off and the dog is barking

  • 1: Brute force (no independence assumption):

– P(fo) = Σbp Σdo p(fo, bp, do, lo=1, hb=1) = Σbp Σdo p(fo) p(bp|fo) p(do|fo,bp) …

  • 2: Using Factorization property of BNs:

– P(fo) = Σbp Σdo p(fo) p(bp) p(lo|fo) p(do|fo,bp) p(hb|do) = p(fo) Σbp p(bp) p(lo|fo) Σdo p(do|fo,bp) p(hb|do)

7

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-8
SLIDE 8

What have we gained, so far?

  • 1. We have made it easier for us to visualize

relationships between variables

  • 2. We have simplified joint distribution as a

product of lower dimensional conditional distributions

  • 3. We have simplified marginalization of the joint

distribution by taking out terms that do not depend on the function being marginalized

8

slide-9
SLIDE 9

Notion of D-separation

  • Influence of x “flows through” z to y
  • In which case the influence does not flow iff z is known (z D-

separates x and y, or the path becomes inactive given z)?

  • Ans: (a), (b) and (c). For (d), iff z is unknown

9

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

x z y y z x y z x y z x

slide-10
SLIDE 10

Statistical independence in BNs

  • Independence: (x|y), iff p(x,y) = p(x) p(y)
  • Conditional independence:

(x|y | z), iff p(x,y | z) = p(x | z) p(y | z)

– x is conditionally independent of y given z

  • In Bayesian Networks:

– (x | NonDescendants(x) | Parents(x) ) – x is conditionally independent of all its non- descendants given its parents

10

slide-11
SLIDE 11

Markov Networks aka MRF

  • Definition

– Graphical models with undirected edges – Variables are nodes – Relationships between variables are undirected edges

  • Properties

– Notion of conditional independence is simpler – Joint distributions are represented by clique (largest fully connected subset of nodes) potentials

11

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-12
SLIDE 12

Joint distributions in MRFs

  • If there is no link between two nodes xi and xj, then conditional

independence can be expressed as: p(xi,xj|x\{i,j}) = p(xi|x\{i,j}) p(xj|x\{i,j})

  • Due to Hammersley-Clifford theorem, the sets of distributions

represented by the MRF’s conditional independence structure is the same as those that can be represented by a product of maximal clique potentials. i.e. the joint distribution is written as a product of potential functions ψc(xc) over the maximal cliques of the graph p(x) = 1/Z Пc ψc(xc)

  • Here Z is the partition function, which is ∑x Пc ψc(xc) , which is a

normalization constant

12

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-13
SLIDE 13

This clique potential can be represented in terms of an energy function

13

  • The clique potentials are strictly positive, hence

can be defined in terms of an energy ψc(xc) = exp(-E(xc))

  • Now, product of clique potentials is equivalent to

sum of energies

  • However, the clique potentials do not have a

specific probabilistic interpretation

slide-14
SLIDE 14

In general, the BN and MRFs represent a non-overlapping set of distributions

  • A directed graph whose

conditional independence cannot be expressed as an undirected graph

  • An undirected graph whose

conditional independence cannot be expressed as an directed graph

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-15
SLIDE 15

An example use of an MRF in image denoising or binary segmentation

  • Objective:

– Find the underlying clean image

  • Assumptions:

– Most of the pixels are not corrupt – Neighbouring pixels are likely to be same

  • Define (with values {-1,+1}):

– xi to be underlying true pixel – yi to be observed pixels (iid | xi)

  • Potentials:

– For observing: -ηxiyi – For spatial coherence: -βxixj – For prior: -hxi

  • Total energy:

15

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-16
SLIDE 16

Now, we minimize the energy to get the desired results

  • Energy function:

E(x,y) = h ∑i xi – β ∑{i,j} xi xj – η ∑i xi yi

  • And p(x,y) = 1/Z exp{-E(x,y)}
  • We initialize yi and find xi such that the energy is minimized. The following

are results with two different energy minimization algorithms:

16

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-17
SLIDE 17

Factor graphs are the most general form of Graphical Models

  • Factor graphs make the relationship among variables

explicit by using Factor Nodes

  • Factorization:

– If we can represent p(X) as a product of factors: p(x) = Пs fs(xs) = fa(x1,x2) fb(x1,x2) fc(x2,x3) fd(x3)

  • Then, we can draw a Bipartite graph (undirected) such

that:

– Set of nodes V represent variables – Set of nodes F represent functions or factors – No node in V connected to another node in V – No node in F connected to another node in F

17

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop

slide-18
SLIDE 18

Relation between the three PGMs

  • In a Bayesian Network, co-parents need to be moralized (married) to form

edges in an MRF (because they are not independent given the children)

  • For an MRF, every clique is represented by a function node
  • Priors of parentless variables can also

be incorporated in factor graphs

  • Loops can be avoided in factor graphs

by combining functions that form a loop

18

Source: “Pattern Recognition and Machine Learning”, Book by Christopher Bishop