Logistics Class webpage: - - PDF document

logistics
SMART_READER_LITE
LIVE PREVIEW

Logistics Class webpage: - - PDF document

School of Computer Science Introduction Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 0, Sep 10, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2 X 2 Eric Xing Eric Xing


slide-1
SLIDE 1

1

1

School of Computer Science

Introduction

Probabilistic Graphical Models (10 Probabilistic Graphical Models (10-

  • 708)

708)

Lecture 0, Sep 10, 2007

Eric Xing Eric Xing

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

Reading:

Eric Xing 2

Class webpage:

  • http://www.cs.cmu.edu/~epxing/Class/10708-07/

Logistics

slide-2
SLIDE 2

2

Eric Xing 3

Logistics

  • No formal text book, but draft chapters will be handed out in class:
  • M. I. Jordan, An Introduction to Probabilistic Graphical Models
  • Daphne Koller and Nir Friedman, Structured Probabilistic Models
  • Mailing Lists:
  • To contact the instructors: 10708-07-instr@cs.cmu.edu
  • Class announcements list: 10708-07-announce@cs.cmu.edu.
  • TA:
  • Hetunandan Kamichetty, Doherty 4302C, Office hours: Wednesdays, 5:00-6:00 pm
  • Dr. Ramesh Nallapati
  • Class Assistant:
  • Monica Hopes, Wean Hall 4616, x8-5527

Eric Xing 4

Logistics

4 homework assignments: 45% of grade

  • Theory exercises
  • Implementation exercises

Final project: 30% of grade

  • Applying PGM to your research area
  • NLP, IR, Computational biology, vision, robotics …
  • Theoretical and/or algorithmic work
  • a more efficient approximate inference algorithm
  • a new sampling scheme for a non-trivial model …

Take home final: 25% of grade

  • Theory exercises and/or analysis

Policies …

slide-3
SLIDE 3

3

Eric Xing 5

Past projects:

  • We will have a prize for the

best project(s) …

  • Winner of the 2005 project:
  • J. Yang, Y. Liu, E. P. Xing and A.

Hauptmann, Harmonium-Based Models for Semantic Video Representation and Classification , Proceedings of The Seventh SIAM International Conference on Data Mining (SDM 2007). (Recipient of the BEST PAPER Award)

  • Other projects:

Andreas Krause, Jure Leskovec and Carlos Guestrin, Data Association for Topic Intensity Tracking, 23rd International Conference on Machine Learning (ICML 2006).

  • Y. Shi, F. Guo, W. Wu and E. P. Xing,

GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data, The Eleventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2007).

Eric Xing 6

What is this?

  • Classical AI and ML research ignored this phenomena
  • The Problem (an example):
  • you want to catch a flight at 10:00am from Pitt to SF, can I make it if I leave at

7am and take a 28X at CMU?

  • partial observability (road state, other drivers' plans, etc.)
  • noisy sensors (radio traffic reports)
  • uncertainty in action outcomes (flat tire, etc.)
  • immense complexity of modeling and predicting traffic
  • Reasoning under uncertainty!
slide-4
SLIDE 4

4

Eric Xing 7

A universal task …

Speech recognition Speech recognition Information retrieval Information retrieval Computer vision Computer vision Robotic control Robotic control Planning Planning Games Games Evolution Evolution Pedigree Pedigree

Eric Xing 8

Representation

  • How to capture/model uncertainties in possible worlds?
  • How to encode our domain knowledge/assumptions/constraints?

Inference

  • How do I answers questions/queries

according to my model and/or based given data?

Learning

  • What model is "right"

for my data?

The Fundamental Questions

? ? ? ?

X1 X2 X3 X4 X5 X6 X7 X8 X9

) | ( : e.g. D

i

X P ) ; ( max arg : e.g. M M

M

D F

M ∈

=

slide-5
SLIDE 5

5

Eric Xing 9

Graphical Models

  • Graphical models are a marriage between graph theory and

probability theory

  • One of the most exciting developments in machine learning

(knowledge representation, AI, EE, Stats,…) in the last two decades…

  • Some advantages of the graphical model point of view
  • Inference and learning are treated together
  • Supervised and unsupervised learning are merged seamlessly
  • Missing data handled nicely
  • A focus on conditional independence and computational issues
  • Interpretability (if desired)
  • Are having significant impact in science, engineering and beyond!

X1 X2 X3 X4 X5 X6 X7 X8 X9

Eric Xing 10

What is a Graphical Model?

The informal blurb:

  • It is a smart way to write/specify/compose/design exponentially-large

probability distributions without paying an exponential cost, and at the same time endow the distributions with structured semantics

A more formal description:

  • It refers to a family of distributions on a set of random variables that are

compatible with all the probabilistic independence propositions encoded by a graph that connects these variables

A C F G H E D B A C F G H E D B A C F G H E D B A C F G H E D B A C F G H E D B

) (

8 7 6 5 4 3 2 1

,X ,X ,X ,X ,X ,X ,X X P

) , ( ) ( ) , ( ) | ( ) | ( ) | ( ) ( ) ( ) (

: 6 5 8 6 7 4 3 6 2 5 2 4 2 1 3 2 1 8 1

X X X P X X P X X X P X X P X X P X X X P X P X P X P =

slide-6
SLIDE 6

6

Eric Xing 11

probabilistic probabilistic generative generative model model

gene expression profiles gene expression profiles

Statistical Inference

Eric Xing 12

statistical statistical inference inference

gene expression profiles gene expression profiles

Statistical Inference

slide-7
SLIDE 7

7

Eric Xing 13

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8

Multivariate Distribution in High-D Space

A possible world for cellular signal transduction:

Eric Xing 14

  • Representation: what is the joint probability dist. on multiple

variables?

  • How many state configurations in total? --- 28
  • Are they all needed to be represented?
  • Do we get any scientific/medical insight?

Learning: where do we get all this probabilities?

  • Maximal-likelihood estimation? but how many data do we need?
  • Where do we put domain knowledge in terms of plausible relationships between variables, and

plausible values of the probabilities?

  • Inference: If not all variables are observable, how to compute the

conditional distribution of latent variables given evidence?

  • Computing p(H|A) would require summing over all 26 configurations of the

unobserved variables

) , , , , , , , , (

8 7 6 5 4 3 2 1

X X X X X X X X P

Recap of Basic Prob. Concepts

A C F G H E D B A C F G H E D B A C F G H E D B A C F G H E D B

slide-8
SLIDE 8

8

Eric Xing 15

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8

What is a Graphical Model?

  • -- example from a signal transduction pathway

A possible world for cellular signal transduction:

Eric Xing 16

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B

Membrane Cytosol Nucleus

X1 X2 X3 X4 X5 X6 X7 X8

GM: Structure Simplifies Representation

Dependencies among variables

slide-9
SLIDE 9

9

Eric Xing 17

If Xi's are conditionally independent (as described by a PGM), the

joint can be factored to a product of simpler terms, e.g.,

Why we may favor a PGM?

Incorporation of domain knowledge and causal (logical) structures

P(X1, X2, X3, X4, X5, X6, X7, X8) = P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2) P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)

Probabilistic Graphical Models

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !

Stay tune for what are these independencies!

Eric Xing 18

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X X1

1

X X2

2

X X3

3

X X4

4

X X5

5

X X6

6

X X7

7

X X8

8

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X X1

1

X X2

2

X X3

3

X X4

4

X X5

5

X X6

6

X X7

7

X X8

8

GM: Data Integration

slide-10
SLIDE 10

10

Eric Xing 19

If Xi's are conditionally independent (as described by a PGM), the

joint can be factored to a product of simpler terms, e.g.,

Why we may favor a PGM?

Incorporation of domain knowledge and causal (logical) structures Modular combination of heterogeneous parts – data fusion

Probabilistic Graphical Models

2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X X1

1

X X2

2

X X3

3

X X4

4

X X5

5

X X6

6

X X7

7

X X8

8

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X X1

1

X X2

2

X X3

3

X X4

4

X X5

5

X X6

6

X X7

7

X X8

8

X X1

1

X X2

2

X X3

3

X X4

4

X X5

5

X X6

6

X X7

7

X X8

8

P(X1, X2, X3, X4, X5, X6, X7, X8) = P(X2) P(X4| X2) P(X5| X2) P(X1) P(X3| X1) P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)

Eric Xing 20

∈ ′

′ ′ =

H h

h p h d p h p h d p d h p ) ( ) | ( ) ( ) | ( ) | (

Posterior probability Likelihood Prior probability Sum over space

  • f hypotheses

Rational Statistical Inference

h h d d

The Bayes Theorem:

  • This allows us to capture uncertainty about the model in a principled way
  • But how can we specify and represent a complicated model?
  • Typically the number of genes need to be modeled are in the order of thousands!
slide-11
SLIDE 11

11

Eric Xing 21

GM: MLE and Bayesian Learning

  • Probabilistic statements of Θ is conditioned on the values of the
  • bserved variables Aobs and prior p( |χ)

(A,B,C,D,E,…)=(T,F,F,T,F,…)

A= (A,B,C,D,E,…)=(T,F,T,T,F,…)

…….. (A,B,C,D,E,…)=(F,T,T,T,F,…)

A C F G H E D B A C F G H E D B A C F G H E D B A C F G H E D B A C F G H E D B

0.9 0.1

c d c

0.2 0.8 0.01 0.99 0.9 0.1

d c d d c D C

P(F | C,D)

0.9 0.1

c d c

0.2 0.8 0.01 0.99 0.9 0.1

d c d d c D C

P(F | C,D)

p(Θ; χ)

) ; ( ) | ( ) ; | ( χ χ Θ Θ Θ p p p A A ∝

posterior likelihood prior

Θ Θ Θ Θ d p

Bayes ∫

= ) , | ( χ A

Eric Xing 22

If Xi's are conditionally independent (as described by a PGM), the

joint can be factored to a product of simpler terms, e.g.,

Why we may favor a PGM?

Incorporation of domain knowledge and causal (logical) structures Modular combination of heterogeneous parts – data fusion Bayesian Philosophy Knowledge meets data

Probabilistic Graphical Models

2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !

θ α θ ⇒ ⇒ P(X1, X2, X3, X4, X5, X6, X7, X8) = P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2) P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

slide-12
SLIDE 12

12

Eric Xing 23

Directed edges give causality relationships (Bayesian

Network or Directed Graphical Model):

Undirected edges simply give correlations between variables

(Markov Random Field or Undirected Graphical model):

Two types of GMs

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

P(X1, X2, X3, X4, X5, X6, X7, X8) = P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2) P(X6| X3, X4) P(X7| X6) P(X8| X5, X6) P(X1, X2, X3, X4, X5, X6, X7, X8) = 1/Z exp{E(X1)+E(X2)+E(X3, X1)+E(X4, X2)+E(X5, X2) + E(X6, X3, X4)+E(X7, X6)+E(X8, X5, X6)}

Eric Xing 24

Structure: DAG

  • Meaning: a node is

conditionally independent

  • f every other node in the

network outside its Markov blanket

  • Local conditional distributions

(CPD) and the DAG completely determine the joint dist.

  • Give causality relationships,

and facilitate a generative process

X Y1 Y2

Descendent Ancestor Parent Children's co-parent Children's co-parent Child

Bayesian Networks

slide-13
SLIDE 13

13

Eric Xing 25

Structure: undirected graph

  • Meaning: a node is conditionally

independent of every other node in the network given its Directed neighbors

  • Local contingency functions

(potentials) and the cliques in the graph completely determine the joint dist.

  • Give correlations between

variables, but no explicit way to generate samples

X Y1 Y2

Markov Random Fields

Eric Xing 26

(Picture by Zoubin Ghahramani and Sam Roweis)

An (incomplete) genealogy

  • f graphical

models

slide-14
SLIDE 14

14

Eric Xing 27

  • Computing statistical queries regarding the network, e.g.:
  • Is node X independent on node Y given nodes Z,W ?
  • What is the probability of X=true if (Y=false and Z=true)?
  • What is the joint distribution of (X,Y) if Z=false?
  • What is the likelihood of some full assignment?
  • What is the most likely assignment of values to all or a subset the nodes of the

network?

  • General purpose algorithms exist to fully automate such computation
  • Computational cost depends on the topology of the network
  • Exact inference:
  • The junction tree algorithm
  • Approximate inference;
  • Loopy belief propagation, variational inference, Monte Carlo sampling

Probabilistic Inference

A C F G H E D B A C F G H E D B A C F G H E D B Eric Xing 28

They require a localist semantics for the nodes They require a causal semantics for the edges They are necessarily Bayesian They are intractable

A few myths about graphical models √ √ × × × × √ √

slide-15
SLIDE 15

15

Eric Xing 29

Application of GMs

  • Machine Learning
  • Computational statistics
  • Computer vision and graphics
  • Natural language processing
  • Informational retrieval
  • Robotic control
  • Decision making under uncertainty
  • Error-control codes
  • Computational biology
  • Genetics and medical diagnosis/prognosis
  • Finance and economics
  • Etc.

Eric Xing 30

Speech recognition

A A A A

X2 X3 X1 XT Y2 Y3 Y1 YT

... ...

Hidden Markov Model Hidden Markov Model

slide-16
SLIDE 16

16

Eric Xing 31

A A A A A A A A A A A A A A A A A C G T A G A A A A G A G T C A A T

X Y

θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ1

Segmentation and Pattern Recog.

( in Bio, Vision, NLP)

b b c

m1

c b

m1 m2 m3 m4 m5 m6 m7 m8

c c c

Eric Xing 32

Reinforcement learning

Partially observed Markov decision processes (POMDP)

slide-17
SLIDE 17

17

Eric Xing 33

Machine translation

SMT

The HM The HM-

  • BiTAM

BiTAM model model (B. Zhao and E.P Xing, (B. Zhao and E.P Xing, ACL 2006) ACL 2006)

Eric Xing 34

Genetic pedigree

A0 A1 Ag B0 B1 Bg M M 1 F0 F1 Fg C C 1 C g Sg

An allele network An allele network

slide-18
SLIDE 18

18

Eric Xing 35

Evolution

ancestor A C

Qh Qm

T years

?

A G A G A C

Tree Model Tree Model

Eric Xing 36

Solid State Physics

Ising Ising/Potts model /Potts model

slide-19
SLIDE 19

19

Eric Xing 37

Computer Vision

Eric Xing 38

A Generative GM

P(I|Y) Image Observation P(Y|X; Ө) Transformation P(X|b) Deformation

I:image observation b: deformation parameter : pose parameter θ X: canonical shape Y: transformed shape

  • Unimodal
  • Multimodal
  • Example-based
  • Rigid
  • Perspective
  • Boundary/Regional
  • Searching Path/Region

Y X

θ

b

θ

Y X I

slide-20
SLIDE 20

20

Eric Xing 39

A Generative GM

P(I|Y) Image Observation P(Y|X; Ө) Transformation P(X|b) Deformation

I:image observation b: deformation parameter : pose parameter θ X: canonical shape Y: transformed shape

Y X

θ

b

θ

Y X I

(Gu, Xing, & Kanade, CVPR07) (Gu, & Kanade, CVPR07) Eric Xing 40

Why graphical models

  • A language for communication
  • A language for computation
  • A language for development

Origins:

  • Wright 1920’s
  • Independently developed by Spiegelhalter and Lauritzen in statistics and

Pearl in computer science in the late 1980’s

slide-21
SLIDE 21

21

Eric Xing 41

  • Probability theory provides the glue whereby the parts are combined,

ensuring that the system as a whole is consistent, and providing ways to interface models to data.

  • The graph theoretic side of graphical models provides both an intuitively

appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.

  • Many of the classical multivariate probabilistic systems studied in fields

such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism

  • The graphical model framework provides a way to view all of these systems

as instances of a common underlying formalism.

  • -- M. Jordan

Why graphical models

Eric Xing 42

Plan for the Class

  • Fundamentals of Graphical Models:
  • Bayesian Network and Markov Random Fields
  • Continuous and Hybrid models, exponential family, GLIM
  • Basic representation, inference, and learning
  • Case studies: Popular Bayesian networks and MRF
  • Multivariate Gaussian Models
  • Temporal models
  • Trees models
  • Intractable popular BNs and MRFs: e.g., Dynamic Bayesian networks, Bayesian admixture

models (LDA)

  • Approximate inference
  • Monte Carlo algorithms
  • Vatiational methods
  • Advanced topics
  • Learning in structured input-output space
  • Nonparametric Bayesian model
  • Applications
slide-22
SLIDE 22

22

Eric Xing 43

Notation

Variable, value and index Random variable Random vector Random matrix Parameters