Probabilistic Graphical Models Lecture 1 Introduction CS/CNS/EE - - PowerPoint PPT Presentation

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Lecture 1 Introduction CS/CNS/EE - - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 1 Introduction CS/CNS/EE 155 Andreas Krause One of the most exciting advances in machine learning (AI, signal processing, coding, control, ) in the last decades 2 How can we gain global insight


slide-1
SLIDE 1

Probabilistic Graphical Models

Lecture 1 – Introduction

CS/CNS/EE 155 Andreas Krause

slide-2
SLIDE 2

2

One of the most exciting advances in machine learning (AI, signal processing, coding, control, …) in the last decades

slide-3
SLIDE 3

3

How can we gain global insight based on local observations?

slide-4
SLIDE 4

4

Key idea:

Represent the world as a collection of random variables X1, … Xn with joint distribution P(X1,…,Xn) Learn the distribution from data Perform “inference” (compute conditional distributions P(Xi | X1 = x1, …, Xm = xm)

4

slide-5
SLIDE 5

5

Applications Natural Language Processing

5

slide-6
SLIDE 6

6

Speech recognition

Infer spoken words from audio signals “Hidden Markov Models”

6

Y1 Y2 Y3 Y4 Y5 Y6 Phoneme X1 X2 X3 X4 X5 X6 Words “He ate the cookies on the couch”

slide-7
SLIDE 7

7

Natural language processing

7

“He ate the cookies on the couch”

X1 X2 X3 X4 X5 X6 X7

slide-8
SLIDE 8

8

Natural language processing

Need to deal with ambiguity! Infer grammatical function from sentence structure “Probabilistic Grammars”

8

“He ate the cookies on the couch”

X1 X2 X3 X4 X5 X6 X7

slide-9
SLIDE 9

9

Evolutionary biology

Reconstruct phylogenetic tree from current species (and their DNA samples)

9

ACCGTA.. CCGAA.. CCGTA.. GCGGCT.. GCAATT.. GCAGTT..

[Friedman et al.]

slide-10
SLIDE 10

10

Applications Computer Vision

10

slide-11
SLIDE 11

11 11

Image denoising

slide-12
SLIDE 12

12 12

Image denoising

  • Markov Random Field

Xi: noisy pixels Yi: “true” pixels

slide-13
SLIDE 13

13

Make3D

Infer depth from 2D images “Conditional random fields”

13

slide-14
SLIDE 14

14

Applications State estimation

14

slide-15
SLIDE 15

15

Robot localization & mapping

15

Infer both location and map from noisy sensor data Particle filters

  • D. Haehnel,
  • W. Burgard,
  • D. Fox, and
  • S. Thrun.

IROS-03.

slide-16
SLIDE 16

16

Activity recognition

16

Predict “goals” from raw GPS data “Hierarchical Dynamical Bayesian networks”

  • L. Liao, D. Fox, and H. Kautz. AAAI-04
slide-17
SLIDE 17

17 17

Traffic monitoring

Deployed sensors, high accuracy speed data What about 148th Ave?

How can we get accurate road speed estimates everywhere?

Detector loops Traffic cameras

slide-18
SLIDE 18

18 18

  • Cars as a sensor network

[Krause, Horvitz et al.] (Normalized) speeds as random variables Joint distribution allows modeling correlations Can predict unmonitored speeds from monitored speeds using P(S5 | S1, S9)

slide-19
SLIDE 19

19

Applications Structure Prediction

19

slide-20
SLIDE 20

20

Collaborative Filtering and Link Prediction

Predict “missing links”, ratings… “Collective matrix factorization”, Relational models

20

  • L. Brouwer
  • T. Riley
slide-21
SLIDE 21

21

Analyzing fMRI data

Predict activation patterns for nouns Predict connectivity (Pittsburgh Brain Competition)

21

Mitchell et al., Science, 2008

slide-22
SLIDE 22

22

Other applications

Coding (LDPC codes, …) Medical diagnosis Identifying gene regulatory networks Distributed control Computer music Probabilistic logic Graphical games ….

MANY MORE!!

22

slide-23
SLIDE 23

23

Key challenges:

How do we

… represent such probabilistic models? (distributions over vectors, maps, shapes, trees, graphs, functions…) … perform inference in such models? … learn such models from data?

23

slide-24
SLIDE 24

24

Syllabus overview

We will study Representation, Inference & Learning First in the simplest case

Only discrete variables Fully observed models Exact inference & learning

Then generalize

Continuous distributions Partially observed models (hidden variables) Approximate inference & learning

Learn about algorithms, theory & applications

24

slide-25
SLIDE 25

25 25

Overview

Course webpage

http://www.cs.caltech.edu/courses/cs155/

Teaching assistant: Pete Trautman (trautman@cds.caltech.edu) Administrative assistant: Sheri Garcia (sheri@cs.caltech.edu)

slide-26
SLIDE 26

26 26

Background & Prerequisites

Basic probability and statistics Algorithms CS 156a or permission by instructor Please fill out the questionnaire about background (not graded ) Programming assignments in MATLAB. Do we need a MATLAB review recitation?

slide-27
SLIDE 27

27 27

Coursework

Grading based on

4 homework assignments (one per topic) (40%) Course project (40%) Final take home exam (20%)

3 late days Discussing assignments allowed, but everybody must turn in their own solutions Start early!

slide-28
SLIDE 28

28 28

Course project

“Get your hands dirty” with the course material

Implement an algorithm from the course or a paper you read and apply it to some data set Ideas on the course website (soon) Application of techniques you learnt to your own research is encouraged Must be something new (e.g., not work done last term)

slide-29
SLIDE 29

29 29

Project: Timeline and grading

Small groups (2-3 students) October 19: Project proposals due (1-2 pages); feedback by instructor and TA November 9: Project milestone December 4: Project report due; poster session Grading based on quality of poster (20%), milestone report (20%) and final report (60%)

slide-30
SLIDE 30

30

slide-31
SLIDE 31

31 31

Review: Probability

This should be familiar to you… Probability Space (Ω, F, P)

Ω: set of “atomic events” F 2Ω: set of all (non-atomic) events F is a -Algebra (closed under complements and countable unions) P: F [0,1] probability measure For F, P() is the probability that event happens

slide-32
SLIDE 32

32

Interpretation of probabilities

Philosophical debate.. Frequentist interpretation

P() is relative frequency of in repeated experiments Often difficult to assess with limited data

Bayesian interpretation

P() is “degree of belief” that will occur Where does this belief come from? Many different flavors (subjective, pragmatic, …)

Most techniques in this class can be interpreted either way.

slide-33
SLIDE 33

33

Independence of events

Two events , F are independent if A collection S of events is independent, if for any subset ,…, S it holds that

33

slide-34
SLIDE 34

34

Conditional probability

Let , be events, P()>0 Then:

slide-35
SLIDE 35

35

Most important rule #1:

Let ,…, be events, P()>0 Then

slide-36
SLIDE 36

36

Most important rule #2:

Let , be events with prob. P() > 0, P() > 0 Then P(α | β) =

slide-37
SLIDE 37

37

Random variables

Events are cumbersome to work with. Let D be some set (e.g., the integers) A random variable X is a mapping X: Ω D For some x D, we say P(X = x) = P({ Ω: X() = x}) “probability that variable X assumes state x” Notation: Val(X) = set D of all values assumed by X.

37

slide-38
SLIDE 38

38

Examples

Bernoulli distribution: “(biased) coin flips” D = {H,T} Specify P(X = H) = p. Then P(X = T) = 1-p. Write: X ~ Ber(p); Multinomial distribution: “(biased) m-sided dice” D = {1,…,m} Specify P(X = i) = pi, s.t. ι pi = 1 Write: X ~ Mult(p1,…,pm)

38

slide-39
SLIDE 39

39

Multivariate distributions

Instead of random variable, have random vector X() = [X1(),…,Xn()] Specify P(X1=x1,…,Xn=xn) Suppose all Xi are Bernoulli variables. How many parameters do we need to specify?

39

slide-40
SLIDE 40

40

Rules for random variables

Chain rule Bayes’ rule

slide-41
SLIDE 41

41

Marginal distributions

Suppose, X and Y are RVs with distribution P(X,Y)

slide-42
SLIDE 42

42

Marginal distributions

Suppose we have joint distribution P(X1,…,Xn) Then If all Xi binary: How many terms?

42

slide-43
SLIDE 43

43

Independent RVs

What if RVs are independent? RVs X1,…,Xn are independent, if for any assignment P(X1=x1,…,Xn=xn) = P(x1) P(x2) … P(xn) How many parameters are needed in this case? Independence too strong assumption… Is there something weaker?

43

slide-44
SLIDE 44

44

Key concept: Conditional independence

Events , conditionally independent given if Random variables X and Y cond. indep. given Z if for all x Val(X), y Val(Y), Z Val(Z) P(X = x, Y = y | Z = z) = P(X =x | Z = z) P(Y = y| Z= z) If P(Y=y |Z=z)>0, that’s equivalent to P(X = x | Z = z, Y = y) = P(X = x | Z = z) Similarly for sets of random variables X, Y, Z We write: P X Y | Z

44

slide-45
SLIDE 45

45

Why is conditional independence useful?

P(X1,…,Xn) = P(X1) P(X2 | X1) … P(Xn | X1,…,Xn-1) How many parameters? Now suppose X1 …Xi-1 Xi+1… Xn | Xi for all i Then P(X1,…,Xn) = How many parameters? Can we compute P(Xn) more efficiently?

slide-46
SLIDE 46

46

Properties of Conditional Independence

Symmetry

X Y | Z Y X | Z

Decomposition

X Y,W | Z X Y | Z

Contraction

(X Y | Z) (X W | Y,Z) X Y,W | Z

Weak union

X Y,W | Z X Y | Z,W

Intersection

(X Y | Z,W) (X W | Y,Z) X Y,W | Z Holds only if distribution is positive, i.e., P>0

slide-47
SLIDE 47

47

Key questions

How do we specify distributions that satisfy particular independence properties? Representation How can we exploit independence properties for efficient computation? Inference How can we identify independence properties present in data? Learning Will now see examples: Bayesian Networks

slide-48
SLIDE 48

48

Bayesian networks

A powerful class of probabilistic graphical models Compact parametrization of high-dimensional distributions In many cases, efficient exact inference possible Many applications

Natural language processing State estimation Link prediction …

Demo..

slide-49
SLIDE 49

49

Key idea

Conditional parametrization (instead of joint parametrization) For each RV, specify P(Xi | XA) for set XA of RVs Then use chain rule to get joint parametrization Have to be careful to guarantee legal distribution…

slide-50
SLIDE 50

50

Example: 2 variables

slide-51
SLIDE 51

51

Example: 3 variables

slide-52
SLIDE 52

52

Example: Naïve Bayes models

Class variable Y Evidence variables X1,…,Xn Assume that XA XB | Y for all subsets XA,XB of {X1,…,Xn} Conditional parametrization:

Specify P(Y) Specify P(Xi | Y)

Joint distribution

slide-53
SLIDE 53

53

What you need to know

Basic probability Independence and conditional independence Chain rule & Bayes’ rule Naïve Bayes models

53

slide-54
SLIDE 54

54 54

Tasks

By tomorrow (October 1, 4pm): hand in questionnaire about background to Sheri Garcia Read Chapter 2 in Koller & Friedman Start thinking about project teams and ideas (proposals due October 19)